# Module 6: Pandas, Visualization
### NBA Tweets Exploration

Dataset: https://www.kaggle.com/mcdonalds/nutrition-facts

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt
%matplotlib inline

In [2]:
locations = pd.read_csv("nbatweets/locations.csv")
locations.head()

Unnamed: 0.1,Unnamed: 0,lon,lat
0,1,121.171039,14.996784
1,2,-77.094709,38.984652
2,3,-3.70379,40.416775
3,8,-83.753428,9.748917
4,11,-77.066923,39.153163


In [3]:
tweets = pd.read_csv("nbatweets/TweetsNBA.csv", encoding = "ISO-8859-1")
tweets.head()

Unnamed: 0.1,Unnamed: 0,text,retweet_count,favorite_count,favorited,truncated,id_str,in_reply_to_screen_name,source,retweeted,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
0,1,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531741216989191,,"<a href=""http://twitter.com/download/iphone"" r...",False,...,,,,,,,,,,
1,2,Ja comecou e eu no onibus https://t.co/wPgRLw...,0,0,False,False,1004531741422481409,,"<a href=""http://twitter.com/download/android"" ...",False,...,,,,,,,,,https://twitter.com/NBA/status/100447552030887...,https://t.co/wPgRLwdg1O
2,3,lets go Cavs\r\n#WhateverItTakes \r\n#NBAFinals,0,0,False,False,1004531741954981888,,"<a href=""http://twitter.com/download/android"" ...",False,...,,,,,,,,,,
3,4,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531743410573312,,"<a href=""http://twitter.com/download/android"" ...",False,...,,,,,,,,,,
4,5,RT @NBA: Count down @StephenCurry30's TOP 5 t...,0,0,False,False,1004531743272194048,,"<a href=""http://twitter.com/download/iphone"" r...",False,...,,,,,,,,,,


In [4]:
len(tweets)

51425

In [5]:
tweets.columns

Index(['Unnamed: 0', 'text', 'retweet_count', 'favorite_count', 'favorited',
       'truncated', 'id_str', 'in_reply_to_screen_name', 'source', 'retweeted',
       'created_at', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str',
       'lang', 'listed_count', 'verified', 'location', 'user_id_str',
       'description', 'geo_enabled', 'user_created_at', 'statuses_count',
       'followers_count', 'favourites_count', 'protected', 'user_url', 'name',
       'time_zone', 'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

In [6]:
tweets = tweets.drop(columns=["Unnamed: 0", "utc_offset", "id_str", "url", "expanded_url", "user_id_str", "in_reply_to_screen_name", 
                             "in_reply_to_status_id_str", "in_reply_to_user_id_str"])

In [7]:
tweets.head()

Unnamed: 0,text,retweet_count,favorite_count,favorited,truncated,source,retweeted,created_at,lang,listed_count,...,country_code,country,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon
0,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,7,...,,,,,,,,,,
1,Ja comecou e eu no onibus https://t.co/wPgRLw...,0,0,False,False,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,pt,1,...,,,,,,,,,,
2,lets go Cavs\r\n#WhateverItTakes \r\n#NBAFinals,0,0,False,False,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,0,...,,,,,,,,,,
3,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,4,...,,,,,,,,,,
4,RT @NBA: Count down @StephenCurry30's TOP 5 t...,0,0,False,False,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,1,...,,,,,,,,,,


### Group By and Aggregations 

We use group by to aggregate data based on a certain field value. Imagine if you had a dataset with each entry being a person in your class and the color they like. If you wanted to answer the question regarding how many people like the color pink, you would have to group by the color and aggregate the number of people. 

Similarly, in regards to the the nba tweets dataset, let's say we want to look at how many people tweeted in a certain language. First step would be to group by tweet. 

In [9]:
grouped_by_lang = tweets[["text", "lang"]].groupby("lang")

In [10]:
grouped_by_lang

<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x00000246BE487748>

As you can see above, grouped_by_lang does not look like a dataframe yet - it is actually a groupby object until we apply an aggregation method like sum, count, average or similar on the group by object. 
Because in this case we are only looking at how many people tweeted, we can use a simple count function as our aggregator. 

In [19]:
count_by_lang = grouped_by_lang.agg(np.count_nonzero).sort_values("text", ascending=False).head()
count_by_lang

Unnamed: 0_level_0,text
lang,Unnamed: 1_level_1
en,37756
und,3237
es,3164
ar,2129
pt,1901


Notice that the entries in the lang field are bolded, and when you try to access the column, it returns a key error as below.  

In [20]:
count_by_lang["lang"]

KeyError: 'lang'

This is because use a groupby on a certain field, make it the index in the resulting dataframe. So if you would like to acccess the lang field, we must reset the index. 

In [22]:
count_by_lang_reset_index = count_by_lang.reset_index()
count_by_lang_reset_index

Unnamed: 0,lang,text
0,en,37756
1,und,3237
2,es,3164
3,ar,2129
4,pt,1901


Now we can access the lang field. 

In [24]:
list(count_by_lang_reset_index["lang"])

['en', 'und', 'es', 'ar', 'pt']

As we can see, as we can expect english is the most common language among all the tweets. 

How about trying to find out the number of tweets per country in the dataset? Replace the values wherever you see an "..."

In [None]:
grouped_by_country_code = tweets["...."].groupby("....")
count_by_country_code = grouped_by_country_code.agg("...")
count_by_country_code.head()

Sort the values in count_by_country_code and find out the top 5 countries that tweet about the NBA. Assign the answer variable below, to a list of these countries. 

In [None]:
answer_max_country_engagement = [...]
answer_max_country_engagement

For the next exercise remember, the way to aggregate by a certain function, is to provide that function as an argument to the .agg method. 

Common functions used are sum, np.count_nonzero (count), np.mean (average), and np.std (standard deviation). The example below groups by country code and returns the average number of listed count for each country. Fill in the "..." below. 

In [28]:
grouped_by_country_code = tweets[["country_code", "listed_count"]].groupby("country_code")
grouped_by_country_code.agg(np.mean).sort_values("listed_count", ascending=False).head()

Unnamed: 0_level_0,listed_count
country_code,Unnamed: 1_level_1
IN,78.0
KW,75.625
JM,62.0
ES,57.3
CM,54.8
