# Module 6: Pandas, Visualization
### NBA Tweets Exploration

In this activity, we'll be using the material we've learned to explore a [dataset of tweets from the 2018 NBA finals between the Warriors and the Cavaliers](https://www.kaggle.com/xvivancos/tweets-during-cavaliers-vs-warriors)! This dataset is not what we call **clean**, which means that the data contains lots of imperfections! In order to make it easier to analyze, we first want to clean our data.

In [1]:
import pandas as pd
import numpy as np 
import matplotlib.pyplot as plt

## Loading and Cleaning Data

First, we'll load in our data. Note that we're providing two additional arguments to our `pd.read_csv` function: `encoding`, which specifies how to read the text, and `index_col`, which specifies which column to use as the index (kind of like column names, but for rows) of the dataset.

Try running the two commented-out lines of code without the extra arguments to `pd.read_csv`, and see what happens! What do you notice?

In [5]:
tweets = pd.read_csv("nbatweets/TweetsNBA.csv", encoding = "ISO-8859-1", index_col=0)
# tweets = pd.read_csv("nbatweets/TweetsNBA.csv", encoding = "ISO-8859-1")
# tweets = pd.read_csv("nbatweets/TweetsNBA.csv", index_col=0)

tweets.head()

Unnamed: 0,text,retweet_count,favorite_count,favorited,truncated,id_str,in_reply_to_screen_name,source,retweeted,created_at,...,place_type,full_name,place_name,place_id,place_lat,place_lon,lat,lon,expanded_url,url
1,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531741216989191,,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,...,,,,,,,,,,
2,Ja comecou e eu no onibus https://t.co/wPgRLw...,0,0,False,False,1004531741422481409,,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,...,,,,,,,,,https://twitter.com/NBA/status/100447552030887...,https://t.co/wPgRLwdg1O
3,lets go Cavs\r\n#WhateverItTakes \r\n#NBAFinals,0,0,False,False,1004531741954981888,,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,...,,,,,,,,,,
4,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531743410573312,,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,...,,,,,,,,,,
5,RT @NBA: Count down @StephenCurry30's TOP 5 t...,0,0,False,False,1004531743272194048,,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,...,,,,,,,,,,


Let's take a look at how many tweets we have and what data we have about each tweet!

In [6]:
len(tweets)

51425

In [7]:
tweets.columns

Index(['text', 'retweet_count', 'favorite_count', 'favorited', 'truncated',
       'id_str', 'in_reply_to_screen_name', 'source', 'retweeted',
       'created_at', 'in_reply_to_status_id_str', 'in_reply_to_user_id_str',
       'lang', 'listed_count', 'verified', 'location', 'user_id_str',
       'description', 'geo_enabled', 'user_created_at', 'statuses_count',
       'followers_count', 'favourites_count', 'protected', 'user_url', 'name',
       'time_zone', 'user_lang', 'utc_offset', 'friends_count', 'screen_name',
       'country_code', 'country', 'place_type', 'full_name', 'place_name',
       'place_id', 'place_lat', 'place_lon', 'lat', 'lon', 'expanded_url',
       'url'],
      dtype='object')

If you take a look at the dataset, you might notice that there are a lot of columns with values that just say `NaN`. `NaN` is shorthand for `Not a Number`, which is used to denote missing values!

We can use a few methods to work with `NaN` values in Pandas.

We can find the null values in a DataFrame (the whole table) or Series (a particular column or row) with the following commands:
- `df.isna()` returns True for each null value and False for each non-null value
- `df.notna()` returns False for each null value and True for each non-null value

Once you know what null values exist, you can handle null values with the following functions:
- [`fillna(<value>)`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html) will fill all null values with the value you pass in. We can fill null values with another value such as 'missing' or 0, or the mean of the data, or something else, depending on the data.
- [`dropna()`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html) will drop all rows or columns that have null values. You can choose to drop rows or columns by using the `index` argument; 0 means drop any rows with null values, and 1 means drop any columns that have null values. You can also specify whether to drop things based on how many null values there are with the `how` and `thresh` arguments, which you can read more about in the [documentation](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html)

First, let's try and write some code to figure out how many missing values there are in each column:

### Question 1

In [9]:
# PROMPT
for column in "___YOUR CODE HERE___":
    # Recall that we can use `np.count_nonzero` to count the number of true values in a list, array, or Series
    # What might we want to count?
    num_true = np.count_nonzero(...)
    print(column, num_true)
    
# SOLUTION
for column in tweets.columns:
    # Recall that we can use `np.count_nonzero` to count the number of true values in a list, array, or Series
    # What might we want to count? (Hint: it involves the column we're currently looking at and one of the functions
    # described in above!)
    num_true = np.count_nonzero(tweets[colbumn].isna())
    print(column, num_true)

_ 1
_ 1
_ 1
Y 1
O 1
U 1
R 1
  1
C 1
O 1
D 1
E 1
  1
H 1
E 1
R 1
E 1
_ 1
_ 1
_ 1
text 0
retweet_count 0
favorite_count 0
favorited 0
truncated 0
id_str 0
in_reply_to_screen_name 50367
source 0
retweeted 0
created_at 0
in_reply_to_status_id_str 50967
in_reply_to_user_id_str 50367
lang 0
listed_count 0
verified 0
location 14487
user_id_str 0
description 7437
geo_enabled 0
user_created_at 0
statuses_count 0
followers_count 0
favourites_count 0
protected 0
user_url 36235
name 2
time_zone 51425
user_lang 0
utc_offset 51425
friends_count 0
screen_name 0
country_code 49762
country 49762
place_type 49761
full_name 49761
place_name 49761
place_id 49761
place_lat 49761
place_lon 49761
lat 51414
lon 51414
expanded_url 41897
url 41897


Note that we earlier found that there are 51425 items in the dataset. For a lot of these columns, most of the values are null! And for some of them, all of the values are null!

Let's drop the columns where there are more than 49000 null values. We can use the `thresh` and `axis` arguments here... What might we need to pass in?

### Question 2

In [12]:
# PROMPT
cleaned_tweets = tweets.dropna(axis=..., thresh=...)

# SOLUTION
cleaned_tweets = tweets.dropna(axis=1, thresh=49000)

In [13]:
cleaned_tweets

Unnamed: 0,text,retweet_count,favorite_count,favorited,truncated,id_str,source,retweeted,created_at,lang,...,geo_enabled,user_created_at,statuses_count,followers_count,favourites_count,protected,name,user_lang,friends_count,screen_name
1,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531741216989191,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,True,Thu Jan 06 20:41:19 +0000 2011,5860,285,12836,False,Ava,en,806,avakrutko
2,Ja comecou e eu no onibus https://t.co/wPgRLw...,0,0,False,False,1004531741422481409,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,pt,...,True,Thu May 04 09:46:22 +0000 2017,2200,113,1963,False,Markinho,pt,119,Marcoosloco
3,lets go Cavs\r\n#WhateverItTakes \r\n#NBAFinals,0,0,False,False,1004531741954981888,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,False,Sat Jun 24 03:16:12 +0000 2017,578,380,736,False,WR E N Z,en,319,wrenzberja
4,RT @cavs: #NBAFinals GAME 3 STARTERS:\r\n\r\n@...,0,0,False,False,1004531743410573312,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,False,Fri Aug 28 21:58:35 +0000 2015,67308,520,17161,False,GB,pt,171,duartegabriel35
5,RT @NBA: Count down @StephenCurry30's TOP 5 t...,0,0,False,False,1004531743272194048,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,True,Fri Jan 12 15:12:18 +0000 2018,10357,805,4057,False,NG,en,326,NgDaizha
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51421,RT @NBAUK: Jordan Bell lifts off! ??\r\n\r\n#D...,0,0,False,False,1004543061630779392,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,False,Fri Jan 06 19:42:48 +0000 2017,1470,56,5582,False,Sco Ducks,en,920,quackerforlife
51422,RT @OfficialNBARefs: What is it like to watch ...,0,0,False,False,1004543061672669185,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,True,Thu Feb 03 02:55:20 +0000 2011,954,56,628,False,Marvin Dalit,en,196,marvdalit
51423,RT @fanaticsview: LeBron James throws the bac...,0,0,False,False,1004543062087909376,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,True,Sun Nov 23 10:46:20 +0000 2014,6832,1880,48676,False,<U+014A><U+03F0> <U+039C>a<U+0438><U+03B9>,en,2852,NxMani
51424,https://t.co/mrT5JMngl4,0,0,False,False,1004543061874135040,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,und,...,True,Sat Jul 19 00:12:49 +0000 2014,15558,288,2099,False,Nicko <ed><U+00A0><U+00BD><ed><U+00B1><U+0091>,en,162,nnnnicole4


## Analyzing Data
Now that we have a cleaned dataset, we can start some analysis! Let's start by using `groupby`, which you might recall from last class. Let's start by looking at how people tweeted in a certain language.

First, let's just select the `text` and `lang` columns from our dataset, then group by the language:

### Question 3

In [15]:
# PROMPT
grouped_by_lang = cleaned_tweets[...].groupby(...)

# SOLUTION
grouped_by_lang = cleaned_tweets[["text", "lang"]].groupby("lang")

In [24]:
grouped_by_lang

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x7fe46990b850>

Now, let's try aggregating these values; note that we can use the `sort_values` function to sort a DataFrame by a certain column.

In [22]:
count_by_lang = grouped_by_lang.count().sort_values('text', ascending=False)
count_by_lang

Unnamed: 0_level_0,text
lang,Unnamed: 1_level_1
en,37756
und,3237
es,3164
ar,2129
pt,1901
tr,1205
tl,627
fr,409
in,276
ja,194


The second-most common language is `und`, which means undefined -- so there is no language set for that user!

Notice that the entries in the lang field are bolded, and when you try to access the column, it returns a key error as below.

In [26]:
count_by_lang["lang"]

KeyError: 'lang'

This is because when you `groupby` a certain column, that makes it the `index` in the resulting DataFrame. What does that mean? That means it makes it the row id for each given row, since there's one row for each value in the column you grouped by! If you want to access the field that you have grouped by, you must reset the index. 

In [23]:
count_by_lang_reset_index = count_by_lang.reset_index()
count_by_lang_reset_index

Unnamed: 0,lang,text
0,en,37756
1,und,3237
2,es,3164
3,ar,2129
4,pt,1901
5,tr,1205
6,tl,627
7,fr,409
8,in,276
9,ja,194


Now we can access the lang field!

As you may have noticed, English is the most common language among all the tweets. 

## String Manipulation

Now, let's try analyzing the contents of the tweets themselves. Pandas has a built in `str` module which we can use to manipulate string data. For more complex string data, you can use [Regular Expressions](https://en.wikipedia.org/wiki/Regular_expression). There are free online interpreters available such as [regex101](https://regex101.com/) which are helpful for testing out regex code. However, we'll focus on the simpler built in Pandas modules.

We can use methods such as:
- `df[column].str.replace(<old_value>, <new_value>)`
    - replace a string value in a series
- `df[column].str.contains(<value>)`
    - see if a series contains a string or substring
- `df[column].str.lower()`
    - make strings all lowercase
- `df[column].str.upper()`
    - make strings all uppercase
- `df[column].str.len()`
    - calculate the length of a string
- `df[column].str.cat()`
    - `cat` is short for concatenate, which is like adding 2 strings together. This is helpful for cases such as combining first and last name to make full name.
    
Note: This code is might output a warning; we can ignore it.

### Question 4

In [30]:
# PROMPT
# cleaned_tweets['text'] = cleaned_tweets['text']._____(inplace=True) # FILL IN THE FUNCTION
# bbcleaned_tweets

# SOLUTION
cleaned_tweets['text'] = cleaned_tweets['text'].str.lower() # FILL IN THE FUNCTION
cleaned_tweets

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  cleaned_tweets['text'] = cleaned_tweets['text'].str.lower() # FILL IN THE FUNCTION


Unnamed: 0,text,retweet_count,favorite_count,favorited,truncated,id_str,source,retweeted,created_at,lang,...,geo_enabled,user_created_at,statuses_count,followers_count,favourites_count,protected,name,user_lang,friends_count,screen_name
1,rt @cavs: #nbafinals game 3 starters:\r\n\r\n@...,0,0,False,False,1004531741216989191,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,True,Thu Jan 06 20:41:19 +0000 2011,5860,285,12836,False,Ava,en,806,avakrutko
2,ja comecou e eu no onibus https://t.co/wpgrlw...,0,0,False,False,1004531741422481409,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,pt,...,True,Thu May 04 09:46:22 +0000 2017,2200,113,1963,False,Markinho,pt,119,Marcoosloco
3,lets go cavs\r\n#whateverittakes \r\n#nbafinals,0,0,False,False,1004531741954981888,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,False,Sat Jun 24 03:16:12 +0000 2017,578,380,736,False,WR E N Z,en,319,wrenzberja
4,rt @cavs: #nbafinals game 3 starters:\r\n\r\n@...,0,0,False,False,1004531743410573312,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,False,Fri Aug 28 21:58:35 +0000 2015,67308,520,17161,False,GB,pt,171,duartegabriel35
5,rt @nba: count down @stephencurry30's top 5 t...,0,0,False,False,1004531743272194048,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:13:25 +0000 2018,en,...,True,Fri Jan 12 15:12:18 +0000 2018,10357,805,4057,False,NG,en,326,NgDaizha
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
51421,rt @nbauk: jordan bell lifts off! ??\r\n\r\n#d...,0,0,False,False,1004543061630779392,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,False,Fri Jan 06 19:42:48 +0000 2017,1470,56,5582,False,Sco Ducks,en,920,quackerforlife
51422,rt @officialnbarefs: what is it like to watch ...,0,0,False,False,1004543061672669185,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,True,Thu Feb 03 02:55:20 +0000 2011,954,56,628,False,Marvin Dalit,en,196,marvdalit
51423,rt @fanaticsview: lebron james throws the bac...,0,0,False,False,1004543062087909376,"<a href=""http://twitter.com/download/android"" ...",False,Thu Jun 07 01:58:24 +0000 2018,en,...,True,Sun Nov 23 10:46:20 +0000 2014,6832,1880,48676,False,<U+014A><U+03F0> <U+039C>a<U+0438><U+03B9>,en,2852,NxMani
51424,https://t.co/mrt5jmngl4,0,0,False,False,1004543061874135040,"<a href=""http://twitter.com/download/iphone"" r...",False,Thu Jun 07 01:58:24 +0000 2018,und,...,True,Sat Jul 19 00:12:49 +0000 2014,15558,288,2099,False,Nicko <ed><U+00A0><U+00BD><ed><U+00B1><U+0091>,en,162,nnnnicole4


Let's now try taking a look at how many tweets mention the Warriors:

In [31]:
np.count_nonzero(cleaned_tweets['text'].str.contains('warriors'))

3737

So 3737 of the tweets in our dataset contain the word "warriors"!

This might be an interesting analysis to do for a bunch of words! Let's write a function that takes in a word and outputs the number of times that it is in our dataset.


### Question 5

In [32]:
# PROMPT
def count_word(word):
    returbn ...

# SOLUTION
def count_word(word):
    return np.count_nonzero(cleaned_tweets['text'].str.contains(word))

Let's try calling our function on a few different words:

In [33]:
count_word("cavs")

5792

In [34]:
count_word("cavaliers")

1027

In [35]:
count_word("lebron")

8156

In [36]:
count_word("steph")

678

Now, try calling the function on some words of your own!

## Summary Statistics
Let's try taking a look at our data and finding some information out about it!

Summary statistics describe some data. Let's take a look at our tweet data and calculate some summary statistics.
- `df.describe()`
- `df[column].mean()`
- `df[column].min()`
- `df[column].max()`

In [37]:
cleaned_tweets.describe()

Unnamed: 0,retweet_count,favorite_count,id_str,listed_count,user_id_str,statuses_count,followers_count,favourites_count,friends_count
count,51425.0,51425.0,51425.0,51425.0,51425.0,51425.0,51425.0,51425.0,51425.0
mean,0.0,0.0,1.004537e+18,53.655226,1.846986e+17,27146.2,15277.66,11328.81,867.86244
std,0.0,0.0,3190923000000.0,779.539123,3.58742e+17,64392.3,410977.7,29156.22,5460.407658
min,0.0,0.0,1.004532e+18,0.0,3629.0,1.0,0.0,0.0,0.0
25%,0.0,0.0,1.004534e+18,0.0,280438900.0,2178.0,148.0,751.0,194.0
50%,0.0,0.0,1.004536e+18,2.0,1109339000.0,8516.0,365.0,3240.0,385.0
75%,0.0,0.0,1.004539e+18,10.0,3584163000.0,26517.0,851.0,10677.0,771.0
max,0.0,0.0,1.004543e+18,48677.0,1.004542e+18,2748293.0,27814570.0,1036221.0,742559.0


This is a little hard to read! All of these numbers are of type `float`, which is why they have the `e+08` and so next to them -- the code is putting them in scientific notation!

To make them easier to read, let's convert them to integers. 

We can use the pandas [astype()](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.astype.html) function to convert a column to a different type -- in this case, let's convert all of our columns to `int`s.

`df.astype(int)`

In [38]:
cleaned_tweets.describe().astype(int)

Unnamed: 0,retweet_count,favorite_count,id_str,listed_count,user_id_str,statuses_count,followers_count,favourites_count,friends_count
count,51425,51425,51425,51425,51425,51425,51425,51425,51425
mean,0,0,1004536761083506048,53,184698628093268288,27146,15277,11328,867
std,0,0,3190923466062,779,358742024143241472,64392,410977,29156,5460
min,0,0,1004531741216989184,0,3629,1,0,0,0
25%,0,0,1004533716486631424,0,280438883,2178,148,751,194
50%,0,0,1004536224663552000,2,1109339010,8516,365,3240,385
75%,0,0,1004539318755618816,10,3584163256,26517,851,10677,771
max,0,0,1004543062427697152,48677,1004541752487829504,2748293,27814570,1036221,742559


### Question 6
What do you notice about the data in each of these? What is the maximum number of followers someone has? What is the average number of followers someone has? What is the median number of followers someone has? Why might some of these statistics make sense or not make sense?

Specifically, talk about the `retweet_count`, `followers_count`, and `favourites_count` columns.

## Exploratory Data Analysis

Now that you've learned Pandas, try analyzing some of `cleaned_tweets` on your own! Take a look at the columns in the dataset and try using them to investigate something about the data you've been given. 