In [1]:
import pandas as pd
import numpy as np

In [3]:
# reading data
nfl_data = pd.read_csv("/home/oktavianu/data/NFL/NFL Play by Play 2009-2017 (v4).csv")
nfl_data.head()

  nfl_data = pd.read_csv("/home/oktavianu/data/NFL/NFL Play by Play 2009-2017 (v4).csv")


Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season
0,2009-09-10,2009091000,1,1,,15:00,15,3600.0,0.0,TEN,...,,0.485675,0.514325,0.546433,0.453567,0.485675,0.060758,,,2009
1,2009-09-10,2009091000,1,1,1.0,14:53,15,3593.0,7.0,PIT,...,1.146076,0.546433,0.453567,0.551088,0.448912,0.546433,0.004655,-0.032244,0.036899,2009
2,2009-09-10,2009091000,1,1,2.0,14:16,15,3556.0,37.0,PIT,...,,0.551088,0.448912,0.510793,0.489207,0.551088,-0.040295,,,2009
3,2009-09-10,2009091000,1,1,3.0,13:35,14,3515.0,41.0,PIT,...,-5.031425,0.510793,0.489207,0.461217,0.538783,0.510793,-0.049576,0.106663,-0.156239,2009
4,2009-09-10,2009091000,1,1,4.0,13:27,14,3507.0,8.0,PIT,...,,0.461217,0.538783,0.558929,0.441071,0.461217,0.097712,,,2009


#### Randomness
When performing data cleaning or data preprocessing, we might use random operations such as filling in missing values or splitting data into training and test sets. This operations often involve randomness, which means running the same code multiple times could lead to different results if the random seed isn't set. This is where numpy random seed comes in.

#### Seed
A seed is like a starting point for the sequence of random numbers. When we set the seed using `np.random.seed(0)`, we initialize the random number generator in NumPy with a specific value (in this case, 0). This ensures that whenever random numbers are generated, they follow a predictable sequence.

In [15]:
# set seed for reproducibility
np.random.seed(0)

In [11]:
nfl_data.shape

(407688, 102)

In [8]:
# get the number of how many missing data points we have?
missing_value_count = nfl_data.isnull().sum()

# look at the number of missing value in the first 10 columns
missing_value_count[0:10]

Date                0
GameID              0
Drive               0
qtr                 0
down            61154
time              224
TimeUnder           0
TimeSecs          224
PlayTimeDiff      444
SideofField       528
dtype: int64

That seems like a lot! It might be helpful to see what percentage of the values in our dataset were missing to give us a better sense of 
the scale of this problem:

In [13]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_value_count.sum()

# percentage of data that is missing
percent_missing = (total_missing/total_cells) * 100
print(percent_missing)

27.66722370547874


Almost a quarter of the cells in this dataset are empty! In the next step, we're going to take a closer look at some of the columns with 
missing values and try to figure out what might be going on with them.

#### Figuring out why data is missing
This part, some people call it data intuition. In this stage, we really looking at our data and trying to figure out why and how it will affects our analysis. This is hard. Most often we really need to use our intuition on why the value is missing. There is one important question to tackle this:
**Is the value missing because it wasn't recorded or because it does not exist?**
If a value is missing because it does not exist (for example like the height of the oldest child of someone who don't have any children) then it doesn't make sense to try and guess what it might be. These kind of missing values we probably do want to keep as `NaN`. On the other hand, if the values is missing because it was not recorded, then we can try to guess what it might have been based on the other values in that column and row, which we called **imputation**.

#### Strategy to handle missing values
If we're in a hurry or don't have a reason to figure out why our values are missing, one option we have is to just remove any rows or columns that contain missing values. (This is generally not a recommended approch for important projects! It's usually worth it to take the time to go through our data and really look at all the columns with missing values one-by-one to really get to know our dataset.)

If we're sure we want to drop rows with missing values, pandas does have a handy function, dropna() to help us do this. Let's try it out on our NFL dataset!

In [16]:
# remove all the rows that contain missing values
nfl_data.dropna()

Unnamed: 0,Date,GameID,Drive,qtr,down,time,TimeUnder,TimeSecs,PlayTimeDiff,SideofField,...,yacEPA,Home_WP_pre,Away_WP_pre,Home_WP_post,Away_WP_post,Win_Prob,WPA,airWPA,yacWPA,Season


It seems that our method of using `dropna()` is not quite right. It is obvious from the ouput that the command removes the entire data and we have no rows left. Our data is empty. To tackle this, it is better for us to remove column with at least one missing value. 

In [17]:
columns_with_na_dropped = nfl_data.dropna(axis=1) # axis=1 means we drop columns, default = 0 which removes rows
columns_with_na_dropped.head()


Unnamed: 0,Date,GameID,Drive,qtr,TimeUnder,ydstogo,ydsnet,PlayAttempted,Yards.Gained,sp,...,AwayTeam,Timeout_Indicator,posteam_timeouts_pre,HomeTimeouts_Remaining_Pre,AwayTimeouts_Remaining_Pre,HomeTimeouts_Remaining_Post,AwayTimeouts_Remaining_Post,ExPoint_Prob,TwoPoint_Prob,Season
0,2009-09-10,2009091000,1,1,15,0,0,1,39,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
1,2009-09-10,2009091000,1,1,15,10,5,1,5,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
2,2009-09-10,2009091000,1,1,15,5,2,1,-3,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
3,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
4,2009-09-10,2009091000,1,1,14,8,2,1,0,0,...,TEN,0,3,3,3,3,3,0.0,0.0,2009
