# Handling Missing Data (Imputation)

- dropna, isnull, notnull (filter out the missing data)
  (dropna(axis = 0, how='all', treshold=2))
- fillna (filling in missing data)
  - functional arguments: value, method (default is ffill),axis,inplace,limit
  - (fillna({1:0.5,2:0}) by giving a dict to fill in a different value for each column
   fillna(0,inplace=True)
   
- remove duplicates and date format

## Discussion on missing data:

**Understand why data is missing**
In many cases, some predictors have no values for a given sample. It is important to understand *why* the values are missing. First and foremost, it is important to know if the pattern of missing data is related to the outcome. This is called *informative missingness* since the missing data pattern is instructional on its own. Informative missingness can induce significant bias in the model.

**Cencored data vs missing data**
Missing data should not be confused with *censored* data where the exact value is missing but something is known about its value. When building traditional statistical models focused on interpretation or inference, the censoring is usually taken in to account in a formal manner by making assumptions about the censoring mechanism. For predictive models, it is more common to treat these data as simple missing data or use the censored value as the observed value.

**How much missing data is there?**
Missing values are more often related to predictive variables than the sample. Because of this, amount of missing data may be concentrated in a subset of predictors rather than occuring randomly across all the predictors. In some cases, the percentage of missing data is substantial enough to remove this predictor from subsequent modeling activities.

There are cases where the missing values might be concentrated in specific samples. For large datasets, removal of samples based on missing values is not a problem, assuming that the missingness is not informative. In smaller datasets, there is a steep price in removing samples; some of alternative approaches described below may be more appropriate.

**I have missing data, now what?**
If we do not remove the missing data, there are two general approaches. First, a few predictive models, especially tree-based techniques, can specifically account for missing data. Alternatively, missing data can be imputed. In this case, we can use information in the training set predictors to, in essence, estimate the values of other predictors.

*Imputation* is just another layer of modeling where we try to estimate values of the predictor variables based on other predictor variables. The most relevant scheme for accomplishing this is to use the training set to built an imputation model for each predictor in the data set. Prior to model training or the prediction of new samples, missing values are filled in using imputation. Note that this extra layer of models adds uncertainty. If we are using resampling to select tuning parameter values or to estimate performance with machine learning models, the imputation should be incorporated within the resampling. This will increase the computational time for building models, but it will also provide honest estimates of model performance.

If the number of predictors affected by missing values is small, an exploratory analysis of the relationships between the preditors is a good idea. For example, visulization or methods like *PCA* can be used to determine if there are strong relationships between the predictors. If a variable with missing values is highly correlated with another predictor that has few missing values, a focused model can often be effective for imputation.

One popular technique for imputation is a *$K$-nearest neighbor* model. A new sample is imputed by finding the samples in the training set "closest" to it and averages these nearby points to fill in the value. One advantage of this approach is that the imputed data are confined to be within the range of the training set values. One disadvantage is that the entire training set is required every time a missing value needs to be imputed. Also, the number of neighbors is a tuning parameter, as is the method for determining "closeness" of two points. However, Troyanskaya et al. (2001) found the nearest neighbor approach to be fairly robust to the tuning parameters, as well as the amount of missing data.

In [None]:
import pandas as pd
import numpy as np
# set seed
np.random.seed(0)

In [None]:
# read the data
nfl_data = pd.read_csv('NFL Play by Play 2009-2017.csv', low_memory = False)

In [None]:
# handling missing data
nfl_data.head()

In [None]:
nfl_data.info(memory_usage = 'deep')

In [None]:
# get the number of missing data points per column
missing_values_count = nfl_data.isnull().sum()

# look at the # of missing points in the first ten columns
missing_values_count[0:10]

To get a better sense of the problem, we can also look at the percentage of missing data:

In [None]:
# how many total missing values do we have?
total_cells = np.product(nfl_data.shape)
total_missing = missing_values_count.sum()

# percent of data that is missing
round((total_missing/total_cells) * 100,2)

TimeSecs column has the information on how many seconds are left in the game. The missing data is probably because they were not recorded. This would be a good column to guess than leave the missing values as blank.

PenalizedTeam, on the other hand, if there were no penalty then the data shows up as missing values. One way to handle this missing data is to leave it as empty or create a new column to say there isn't any penalty. 

In [None]:
# let's see the missing TimeSecs rows:
nfl_data[nfl_data['TimeSecs'].isnull()]

Looks like almost 25% of the data is missing. Next, we'll decide what to do with the missing data.

# What do we do with missing values?

## a) Drop missing values
- dropna(): removes all the rows that contain a missing value
    remove all columns with at least one missing value, set axis = 1

In [None]:
nfl_data_na_column_dropped = nfl_data.dropna(axis=1)
nfl_data_na_column_dropped.head()

In [None]:
# the original dataset had 102 columns, 41 columns are removed
print (nfl_data.shape)
print (nfl_data_na_column_dropped.shape)

## b) Filling in missing values automatically

In [None]:
nfl_data[nfl_data['TimeSecs'].isnull()].head()

In [None]:
# replace all NA's with 0
subset_timesec_na = nfl_data[nfl_data['TimeSecs'].isnull()]
subset_timesec_na.fillna(0).head()

In [None]:
subset_timesec_na.head()

In [None]:
# replace missing values with whatever value comes directly after it in the same column
# using bfill method
subset_timesec_na.fillna(method = 'bfill').head()

# What else can we clean up?

## 1. Check for Duplicates

In [None]:
nfl_data[nfl_data.duplicated()==True]

## 2. Check the Date

In [None]:
nfl_data.Date.head()

In [None]:
nfl_data.Date.dtype

## 3. Convert date column to a real date!
- 1/17/07 has the format "%m/%d/%y"
- 17-1-2007 has the format "%d-%m-%Y"


In [None]:
# create a new column, date_parsed, with the parsed dates
nfl_data.Date = pd.to_datetime(nfl_data.Date, format = "%Y-%m-%d")
nfl_data.Date.head()

There may be cases where the date column has multiple format, then use " infer_datetime_format=True "


In [None]:
nfl_data['Date_dt'] = pd.to_datetime(nfl_data.Date, infer_datetime_format=True)

It is better to specify the date format, because infer_datetime_format isn't very efficient, and may be unable to identify the correct format. 


## 4. Using the date formated date column: 

In [None]:
# get the day of the month from the Date_dt column
nfl_data['day_of_month'] = nfl_data['Date_dt'].dt.day
nfl_data['day_of_month'].describe()

In [None]:
import seaborn as sns
sns.distplot(nfl_data['day_of_month'], kde=False, bins=31);