<a href="https://colab.research.google.com/github/ludawg44/jigsawlabs/blob/master/18Apr20_Handling%20Missing%20Values.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Handling Missing Values

### Introduction

In the last lesson we saw how to detect missing values, now in this lesson we will see how to handle missing values.  As we'll see, the fact that data is missing, may itself contain information.

### Throw it out?

The easiest, and perhaps most tempting way to handle missing values is to throw the buggers out. 

In general, this is not a great idea.  This is why:

**The missingness may be an indication of something that can predict the target variable.**  And thus by throwing away missing data, we will be further exposed to ommitted variable bias.

For example, in our SAT score data, we have data regarding the number of students who are tested in the SAT.  And in this feature, we have some missing data.

Well, whether or not a school reports this information may not be random.  For example: 
* it could be that schools who do not record the percent tested are less likely to put efforts into prepping for the SAT.  
* Or it could be, that schools who don't record percent tested are requiring every student to take the SAT, and their students take the exam very seriously.  

Either way, there may be an association between this information being recorded and a student's performance.  And because of this, we don't want to throw out this information.  Our model may find it helpful in predicting our target. 

### How to handle missing data

So how do we allow our model to handle missing data.  After all, missing data is often numeric, and we know that missing data can be listed as something that is not a number.

Well, the general way to handle missing data is to impute the data. Impute just means to replace the data with something else.

The following is the easiest, and recommended, mechanism for handling missing data: 

1. create a new feature that indicates whether or not the value is missing, and 
2. replace the missing number with the mean of the feature

Ok, let's get to it, and then we'll explain why this is a good strategy.

### Loading our Data

LEt's start by loading our sat data.

In [0]:
import pandas as pd
url = "https://raw.githubusercontent.com/jigsawlabs-student/feature-engineering/master/3-handling-missing-data-reading/scores.csv"
df = pd.read_csv(url)
df.columns = df.columns.str.strip().str.lower().str.replace(' ', '_').str.replace('(', '').str.replace(')', '')

Remember we saw that in the `percent_tested` column, there is a number of missing values.

In [0]:
df.percent_tested.isna().value_counts()

False    386
True      49
Name: percent_tested, dtype: int64

So let's create a new feature called `is_null` -- that is, a new pandas series -- to indicate if a feature variable is missing.

In [0]:
import numpy as np
is_null = np.where(df.percent_tested.isna(), 1, 0)
is_null[0:3]

array([1, 1, 0])

Above we find all of the places where a value for `percent_tested` is not available.  Our is_null column lists a 1 where when the data is not available, and a zero otherwise.  Then, we simpy attach this column to our dataframe.

In [0]:
df['null_percent_tested'] = is_null
df[['percent_tested', 'null_percent_tested']][0:3]

Unnamed: 0,percent_tested,null_percent_tested
0,,1
1,,1
2,91.0%,0


Now, we can perform our second step of replacing the `NaN`s with the mean.

In [0]:
df.percent_tested = df.percent_tested.str[0:-1].astype('Float64')

In [0]:
df['percent_tested'].mean()

63.34715025906737

In [0]:
df['percent_tested'] = np.where(df.percent_tested.isna(), 63.3, df.percent_tested)

In [0]:
df[['percent_tested', 'null_percent_tested']][0:3]

Unnamed: 0,percent_tested,null_percent_tested
0,63.3,1
1,63.3,1
2,91.0,0


Here's why this is beneficial.

1. By adding a `null_math` column, we are allowing our machine learning model to say whether our `is_null` feature is associated with a higher or lower math SAT score.  
2. Then, for our percent tested column, we are replacing the `nan` with a value that will have a relatively neutral effect.  

So we are effectively decreasing the impact of our missing value in the percent tested column, and instead capturing that impact in the new column, `null_percent_tested`.

### Missing variables with the target variable

So far we discussed working with missing variables with a feature variable.  However, things are not so easy with a target variable.

With our target variable there are a couple of other approaches we can employ:

1. Drop the rows with a null target
    * Here, we are conceding defeat, as it is hard difficult to learn the information when we have a null target
2. Use a machine learning model to impute target data
    * We can treat the missing target data as information we are trying to predict, and then replace our values with those we predict.  
    * See the [fancy impute](https://github.com/iskandr/fancyimpute) library for more on this.

### Also, be scrappy

Finally, remember that we do have coding skills.  If we cannot find our data in one dataset, we can always keep looking.  Try to fill in the data by collecting data from other datasets.

### Summary

In this lesson, we learned how to work with missing data.  Our main approach involves two steps.  We decrease the numerical impact that our missing data has by replacing it with the mean in the original feature.  But, to see if the missingness itself has predictive value, we add a new boolean feature to indicate missingness and then include this in our model.  

### Resources

[Gelman missing data](http://www.stat.columbia.edu/~gelman/arm/missing.pdf)