# MISSING DATA

It's usual to find missing data when working with large datasets. In fact, you might not even find a single sample free of missing data. Annoying as this is, simply ignoring missing data usually isn't an option, as it can wreck havoc if not handled properly during your analysis. If not accounted for, missing data might lead you to erroneous conclusions about your samples by resulting in incorrect sums and means, and even by skewing distributions.

When data are missing in a variable of a particular case, it is very important to fill this variable with some intuitive data, if possible. Adding a reasonable estimate of a suitable data value for this variable is better than leaving it blank. The operation of deciding what data to use to fill these blanks is called **data imputation**.

If there is a strong pattern among the missing values of a variable (e.g. caused by a broken sensor), the variable should be eliminated from the model.

Basic methods for mitigating missing data:
* Any time missing data (nan) is encountered, **replace it with a scalar value** (mean, median, mode, etc). Imputation of missing values with the mean of the nonmissed cases is referred as __mean substitution__. (see note below)
* If working with time series and the data is ordered, **replace it with the immediate or previous non missing value**.
* Another method when dealing with time series is interpolating missing data with non-nan values that come immediately before and after (interpolating methods: nearest, cubic, spline, etc.)

<div class="alert alert-block alert-info" style="margin-top: 20px">
<strong>Note about MEAN SUBSTIUTION</strong>
<br/>
If some decision rule can be safely applied to supply a specific value to the missing data, it may be closer to the true value than even the mean substitution would be. For example, it is more reasonable to replace a missing value for number of children with zero rather than replace it with the mean or the median number of children based on all the other records (many couples are childless)
</div>

## Replacing Missing Data

In [33]:
import pandas as pd

dataset = pd.read_csv('../data/00_DataPreparation/missing_data_example.csv')

In [34]:
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,,Yes
5,France,35.0,58000.0,Yes
6,Spain,,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [35]:
dataset.dtypes

Country       object
Age          float64
Salary       float64
Purchased     object
dtype: object

Finding missing data

In [36]:
dataset.describe()

Unnamed: 0,Age,Salary
count,9.0,9.0
mean,38.777778,63777.777778
std,7.693793,12265.579662
min,27.0,48000.0
25%,,
50%,,
75%,,
max,50.0,83000.0


In [37]:
dataset.isnull()

Unnamed: 0,Country,Age,Salary,Purchased
0,False,False,False,False
1,False,False,False,False
2,False,False,False,False
3,False,False,False,False
4,False,False,True,False
5,False,False,False,False
6,False,True,False,False
7,False,False,False,False
8,False,False,False,False
9,False,False,False,False


In [38]:
dataset.isnull().sum()

Country      0
Age          1
Salary       1
Purchased    0
dtype: int64

Replace missing data using the mean

In [39]:
from sklearn.preprocessing import Imputer
imputer = Imputer(missing_values='NaN', strategy='mean', axis=0, verbose=True)
imputer.fit(dataset[['Age', 'Salary']])
dataset[['Age', 'Salary']] = imputer.transform(dataset[['Age', 'Salary']])
dataset

Unnamed: 0,Country,Age,Salary,Purchased
0,France,44.0,72000.0,No
1,Spain,27.0,48000.0,Yes
2,Germany,30.0,54000.0,No
3,Spain,38.0,61000.0,No
4,Germany,40.0,63777.777778,Yes
5,France,35.0,58000.0,Yes
6,Spain,38.777778,52000.0,No
7,France,48.0,79000.0,Yes
8,Germany,50.0,83000.0,No
9,France,37.0,67000.0,Yes


In [40]:
dataset.describe()

Unnamed: 0,Age,Salary
count,10.0,10.0
mean,38.777778,63777.777778
std,7.253777,11564.099406
min,27.0,48000.0
25%,35.5,55000.0
50%,38.388889,62388.888889
75%,43.0,70750.0
max,50.0,83000.0
