![@mikegchambers](../images/header.png)

# Imputing Missing Data

In this notebook, we fill in this gaps

![missing](jigsaw.png)

First let's import some libraries:

In [None]:
from sklearn.impute import SimpleImputer

import matplotlib.pyplot as plt
import pandas as pd
import numpy as np

## Loading the data and sizing the problem

Loading some data.  The data has an index column, so we will drop that now.

In [None]:
data = pd.read_csv("data.csv", )
data = data.drop(data.columns[0], axis=1)

Let's have a look at the data:

In [None]:
data.head(10)

We have missing data.  How big is the problem?

In [None]:
data.isnull().sum()

Let's see what the data looks like in a graph

In [None]:
data.plot(kind='scatter', x='X', y='y')

## Using scikit-learn's SimpleImputer

Create the imputer object:

In [None]:
imp = SimpleImputer(missing_values=np.nan, strategy='mean')

Fit out data to the imputer.  Easy.  Also notice that we call 'fit', this is a type of ML algorithm. 

In [None]:
X = imp.fit_transform(data)

Now let's load that data into a dataframe so we can take a look.

In [None]:
mean_impute_data = pd.DataFrame(X)
mean_impute_data.head(10)

And fix up the column names:

In [None]:
mean_impute_data.columns = ['X', 'y']
mean_impute_data.head(10)

Looks like the missing data is filled in, let's check

In [None]:
mean_impute_data.isnull().sum()

What does it look like?

In [None]:
axes = plt.axes()

axes.scatter(x=mean_impute_data['X'], y=mean_impute_data['y'], c='red', s=20)
axes.scatter(x=data['X'], y=data['y'], c='blue', s=20)

plt.show()

## Doing a better(?) job with LinearRegression

Using the same method we used in 'My First Model', lets train a LinearRegression model on the complete data we have

In [None]:
import sklearn.linear_model

In [None]:
model = sklearn.linear_model.LinearRegression()

We need complete data to train this model (that's the point) so we need to drop rows with empty values

In [None]:
data_complete = data.dropna()

Now we can fit the LinearRegression model to our complete data

In [None]:
model.fit(data_complete[['X']], data_complete['y'])

Now that we have a model, we need to isolate out our problematic empty rows, so we can itterate through them and predict the imputed values.

Here we get a list of the row numbers that contain missing data:

In [None]:
flag_isnan = data.isnull()
row_num_isnan = flag_isnan.any(axis=1)
row_num_isnan.head(10)

And here we create a new data frame with just these problematic rows:

In [None]:
data_incomplete = data[row_num_isnan]
data_incomplete.head(10)

Now let's use `model.predict` to impute the values in these rows:

In [None]:
# UPDATE: This code has been tweaked to work with newer versions of the Pandas library:
for index, row in data_incomplete.iterrows():
    predict_df = pd.DataFrame([row['X']], columns=['X'])
    predicted_value = model.predict(predict_df)[0]
    row['y'] = predicted_value

How did we do?

In [None]:
data_incomplete.head(10)

Looking good.

Lets join our newly imputed rows with the complete data we had before.  Our dataset is ready.

In [None]:
data_impute = pd.concat((data_complete, data_incomplete))

As a final check let's see if there are any incomplete rows:

In [None]:
data_impute.isnull().sum()

And let's see what that data looks like on a graph.  

In [None]:
axes = plt.axes()

axes.scatter(x=data_incomplete['X'], y=data_incomplete['y'], c='red', s=40)
axes.scatter(x=data_complete['X'], y=data_complete['y'], c='blue', s=20)

plt.show()

Note, in the graph above we plot the two sets of data rather than `data_impute` so that we can easily see the newly imputed data.