# Missing Data

Real world datasets often have missing values. Datasets with missing values are incomplete, which is a problem because not all machine learning algorithms can handle missing data. Accordingly, we need to find ways to transform an incomplete dataset into a complete dataset.

Two of the most **common solutions** to transform an incomplete dataset into a complete dataset are:

1. **Use only valid data.** In these cases we remove all the observations with missing data.
1. **Impute data.** Here we replace missing values with estimated values based on other information available in the dataset.

The question now is: which is the best solution? 

Assuming that our goal is to make predictions, the **best solution** is the one who leads us to the **most accurate model**. We can find out this through the following process:

1. Choose an **error metric** (e.g. accuracy).
1. Select a **machine learning algorithm** (e.g. logistic regression).
1. Apply a **missing data solution** (e.g. impute data) to get a complete dataset.
1. Evaluate model performance through **cross-validation**.

In the end, we should choose the solution that gives us the most accurate model. 

Let's see how to do this in practice.

---

# Example

We will start by creating a dataset with missing data. In this example, missing values will be artificially implanted into the [Iris dataset](https://en.wikipedia.org/wiki/Iris_flower_data_set). This dataset uses morphologic data (e.g. petal length) to characterize three different species of the Iris flower.

In [1]:
# Load Iris dataset
from sklearn.datasets import load_iris

df = load_iris()
df

 'data': array([[5.1, 3.5, 1.4, 0.2],
        [4.9, 3. , 1.4, 0.2],
        [4.7, 3.2, 1.3, 0.2],
        [4.6, 3.1, 1.5, 0.2],
        [5. , 3.6, 1.4, 0.2],
        [5.4, 3.9, 1.7, 0.4],
        [4.6, 3.4, 1.4, 0.3],
        [5. , 3.4, 1.5, 0.2],
        [4.4, 2.9, 1.4, 0.2],
        [4.9, 3.1, 1.5, 0.1],
        [5.4, 3.7, 1.5, 0.2],
        [4.8, 3.4, 1.6, 0.2],
        [4.8, 3. , 1.4, 0.1],
        [4.3, 3. , 1.1, 0.1],
        [5.8, 4. , 1.2, 0.2],
        [5.7, 4.4, 1.5, 0.4],
        [5.4, 3.9, 1.3, 0.4],
        [5.1, 3.5, 1.4, 0.3],
        [5.7, 3.8, 1.7, 0.3],
        [5.1, 3.8, 1.5, 0.3],
        [5.4, 3.4, 1.7, 0.2],
        [5.1, 3.7, 1.5, 0.4],
        [4.6, 3.6, 1. , 0.2],
        [5.1, 3.3, 1.7, 0.5],
        [4.8, 3.4, 1.9, 0.2],
        [5. , 3. , 1.6, 0.2],
        [5. , 3.4, 1.6, 0.4],
        [5.2, 3.5, 1.5, 0.2],
        [5.2, 3.4, 1.4, 0.2],
        [4.7, 3.2, 1.6, 0.2],
        [4.8, 3.1, 1.6, 0.2],
        [5.4, 3.4, 1.5, 0.4],
        [5.2, 4.1, 1.5, 0.1],
  

As we can see, there are four features (sepal length, sepal width, petal length, petal width) and three different species (setosa, versicolor, virginica).

In [2]:
# Split data into features and target variable
X, y = df.data, df.target
print('X\n', X)
print('Y\n', y)

X
 [[5.1 3.5 1.4 0.2]
 [4.9 3.  1.4 0.2]
 [4.7 3.2 1.3 0.2]
 [4.6 3.1 1.5 0.2]
 [5.  3.6 1.4 0.2]
 [5.4 3.9 1.7 0.4]
 [4.6 3.4 1.4 0.3]
 [5.  3.4 1.5 0.2]
 [4.4 2.9 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.4 3.7 1.5 0.2]
 [4.8 3.4 1.6 0.2]
 [4.8 3.  1.4 0.1]
 [4.3 3.  1.1 0.1]
 [5.8 4.  1.2 0.2]
 [5.7 4.4 1.5 0.4]
 [5.4 3.9 1.3 0.4]
 [5.1 3.5 1.4 0.3]
 [5.7 3.8 1.7 0.3]
 [5.1 3.8 1.5 0.3]
 [5.4 3.4 1.7 0.2]
 [5.1 3.7 1.5 0.4]
 [4.6 3.6 1.  0.2]
 [5.1 3.3 1.7 0.5]
 [4.8 3.4 1.9 0.2]
 [5.  3.  1.6 0.2]
 [5.  3.4 1.6 0.4]
 [5.2 3.5 1.5 0.2]
 [5.2 3.4 1.4 0.2]
 [4.7 3.2 1.6 0.2]
 [4.8 3.1 1.6 0.2]
 [5.4 3.4 1.5 0.4]
 [5.2 4.1 1.5 0.1]
 [5.5 4.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [5.  3.2 1.2 0.2]
 [5.5 3.5 1.3 0.2]
 [4.9 3.1 1.5 0.1]
 [4.4 3.  1.3 0.2]
 [5.1 3.4 1.5 0.2]
 [5.  3.5 1.3 0.3]
 [4.5 2.3 1.3 0.3]
 [4.4 3.2 1.3 0.2]
 [5.  3.5 1.6 0.6]
 [5.1 3.8 1.9 0.4]
 [4.8 3.  1.4 0.3]
 [5.1 3.8 1.6 0.2]
 [4.6 3.2 1.4 0.2]
 [5.3 3.7 1.5 0.2]
 [5.  3.3 1.4 0.2]
 [7.  3.2 4.7 1.4]
 [6.4 3.2 4.5 1.5]
 [6.9 3.1

The dataset isn't shuffled. We need to shuffle it to avoid data segretation problems during cross-validation. If you want to read more about shuffling data and other data cleaning tasks, you can read this [blog post](http://pmarcelino.com/data-cleaning-general-techniques/).

In [3]:
# Shuffle data
from sklearn.utils import shuffle

X, y = shuffle(X, y)
print('X\n', X)
print('Y\n', y)

X
 [[5.7 4.4 1.5 0.4]
 [4.9 3.1 1.5 0.1]
 [5.7 3.  4.2 1.2]
 [5.8 2.7 5.1 1.9]
 [6.9 3.1 5.4 2.1]
 [6.3 3.4 5.6 2.4]
 [6.7 3.3 5.7 2.1]
 [5.1 3.8 1.5 0.3]
 [5.  3.5 1.3 0.3]
 [6.6 2.9 4.6 1.3]
 [5.3 3.7 1.5 0.2]
 [5.  3.4 1.6 0.4]
 [5.  3.  1.6 0.2]
 [6.1 3.  4.9 1.8]
 [4.6 3.2 1.4 0.2]
 [4.9 3.1 1.5 0.1]
 [6.  2.2 5.  1.5]
 [4.9 3.1 1.5 0.1]
 [5.6 3.  4.1 1.3]
 [5.6 2.8 4.9 2. ]
 [6.3 2.3 4.4 1.3]
 [6.4 3.2 4.5 1.5]
 [6.7 3.1 4.4 1.4]
 [5.1 3.7 1.5 0.4]
 [5.7 2.5 5.  2. ]
 [6.4 2.8 5.6 2.1]
 [5.6 3.  4.5 1.5]
 [5.4 3.4 1.7 0.2]
 [6.7 3.1 4.7 1.5]
 [5.  3.2 1.2 0.2]
 [6.3 2.5 5.  1.9]
 [4.8 3.  1.4 0.3]
 [4.6 3.6 1.  0.2]
 [6.1 3.  4.6 1.4]
 [6.8 2.8 4.8 1.4]
 [5.8 2.7 4.1 1. ]
 [7.6 3.  6.6 2.1]
 [6.1 2.6 5.6 1.4]
 [6.5 3.  5.2 2. ]
 [7.  3.2 4.7 1.4]
 [5.2 3.4 1.4 0.2]
 [5.9 3.2 4.8 1.8]
 [7.1 3.  5.9 2.1]
 [6.  2.2 4.  1. ]
 [6.3 2.9 5.6 1.8]
 [7.7 2.8 6.7 2. ]
 [6.4 3.2 5.3 2.3]
 [5.4 3.  4.5 1.5]
 [5.1 3.8 1.9 0.4]
 [6.1 2.8 4.  1.3]
 [7.4 2.8 6.1 1.9]
 [5.5 2.4 3.7 1. ]
 [7.7 3.8

The dataset is loaded and converted into a workable format.

Now, we need to artificially implant missing data. There are countless ways to do it. Here, we will do it by:
1. Generating a random number.
1. Subtracting it to the values in the dataset.
1. Assigning a missing value to the cases where the substraction is less than a certain threshold.

Parameters and threshold values are defined randomly, but taking into account the order of magnitude of the numbers involved. 

In [4]:
# Implant artificial missing values
import numpy as np

rand = np.random.RandomState(0)  # Random number generator
X_missing = X.copy()

mask = []  # We will need this later to filter observations in y
features_missing = np.shape(X)[1]  # Missing values in all features
loc = 0  # Mean in numpy.random.normal
scale = 5  # Standard deviation in numpy.random.normal 
threshold = 1.75
for i in range(0, features_missing):
    mask_partial = np.abs(X[:,1] - rand.normal(loc=loc, scale=scale, size=X.shape[0])) < threshold
    X_missing[mask_partial, i] = np.NaN
    mask.append(mask_partial)

X_missing

array([[5.7, 4.4, 1.5, 0.4],
       [nan, 3.1, 1.5, nan],
       [5.7, 3. , 4.2, nan],
       [5.8, 2.7, 5.1, nan],
       [6.9, 3.1, nan, nan],
       [6.3, 3.4, 5.6, 2.4],
       [nan, 3.3, 5.7, 2.1],
       [5.1, 3.8, 1.5, 0.3],
       [5. , 3.5, nan, 0.3],
       [nan, 2.9, 4.6, 1.3],
       [5.3, 3.7, 1.5, 0.2],
       [5. , 3.4, 1.6, 0.4],
       [nan, nan, 1.6, 0.2],
       [6.1, 3. , 4.9, 1.8],
       [nan, 3.2, 1.4, 0.2],
       [nan, nan, 1.5, 0.1],
       [6. , 2.2, 5. , 1.5],
       [4.9, 3.1, 1.5, 0.1],
       [nan, 3. , 4.1, 1.3],
       [5.6, nan, 4.9, nan],
       [6.3, 2.3, 4.4, 1.3],
       [nan, nan, 4.5, 1.5],
       [nan, nan, nan, 1.4],
       [5.1, 3.7, 1.5, 0.4],
       [5.7, 2.5, 5. , 2. ],
       [6.4, nan, 5.6, 2.1],
       [5.6, 3. , 4.5, 1.5],
       [5.4, 3.4, 1.7, 0.2],
       [6.7, 3.1, 4.7, 1.5],
       [5. , 3.2, nan, nan],
       [nan, 2.5, nan, nan],
       [nan, 3. , 1.4, 0.3],
       [4.6, 3.6, nan, 0.2],
       [6.1, 3. , 4.6, 1.4],
       [6.8, n

Done! Our dataset is ready.

Now, let's compare the two approaches we discussed in the beginning of the notebook:
1. Use only valid data.
1. Impute data.

# 1. Use only valid data

As we said in the beginning, the procedure that we need to follow is:
1. Choose an **error metric** (e.g. accuracy).
1. Select a **machine learning algorithm** (e.g. logistic regression).
1. Apply a **missing data solution** (e.g. impute data) to get a complete dataset.
1. Evaluate model performance through **cross-validation**.

## 1.1. Error metric

We will use **accuracy**. Accuracy is given by the ratio between the number of correct predicted labels and the total number of observations in the sample.

## 1.2. Machine learning algorithm

We will use **logistic regression**. It's a simple and well-known algorithm that fits the illustrative purposes of our example.

## 1.3. Imputation method

In the *only valid data* approach, the idea is to remove the observations with missing values and keep only the observations with complete data. Let's do it.

In [5]:
# Delete observations with missing values
import pandas as pd

X_filtered = pd.DataFrame(X_missing)
X_filtered.dropna(inplace=True)

## To remove observations in y it's not trivial
## We need mask to know, in each column, which observations have missing values
## Then we keep all the observations without missing values
mask_total = mask[0]
for i in range(0, np.shape(mask)[0]):
    mask_total += mask[i]
y_filtered = y[~(mask_total)]

print(np.shape(X_filtered))
print(np.shape(y_filtered))

(54, 4)
(54,)


## 1.4. Evaluation through cross-validation

Now, to estimate the performance of the model, we will use cross-validation. In scikit-learn, we can use cross-validation through the *cross_val_score* function. Since it applies k-fold cross-validation, we need to define the number of folds. Usually, 5 or 10 are enough. Here, we will use 5 folds (cv=5) because I like the number 5.

In [6]:
# Estimate model's performance
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_score

lr = LogisticRegression()
score = cross_val_score(lr, X_filtered, y_filtered, cv=5)
print('Score: %.3f +/- %.3f' % (np.mean(score), np.std(score)))

Score: 0.909 +/- 0.058


# 2. Impute data

Ok, here it's a different solution but the is similar. Do you still remember the four steps to glory?

## 2.1. Error metric

We need to use **accuracy** because we want to compare the approaches.

## 2.2. Machine learning algorithm

For the same reasons as above, we will go for **logistic regression**.

## 2.3. Imputation method

This is where things start getting interesting. So, we want to impute values to replace missing values. These imputed values must be estimated. How do we estimate them? The easiest way to estimate them is to say that they result from the average of the known values (in each feature). This means that, for example, the missing values of petal length can be replaced by the mean value of all the known petal lengths values.

The mean imputation is one of the simplest ways to estimate missing values. We can go for more complex solutions, but in scikit-learn we are somehow restricted. Let's see why.

### Data leakage

Imagine that you have a dataset with missing values and you will use all the data in the dataset to compute means:

<img src="figures/missing_data_dataset_incomplete.jpeg" style="max-width:50%; width: 25%">

Now, you will impute those means into your dataset to complete it:

<img src="figures/missing_data_dataset_complete.jpeg" style="max-width:50%; width: 25%">

Ok. You have a complete dataset. What's next? Next, you use cross-validation to evaluate the performance of the model:

<img src="figures/missing_data_5foldcv.jpg" style="max-width:50%; width: 25%">

And then you find out that what you're doing is **wrong**. 

Let me tell you why. If you use the entire dataset to compute the means, you'll be using information from the validation set to fill missing values in the training set. This corresponds to a **data leakage** situation. You fall into a data leakage situation everytime you train your model with data that, somehow, has information about the data used to evaluate model performance. When data leakage occurs, you'll get overly optimistic results during cross-validation because the model will be tested on seen data, instead of unseen data. That's why you should be careful when using imputation and cross-validation together.

### Pipelines

So, what should we do? What we need to do is to make sure that we first split the data into train and validation sets, and only then we compute and impute means. In this way, we avoid mixing the datasets.

The following diagram illustrates what I want to say:

<img src="figures/missing_data_correct_pipeline_1.jpg" style="width:30%">

<img src="figures/missing_data_correct_pipeline_2.jpeg" style="width:30%">

<img src="figures/missing_data_correct_pipeline_3.jpeg" style="width:30%">

We can easily implement these steps in scikit-learn using the **Pipeline** class. Pipelines allow us to integrate multiple steps into a single unit (the pipeline). In scikit-learn, you can use this unit in the same way you use an estimator. This means that the unit Pipeline works like LogisticRegression or any other model. It has fit, predict, and score methods, and you can use it as a classifier.

### Building pipelines

The easiest way to build a pipeline is through the function *make_pipeline*. The syntax for *make_pipeline* is as simples as: 

> make_pipeline(*steps you want to do*)

The steps that you want to do must be 'transforms'. And that's a problem...

### Pipelines limitations

To say that the steps must be 'transforms', means that they must implement fit and transform methods. Accordingly, if we are trying an imputation method that does not have these methods, we can't use pipelines. And if we can't use pipelines, we can't evaluate our model through cross-validation because of the data leakage problem.

This is the reason why I told you that scikit-learn restricts the use of more complex imputation methods. You can still apply them if you work around the code, but it will never be as straightforward as it is to apply the mean imputation method.

## 2.4. Evaluation through cross-validation

Now that we already discussed the need for pipelines, let's solve the problem.

In [7]:
# Estimate score with pipeline
from sklearn.preprocessing import Imputer
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(Imputer(), LogisticRegression())
score = cross_val_score(pipe, X_missing, y, cv=5)
print('Score: %.3f +/- %.3f' % (np.mean(score), np.std(score)))

Score: 0.813 +/- 0.078


This is the score that should be compared with the score resulting from the approach in which we used only valid data. It's the comparison between these two scores that should guide our decision on which solution we should use to solve the missing data problem.

---

# Summary

In this example, we saw how to select a solution for the missing data problem. We discussed two solutions: 
1. Use only valid data.
1. Impute data (mean imputation).

While the first solution is easy to apply, the second one is tricky. In particular, the second solution can guide us to a data leakage situation. To avoid these situation, we must use pipelines. Since pipelines are restricted to 'transformers', not always we can apply complex imputation methods in scikit-learn.

The general process to compare missing data solutions is:
1. Choose an error metric.
1. Select a machine learning algorithm.
1. Apply a missing data solution to get a complete dataset.
1. Evaluate model performance through cross-validation.

Once you finished this process, you're able to select the missing data solution that leads you to the most accurate predictions.