# Marking and Removal of Missing Data

Real-world data often has missing values. Data can have missing values for a number of reasons such as observations that were not recorded and data corruption. Handling missing data is important as many machine learning algorithms do not support data with missing values. In this exercise, you will discover how to handle missing data for machine learning with Python.

Specifically, after completing this exercise you will know:
- How to mark invalid or corrupt values as missing in your dataset.
- How to confirm that the presence of marked missing values causes problems for learning algorithms.
- How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.

---

## 1. Exercise Overview

This exercise is divided into 4 parts; they are:
1. Diabetes Dataset
2. Mark Missing Values
3. Missing Values Cause Problems
4. Remove Rows With Missing Values

---

## 2. Diabetes Dataset

As the basis of this exercise, we will use the *diabetes* dataset. The dataset classifies patient data as either an onset of diabetes within five years or not. There are 768 examples and eight input variables. It is a binary classification problem. A naive model can achieve an accuracy of about 65 percent on this dataset. A good score is about 77 percent. We will aim for this region, but note that the models in this tutorial are not optimized; they are designed to demonstrate feature selection schemes.

Looking at the data, we can see that all nine input variables are numerical.

```
6,148,72,35,0,33.6,0.627,50,1
1,85,66,29,0,26.6,0.351,31,0
8,183,64,0,0,23.3,0.672,32,1
1,89,66,23,94,28.1,0.167,21,0
0,137,40,35,168,43.1,2.288,33,1
...
```

This dataset is known to have missing values. Specifically, there are missing observations for
some columns that are marked as a zero value. We can corroborate this by the definition of
those columns and the domain knowledge that a zero value is invalid for those measures, e.g. a
zero for body mass index or blood pressure is invalid.

---

### 3. Mark Missing Values

Most data has missing values, and the likelihood of having missing values increases with the
size of the dataset. Missing data are not rare in real data sets. In fact, the chance that at least one data point is missing increases as the data set size increases. In this section, we will look at how we can identify and mark values as missing.

- Use the `describe()` function to help identify missing or corrupt data.

This is useful. We can see that there are columns that have a minimum value of zero (0).
On some columns, a value of zero does not make sense and indicates an invalid or missing value.

Specifically, the following columns must have an invalid zero minimum value:
1. Plasma glucose concentration
2. Diastolic blood pressure
3. Triceps skinfold thickness
4. 2-Hour serum insulin
5. Body mass index

- Confirm this by looking at the first 20 rows of data.

You will see that there 0 values in the columns 2, 3, 4, and 5.


We can get a count of the number of missing values on each of these columns. We can do this by marking all of the values in the subset of the dataframe we are interested in that have zero values as True.

- Count the number of true values in each column.
  
  Expected output:
  ```
  1 5
  2 35
  3 227
  4 374
  5 11
  ```

We can see that columns 1, 2 and 5 have just a few zero values, whereas columns 3 and 4 show a lot more, nearly half of the rows. This highlights that different missing value strategies may be needed for different columns, e.g. to ensure that there are still a sufficient number of records left to train a predictive model.

In Python, specifically Pandas, NumPy and Scikit-Learn, we mark missing values as NaN (Not a Number). Values with a NaN value are ignored from operations like sum, count, etc.

- Mark and replace values as NaN with the `Pandas DataFrame` by using the `replace()` function on a subset of the columns we are interested in. You may need to import the `nan` from Numpy. Confirm your new dataset by looking at the first 20 rows of data.

- After marking the missing values, use the `isnull()` function to mark all of the NaN values in the dataset as `True` and count of the missing values for each column.

  Expected output:
  ```
  0      0
  1      5
  2     35
  3    227
  4    374
  5     11
  6      0
  7      0
  8      0
  ```

Before we look at handling missing values, let's first demonstrate that having missing values
in a dataset can cause problems.

---

## 4. Missing Values Cause Problems

Having missing values in a dataset can cause errors with some machine learning algorithms. Missing values are common occurrences in data. Unfortunately, most predictive modeling techniques cannot handle any missing values. Therefore, this problem must be addressed prior to modeling.

In this section, we will try to evaluate the *Linear Discriminant Analysis* (LDA) algorithm
on the dataset with missing values. While you have not learnt this, LDA is an algorithm that does not work when there are missing values in the dataset.

The example below marks the missing values in the dataset, as we did in the previous section, then attempts to evaluate LDA using 3-fold cross-validation and print the mean accuracy.

- Try running the cell below:

In [None]:
# example where missing values cause errors
from numpy import nan
from pandas import read_csv
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# split dataset into inputs and outputs
values = dataset.values
X = values[:,0:8]
y = values[:,8]
# define the model
model = LinearDiscriminantAnalysis()
# define the model evaluation procedure
cv = KFold(n_splits=3, shuffle=True, random_state=1)
# evaluate the model
result = cross_val_score(model, X, y, cv=cv, scoring='accuracy')
# report the mean performance
print('Accuracy: %.3f' % result.mean())

We are prevented from evaluating an LDA algorithm (and other algorithms) on the dataset with missing values. Many popular predictive models such as support vector machines, the glmnet, and neural networks, cannot tolerate any amount of missing values.

Now, we can look at methods to handle the missing values.

---

## 5. Remove Rows With Missing Values

The simplest strategy for handling missing data is to remove records that contain a missing
value.

We can do this by creating a new Pandas `DataFrame` with the rows containing missing values removed. Pandas provides the `dropna()` function that can be used to drop either columns or rows with missing data. 

- Complete the following code snippet using `dropna()` to remove all rows with missing data:

In [None]:
# example of removing rows that contain missing values
from numpy import nan
from pandas import read_csv
# load the dataset
dataset = read_csv('pima-indians-diabetes.csv', header=None)
# summarize the shape of the raw data
print(dataset.shape)
# replace '0' values with 'nan'
dataset[[1,2,3,4,5]] = dataset[[1,2,3,4,5]].replace(0, nan)
# drop rows with missing values (Write your code here)

# summarize the shape of the data with missing rows removed
print(dataset.shape)

You should expect that the number of rows to be aggressively cut from 768 in the original dataset to 392 after all the rows containing NaN have been removed.

- Using the new dataset, re-evaluate the LDA algorithm (Answer: Accuracy 0.781)

However, removing rows with missing values can be too limiting on some predictive modeling problems. An alternative is to impute missing values. We will explore how we can impute missing data values using statistics later.

---

7. Summary

In this exercise, you discovered how to handle machine learning data that contains missing
values. Specifically, you learned:
- How to mark invalid or corrupt values as missing in your dataset.
- How to confirm that the presence of marked missing values causes problems for learning algorithms.
- How to remove rows with missing data from your dataset and evaluate a learning algorithm on the transformed dataset.