# Leaning Objectives

* Introduce the issue around datasets with missing values
* Investigate different strategies for dealing with missing values in datasets

### Motivation

* Even the simple PM2.5 dataset we introduced had missing values (indicated by "NA")
* So far we dealt with them simply by **discarding** those instances

```Python
df = df.dropna(subset=['pm2.5'])
```

**OR**

```Python
dataset = [d for d in dataset if d[5] != 'NA']
```

* This was an okey strategy when dealing with a single feature where missing data was rare, but how would in generalize?

* In particular, this approach wouldn't work if **many** features might be missing

## Concept: Strategies for dealing with missing values

In this lecture we'll look at three strategies for dealing with missing data:

* **Filtering** (i.e., discarding missing values), as we discussed on the previous slide
* **Missing data imputation**: filling in the missing values with "reasonable" estimates
* **Modeling**: changing our regression/classification algorithms to handle missing data explicitly

### Missing data imputation

Seeks to replace missing values by reasonable estimates:

* A simple scheme would be to replace every missing value with the **average** for that feature.
  - What are the consequences of such a scheme?
    - The average may be sensitive to outlying values (though this could be addressed by using the **median** instead)
    - The imputed value may or may not be "reasonable" (e.g. consider our ```gender = male``` feature)
    
    
* Alternatively we could consider more sophisticated data imputation schemes
  - Rather than imputing using the mean, does it make more sense to compute the mean of **a certain subgroup** (e.g. if "height" is missing, can we impute using the average height of users with the same gender?) 
  
  - We could also train a seperate **predictor** to impute the missing values (though this is complex if there are missing values for many different features)

### Modeling missing data

How can we directly model the missing values within a regression or classification algorithm?

* One simple scheme: add an **additional feature** indicating that a value is missing
* E.g.:
$$
\begin{array}{l}
\mbox{feature} = [1, 0, 0] \mbox{ for "female"} \\
\mbox{feature} = [0, 1, 0] \mbox{ for "male"} \\
\mbox{feature} = [0, 0, 1] \mbox{ for "feature missing"}
\end{array}
$$


## Summary of concepts

* Discussed some simple schemes for dealing with missing data
* Introduced the ideas of **data imputation** and **modeling missing data**

### On your own...

* Extend our previous code (on pm2.5 levels vs. air temperature) to handle missing features
* Experiment with different missing data imputation schemes and note their effect on performance