# Chapter 4: Data Preprocessing

## Introduction

A good machine learning model is nothing without good data. In this chapter, Raschka goes over some common techniques for getting your data into shape for the training/testing process.

We cover three general elements of preprocessing:

1. [Removing and imputing **missing data**](#Removing-and-imputing-missing-data)
2. [Dealing with **categorical** variables](#Dealing-with-categorical-variables)
3. [**Feature selection**](#Feature-selection)

In the process, we make liberal use of the `sklearn.preprocessing` module, which has a lot of handy preprocessing functions built-in.

### Using scikit-learn's estimator API

Most of scikit-learn's preprocessing functions make use of two essential class methods:

- `fit()`: learns parameters based on sample data
- `transform()`: uses learned parameters to change the values of the inputs

These two methods should look familiar: all of the classification models we've used so far have made use of them!

Scikit-learn also often includes a method that combines these two functions into one operation:

- `fit_transform()`: learns parameters based on sample data, then uses those parameters to transform and return the sample data

We'll most often use `fit_transform()` when we're preprocessing data, since we'll often want a transformed version of the training data. When training a model, however, we often want the fine-grained control offered by the different `fit()` and `transform()` methods.

## Removing and imputing missing data

### What's up with missing data?

If they're not handled properly, missing data (i.e. missing *features* - row values that are empty) can be a huge source of error in machine learning models.

Some common reasons that row values can be missing from a dataset:

1. Errors in the collection process
2. Conscious decisions by the schema designers (e.g. `NULL` indicates that the respondent refused to answer)
3. The feature does not apply to the sample (e.g. conditional features)

It's important to be aware of any **intended meanings** behind missing values before making a decision about how to interpret them. In 2 and 3 above, for example, missing values have *semantic meaning* that must be interpreted in the way that the producers of the data intended.

### Common tactics

In general, there are two ways of dealing with missing data:

1. **Eliminate samples** (or features) that contain erroneously missing values
2. **Impute (guess)** missing values based on context.

The advantage of eliminating samples or features is that it's easy and principled; the downside is that it can remove valuable information from your data, leading to a biased estimator.

Since training data is often precious and difficult to come by, we have two main methods for **imputing** missing data:

1. **Mean imputation**: Substitute the mean of the feature for the row value (most common for numerical variables)
2. **Mode imputation**: Subsitute the mode of the feature for the row value (most common for categorical variables)

Different imputation strategies are available via the [`sklearn.preprocessing.Imputer`](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.Imputer.html#sklearn.preprocessing.Imputer) class.