# Preprocessing with pandas and scikit-learn

## What you'll learn in this class 🧐🧐

In the coming weeks, you will learn how to do machine learning. The necessary preparation step you will have to take before using any machine learning model is called preprocessing. In the previous course we learned about ETL, the process of taking messy unstructured data and convert it to a structured often tabular form that is fit for further analysis. Preprocessing happens once you have structured data that you need to clean before you can start training your models, indeed your data may be structured, however it may contain a wide variety of data types: numbers, texts, dates, times,... that you will have to transform, and you will also have to deal with missing data, because a machine learning model is a mathematical model that only knows how to process numbers.  Here's what you'll be able to do after this course:

* How to choose the right preprocessing methods according to the type of data you want to use in a machine learning model

## What is preprocessing 👷👷

From this module onwards, you will get to the heart of the subject of machine learning! Without going into details (this is not the purpose of this course), you can for the moment remember that "machine learning" is a whole set of methods that make it possible to carry out tasks, without explicitly coding all the steps necessary for these tasks. The applications are very diverse: image recognition, semantic analysis, cost prediction, customer segmentation,... The principle of machine learning is to "show" data to a mathematical model and it will "learn" from it according to a set of rules that define the optimal way to perform its given task, based on data.

You may have discovered that in Data Science, we have to manipulate different types of data: it can be tables of numbers, but also text, or even images. However, since machine learning models are mathematical models, they are only able to process numbers. This course will provide you with techniques to transform the different types of data into numbers that can be interpreted by a model.

As you will discover during the rest of the Full Stack program: preprocessing is a crucial step in machine learning, and it is a Science in its own right! Data Scientists often spend a lot of time (some statistics say preprocessing amounts for 80% of the total duration of a machine learning project) figuring out how to properly prepare the data to optimize the performance of a machine learning model.

Today's course will introduce the "basic" steps of preprocessing, those you will almost always encounter when you do machine learning. You will learn about two librairies to code the different preprocessing steps in python: pandas and, scikit-learn.


## The different types of data 🔢🔢

### Structured data 📊

All data that can be represented in tabular form is called "structured data" (in English, this is called a "flat database"). The purpose of the ETL process we learned about in the previous course is to produce such data: SQL table, CSV file, dictionary returned by an API, etc... You can remember that if you can find a way to read and store your data in a pandas DataFrame, then you are working with structured data.

When working with structured data, the columns of the table are called variables: age, salary, gender, country, etc.... These variables can be classified into four broad families. These categories are useful to keep in mind because we will not go through the same pre-processing steps depending on the category of the variables we are working on.

#### The four main families of variables 📈

**Quantitative/numeric variables**
These are all variables that are written as a number. There are two sub-categories:

- continuous quantitative variable: can take all possible values in a given interval (similar to python's *float* format). For example: a distance, temperature, salary, a patient's blood sugar level, etc.
- discrete quantitative variable: all potential values are separated by an interval of a given length (similar to python's *integer* format). For example: age, shoe size, exam score, etc.

**Qualitative/categorical variables** 
These are all variables that are not (directly) expressed as a number or that take a finite number of possible values. They are expressed in the form of short texts, called "modalities" or categories, or sometimes a limited collection of numbers. There are two subfamilies:

- ordinal qualitative variables: there is a hierarchy between the modalities. For example: the answers to a satisfaction poll: "very bad", "bad", "average", "good", "very good", or the types of house you created in exercise S1-3B "house market": "small house" / "large house" / "very large house".
- nominal qualitative variables: there is no hierarchy between modalities. For example: nationality, gender, socio-professional category, cities, postcodes... In general, most categorical variables are nominal.


#### Target variable / explanatory variables 💹

Most times in machine learning, we are building models in order to predict the value of one variable according to the other variables, this is called *supervised* machine learning. Supervised machine learning often uses the following notations and vocabulary:

- **Y**, the "target variable", "explained variable" or the "predicted variable" is the variable you are trying to predict.
- **X**, the "explanatory variables" are the other variables that the model will use to make its prediction.

For example: if you are trying to create a model that will predict a person's salary according to their professionnal experience in years, you are doing supervised machine learning. **Y** is the salary, **X** is the variable giving the person's experience, and you will need data to train the model for which the values of both **X** and **Y** are known in order to train the model: i.e. teach it how to make the best salary prediction possible based on experience.


### Unstructured data 📋

Unstructured data is any data that cannot be stored as tabular or json/dictionnary format. Here are some examples of unstructured data:

- all the articles of Wikipedia (we talk about corpus of texts)
- images
- digitized sounds

The examples listed above require specific preprocessing techniques that you will learn later on during the Deep Learning module. For now, we will restrict this course to the preparation of structured data :

## The different types of preprocessing for structured data ⚙️⚙️

### Drop rows and columns 🗑️

First of all, we have clean the dataset to keep only the rows and columns that will be useful for machine learning.

#### Drop columns ⬇️
Some columns need to be excluded from our datasets because they are not suitable to train a machine learning model:

- All columns that are "unique identifiers": dataset index, surname/first name of a person, social security number, transaction number. The idea is that machine learning bases its ability to learn on finding out common patterns among the explanatory variables, a variable that takes unique values for each row is therefore unfit to find such patterns.
- Columns with too many missing values. There is no general rule, but you can generally conclude that if the rate of missing values exceeds 60/70%, it is better to exclude this column from the dataset. Think that missing values cannot be used to train a model, the rows containing missing values need either be excluded, in this case that would mean excluding 60%/70% of all available data where the column contains missing values, or be replaced by valid values based on the data scientist's decision, therefore introducing a bias in 60%/70% of the rows, it's usually better not to work with such variables.
- Nominal variables that have too many modalities. Again, there is no general rule, but for the small datasets we will be working on during the training, you will have to exclude columns that have more than 20-30 modalities. This rule needs to be interpreted relatively to the size of the dataset, a variable taking 100 modalities over 1.000 rows can't be used as is, the same variable in a new sample of 1.000.000 rows needs to be used for predictions.
- If two columns are colinear (correlation coefficient equals 1) with each other, only one of them will be kept: typically, you should never keep both the age and the year of birth of a person in your dataset.

#### Drop lines ➡️
The following lines are excluded from the dataset: 

- If we are working on a supervised machine learning problem, we will exclude all the lines for which the target variable **Y** is missing.
- the lines with too many outliers or too many missing values along its columns, i.e. those with "strange" or even inconsistent values, or very far from the usual values: negative age, very high salary compared to the rest of the sample, unknown city name that you only find once in the dataset, ...

### Imputation of missing values 🔮
Imputation in machine learning is when you replace missing values with valid information. There are a multitude of imputation methods and it is a Science in its own right. Below are the most common methods. Depending on the type of variable, you will use one or the other.

#### Quantitative variables 🔢
**Mean Imputation**

Imputation by means of the average consists in replacing the missing values with the mean value of the variable. This method works on continuous quantitative variables.

**Median Imputation**

Median imputation involves replacing missing values with the median of the variable. This method is used for discrete quantitative variables, or for some continuous quantitative variables that may have extreme values: for example salary.


#### Categorical variables 🔠
For categorical variables, no mean or median can be calculated. In this case, the simplest way is to replace the missing values with the most frequent modality of the variable. This is called "mode" imputation.
Another very simple imputation method for categorical variables is to replace all missing values with the caracter string 'missing', this will let the model view a missing value as a distinct piece of information rather than replacing it with an existing category.

### Standardization/Normalization (quantitative variables only) 📊
To avoid having to deal with very large values (this is never advisable in computing, for memory management reasons) , and to help our models to train properly, we will normalize all the quantitative explanatory variables so that their values are contained in reasonnable value intervals, typically a few units. There are various methods for normalizing data, but the most common is called Standardization: the variable $X$ is replaced with $\frac{X - \bar{X}}{\sqrt(Var(X))}$, we remove the variable mean and divide by the variable's standard deviation, resulting in a variable of mean 0 and a stardard deviation of 1.

Note: When using supervised machine learning, the target variable is never normalized, but it can be rescaled in situations where its distribution is really skewed, in that case we can apply a $\log$ transformation to the target variable!

### Encoding (categorical variables only) 💻
Finally, since machine learning models deal with numbers only, all categorical variables must be encoded, that is turned into numbers. Depending on the type of variable, different methods will be used:


#### Target variable 🎯
In supervised machine learning, we often deal with categorical target variables. In this case, we will simply encode it by making each modality correspond to a number. This transformation can be summarized as follows:

|  Y  | 
|-----|
| no  |
| yes |
| no  |
| no  |
| yes |

becomes :

| Y | 
|---|
| 0 |
| 1 |
| 0 |
| 0 |
| 1 |


#### Explanatory variables 🔡
**Ordinal variable**
For ordinal variables, they can be encoded in the same way as the target variable, if you choose to do that, you may normalize them as if they were quantitative variables. However, remember that it is rare to have to deal with ordinal categorical variables.

**Nominal variable**
Most of the time, this is the method you will use to encode categorical explanatory variables.

When the variable is nominal, it cannot be encoded by simply replacing each modality with a different numbers, because numbers have a hierarchical structure (2 is bigger than 1), and nominal variables don't. In this case, the simplest method is what is called One Hot Encoding (or Dummy Encoding): we will create a binary variable for each modality. Let's present an example to make it clearer:

| Country |
|---------|
| France  |
| France  |
| Spain   |
| Germany |
| Germany |
| Spain   |
| Spain   |

becomes :

| France | Spain | Germany |
|--------|-------|---------|
|    1    |   0    |    0     |
|    1    |   0    |    0     |
|    0    |   1    |    0     |
|    0    |   0    |    1     |
|    0    |   0    |    1     |
|    0    |   1    |    0     |
|    0    |   1    |    0     |

An additional subtlety: we learned at the beginning of this course that we should avoid colinear columns. In the above example, if the values in the first two columns are known, the values in the third column can be determined using the linear equation (Germany = 1 - France - Spain), this is called colinearity. For this reason, one of the columns produced by the encoding is usually discarded, resulting in: 

| France | Spain |
|--------|-------|
|    1    |   0    |
|    1    |   0    |
|    0    |   1    |
|    0    |   0    |
|    0    |   0    |
|    0    |   1    |
|    0    |   1    |

In this way, no information was lost, but colinearity was avoided.
