# Section 7. Preprocessing

#### Instructor: Pierre Biscaye

The content of this notebook draws on material from UC Berkeley D-Lab's Python Machine Learning [course](https://github.com/dlab-berkeley/Python-Machine-Learning).

Preprocessing is the process of data cleaning and preparation for analysis. This is an essential step for any data work, and no less for the machine learning workflow and the performance of models. This notebook will introduce the major steps of preprocessing for machine learning. 


### Sections

1. Missing data
2. Processing categorical data: dummy encoding
3. Processing continuous data: outliers and normalization


## Load Data

For today, we will be working with the `penguins` data set, a common public data set for teaching visualization and exploration. This data set is from [Kaggle](https://www.kaggle.com/parulpandey/penguin-dataset-the-new-iris) and includes data on penguins of three different species, their location, and some measurements for each penguin.

First, let's import some packages we'll need.

In [None]:
import warnings

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split

Now, let's load in the data.

In [None]:
data = pd.read_csv('Data/penguins.csv')
data

Below is the information for each of the columns:
1. **species**: Species of penguin [Adelie, Chinstrap, Gentoo]
2. **island**: Island where the penguin was found [Torgersen, Biscoe]
3. **culmen_length_mm**: Length of upper part of penguin's bill (millimeters)
4. **culmen_depth_mm**: Height of upper part of bill (millimeters)
5. **flipper_length_mm**: Length of penguin flipper (millimeters)
6. **body_mass_g**: Body mass of the penguin (grams)
7. **sex**: Biological sex of the penguin [MALE, FEMALE]

*Question:* Which of the columns are continuous? Which are categorical?

We will need to treat the numeric and categorical data differently in preprocessing.


## 1. Missing Data Preprocessing

First, let's check to see if there are any missing values in the data set. Missing values are represented by `NaN`. 

*Question:* In this case, what do missing values stand for?

In [None]:
data.isnull().sum()

It is also possible to have non `NaN` missing values. For example, let's take a look at the `sex` column.

In [None]:
data['sex'].unique()

In this case, the `.` represents a missing value, so let's replace those with `np.nan` objects.

In [None]:
data.replace('.', np.nan, inplace=True)

data['sex'].unique()

### Imputation

In the case of missing values, we have the option to fill in the missing values with the best guess. `sklearn` has a function `SimpleImputer` that has flexible options for how to approach this.

There are many strategies that can be used to impute missing data ([see function documentation](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html)), some of which we have discussed previously.

Here we'll impute any missing values for two selected variables using the average, or mean, of all the data that does exist for those variables -- this is making a best guess for what the values would have been. We'll then save this as a new dataset, rather than overwriting the original data.

Let's see how the `SimpleImputer` works on a subset of the data. 

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
imputed = imputer.fit_transform(data[['body_mass_g','flipper_length_mm']])
imputed

Now let's check that the previously null values have been filled in. 

In [None]:
print(imputed[data[data['body_mass_g'].isna()].index])

### Dropping Null Values

Another option option is to use `pd.dropna()` to drop `Null` values from the `DataFrame`. This should almost always be used with the `subset` argument which restricts the function to only dropping values that are null in a certain column(s).

Here we will actually overwrite the data.

In [None]:
data = data.dropna(subset=['sex'])

# Now this line will return an empty dataframe
data[data['sex'].isna()]

In [None]:
# There are now 11 fewer rows
data.shape

Note that it is not strictly necessary to drop observations with missing data. Any model that includes a variable with missing data will automatically drop the observations with missing data from the variable from the estimation. It is potentially important if you are running analyses on subsets of variables, and want to make sure you always include the same observations.

This also implies that if you do not impute values for missing data, those observations will be dropped from any analyses using the variables with the missing data.

## 2. Categorical Data Processing

As we saw earlier, the `penguins` dataset contains both categorical and continuous features, which will each need to be preprocessed in different ways. First, we want to transform the categorical variables from strings to **indicator variables**. Indicator variables have one column per level, For example, the island variable will change from Biscoe/Dream/Torgersen --> Biscoe (1/0), Dream (1/0), and Torgerson (1/0). For each set of indicator variables, there should be a 1 in exactly one column.

 Let's make a list of the categorical variable names to be transformed into indicator variables, and save a dataset of just these variables.

In [None]:
# Define the variable names that are categorical for use later
cat_var_names = ['species','island', 'sex']
data_cat = data[cat_var_names]
data_cat.head()

### Categorical Variable Encoding (One-hot & Dummy)

Many machine learning algorithms require that categorical data be encoded numerically in some fashion. There are two main ways to do so:


- **One-hot-encoding**, which creates `k` new variables for a single categorical variable with `k` categories (or levels), where each new variable is coded with a `1` for the observations that contain that category, and a `0` for each observation that doesn't. 
- **Dummy encoding**, which creates `k-1` new variables for a categorical variable with `k` categories

However, when using some machine learning algorithms we can run into the so-called ["Dummy Variable Trap"](https://www.algosome.com/articles/dummy-variable-trap-regression.html) when using One-Hot-Encoding on multiple categorical variables within the same set of features. This occurs because each set of one-hot-encoded variables can be added together across columns to create a single column of all `1`s, and so are multi-colinear when multiple one-hot-encoded variables exist within a given model. This can lead to misleading results. 

To resolve this, we can simply add an intercept term to our model (which is all `1`s) and remove the first one-hot-encoded variable for each categorical variables, resulting in `k-1` so-called "Dummy Variables". 

Luckily the `OneHotEncoder` from `sklearn` can perform both one-hot and dummy encoding simply by setting the `drop` parameter (`drop = 'first'` for Dummy Encoding and `drop = None` for One Hot Encoding). 

**Question:** How many total columns will there be in the output?

In [None]:
from sklearn.preprocessing import OneHotEncoder
dummy_e = OneHotEncoder(categories='auto', drop='first', sparse_output=False) # sparse_output=False for scikit-learn v> 1.2; else sparse=False
dummy_e.fit(data_cat);
dummy_e.categories_

In [None]:
temp = dummy_e.transform(data_cat)
temp

We can also create encode categorical variables into dummy variables manually by looping through categorical values.

In [None]:
for v in data_cat['sex'].unique():
    data_cat[v]=data_cat['sex']==v
data_cat.head()

That loop was one-hot encoding. What if we want to do dummy encoding? We want $k-1$ new variables. We can tweak the loop to accomplish this by including an if statement and telling the loop to stop before the last unique categorical value.

In [None]:
data_cat['species'].unique()

In [None]:
i = 1
for v in data_cat['species'].unique():
    if i < data_cat['species'].nunique():
        data_cat[v]=data_cat['species']==v
        i += 1
data_cat.head()

## 3. Continuous Data Preprocessing

For numeric data, we don't need to create indicator variables, but there are many potential considerations for preparing the data for analysis.

 Let's subset out the continuous variables to be normalized.

In [None]:
data_num = data.drop(columns=cat_var_names + ['species'])
data_num.head()

### Checking for outliers

One that we discussed previously is checking for **outliers** and deciding how to treat those values.

A simple approach to identifying potential outliers is plotting histograms.

In [None]:
fig, ax = plt.subplots(ncols=2, nrows=2, figsize=(8,8))

data_num['culmen_length_mm'].hist(grid=False, ax=ax[0,0])
data_num['culmen_depth_mm'].hist(grid=False, ax=ax[0,1])
data_num['flipper_length_mm'].hist(grid=False, ax=ax[1,0])
data_num['body_mass_g'].hist(grid=False, ax=ax[1,1])

plt.show()

There are no obvious outliers for any of these variables. If there were, we would need to determine how to *define* an outlier (i.e., values outside some threshold, which may be a percentile of the distribution), and how to *process* the outliers (i.e., set to missing, set to the threshold value, set to the median, etc.).

### Normalization

Another processing approach that is often useful for machine learning is normalizing our variables. This converts all variables into the same units and gives them a similar distribution, which helps improve performance of many machine learning models (see [here](https://en.wikipedia.org/wiki/Feature_scaling)).

[Normalization](https://en.wikipedia.org/wiki/Normalization_(statistics)) is a transformation that puts data into some known "normal" scale. There are many forms of normalization, but perhaps the most useful to machine learning algorithms is called the "z-score" also known as the standard score. This approach is also useful in econometrics, particularly when comparing effect sizes of different variables or on different outcomes. 

To z-score normalize the data, we simply subtract the mean of the data, and divide by the standard deviation. This results in data with a mean of `0` and a standard deviation of `1`.

We'll use the `StandardScaler` from `sklearn` to do normalization, but you can also code this manually.

In [None]:
from sklearn.preprocessing import StandardScaler
norm_e = StandardScaler()
norm_e.fit_transform(data_num,)

To check the normalization works, let's look at the mean and standard variation of the resulting columns. 

In [None]:
print('mean:',norm_e.fit_transform(data_num,).mean(axis=0))
print('std:',norm_e.fit_transform(data_num,).std(axis=0))

## 4. Combine it all together

Now let's combine what we've learned to preprocess the entire dataset.

First we will reload the data set to start with a clean copy.

In [None]:
# load raw data
data = pd.read_csv('Data/penguins.csv')
# ensure all missing values are coded as np.nan
data.replace('.', np.nan, inplace=True)
# drop observations with missing sex
data = data.dropna(subset=['sex'])


We will now split the data into a training and test set. In the next notebook we will be developing a **classification model** to predict penguin species using other characteristics.

We will stratify the split by species to ensure the training and test shares are the same within each species.

Why do we do split the data before preprocessing? The best practice is to fit preprocessing methods on the training data. This avoids **data leakage** or influence of test data information on training data. For example, we don't want to impute means that include test data or normalize based on the distribution that includes the test data.

In [None]:
# Perform the train-test split
y = data['species']
X = data.drop('species', axis =1, inplace=False)
X_train, X_test, y_train, y_test=train_test_split(X, y, test_size=.25, stratify=y, random_state=28)
print(X_train.shape)

We want to train our preprocessing protocols on the training data using the `fit_transform` function, then use the `transform` funtion on the test data. This more closely resembles what the workflow would look like if you are bringing in brand new test data.

First, we will subset out the categorical and numerical features separately. 

In [None]:
# Get the categorical and numerical variable column indices
cat_var = ['island', 'sex']
num_var = ['culmen_length_mm', 'culmen_depth_mm',
           'flipper_length_mm', 'body_mass_g']
# Splice the training array
X_train_cat = X_train[cat_var]
X_train_num = X_train[num_var]

# Splice the test array
X_test_cat = X_test[cat_var]
X_test_num = X_test[num_var]

Now, let's process the categorical data with **Dummy encoding**

In [None]:
print(X_train_cat['island'].nunique())
print(X_train_cat['sex'].nunique())

In [None]:
warnings.filterwarnings('ignore')

# Categorical feature encoding
dummy_e = OneHotEncoder(categories='auto', drop='first', sparse_output=False) # sparse_output=False for scikit-learn v> 1.2; else sparse=False
X_train_dummy = dummy_e.fit_transform(X_train_cat)
X_test_dummy = dummy_e.transform(X_test_cat)

# Check the shape
X_train_dummy.shape, X_test_dummy.shape

Is this the number of variables we expected?

Now, let's process the numerical data by imputing any missing values using the mean in the trianing set and normalizing the results.

In [None]:
# Numerical feature standardization

# Impute means for missing observations
imputer = SimpleImputer(missing_values=np.nan,
                        strategy='mean', 
                        copy=True)
X_train_imp = imputer.fit_transform(X_train_num)
X_test_imp = imputer.transform(X_test_num)

# Check for missing values
np.isnan(X_train_imp).any(), np.isnan(X_test_imp).any()

In [None]:
# normalize
norm_e = StandardScaler()
X_train_norm = norm_e.fit_transform(X_train_imp)
X_test_norm = norm_e.transform(X_test_imp)

X_train_norm.shape, X_test_norm.shape

Now that we've processed the numerical and categorical data separately, we can put the two arrays back together.

In [None]:
X_train = np.hstack((X_train_dummy, X_train_norm))
X_test = np.hstack((X_test_dummy, X_test_norm))

X_train.shape, X_test.shape

Finally, let's save our results as separate `.csv` files, so we won't have to run the preprocessing again. We will use these files later.

First we will make them DataFrames, then add column names, and then save them as .csv files. Note that the order of column names is based on how we stacked the categorical and continuous data sets together, and how the one-hot encoding generated the dummy variables based on the island and sex categories.

In [None]:
X_train = pd.DataFrame(X_train)
X_train.columns = ['Dream','Torgersen', 'Male',
                   'culmen_length_mm', 'culmen_depth_mm',
                   'flipper_length_mm', 'body_mass_g']

X_test = pd.DataFrame(X_test)

X_test.columns = ['Dream','Torgersen', 'Male',
                   'culmen_length_mm', 'culmen_depth_mm',
                   'flipper_length_mm', 'body_mass_g']
y_train = pd.DataFrame(y_train)
y_train.columns = ['species']

y_test = pd.DataFrame(y_test)
y_test.columns = ['species']

X_train.to_csv('Data/penguins_X_train.csv')
X_test.to_csv('Data/penguins_X_test.csv')
y_train.to_csv('Data/penguins_y_train.csv')
y_test.to_csv('Data/penguins_y_test.csv')


Although now we will move on to talk about classification, all of the choices we make in the preprocessing pipeline are extremely important to machine learning.