# Chapter #4: Preprocessing and Pipelines

## 1. Preprocessing data

**scikit-learn requirements**
> - Recall that **scikit-learn requires numeric data, with no missing values**.
> - All the data that we have used so far has been in this format.
> - However, with **real-world data**, this will rarely be the case, and instead **we need to preprocess our data before we can build models**.

**Dealing with categorical features**
> - Say we have a dataset containing **categorical features**, such as color.
> - As these are not numeric, **scikit-learn will not accept them** and we need to convert them into **numeric features**.
> - We achieve this by **splitting the feature into multiple binary features called dummy variables**, one for each category.
> - **`0`** means the observation was **not that category**, while **`1`** means **it was**.

**Dummy variables**
> - Say we are working with a music dataset that has a genre feature with 10 values such as Electronic, Hip-Hop, and Rock.
> - We create **binary features** for each genre.
> - As each song has one genre, **each row will have a `1` in one of the 10 columns** and `0` in the rest.
> - If a song is not any of the first nine genres, **then implicitly, it is a rock song**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img01.png">

**Dummy variables**
> - That means **we only need nine features, so we can delete the Rock column**.
> - If we do not do this, **we are duplicating information, which might be an issue for some models**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img02.png">

**Dealing with categorical features in Python**
> - To create **dummy variables** we can use:
>> - **scikit-learn's** `OneHotEncoder()`,
>> - **pandas'** `get_dummies()`.
> - We will use **`get_dummies()`**.

**Music dataset**
> - We will be working with a **music dataset** in this chapter, for both **classification** and **regression** problems.
> - Initially, we will build a **regression model** using all features in the dataset to **predict song `popularity`**.
> - There is one categorical feature, **`genre`**, with **ten possible values**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img03.png">

**EDA w/ categorical feature**
> - This box plot shows how **`popularity` varies by `genre`**.
> - Let's encode this feature using **dummy variables**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img04.png">

**Encoding dummy variables**
> - We **import** pandas, **read** in the DataFrame, and **call** `pd.get_dummies()`, **passing** the categorical column.
> - As we only need to keep 9 out of our 10 binary features, we can **set** the `drop_first` argument to `True`.
> - Printing the first five rows, we see **pandas creates 9 new binary features**.
> - The first song is Jazz, and the second is Rap, indicated by a `1` in the respective columns.
> - To bring these binary features back into our original DataFrame we can **use** `pd.concat()`, **passing** a list containing the `music_df` DataFrame and our `music_dummies` DataFrame, and setting `axis` equal to `1`.
> - Lastly, we can **remove** the original `genre` column using `.drop()`, **passing(()) the column, and setting `axis` equal to `1`.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img05.png">

**Encoding dummy variables**
> - If the DataFrame only has one categorical feature, we can **pass** the entire DataFrame, thus skipping the step of combining variables.
> - If we **don't specify a column**, the new DataFrame's binary columns will have the original feature name **prefixed**, so they will start with genre-underscore - as shown here.
> - Notice the original genre column is **automatically dropped**.
> - Once we have dummy variables, we can **fit** models as before.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img07.png">

**Linear regression with dummy variables**
> - Using the `music_dummies` DataFrame, the process for creating **training** and **test sets** remains unchanged.
> - To perform **cross-validation** we then **create** a `KFold()` object, **instantiate** a linear regression model, and **call** `cross_val_score()`.
> - We **set** scoring equal to `neg_mean_squared_error`, which **returns the negative MSE**.
> - This is because scikit-learn's cross-validation metrics presume a **higher score is better**, so **MSE** is changed to **negative MSE** to counteract this.
> - We can **calculate** the training RMSE by **taking the square root and converting to positive**, achieved by **calling** `numpy.square-root()` and **passing** our scores with a **minus sign** in front.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img08.png">

### 1.1. Creating dummy variables

Being able to **include categorical features in the model building process** can enhance performance as they may add information that contributes to prediction accuracy.

The `music_df` dataset has been preloaded for you, and its shape is printed. Also, `pandas` has been imported as `pd`.

Now you will create a new DataFrame containing the original columns of `music_df` plus dummy variables from the `"genre"` column.

- Set up the workspace.

In [1]:
import pandas as pd
music_df = pd.read_csv("./datasets/music.csv").drop(columns="Unnamed: 0")
music_df.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre
0,41.0,0.644,0.823,236533.0,0.814,0.687,0.117,-5.611,0.177,102.619,0.649,Jazz
1,62.0,0.0855,0.686,154373.0,0.67,0.0,0.12,-7.626,0.225,173.915,0.636,Rap
2,42.0,0.239,0.669,217778.0,0.736,0.000169,0.598,-3.223,0.0602,145.061,0.494,Electronic
3,64.0,0.0125,0.522,245960.0,0.923,0.017,0.0854,-4.56,0.0539,120.406497,0.595,Rock
4,60.0,0.121,0.78,229400.0,0.467,0.000134,0.314,-6.645,0.253,96.056,0.312,Rap


- Use a relevant function, passing the entire `music_df` DataFrame, to create `music_dummies`, dropping the first binary column.

In [2]:
music_dummies = pd.get_dummies(music_df, drop_first=True).rename(columns=lambda x: x.lower())
music_dummies.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre_anime,genre_blues,genre_classical,genre_country,genre_electronic,genre_hip-hop,genre_jazz,genre_rap,genre_rock
0,41.0,0.644,0.823,236533.0,0.814,0.687,0.117,-5.611,0.177,102.619,0.649,0,0,0,0,0,0,1,0,0
1,62.0,0.0855,0.686,154373.0,0.67,0.0,0.12,-7.626,0.225,173.915,0.636,0,0,0,0,0,0,0,1,0
2,42.0,0.239,0.669,217778.0,0.736,0.000169,0.598,-3.223,0.0602,145.061,0.494,0,0,0,0,1,0,0,0,0
3,64.0,0.0125,0.522,245960.0,0.923,0.017,0.0854,-4.56,0.0539,120.406497,0.595,0,0,0,0,0,0,0,0,1
4,60.0,0.121,0.78,229400.0,0.467,0.000134,0.314,-6.645,0.253,96.056,0.312,0,0,0,0,0,0,0,1,0


- Print the shape of `music_dummies`.

In [3]:
music_dummies.shape

(1000, 20)

### 1.2. Regression with categorical features

Now you have created `music_dummies`, containing binary features for each song's genre, it's time to **build a ridge regression model to predict song popularity**.

`music_dummies` has been preloaded for you, along with Ridge, `cross_val_score`, `numpy` as `np`, and a `KFold()` object stored as `kf`.

The model will be evaluated by calculating the average RMSE, but first, you will need to convert the scores for each fold to positive values and take their square root. This metric shows the average error of our model's predictions, so it can be compared against the standard deviation of the target value —`"popularity"`.

- Set up the workspace.

In [4]:
import numpy as np
from sklearn.model_selection import KFold, cross_val_score
from sklearn.linear_model import Ridge

In [5]:
music_dummies.head()

Unnamed: 0,popularity,acousticness,danceability,duration_ms,energy,instrumentalness,liveness,loudness,speechiness,tempo,valence,genre_anime,genre_blues,genre_classical,genre_country,genre_electronic,genre_hip-hop,genre_jazz,genre_rap,genre_rock
0,41.0,0.644,0.823,236533.0,0.814,0.687,0.117,-5.611,0.177,102.619,0.649,0,0,0,0,0,0,1,0,0
1,62.0,0.0855,0.686,154373.0,0.67,0.0,0.12,-7.626,0.225,173.915,0.636,0,0,0,0,0,0,0,1,0
2,42.0,0.239,0.669,217778.0,0.736,0.000169,0.598,-3.223,0.0602,145.061,0.494,0,0,0,0,1,0,0,0,0
3,64.0,0.0125,0.522,245960.0,0.923,0.017,0.0854,-4.56,0.0539,120.406497,0.595,0,0,0,0,0,0,0,0,1
4,60.0,0.121,0.78,229400.0,0.467,0.000134,0.314,-6.645,0.253,96.056,0.312,0,0,0,0,0,0,0,1,0


In [6]:
kf = KFold(n_splits=5, shuffle=True, random_state=42)

- Create `X`, containing all features in `music_dummies`, and `y`, consisting of the `"popularity"` column, respectively.

In [7]:
X = music_dummies.drop(columns="popularity").values
y = music_df["popularity"].values

- Instantiate a `ridge` regression model, setting `alpha` equal to `0.2`.

In [8]:
ridge = Ridge(alpha=0.2)

- Perform cross-validation on `X` and `y` using the `ridge` model, setting `cv` equal to `kf`, and using negative mean squared error as the scoring metric.

In [9]:
scores = cross_val_score(ridge, X, y, cv=kf, scoring="neg_mean_squared_error")

- Print the RMSE values by converting negative `scores` to positive and taking the square root.

In [10]:
rmse = np.sqrt(-scores)

In [11]:
print(f"Average RMSE: {np.mean(rmse):.2f}")
print(f"Standard deviation of the target array: {np.std(y):.2f}")

Average RMSE: 8.24
Standard deviation of the target array: 14.02


## 2. Handling missing data

**Missing data**
When there is no value for a feature in a particular row, we call it missing data. This can happen because there was no observation or the data might be corrupt. Whatever the reason, we need to deal with it.

**Music dataset**
Previously we worked with a modified music dataset. Now let's inspect the original version, which contains one thousand rows. We do this by chaining pandas' dot-isna with dot-sum and dot-sort_values. Each feature is missing between 8 and 200 values!

**Dropping missing data**
A common approach is to remove missing observations accounting for less than 5% of all data. To do this, we use pandas' dot-dropna method, passing a list of columns with less than 5% missing values to the subset argument. If there are missing values in our subset column, the entire row is removed. Rechecking the DataFrame, we see fewer missing values.

**Imputing values**
Another option is to impute missing data. This means making an educated guess as to what the missing values could be. We can impute the mean of all non-missing entries for a given feature. We can also use other values like the median. For categorical values we commonly impute the most frequent value. Note we must split our data before imputing to avoid leaking test set information to our model, a concept known as data leakage.

**Imputation with scikit-learn**
Here is a workflow for imputation to predict song popularity. We import SimpleImputer from sklearn-dot-impute. As we will use different imputation methods for categorical and numeric features, we first split them, storing as X_cat and X_num respectively, along with our target array as y. We create categorical training and test sets. We repeat this for the numeric features. By using the same value for the random_state argument, the target arrays' values remain unchanged. To impute missing categorical values we instantiate a SimpleImputer, setting strategy as most frequent. By default, SimpleImputer expects NumPy-dot-NaN to represent missing values. Now we call dot-fit_transform to impute the training categorical features' missing values! For the test categorical features, we call dot-transform.

**Imputation with scikit-learn**
For our numeric data, we instantiate another imputer. By default, it fills values with the mean. We fit and transform the training features, and transform the test features. We then combine our training data using numpy-dot-append, passing our two arrays, and set axis equal to 1. We repeat this for our test data. Due to their ability to transform our data, imputers are known as transformers.

**Imputing within a pipeline**
We can also impute using a pipeline, which is an object used to run a series of transformations and build a model in a single workflow. To do this, we import Pipeline from sklearn-dot-pipeline. Here we perform binary classification to predict whether a song is rock or another genre. We drop missing values accounting for less than five percent of our data. We convert values in the genre column, which will be the target, to a 1 if Rock, else 0, using numpy-dot-where. We then create X and y.

**Imputing within a pipeline**
To build a pipeline we construct a list of steps containing tuples with the step names specified as strings, and instantiate the transformer or model. We pass this list when instantiating a Pipeline. We then split our data, and fit the pipeline to the training data, as with any other model. Finally, we compute accuracy. Note that, in a pipeline, each step but the last must be a transformer.