# Chapter #4: Preprocessing and Pipelines

## 1. Preprocessing data

**scikit-learn requirements**
> - Recall that **scikit-learn requires numeric data, with no missing values**.
> - All the data that we have used so far has been in this format.
> - However, with **real-world data**, this will rarely be the case, and instead **we need to preprocess our data before we can build models**.

**Dealing with categorical features**
> - Say we have a dataset containing **categorical features**, such as color.
> - As these are not numeric, **scikit-learn will not accept them** and we need to convert them into **numeric features**.
> - We achieve this by **splitting the feature into multiple binary features called dummy variables**, one for each category.
> - **`0`** means the observation was **not that category**, while **`1`** means **it was**.

**Dummy variables**
> - Say we are working with a music dataset that has a genre feature with 10 values such as Electronic, Hip-Hop, and Rock.
> - We create **binary features** for each genre.
> - As each song has one genre, **each row will have a `1` in one of the 10 columns** and `0` in the rest.
> - If a song is not any of the first nine genres, **then implicitly, it is a rock song**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img01.png">

**Dummy variables**
> - That means **we only need nine features, so we can delete the Rock column**.
> - If we do not do this, **we are duplicating information, which might be an issue for some models**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img02.png">

**Dealing with categorical features in Python**
> - To create **dummy variables** we can use:
>> - **scikit-learn's** `OneHotEncoder()`,
>> - **pandas'** `get_dummies()`.
> - We will use **`get_dummies()`**.

**Music dataset**
> - We will be working with a **music dataset** in this chapter, for both **classification** and **regression** problems.
> - Initially, we will build a **regression model** using all features in the dataset to **predict song `popularity`**.
> - There is one categorical feature, **`genre`**, with **ten possible values**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img03.png">

**EDA w/ categorical feature**
> - This box plot shows how **`popularity` varies by `genre`**.
> - Let's encode this feature using **dummy variables**.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img04.png">

**Encoding dummy variables**
> - We **import** pandas, **read** in the DataFrame, and **call** `pd.get_dummies()`, **passing** the categorical column.
> - As we only need to keep 9 out of our 10 binary features, we can **set** the `drop_first` argument to `True`.
> - Printing the first five rows, we see **pandas creates 9 new binary features**.
> - The first song is Jazz, and the second is Rap, indicated by a `1` in the respective columns.
> - To bring these binary features back into our original DataFrame we can **use** `pd.concat()`, **passing** a list containing the `music_df` DataFrame and our `music_dummies` DataFrame, and setting `axis` equal to `1`.
> - Lastly, we can **remove** the original `genre` column using `.drop()`, **passing(()) the column, and setting `axis` equal to `1`.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img05.png">

**Encoding dummy variables**
> - If the DataFrame only has one categorical feature, we can **pass** the entire DataFrame, thus skipping the step of combining variables.
> - If we **don't specify a column**, the new DataFrame's binary columns will have the original feature name **prefixed**, so they will start with genre-underscore - as shown here.
> - Notice the original genre column is **automatically dropped**.
> - Once we have dummy variables, we can **fit** models as before.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img07.png">

**Linear regression with dummy variables**
> - Using the `music_dummies` DataFrame, the process for creating **training** and **test sets** remains unchanged.
> - To perform **cross-validation** we then **create** a `KFold()` object, **instantiate** a linear regression model, and **call** `cross_val_score()`.
> - We **set** scoring equal to `neg_mean_squared_error`, which **returns the negative MSE**.
> - This is because scikit-learn's cross-validation metrics presume a **higher score is better**, so **MSE** is changed to **negative MSE** to counteract this.
> - We can **calculate** the training RMSE by **taking the square root and converting to positive**, achieved by **calling** `numpy.square-root()` and **passing** our scores with a **minus sign** in front.

<img style="margin-left: auto; margin-right: auto;" src="./assets/ch04_01_preprocessing_data_img08.png">