
# Transformers & Preprocessing

---

### Learning Objectives
- Understand how to use which scikit-learn transformers
- Fill missing values using SimpleImputer
- Encode categorical features with OneHotEncoder
- Standardize features with StandardScaler
- Add new features with PolynomialFeatures
- Reduce the number of features with RFE 


You have future lessons on feature engineering and interpetation. This lesson is designed to give you familiarity with scikit-learn transformers and the tools you need to create models that perform well.

## Imports

In [1]:
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
import pandas as pd

from sklearn import __version__
__version__

'1.1.2'

#### Load modified tips dataset

One waiter's tips. Data dictionary [here](https://vincentarelbundock.github.io/Rdatasets/doc/reshape2/tips.html).

In [139]:
tips = pd.read_csv('data/tips_miss.csv', index_col=0)

#### Peek and get info

In [140]:
tips.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 244 entries, 0 to 243
Data columns (total 7 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   total_bill  220 non-null    float64
 1   tip         244 non-null    float64
 2   sex         220 non-null    object 
 3   smoker      220 non-null    object 
 4   day         220 non-null    object 
 5   time        220 non-null    object 
 6   size        220 non-null    float64
dtypes: float64(3), object(4)
memory usage: 15.2+ KB


In [141]:
tips.head(2)

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
0,16.99,1.01,Female,No,Sun,Dinner,2.0
1,10.34,1.66,Male,No,Sun,Dinner,3.0


#### Set up X and y

In [142]:
#just use total_bill for now


#### train_test_split

In [143]:
from sklearn.model_selection import train_test_split

In [144]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)

From now on, you deal with the training data.

## Rule 1: the test set is off limits 
Don't do anything to it that you couldn't do to new data.
For example don't one-hot encode a value that only shows up in the test-set.

### Let's jump into transformations you can do to your features.

---

## 1. Fill missing values 



You often want to to deal with missing data early so you can do other preprocessing. Dropping rows can be fine if you have a lot of other data. Same goes for dropping columns if most values are missing and there is not much unique signal in the column. How might you know if a column has little signal in it?

If you can figure out why the values are missing, you might want to fill the values accordingly. For example, maybe people who didn't respond to a survey question about owning a car don't own a car.

Often, though, you don't know why the data are missing.

#### Options:

- If continuous numeric data, fill with the mean, median, mode or a constant you choose.
- If nominal categorical data, fill with the mode or a constant you choose.

This is called _imputing_ missing values. scikit-learn's SimpleImputer can help us.

(Ignore forward or backward filling time series data and adding sentinel values for non-linear algorithms for now).

All of these options reduce the variance in your data, so they are not ideal.

All scikit-learn transformers should be fit on the training data and transform the training data. They should ONLY transform the test data. Remember Rule 1! 😀

In [146]:
from sklearn.impute import SimpleImputer

In [147]:
#instantiate


#### Fit on X_train

In [2]:
#fit on X_train


#### Transform X_train and save the result

In [149]:
#assign as X_train_filled


#### Transform (no fit) X_test and save the result

#### `Strategy=most_frequent` will work on non-numeric columns. Mean won't.⚠️ 

Check out other SimpleImputer options.

Iterative imputing, in which an algorithm is fit to the data that is not missing, is likely to create values that help your model perform better. This process can be slow. IterativeImputer is an experimental class in scikit-learn as of this writing. 

KNNImputer often performs better than SimpleImputer. It can also be a little slow. You learned about KNN classification, and this transformer is similar.

You can evaluate different missing value strategies. GridSearching with Pipelines makes this process much easier, so we'll put it off until we see those techniques.

Adding a column to indicate that a value was missing (a missing indicator) does not appear to help model performance, in most cases. This is an option with most imputation transformers.

⚠️ Interpretation becomes a bit tricky when you create data. Just note what you did.
### Always communicate how you treated missing data!

In [6]:
from sklearn.experimental import enable_iterative_imputer

In [7]:
from sklearn.impute import KNNImputer, IterativeImputer

---
## 2. Encode categorical features



Our data generally needs to be numeric. If you data is nominal categorical data, one-hot-encoding (dummy encoding) is the most common method. 

We generally don't want to encode a column into numeric data before splitting it because that would violate Rule 1. 

If there were 50 categories and some were rare, our model might see one in the real world that it had never seen before. That might make our model perform worse in the real world (assuming that feature is important) We don't want our model to give us test set results that would are overly optimistic.

Generally, if there aren't any values that show up only a few times in a column, you can one-hot encode your columns before creating a test set, and not worry about overstating your test set scores.

In [152]:
X = tips.drop('tip', axis = 1)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=22)

In [153]:
X_train.head()

Unnamed: 0,total_bill,sex,smoker,day,time,size
203,16.4,Female,Yes,Thur,Lunch,2.0
220,12.16,Male,Yes,Fri,,
151,13.13,Male,No,Sun,Dinner,2.0
54,25.56,Male,No,Sun,Dinner,4.0
51,10.29,Female,No,Sun,Dinner,


#### Instantiate, fit and transform X_train, transform (no fit) X_test

In [154]:
#still need fill the missing values
imputer = SimpleImputer(strategy='most_frequent')

In [155]:
#fit and fill training data


In [156]:
#transform test


In [8]:
#check to see all have been removed


In [9]:
#examine the .statistics_


In [159]:
from sklearn.preprocessing import OneHotEncoder

In [160]:
#instantiate


In [161]:
#fit and transform


array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 1., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [162]:
X_train.head(2)

Unnamed: 0,total_bill,sex,smoker,day,time,size
203,16.4,Female,Yes,Thur,Lunch,2.0
220,12.16,Male,Yes,Fri,,


In [163]:
ohe.get_feature_names_out()[:10]

array(['x0_5.75', 'x0_7.25', 'x0_7.51', 'x0_8.35', 'x0_8.51', 'x0_8.52',
       'x0_8.58', 'x0_8.77', 'x0_9.68', 'x0_9.94'], dtype=object)

In [164]:
X_train_filled = pd.DataFrame(X_train_filled, columns = X.columns)
X_test_filled = pd.DataFrame(X_test_filled, columns = X.columns)

In [165]:
X_train_filled.head(2)

Unnamed: 0,total_bill,sex,smoker,day,time,size
0,16.4,Female,Yes,Thur,Lunch,2.0
1,12.16,Male,Yes,Fri,Dinner,2.0


In [166]:
X_test_filled.head(2)

Unnamed: 0,total_bill,sex,smoker,day,time,size
0,18.71,Male,No,Thur,Lunch,3.0
1,38.07,Male,No,Sun,Dinner,3.0


## `make_column_transformer`
If we want to apply a transformation to only some of our X columns, we need to specify which columns with `make_column_transformer`.

In [167]:
from sklearn.compose import make_column_transformer

In [168]:
ohe = OneHotEncoder(sparse = False, drop = 'if_binary')

In [169]:
# ohe on ['sex', 'smoker', 'day', 'time']


In [170]:
#fit on filled


In [171]:
#transform test


### Convert to a DataFrame

In [173]:
X_train_encoded = pd.DataFrame(X_train_encoded, columns = encoder.get_feature_names_out())
X_train_encoded.head(2)

Unnamed: 0,sex_Male,smoker_Yes,day_Fri,day_Sat,day_Sun,day_Thur,time_Lunch,total_bill,size
0,0.0,1.0,0.0,0.0,0.0,1.0,1.0,16.4,2.0
1,1.0,1.0,1.0,0.0,0.0,0.0,0.0,12.16,2.0


---
## 3. StandardScaler
You've seen how to make sure each feature has  0 mean and 1 standard-deviation. 

It's a good idea to standardize and scale any model that uses regularization. Then one feature with large values won't overwhelm other features with small values.





Here is a [post on standardizing and scaling options](https://towardsdatascience.com/scale-standardize-or-normalize-with-scikit-learn-6ccc7d176a02?sk=a82c5faefadd171fe07506db4d4f29db).

If a feature doesn't look very normal after standard scaling, you could try `QuantileTransformer(output_distribution='normal')` to make the distribution more normal.

In [175]:
from sklearn.preprocessing import StandardScaler

#### Instantiate, fit and transform X_train, transform (no fit) X_test

---
## 4. PolynomialFeatures



You've seen how to add interactions and polynomials to create more features. This can help capture non-linear relationships for a regression model.

Features *a* and *b* expand into features `1, a, b, a^2, ab, b^2`

Watch out for a feature explosion! 🧨

Let's do this now so we can then see how to reduce the number of features later.

In [178]:
from sklearn.preprocessing import PolynomialFeatures

#### Instantiate, fit and transform X_train, transform (only) X_test

In [10]:
#instantiate


In [None]:
#transform train data


In [None]:
#transform test data


---
## 5.Feature Elimination with RFE 



You can drop features manually, but that's not ideal if you have lots and lots of features. 

If you want to try out a model with fewer features you can automatically drop what are probably the least useful features.

RFE stands for *Recursive Feature Elimination*. It takes an estimator and the number or proportion of features to select. It keeps the ones with the highest coefficients (or highest features importances for models that don't have coefficients).

You have to pass it the estimator to use. If the estimator works better when you have more observations than features - as linear regression does, consider that fact.

In [183]:
from sklearn.feature_selection import RFE 
from sklearn.linear_model import LinearRegression

#### Instantiate, fit and transform X_train, transform X_test

---
# Transform *y*

All of the above transformers change your X (features, independent variables).

You can transform y, too. It's fairly common to try to make the y more normal in a regression problem. Often a log transform works.

Scikit-learn's TransformedTargetRegressor is what you want.

---

You don't have to do all these things. In fact, usually you won't do all of them.

### After you've done your transformations, it's time to model! ⭐️

You'll learn how to try lots of transformer combinations when you combine GridSearch and Pipelines next.

## Summary

You've seen how to use scikit-learn transformers to 

- Fill missing values using SimpleImputer
- Encode categorical features with OneHotEncoder
- Standardize features with StandardScaler
- Add new features with PolynomialFeatures
- Reduce the number of features with RFE 
- Transform y

Read the scikit-learn docs for each of the transformers when you get a chance. 

## Doing it in a Pipeline



- Fill missing values
- One Hot Encode categorical features
- Scale the data
- Use a `LinearRegression` and `KNNRegressor` regressor to model the data

In [189]:
imputer

In [190]:
encoder

In [191]:
sscaler

In [192]:
poly_feats

In [193]:
rfe

In [194]:
from sklearn.pipeline import make_pipeline

In [195]:
pipe = make_pipeline(imputer, encoder, poly_feats, sscaler, rfe)

In [196]:
from sklearn.compose import make_column_selector, make_column_transformer

In [197]:
X_train.head(1)

Unnamed: 0,total_bill,sex,smoker,day,time,size
203,16.4,Female,Yes,Thur,Lunch,2.0


#### Problem

Build a pipeline to add Polynomial Features, encode categorical variables, and model with `LinearRegression` for the `Balance` column in the credit data (`Credit.csv`).

In [123]:
credit = pd.read_csv('data/Credit.csv', index_col = 0)
credit.head()

Unnamed: 0,Income,Limit,Rating,Cards,Age,Education,Gender,Student,Married,Ethnicity,Balance
1,14.891,3606,283,2,34,11,Male,No,Yes,Caucasian,333
2,106.025,6645,483,3,82,15,Female,Yes,Yes,Asian,903
3,104.593,7075,514,4,71,11,Male,No,No,Asian,580
4,148.924,9504,681,3,36,11,Female,No,No,Asian,964
5,55.882,4897,357,2,68,16,Male,No,Yes,Caucasian,331
