# CPSC 330 Lecture 7

### Lecture plan

- 👋
- **Turn on recording**
- Announcements
- Missing data: abridged version (10 min)
- Feature scaling (25 min)
- Break (5 min)
- Putting it all together with `ColumnTransformer` (30 min)
- Hyperparameter search, revisited (5 min)
- Summary (5 min)

## Announcements

- New office hours starting this week, see [calendar](https://htmlpreview.github.io/?https://github.com/UBC-CS/cpsc330/blob/master/docs/calendar.html).
- I'll try to release hw4 earlier than Tuesday (I have some time later today to work on it)
- Will try Live Q&A one more time (today) and then will send out a poll to get your feedback.
  - Reminder - turn on Live Q&A
- New plot in [Tuesday's lecture notebook](https://github.com/UBC-CS/cpsc330/blob/master/lectures/06_optimization-bias-categoricals.ipynb)

## Learning objectives

- Use `SimpleImputer` to impute values where data is missing.
- Explain the motivation for feature scaling.
- Compare standardization vs. normalization.
  - (but we won't really discuss much about the pros and cons of each).
- Identify which preprocessing steps apply to which types of data (e.g. `StandardScaler` is for numeric features)
- Use `ColumnTransformer` for more complex pipelines.
- Use `GridSearchCV` and `RandomizedSearchCV` on nested pipelines.

In [103]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.rcParams['font.size'] = 16

from sklearn.dummy import DummyClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score, cross_validate, GridSearchCV, RandomizedSearchCV
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn import set_config, config_context

## Dealing with missing data: abridged version (10 min)

Today we'll continue with the census data:

In [2]:
census = pd.read_csv('data/adult.csv')

As discussed last time, we'll drop the `education` column because it's already been ordinally encoded in `education.num`.

In [3]:
census = census.drop(columns=["education"])

In [4]:
census_train, census_test = train_test_split(census, test_size=0.2, random_state=123)

In [5]:
census_train.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,?,77053,9,Widowed,?,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,?,186061,10,Widowed,?,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


- We can see we have a bunch of missing values, where presumably the person did not answer that question on the census.
- Interestingly, these were not picked up: 

In [6]:
census_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26048 entries, 17064 to 19966
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             26048 non-null  int64 
 1   workclass       26048 non-null  object
 2   fnlwgt          26048 non-null  int64 
 3   education.num   26048 non-null  int64 
 4   marital.status  26048 non-null  object
 5   occupation      26048 non-null  object
 6   relationship    26048 non-null  object
 7   race            26048 non-null  object
 8   sex             26048 non-null  object
 9   capital.gain    26048 non-null  int64 
 10  capital.loss    26048 non-null  int64 
 11  hours.per.week  26048 non-null  int64 
 12  native.country  26048 non-null  object
 13  income          26048 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


- Everything is non-null because the missing values were encoded as the string "?" instead of an actual NaN in Python.
- We saw those last class, where "?" was a category generated by OHE.
- Let's change them to actual nulls:

In [7]:
df_train_nan = census_train.replace('?', np.NaN)
df_test_nan  = census_test.replace( '?', np.NaN)

df_train_nan.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country,income
0,90,,77053,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States,<=50K
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States,<=50K
2,66,,186061,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States,<=50K
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States,<=50K
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States,<=50K


In [8]:
df_train_nan.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26048 entries, 17064 to 19966
Data columns (total 14 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             26048 non-null  int64 
 1   workclass       24600 non-null  object
 2   fnlwgt          26048 non-null  int64 
 3   education.num   26048 non-null  int64 
 4   marital.status  26048 non-null  object
 5   occupation      24595 non-null  object
 6   relationship    26048 non-null  object
 7   race            26048 non-null  object
 8   sex             26048 non-null  object
 9   capital.gain    26048 non-null  int64 
 10  capital.loss    26048 non-null  int64 
 11  hours.per.week  26048 non-null  int64 
 12  native.country  25573 non-null  object
 13  income          26048 non-null  object
dtypes: int64(6), object(8)
memory usage: 3.0+ MB


- Now we can see the null values, and likely these would be picked out by pandas profiler.
  - Note: we'll address null values in the features, not in the targets.
- So, how should we address these?
- Disclaimer: we will only cover this in a super simplistic way.
- See STAT courses for a proper treatment of this topic!

#### Gotta drop 'em all

In [9]:
X_train_nan = df_train_nan.drop(columns=['income'])
X_test_nan  = df_test_nan.drop(columns=['income'])
y_train = df_train_nan['income']
y_test = df_test_nan['income']

In [10]:
X_train_nan.shape

(26048, 13)

In [11]:
X_train_nan.dropna(axis=0).shape

(24144, 13)

- So, we dropped about 2000 rows.
- We'd need to do the same in our test set.
- But what if we get a missing value in deployment?
- And furthermore, what if the missing values don't occur at random and we're systematically dropping certain data?
- This is not a great solution, especially if there's a lot of missing values.

In [12]:
X_train_nan.dropna(axis=1).shape

(26048, 10)

- One can also drop all _columns_ with missing values using `axis=1`. 
- This generally throws away a lot of information, because you lose a whole column just for 1 missing value.
- But I might drop a column if it's 99.9% missing values, for example.

#### Imputation

- Imputation means inventing values for the missing data.
- The strategies are different for numeric vs. categorical.
- In this dataset it turns out we only have missing values in the categorical features.

In [13]:
from sklearn.impute import SimpleImputer

In [14]:
imp = SimpleImputer(strategy='most_frequent')

- This imputer is another transformer, like the other ones we've seen (`CountVectorizer`, `OrdinalEncoder`, `OneHotEncoder`).
- The "most_frequent" strategy puts in the most frequent value seen in that column.
- There are also strategies for numeric variables, like taking the mean or median value.

In [15]:
numeric_features = ['age', 'fnlwgt', 'education.num', 'capital.gain', 
                    'capital.loss', 'hours.per.week']
categorical_features = ['workclass', 'marital.status', 'occupation', 
                        'relationship', 'race', 'sex', 'native.country']
target_column = 'income'

In [16]:
imp.fit(X_train_nan[categorical_features]);

In [17]:
X_train_imp_cat = pd.DataFrame(imp.transform(X_train_nan[categorical_features]),
                           columns=categorical_features, index=X_train_nan.index)
X_test_imp_cat = pd.DataFrame(imp.transform(X_test_nan[categorical_features]),
                           columns=categorical_features, index=X_test_nan.index)

X_train_imp = X_train_nan.copy()
X_train_imp.update(X_train_imp_cat)

X_test_imp = X_test_nan.copy()
X_test_imp.update(X_test_imp_cat)

We can see the missing values filled in. Before:

In [18]:
X_train_nan.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,,77053,9,Widowed,,Not-in-family,White,Female,0,4356,40,United-States
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States
2,66,,186061,10,Widowed,,Unmarried,Black,Female,0,4356,40,United-States
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States


After:

In [19]:
X_train_imp.sort_index().head()

Unnamed: 0,age,workclass,fnlwgt,education.num,marital.status,occupation,relationship,race,sex,capital.gain,capital.loss,hours.per.week,native.country
0,90,Private,77053,9,Widowed,Prof-specialty,Not-in-family,White,Female,0,4356,40,United-States
1,82,Private,132870,9,Widowed,Exec-managerial,Not-in-family,White,Female,0,4356,18,United-States
2,66,Private,186061,10,Widowed,Prof-specialty,Unmarried,Black,Female,0,4356,40,United-States
3,54,Private,140359,4,Divorced,Machine-op-inspct,Unmarried,White,Female,0,3900,40,United-States
4,41,Private,264663,10,Separated,Prof-specialty,Own-child,White,Female,0,3900,40,United-States


We won't go into any detail about methods of imputation, but you can consider the different approaches as hyperparameters.

#### Fill value

- By the way, another option would be to just leave in the "?" and have this be its own category for categorical variables.

In [20]:
SimpleImputer(strategy='constant', fill_value="?");

- I won't get too deep into this in this course. 
- We can just say what we always say - treat it as a hyperparameter unless you have a better idea.

#### Pipeline

Let's build a Pipeline with what we have so far for categorical features only.

In [22]:
pipe = Pipeline([('imputation', SimpleImputer(strategy='most_frequent')),
                 ('ohe', OneHotEncoder(handle_unknown='ignore')),
                 ('lr', LogisticRegression(max_iter=1000))])

- Now we have a Pipeline with 3 stages: 2 transformers followed by a classifier.
- Now we can go back to that image from Lecture 5 and it's more appropriate:

<img src="img/pipeline.png" width="700">

[Source](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#18)

We can run the pipeline:

In [23]:
pd.DataFrame(cross_validate(pipe, X_train_nan[categorical_features], y_train))

Unnamed: 0,fit_time,score_time,test_score
0,0.994029,0.016973,0.813244
1,0.954693,0.015995,0.808829
2,0.972394,0.017054,0.80595
3,0.962992,0.016949,0.819159
4,0.940569,0.016533,0.815512


- Great, so this all works, but we only used the categorical features.
- Later today we'll see how to combine everything nicely with `ColumnTransformer`.
- But first, one more thing: preprocessing of the numeric variables!

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Feature scaling (25 min)

Here are the numeric features:

In [24]:
X_train_imp[numeric_features]

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,20,110998,10,0,0,30
18434,22,263670,9,0,0,80
3294,51,335997,9,4386,0,55
31317,53,111939,13,0,0,35
4770,52,51048,13,0,0,55
...,...,...,...,...,...,...
28636,48,70668,9,0,0,50
17730,35,340018,6,0,0,38
28030,26,373553,10,0,0,42
15725,28,155621,3,0,0,40


Let's train a model using only these features:

In [27]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_imp[numeric_features], y_train, return_train_score=True))

Unnamed: 0,fit_time,score_time,test_score,train_score
0,0.149345,0.012755,0.793858,0.800557
1,0.141597,0.012674,0.798848,0.800125
2,0.114308,0.011775,0.80096,0.799261
3,0.113999,0.012915,0.803993,0.798503
4,0.128847,0.013455,0.798426,0.799079


Ok, so `DummyClassifier` gets

In [26]:
DummyClassifier(strategy='prior').fit(None, y_train).score(None, y_train)

0.7605190417690417

- And here we do a few percent better.
- But let's look at the coefficients:

In [28]:
lr.fit(X_train_imp[numeric_features], y_train);

In [29]:
pd.DataFrame(data=lr.coef_[0], index=numeric_features, columns=['Coefficient'])

Unnamed: 0,Coefficient
age,-0.007233
fnlwgt,-4e-06
education.num,-0.001697
capital.gain,0.000337
capital.loss,0.000785
hours.per.week,-0.007883


- What we see here is a very small coefficient for `fnlwgt` (description of this feature [here](https://www.kaggle.com/uciml/adult-census-income), I couldn't quite decipher it).
- Why is this coefficient so small?

In [30]:
X_train_nan.describe()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
count,26048.0,26048.0,26048.0,26048.0,26048.0,26048.0
mean,38.586686,189229.5,10.070485,1075.695754,87.629991,40.433239
std,13.619181,105000.5,2.572231,7334.297499,404.192112,12.346313
min,17.0,13769.0,1.0,0.0,0.0,1.0
25%,28.0,117583.0,9.0,0.0,0.0,40.0
50%,37.0,177785.0,10.0,0.0,0.0,40.0
75%,48.0,236885.2,12.0,0.0,0.0,45.0
max,90.0,1366120.0,16.0,99999.0,4356.0,99.0


- Answer: because the values are so big (avg = 200,000)
- And what if these values happened to be even larger? Or what if capital gain/loss was measured in thousands of dollars?

In [31]:
X_train_mod = X_train_imp[numeric_features].copy()
X_train_mod["capital.gain"] /= 1000
X_train_mod["capital.loss"] /= 1000
X_train_mod["fnlwgt"] *= 1000

In [32]:
X_train_mod.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,20,110998000,10,0.0,0.0,30
18434,22,263670000,9,0.0,0.0,80
3294,51,335997000,9,4.386,0.0,55
31317,53,111939000,13,0.0,0.0,35
4770,52,51048000,13,0.0,0.0,55


In [33]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_mod, y_train, return_train_score=True)).mean()

fit_time       0.061292
score_time     0.012938
test_score     0.760519
train_score    0.760519
dtype: float64

- Now our train & test scores went down to basically `DummyClassifier` level!
- But what is up with that, these units are arbitrary to begin with!!
- BTW, decision trees don't have this problem because they're only about thresholds, rather than crunching the actual number.
  - [Great post on Piazza](https://piazza.com/class/kb2e6nwu3uj23?cid=256) pointing out something similar with the spacing in ordinal encodings!

In [34]:
dt = DecisionTreeClassifier(random_state=1)
cross_val_score(dt, X_train_imp[numeric_features], y_train).mean()

0.7707694308794502

In [35]:
dt = DecisionTreeClassifier(random_state=1)
cross_val_score(dt, X_train_mod, y_train).mean()

0.7707694308794502

- But this problem affects plenty of ML methods.
- So it would be nice to just take care of this issue.
- The general approach is to rescale the features.
- Two specific approaches we'll cover are standardization and normalization.

## Q&A

(Pause for Q&A)

<br><br><br><br>

| Approach | What it does | How to update $X$ (but see below!) | sklearn implementation | 
|---------|------------|-----------------------|----------------|
| normalization | sets range to $[0,1]$   | `X -= np.min(X,axis=0)`<br>`X /= np.max(X,axis=0)`  | [`MinMaxScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html)
| standardization | sets sample mean to $0$, s.d. to $1$   | `X -= np.mean(X,axis=0)`<br>`X /=  np.std(X,axis=0)` | [`StandardScaler()`](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html#sklearn.preprocessing.StandardScaler) |

There are all sorts of articles on this; see, e.g. [here](http://www.dataminingblog.com/standardization-vs-normalization/) and [here](https://medium.com/@rrfd/standardize-or-normalize-examples-in-python-e3f174b65dfc).

Let's use these scaling methods:

In [36]:
from sklearn.preprocessing import StandardScaler, MinMaxScaler

In [37]:
scaler = StandardScaler()
scaler.fit(X_train_imp[numeric_features]);

In [38]:
scaler.transform(X_train_imp[numeric_features])

array([[-1.36476947, -0.74507317, -0.02740291, -0.14666932, -0.21680698,
        -0.84506515],
       [-1.21791495,  0.70896709, -0.41617802, -0.14666932, -0.21680698,
         3.20480459],
       [ 0.91147565,  1.39780572, -0.41617802,  0.45135445, -0.21680698,
         1.17986972],
       ...,
       [-0.9242059 ,  1.75548713, -0.02740291, -0.14666932, -0.21680698,
         0.12690359],
       [-0.77735138, -0.32008602, -2.74882863, -0.14666932, -0.21680698,
        -0.0350912 ],
       [ 0.10377577, -0.36129614, -0.41617802, -0.14666932, -0.21680698,
         0.61288796]])

In [39]:
scaled_train_df = pd.DataFrame(scaler.transform(X_train_imp[numeric_features]),
                           columns=numeric_features, index=X_train_imp.index)
scaled_train_df.head()

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.17987
31317,1.05833,-0.736111,1.138922,-0.146669,-0.216807,-0.440078
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.17987


- Note the same Golden Rule issue we talked about before.
  - We fit the transformer on the training data, and then transform both data sets.
  - We need to use a Pipeline for cross-validation because the transformation of each row depends on the other rows.

In [40]:
scaled_test_df = pd.DataFrame(scaler.transform(X_test_imp[numeric_features]),
                           columns=numeric_features, index=X_test_imp.index)

Let's check that it did what we expected:

In [41]:
scaled_train_df.mean(axis=0)

age               3.634832e-16
fnlwgt           -4.961863e-17
education.num     1.371028e-15
capital.gain     -3.863724e-16
capital.loss      7.671122e-16
hours.per.week    9.381657e-16
dtype: float64

These are basically all zero ($10^{-16}$ is zero to numerical precision)

In [42]:
scaled_train_df.std(axis=0)

age               1.000019
fnlwgt            1.000019
education.num     1.000019
capital.gain      1.000019
capital.loss      1.000019
hours.per.week    1.000019
dtype: float64

Note that for test we get something different - that is OK!!

In [43]:
scaled_test_df.mean(axis=0)

age              -0.001850
fnlwgt            0.026132
education.num     0.019814
capital.gain      0.001331
capital.loss     -0.004034
hours.per.week    0.001708
dtype: float64

In [44]:
scaled_test_df.std(axis=0)

age               1.007872
fnlwgt            1.025728
education.num     1.000891
capital.gain      1.034391
capital.loss      0.984757
hours.per.week    1.000546
dtype: float64

## Q&A

<br><br><br><br>

Let's re-run our experiments now.

1. Without scaling

In [45]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.125385
score_time     0.012492
test_score     0.799217
train_score    0.799505
dtype: float64

2. With scaling

In [46]:
pipe = Pipeline([('scaling', StandardScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [47]:
pd.DataFrame(cross_validate(pipe, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.061689
score_time     0.012663
test_score     0.814727
train_score    0.815024
dtype: float64

Here we actually do a little better! Cool.

3. After messing with the data by rescaling some features

In [48]:
lr = LogisticRegression(max_iter=1000)
pd.DataFrame(cross_validate(lr, X_train_mod[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.061530
score_time     0.012299
test_score     0.760519
train_score    0.760519
dtype: float64

These are the same bad results we saw earlier.

4. After messing with the data, but using feature scaling

In [49]:
pipe = Pipeline([('scaling', StandardScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [50]:
pd.DataFrame(cross_validate(pipe, X_train_mod[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.067296
score_time     0.013045
test_score     0.814727
train_score    0.815024
dtype: float64

BAM! The scaling always sets the variance to 1, so the fact that we scaled up/down by 1000 is irrelevant!

## Q&A

(Pause for Q&A)

<br><br><br><br>

We can redo the same experiments but with min/max scaling:

In [51]:
pipe = Pipeline([('scaling', MinMaxScaler()),
                 ('lr', LogisticRegression(max_iter=1000))])

In [52]:
pd.DataFrame(cross_validate(pipe, X_train_imp[numeric_features], y_train, return_train_score=True)).mean()

fit_time       0.086178
score_time     0.012658
test_score     0.810427
train_score    0.810888
dtype: float64

- Here, we get similar results. 
- We can also check that it does what it's supposed to do.

In [57]:
minmax = MinMaxScaler()
minmax.fit(X_train_imp[numeric_features])
normalized_train = minmax.transform(X_train_imp[numeric_features])
normalized_test = minmax.transform(X_test_imp[numeric_features])

Let's again check the results:

In [58]:
normalized_train.min(axis=0)

array([0., 0., 0., 0., 0., 0.])

In [59]:
normalized_train.max(axis=0)

array([1., 1., 1., 1., 1., 1.])

And again for test:

In [60]:
normalized_test.min(axis=0)

array([ 0.        , -0.00109735,  0.        ,  0.        ,  0.        ,
        0.        ])

In [61]:
normalized_test.max(axis=0)

array([1.        , 1.08768803, 1.        , 1.        , 0.84550046,
       1.        ])

In [64]:
minmax.data_min_

array([1.7000e+01, 1.3769e+04, 1.0000e+00, 0.0000e+00, 0.0000e+00,
       1.0000e+00])

In [65]:
minmax.data_max_

array([9.00000e+01, 1.36612e+06, 1.60000e+01, 9.99990e+04, 4.35600e+03,
       9.90000e+01])

#### Preprocessing the targets?

- We'll discuss this when we get to numeric targets (regression) in a couple weeks

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Break (5 min)

<br><br><br><br>

## Putting it all together with `ColumnTransformer` (30 min)

- Ok, so this is all great, but now we have ourselves a BIG MESS.
- We have a Pipeline for the categorical features
- Attribution: some code in this section adapted from the [sklearn documentation](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html)

In [66]:
pipe_cat = Pipeline([('imputation', SimpleImputer(strategy='most_frequent')),
                     ('ohe', OneHotEncoder(handle_unknown='ignore')),
                     ('lr', LogisticRegression(max_iter=1000))])

And a pipeline for the numeric features:

In [67]:
pipe_num = Pipeline([('scaler', StandardScaler()), # there were no missing values here
                     ('lr', LogisticRegression(max_iter=1000))])

- But we need to join together the scaled numeric features with the imputed/OHE'd categorical features BEFORE passing the whole dataframe into the classifier!
- We can do this awkwardly without a Pipeline:

In [68]:
imputer = SimpleImputer(strategy='most_frequent')
ohe     = OneHotEncoder(handle_unknown='ignore', sparse=False)
scaler  = StandardScaler()
lr      = LogisticRegression(max_iter=1000)

Now we process the categorical variables:

In [69]:
cat_train = X_train_nan[categorical_features]
cat_train_imp = imputer.fit_transform(cat_train)
cat_train_imp_ohe = ohe.fit_transform(cat_train_imp)
cat_train_imp_ohe_df = pd.DataFrame(data=cat_train_imp_ohe, columns=ohe.get_feature_names(categorical_features), index=X_train_nan.index)
cat_train_imp_ohe_df.head()

Unnamed: 0,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,workclass_Self-emp-inc,workclass_Self-emp-not-inc,workclass_State-gov,workclass_Without-pay,marital.status_Divorced,marital.status_Married-AF-spouse,...,native.country_Portugal,native.country_Puerto-Rico,native.country_Scotland,native.country_South,native.country_Taiwan,native.country_Thailand,native.country_Trinadad&Tobago,native.country_United-States,native.country_Vietnam,native.country_Yugoslavia
17064,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
18434,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3294,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
31317,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4770,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


Now we process the numeric variables:

In [70]:
num_train = X_train_nan[numeric_features]
num_train_scaler = scaler.fit_transform(num_train)
num_train_scaler_df = pd.DataFrame(data=num_train_scaler, columns=numeric_features, index=X_train_nan.index)
num_train_scaler_df

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.179870
31317,1.058330,-0.736111,1.138922,-0.146669,-0.216807,-0.440078
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.179870
...,...,...,...,...,...,...
28636,0.691194,-1.129174,-0.416178,-0.146669,-0.216807,0.774883
17730,-0.263361,1.436102,-1.582503,-0.146669,-0.216807,-0.197086
28030,-0.924206,1.755487,-0.027403,-0.146669,-0.216807,0.126904
15725,-0.777351,-0.320086,-2.748829,-0.146669,-0.216807,-0.035091


Now we smush them together:

In [71]:
X_train_this_is_so_annoying = pd.concat((num_train_scaler_df, cat_train_imp_ohe_df), axis=1)
X_train_this_is_so_annoying

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native.country_Portugal,native.country_Puerto-Rico,native.country_Scotland,native.country_South,native.country_Taiwan,native.country_Thailand,native.country_Trinadad&Tobago,native.country_United-States,native.country_Vietnam,native.country_Yugoslavia
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.179870,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
31317,1.058330,-0.736111,1.138922,-0.146669,-0.216807,-0.440078,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.179870,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28636,0.691194,-1.129174,-0.416178,-0.146669,-0.216807,0.774883,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
17730,-0.263361,1.436102,-1.582503,-0.146669,-0.216807,-0.197086,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
28030,-0.924206,1.755487,-0.027403,-0.146669,-0.216807,0.126904,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15725,-0.777351,-0.320086,-2.748829,-0.146669,-0.216807,-0.035091,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [72]:
lr.fit(X_train_this_is_so_annoying, y_train)
lr.score(X_train_this_is_so_annoying, y_train)

0.8523495085995086

And now, time to do all that again for the test data!

![](img/mike_not_gonna_happen.png)

- Right, so the above is horribly messy and also results in a Golden Rule violation if we do cross-validation.
- Enter `ColumnTransformer` to save the day!

In [73]:
from sklearn.compose import ColumnTransformer

<img src="img/column-transformer.png" width=800>

Image adapted from [here](https://amueller.github.io/COMS4995-s20/slides/aml-04-preprocessing/#37).

- A big advantage here is that we build all our transformations together into one object, and that way we're sure we do the same operations to all splits of the data. 
- Otherwise we might, for example, do the OHE on both train and test but forget to scale the test data.

Let's get to work! 

In [74]:
numeric_features

['age',
 'fnlwgt',
 'education.num',
 'capital.gain',
 'capital.loss',
 'hours.per.week']

In [75]:
preprocessor = ColumnTransformer([
    ('scale', StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_features)
])

In [76]:
type(preprocessor)

sklearn.compose._column_transformer.ColumnTransformer

Above:

- The `ColumnTransformer` syntax is somewhat similar to `Pipeline` in that you pass in a list of tuples.
- But here each tuple has 3 values instead of 2: (name, object, list of columns)

In [81]:
preprocessor.fit(X_train_imp);

In [82]:
X_train_preproc = preprocessor.transform(X_train_imp)
X_train_preproc

array([[-1.36476947, -0.74507317, -0.02740291, ...,  1.        ,
         0.        ,  0.        ],
       [-1.21791495,  0.70896709, -0.41617802, ...,  1.        ,
         0.        ,  0.        ],
       [ 0.91147565,  1.39780572, -0.41617802, ...,  1.        ,
         0.        ,  0.        ],
       ...,
       [-0.9242059 ,  1.75548713, -0.02740291, ...,  1.        ,
         0.        ,  0.        ],
       [-0.77735138, -0.32008602, -2.74882863, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.10377577, -0.36129614, -0.41617802, ...,  1.        ,
         0.        ,  0.        ]])

In [79]:
type(X_train_preproc)

numpy.ndarray

In [80]:
new_columns = numeric_features + list(preprocessor.named_transformers_['ohe'].get_feature_names(categorical_features))
pd.DataFrame(data=X_train_preproc, columns=new_columns, index=X_train_imp.index)

Unnamed: 0,age,fnlwgt,education.num,capital.gain,capital.loss,hours.per.week,workclass_Federal-gov,workclass_Local-gov,workclass_Never-worked,workclass_Private,...,native.country_Portugal,native.country_Puerto-Rico,native.country_Scotland,native.country_South,native.country_Taiwan,native.country_Thailand,native.country_Trinadad&Tobago,native.country_United-States,native.country_Vietnam,native.country_Yugoslavia
17064,-1.364769,-0.745073,-0.027403,-0.146669,-0.216807,-0.845065,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
18434,-1.217915,0.708967,-0.416178,-0.146669,-0.216807,3.204805,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3294,0.911476,1.397806,-0.416178,0.451354,-0.216807,1.179870,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
31317,1.058330,-0.736111,1.138922,-0.146669,-0.216807,-0.440078,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4770,0.984903,-1.316034,1.138922,-0.146669,-0.216807,1.179870,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
28636,0.691194,-1.129174,-0.416178,-0.146669,-0.216807,0.774883,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
17730,-0.263361,1.436102,-1.582503,-0.146669,-0.216807,-0.197086,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
28030,-0.924206,1.755487,-0.027403,-0.146669,-0.216807,0.126904,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
15725,-0.777351,-0.320086,-2.748829,-0.146669,-0.216807,-0.035091,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


- BAM! Scaling and OHE applied at the same time!
- When we `fit` the `ColumnTransformer`, it fits _all_ the transformers. And likewise for `transform`.
- Warning: by default `ColumnTransformer` throws away any columns not accounted for in its steps.
- Setting `remainder='passthrough'` keeps the rest of the columns in tact as in the image above.

## Q&A

(Pause for Q&A)

<br><br><br><br>

So, let's put everything together in a pipeline: 

#### Attempt 1: one pipeline

- We use the preprocessor as a step in the pipeline!
- This step treats numeric features and categorical features differently:

In [83]:
pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('preproc', preprocessor), # <-- this is the ColumnTransformer!
    ('lr', LogisticRegression(max_iter=1000))
])

In [84]:
pipe.fit(X_train_nan)

ValueError: Specifying the columns using strings is only supported for pandas DataFrames

- Ugh. What happened?
- The problem is that `SimpleImputer.transform` outputs a numpy array, not a pandas dataframe:

In [86]:
type(imputer.transform(X_train_nan[categorical_features]))

numpy.ndarray

We've been putting it into a dataframe manually so that it looks nice:

In [87]:
pd.DataFrame(data=imputer.transform(X_train_nan[categorical_features]), columns=categorical_features, index=X_train_nan.index)

Unnamed: 0,workclass,marital.status,occupation,relationship,race,sex,native.country
17064,Private,Never-married,Adm-clerical,Own-child,Asian-Pac-Islander,Female,United-States
18434,Private,Never-married,Other-service,Own-child,Black,Male,United-States
3294,Private,Married-civ-spouse,Exec-managerial,Husband,White,Male,United-States
31317,Private,Married-civ-spouse,Other-service,Husband,White,Male,United-States
4770,Self-emp-inc,Married-civ-spouse,Sales,Husband,White,Male,United-States
...,...,...,...,...,...,...,...
28636,Private,Married-civ-spouse,Machine-op-inspct,Wife,White,Female,United-States
17730,Private,Never-married,Other-service,Unmarried,Black,Female,United-States
28030,Private,Married-civ-spouse,Adm-clerical,Wife,White,Female,United-States
15725,Private,Never-married,Craft-repair,Not-in-family,White,Male,Columbia


But, when we made the `ColumnTransformer` we referred to columns _by name_ rather than by index:

In [88]:
preprocessor = ColumnTransformer([
    ('scale', StandardScaler(), numeric_features),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False), categorical_features)])

In [89]:
numeric_features

['age',
 'fnlwgt',
 'education.num',
 'capital.gain',
 'capital.loss',
 'hours.per.week']

In [90]:
categorical_features

['workclass',
 'marital.status',
 'occupation',
 'relationship',
 'race',
 'sex',
 'native.country']

It can't use these names if it's getting a numpy array where the columns aren't named.

In [None]:
pipe = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('preproc', preprocessor), # <-- this is the ColumnTransformer!
    ('lr', LogisticRegression(max_iter=1000))
])

## Q&A

(Pause for Q&A)

<br><br><br><br>

#### Attempt 2: separate pipeline for each feature type

Let's make a pipeline for the categorical features (preprocessing only):

In [91]:
pipe_cat = Pipeline([
    ('impute', SimpleImputer(strategy='most_frequent')),
    ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))
])

We don't need a pipeline for the numerical features because there's only one step, scaling.

Now, let's put these together to make a `ColumnTransformer` for all the preprocessing:

In [92]:
preprocessor = ColumnTransformer([
    ('cat', pipe_cat, categorical_features),
    ('num', StandardScaler(), numeric_features)
])

- Take a minute to digest this... what does it do.
- When ready, let's combine this with the classifier in _another pipeline_:

In [93]:
pipe = Pipeline([
    ('preprocessing', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

In [94]:
pipe.fit(X_train_nan, y_train);

Wow, what just happened here?

- We fit the imputer, one-hot encoder, standard scaler, and logistic regression.

In [95]:
pipe.predict(X_test_nan)

array(['<=50K', '>50K', '<=50K', ..., '<=50K', '<=50K', '<=50K'],
      dtype=object)

In [96]:
cross_validate(pipe, X_train_nan, y_train)

{'fit_time': array([1.68061996, 1.45436502, 1.71592689, 1.49593711, 1.52128315]),
 'score_time': array([0.03714681, 0.03344703, 0.03336501, 0.03313684, 0.03448796]),
 'test_score': array([0.84932821, 0.84913628, 0.84414587, 0.85717028, 0.85486658])}

- A LOT of steps just happened here!
- This is so cool (if you ask me)

Two images for this:

In [108]:
set_config(display='diagram')
pipe

<img src="img/pipeline_columntransformer.png" width=800>

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Hyperparameter search, revisited (5 min)

- Let's do hyperparameter optimization on this pipeline.
- Let's optimize `C` from logistic regression and `strategy` from `SimpleImputer` (either `most_frequent` or `constant`).

In [None]:
hypers = {
    'classifier__C' : [0.01, 0.1, 1, 10, 100]
}

- How do we access the imputer strategy?
- Well, it's the strategy of the imputer of the categorical part of the preprocessing...

In [97]:
hypers = {
    'classifier__C' : [0.01, 0.1, 1, 10, 100],
    'preprocessing__cat__impute__strategy' : ['most_frequent', 'constant']
}

- We are indexing into the mess of pipelines and `ColumnsTransformer`s here.
- Note that we're using the names given when made those objects, _not_ the Python variable names.

In [98]:
searcher = GridSearchCV(pipe, hypers, n_jobs=-1, verbose=2, return_train_score=True)

In [99]:
searcher.fit(X_train_nan, y_train);

Fitting 5 folds for each of 10 candidates, totalling 50 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed:    9.4s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   26.6s finished


In [100]:
columns = [
    'mean_test_score', 'mean_train_score', 'mean_fit_time', 'rank_test_score', 'param_classifier__C', 'param_preprocessing__cat__impute__strategy'
]
pd.DataFrame(searcher.cv_results_)[columns].sort_values(by=['rank_test_score'])

Unnamed: 0,mean_test_score,mean_train_score,mean_fit_time,rank_test_score,param_classifier__C,param_preprocessing__cat__impute__strategy
5,0.851659,0.853492,2.342403,1,1.0,constant
3,0.851467,0.852733,0.900799,2,0.1,constant
7,0.851313,0.853703,5.383548,3,10.0,constant
9,0.851237,0.853626,3.53007,4,100.0,constant
4,0.850929,0.852695,4.088326,5,1.0,most_frequent
2,0.850737,0.851956,2.270041,6,0.1,most_frequent
8,0.850661,0.852897,6.67337,7,100.0,most_frequent
6,0.850584,0.852849,7.300103,8,10.0,most_frequent
1,0.848511,0.849125,0.439311,9,0.01,constant
0,0.848434,0.848904,1.715281,10,0.01,most_frequent


Interestingly, the range of scores is super small here; it seems like these hyperparameters don't matter too much in this case.

In [101]:
searcher.score(X_test_nan, y_test)

0.8519883310302472

Here it looks like the overfitting on the validation set was not a serious issue, likely mainly due to the larger dataset:

In [102]:
X_train_nan.shape

(26048, 13)

## Q&A

(Pause for Q&A)

<br><br><br><br>

## Summary (5 min)

- We can use `SimpleImputer` to impute values where data is missing.
- Feature scaling...
  - improves performance for some models (so far: logistic regression but not decision trees)
  - is generally a good idea for numeric features
  - I'll say more about `StandardScaler` vs. `MinMaxScaler` later in the course
- `ColumnTransformer` is great for more complex pipelines, though it's not simple!
  - It allows us to perform different operations (possible pipelines!) on different sets of columns.
  - We have to be carefully reference the hyperparameters in `GridSearchCV` and `RandomizedSearchCV`.