## <font color='darkblue'>Office Hours Session2</font>
This notebook is from link [here as **Office Hours: Session2**](https://www.crowdcast.io/e/ml-course/4). Before that, let's paste the code from [previous course](https://github.com/johnklee/ml_articles/blob/master/dataschool/20200423_BuildingAnEffectiveMLWorkflowWithScikit-learn/sklearn_workflow.ipynb):

In [1]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# Columns as features
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

# Training data
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

# Testing data
df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]

# Build column transformer
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

logreg = LogisticRegression(solver='liblinear', random_state=1)

# Build pipeline
pipe = make_pipeline(ct, logreg)

# Training & Prediction
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

### K: Why did you create the `imp_ohe` pipeline? Why didn't you instead add `imp_constant` to the ColumnTransformer?

In [2]:
# Original/suggested way
X_tiny = X[:10]
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

imp_ohe = make_pipeline(imp_constant, ohe)
make_column_transformer(
    (imp_ohe, ['Embarked']),
    remainder='drop').fit_transform(X_tiny)

array([[0., 0., 1.],
       [1., 0., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 1., 0.],
       [0., 0., 1.],
       [0., 0., 1.],
       [0., 0., 1.],
       [1., 0., 0.]])

In [3]:
# Problematic approach
make_column_transformer(
    (imp_constant, ['Embarked']),
    (ohe, ['Embarked']),
    remainder='drop').fit_transform(X_tiny)

array([['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['Q', 0.0, 1.0, 0.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['S', 0.0, 0.0, 1.0],
       ['C', 1.0, 0.0, 0.0]], dtype=object)

Column transfomer do thing and combine them while pipeline take step by step (The output of first step will be the input of second step.)

### Justin: What's the cost versus the benefit of adding so many columns for Name?

In [4]:
from sklearn.model_selection import cross_val_score

%time cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Wall time: 126 ms


0.8114619295712762

In [5]:
ct_no = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    ('drop', 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [6]:
no_name = make_pipeline(ct_no, logreg)
%time cross_val_score(no_name, X, y, cv=5, scoring='accuracy').mean()

Wall time: 67.8 ms


0.7833908731404181

In [7]:
only_name = make_pipeline(vect, logreg)
%time cross_val_score(only_name, X['Name'], y, cv=5, scoring='accuracy').mean()

Wall time: 58.8 ms


0.7945954428472788

Better accuracy for including column `Name`. But the execution time will be the cost.

### Anton: What's the target accuracy that you are trying to reach?
It depends on your time, resources and plan.


### Motasem: When using corss_val_score, is the imputation value calculated separately for each fold?

In [8]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

pipe = make_pipeline(ct, logreg)
%time cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

Wall time: 125 ms


0.8114619295712762

### Hause: Regarding cross_val_score, what algorithm does scikit-learn use to split the data into different folds? Can you examine the data used in each fold?

In [9]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

In [10]:
from sklearn.model_selection import StratifiedKFold

kf = StratifiedKFold(5)
cross_val_score(pipe, X, y, cv=5, scoring='accuracy')

array([0.79888268, 0.8258427 , 0.80337079, 0.78651685, 0.84269663])

Here we use [**StratifiedKFold**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html) to split the data into testing and training set. You can access the testing/training fold by below sample code:

In [11]:
kfold_splits = list(kf.split(X, y))
first_split_train_indices, first_split_test_indices= kfold_splits[0]
print("first split of training indices ({:,d}):\n{}".format(len(first_split_train_indices), first_split_train_indices))
print("first split of testing indices ({:,d}):\n{}".format(len(first_split_test_indices), first_split_test_indices))

first split of training indices (712):
[168 169 170 171 173 174 175 176 177 178 179 180 181 182 185 188 189 191
 196 197 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214
 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232
 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250
 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268
 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286
 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322
 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340
 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358
 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376
 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394
 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412
 413 414 415

In [12]:
# In case you want some random
kf = StratifiedKFold(5, shuffle=True, random_state=5)

### Leona: Would you recommend using a validation set when turning the model's hyperparameters?
Depends on your time and resources.

### JV: What is the difference between FeatureUnion and ColumnTransformer?

In [13]:
SimpleImputer(add_indicator=True).fit_transform(X_tiny[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

In [14]:
imp.fit_transform(X_tiny[['Age']])

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

In [15]:
from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
indicator.fit_transform(X_tiny[['Age']])

array([[False],
       [False],
       [False],
       [False],
       [False],
       [ True],
       [False],
       [False],
       [False],
       [False]])

We use [`make_union`](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_union.html) to stack the result side by side:

In [16]:
from sklearn.pipeline import make_union
imp_indicator = make_union(imp, indicator)
imp_indicator.fit_transform(X_tiny[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

[**FeatureUnion**](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.FeatureUnion.html#sklearn.pipeline.FeatureUnion) applies multiple actions on single column to have stacked data frame. However, Column transformer can do the same thing and do that in parallel for better performance. For example:

In [17]:
make_column_transformer(
    (imp_indicator, ['Age']),
    remainder='drop'
).fit_transform(X_tiny)

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

In [18]:
make_column_transformer(
    (imp, ['Age']),
    (indicator, ['Age']),
    remainder='drop'
).fit_transform(X_tiny)

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

### Elin: How would you add feature selection to our Pipeline?
Here is going to introduce [**SelectPercentile**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectPercentile.html) and [**SelectFromModel**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectFromModel.html#sklearn.feature_selection.SelectFromModel) to help you select good features from all available columns.

In [19]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

In [20]:
from sklearn.feature_selection import SelectPercentile, chi2

selection = SelectPercentile(chi2, percentile=50)

In [21]:
pipe_selection = make_pipeline(ct, selection, logreg)

In [22]:
cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

0.8193019898311469

In [23]:
logreg_selection = LogisticRegression(solver='liblinear', penalty='l1', random_state=1)

In [24]:
from sklearn.feature_selection import SelectFromModel
selection = SelectFromModel(logreg_selection, threshold='mean')

In [25]:
pipe_selection = make_pipeline(ct, selection, logreg)

In [26]:
cross_val_score(pipe_selection, X, y, cv=5, scoring='accuracy').mean()

0.8248885820099178

### Khaled: How would you add feature standardization to our Pipeline?
Scaling is not always needed. But for standardization/scaling, you can still do standardization as below:

In [27]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

In [28]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

In [29]:
imp_scaler = make_pipeline(imp, scaler)

In [30]:
ct_scaler = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_scaler, ['Age', 'Fare']),
    remainder='passthrough')

In [31]:
pipe_scaler = make_pipeline(ct_scaler, logreg)

In [32]:
cross_val_score(pipe_scaler, X, y, cv=5, scoring='accuracy').mean()

0.8092210156299039

In [33]:
# For density matrix, you may consider below scaler
from sklearn.preprocessing import MaxAbsScaler
scaler = MaxAbsScaler()

In [34]:
pipe_scaler = make_pipeline(ct_scaler, scaler, logreg)

In [35]:
cross_val_score(pipe_scaler, X, y, cv=5, scoring='accuracy').mean()

0.8103320569957944

### Hussain: Should we scale all features, or only the features that were originally numerical?
Make thing simpler (apply scaler to all features) and do try-and-better approach for the best fit.


### Gaurav: How would you add outlier handling to our Pipeline?
scikit-learn does not support row removal so far.

### Leona: How would you adapt this Pipeline to use a different model, such as a RandomForestClassifier?
Perhaps the step to turn hyperparameter(s) may not work in the pipeline for all kinds of models. So for the column transformer may need to be customized from model to model. Therefore, you may have to build pipeline for each model while some steps can be reused sometimes.

### DS: How can I include custom transformations for feature engineering within a Pipeline?


In [36]:
df_tiny = df[:10]
df_tiny

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [37]:
import numpy as np
np.floor(df_tiny[['Age', 'Fare']])

Unnamed: 0,Age,Fare
0,22.0,7.0
1,38.0,71.0
2,26.0,7.0
3,35.0,53.0
4,35.0,8.0
5,,8.0
6,54.0,51.0
7,2.0,21.0
8,27.0,11.0
9,14.0,30.0


In [38]:
from sklearn.preprocessing import FunctionTransformer

In [39]:
get_floor = FunctionTransformer(np.floor)

In [40]:
get_floor.fit_transform(df_tiny[['Age', 'Fare']])

Unnamed: 0,Age,Fare
0,22.0,7.0
1,38.0,71.0
2,26.0,7.0
3,35.0,53.0
4,35.0,8.0
5,,8.0
6,54.0,51.0
7,2.0,21.0
8,27.0,11.0
9,14.0,30.0


In [41]:
make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.,  7.],
       [38., 71.],
       [26.,  7.],
       [35., 53.],
       [35.,  8.],
       [nan,  8.],
       [54., 51.],
       [ 2., 21.],
       [27., 11.],
       [14., 30.]])

From above, we learned how to define `get_floor` to do customized action on column(s) by applying [**FunctionTransformer**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.FunctionTransformer.html). Below is another example:

In [42]:
# 2D input -> 2D output
df_tiny[['Cabin']].apply(lambda x: x.str.slice(0, 1))

Unnamed: 0,Cabin
0,
1,C
2,
3,C
4,
5,
6,E
7,
8,
9,


In [43]:
def first_letter(df):
    # We can not guarantee that `df` is always a data frame
    # So we do a force type cast here
    return pd.DataFrame(df).apply(lambda x: x.str.slice(0, 1))

In [44]:
get_first_letter = FunctionTransformer(first_letter)

In [45]:
make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.0, 7.0, nan],
       [38.0, 71.0, 'C'],
       [26.0, 7.0, nan],
       [35.0, 53.0, 'C'],
       [35.0, 8.0, nan],
       [nan, 8.0, nan],
       [54.0, 51.0, 'E'],
       [2.0, 21.0, nan],
       [27.0, 11.0, nan],
       [14.0, 30.0, nan]], dtype=object)

Another example is to get the sum of two columns:

In [46]:
df_tiny[['SibSp', 'Parch']].sum(axis=1)

0    1
1    1
2    0
3    1
4    0
5    0
6    0
7    4
8    2
9    1
dtype: int64

In [47]:
def sum_cols(df):
    return np.array(df).sum(axis=1).reshape(-1, 1)

In [48]:
sum_cols(df_tiny[['SibSp', 'Parch']])

array([[1],
       [1],
       [0],
       [1],
       [0],
       [0],
       [0],
       [4],
       [2],
       [1]], dtype=int64)

In [51]:
get_sum = FunctionTransformer(sum_cols)

In [52]:
make_column_transformer(
    (get_floor, ['Age', 'Fare']),
    (get_first_letter, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop'
).fit_transform(df_tiny)

array([[22.0, 7.0, nan, 1],
       [38.0, 71.0, 'C', 1],
       [26.0, 7.0, nan, 0],
       [35.0, 53.0, 'C', 1],
       [35.0, 8.0, nan, 0],
       [nan, 8.0, nan, 0],
       [54.0, 51.0, 'E', 0],
       [2.0, 21.0, nan, 4],
       [27.0, 11.0, nan, 2],
       [14.0, 30.0, nan, 1]], dtype=object)

In [53]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age', 'Cabin', 'SibSp']

In [54]:
X = df[cols]
X_new = df_new[cols]

In [55]:
imp_floor = make_pipeline(imp, get_floor)

In [57]:
# We got missing values in column `Cabin`
X['Cabin'].str.slice(0, 1).value_counts(dropna=False)

NaN    687
C       59
B       47
D       33
E       32
A       15
F       13
G        4
T        1
Name: Cabin, dtype: int64

In [60]:
# When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, 
# The resulting one-hot encoded columns for this feature will be all zeros
ohe_ignore = OneHotEncoder(handle_unknown='ignore')
letter_imp_ohe = make_pipeline(get_first_letter, imp_constant, ohe_ignore)

In [65]:
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp_floor, ['Age', 'Fare']),
    (letter_imp_ohe, ['Cabin']),
    (get_sum, ['SibSp', 'Parch']),
    remainder='drop'
)

In [66]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [67]:
cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8271420500910175

### JV: How can I keep up-to-date with new features that are released in scikit-learn?
Check the [scikit-learn document](https://scikit-learn.org/stable/).