This notebook is used to keep note of lesson from DataSchool [**Building an Effective ML Workflow with scikit-learn**](https://www.crowdcast.io/e/ml-course):

<font size=3>**Outline:**</font>
1. [Review of the basic Machine Learning workflow](#part1)
2. [Encoding categorical data](#part2)
3. [Using ColumnTransformer and Pipeline](#part3)
4. [Encoding text data](#part4)
5. [Handling missing values](#part5)
6. [Switching to the full dataset](#part6)
7. [Evaluating and tuning a Pipeline](#part7)

<a id='part1'></a>
## <font color='darkblue'>Part1 - Review of the basic Machine Learning workflow</font>

In [1]:
# Make sure your scikit-learn version is 0.22.x up
import sklearn

print(sklearn.__version__) # Make sure to have 0.22.x version

0.22.2.post1


Loading data from Kaggle [**Titanic: Machine Learning from Disaster**](https://www.kaggle.com/c/titanic) data set and show a few records of this dataset.

In [2]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Then we are going to extract the `X` (features) and `y` (class/result) from the dataset:

In [3]:
X = df[['Parch', 'Fare']]
''' Use column `Parch` and `Fare` as features'''
r'''
df[['Survived']] gives you a DataFrame, while df['Survived'] gets you a Series
'''
y = df['Survived']
''' extracted class '''
y.shape

(10,)

Then we use [**LogisticRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to demonstrate the workflow in building the ML model as below:

In [4]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)

Then we can leverage [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (Check [**sklearn.model_selection**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)) to check the performance of created model with cv=3 (k-fold with k=3):

In [5]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

0.6944444444444443

We got a cross validation score 0.69. Let's train the model now:

In [6]:
logreg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Let's load the testing data set with only 10 rows to speedup the demonstration:

In [7]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [8]:
X_new = df_new[['Parch', 'Fare']]
X_new

Unnamed: 0,Parch,Fare
0,0,7.8292
1,0,7.0
2,0,9.6875
3,0,8.6625
4,1,12.2875
5,0,9.225
6,0,7.6292
7,1,29.0
8,0,7.2292
9,0,24.15


Let's make prediction on `X_new`:

In [9]:
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)

<a id='part2'></a>
## <font color=darkblue>Part 2 - Encoding categorical data</font>

We need tne encode the categorical column by using [**OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) here:

In [10]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])

<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

You may ask below questions:
* What is a sparse matrix? and why sparse matrix?
* What's the value in the sparse matrix?

You can get the learned categories as below:

In [11]:
# Show the learned category values
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

In [12]:
ohe.get_feature_names()

array(['x0_C', 'x0_Q', 'x0_S'], dtype=object)

In [13]:
import numpy as np

# Show comparison between category and encoded data
transformed_data = ohe.transform(df[['Embarked']]).toarray()
ohe_df = pd.DataFrame(transformed_data, columns=ohe.get_feature_names())
pd.concat([df, ohe_df], axis=1)[['Embarked']+ohe.get_feature_names().tolist()]

Unnamed: 0,Embarked,x0_C,x0_Q,x0_S
0,S,0.0,0.0,1.0
1,C,1.0,0.0,0.0
2,S,0.0,0.0,1.0
3,S,0.0,0.0,1.0
4,S,0.0,0.0,1.0
5,Q,0.0,1.0,0.0
6,S,0.0,0.0,1.0
7,S,0.0,0.0,1.0
8,S,0.0,0.0,1.0
9,C,1.0,0.0,0.0


In [14]:
# Don't use sparse matrix
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked', 'Sex']])

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

Anything you have done to training data, you have to do the samething to the testing data! In order to do duplicate work, we will use pipeline to define the transformation in one shot.

<a id='part3'></a>
## <font color='darkblue'>Part3 - Using ColumnTransformer and Pipeline</font>

In [15]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

X = df[cols]
X.tail(n=10)

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.25,S,male
1,0,71.2833,C,female
2,0,7.925,S,female
3,0,53.1,S,female
4,0,8.05,S,male
5,0,8.4583,Q,male
6,0,51.8625,S,male
7,1,21.075,S,male
8,2,11.1333,S,female
9,0,30.0708,C,female


Now we are going to leverage [**sklearn.compose**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) package to define the column transformation process.

In [16]:
from sklearn.compose import make_column_transformer

ohe = OneHotEncoder()
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough'
)

In [17]:
transformed_data = ct.fit_transform(X)
transformed_data

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

In [18]:
ct_df = pd.DataFrame(transformed_data, columns=['e1', 'e2', 'e3', 's1', 's2', 'Parch', 'Fare'])
ct_df

Unnamed: 0,e1,e2,e3,s1,s2,Parch,Fare
0,0.0,0.0,1.0,0.0,1.0,0.0,7.25
1,1.0,0.0,0.0,1.0,0.0,0.0,71.2833
2,0.0,0.0,1.0,1.0,0.0,0.0,7.925
3,0.0,0.0,1.0,1.0,0.0,0.0,53.1
4,0.0,0.0,1.0,0.0,1.0,0.0,8.05
5,0.0,1.0,0.0,0.0,1.0,0.0,8.4583
6,0.0,0.0,1.0,0.0,1.0,0.0,51.8625
7,0.0,0.0,1.0,0.0,1.0,1.0,21.075
8,0.0,0.0,1.0,1.0,0.0,2.0,11.1333
9,1.0,0.0,0.0,1.0,0.0,0.0,30.0708


In [19]:
pd.concat([df[['Embarked', 'Sex']], ohe_df], axis=1)

Unnamed: 0,Embarked,Sex,x0_C,x0_Q,x0_S
0,S,male,0.0,0.0,1.0
1,C,female,1.0,0.0,0.0
2,S,female,0.0,0.0,1.0
3,S,female,0.0,0.0,1.0
4,S,male,0.0,0.0,1.0
5,Q,male,0.0,1.0,0.0
6,S,male,0.0,0.0,1.0
7,S,male,0.0,0.0,1.0
8,S,female,0.0,0.0,1.0
9,C,female,1.0,0.0,0.0


Then we are going to make the pipeline by package [**sklearn.pipeline**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline):

In [20]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ct, logreg)

In [21]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Embarked', 'Sex'])],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                         

In [22]:
# Above single code is similar to below function call chain
logreg.fit(ct.fit_transform(X), y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

We can access each step from pipeline as below:

In [23]:
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex'])],
                   verbose=False),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False)}

In [24]:
pipe.named_steps.logisticregression.coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Now we can make the prediction by pipeline:

In [25]:
X_new = df_new[cols]
X_new

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.8292,Q,male
1,0,7.0,S,female
2,0,9.6875,Q,male
3,0,8.6625,S,male
4,1,12.2875,S,female
5,0,9.225,S,male
6,0,7.6292,Q,female
7,1,29.0,S,male
8,0,7.2292,C,female
9,0,24.15,S,male


In [26]:
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In [27]:
# Similar to
logreg.predict(ct.transform(X_new))

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

## <font color='darkblue'>Recap</font>
https://gist.github.com/justmarkham/6a04f852443a0bc522afc0740dd9cb7f

In [28]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# 0) Loading data
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]


# 1) Select columns as features
cols = ['Parch', 'Fare', 'Embarked', 'Sex']



# 2) Define pipeline
ohe = OneHotEncoder()
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

logreg = LogisticRegression(solver='liblinear', random_state=1)

pipe = make_pipeline(ct, logreg)

# 3) Training & Making prediction
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

<a id='part4'></a>
## <font color='darkblue'>Part4 - Encoding text data</font>
We are going to create feature from column(s) with text in them by module [**sklearn.feature_extraction**](https://scikit-learn.org/stable/modules/feature_extraction.html):

In [29]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [30]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm

<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

Let's do a little bit exploration on object `dtm`:

In [31]:
# Found 40 unique words from column `Name`
print(vect.get_feature_names())

['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']


In [32]:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,achem,adele,allen,berg,bradley,braund,briggs,cumings,elisabeth,florence,...,nasser,nicholas,oscar,owen,palsson,peel,thayer,timothy,vilhelmina,william
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,1,1,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
9,1,1,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


In [33]:
df.loc[0, 'Name']

'Braund, Mr. Owen Harris'

Now we are going to add `Name` column as one of the features:

In [34]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex,Name
0,0,7.25,S,male,"Braund, Mr. Owen Harris"
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,0,7.925,S,female,"Heikkinen, Miss. Laina"
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,8.05,S,male,"Allen, Mr. William Henry"
5,0,8.4583,Q,male,"Moran, Mr. James"
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J"
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard"
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)"


In [35]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough'
)

In [36]:
ct.fit_transform(X)

<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

In [37]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(analyzer...
                                            

In [38]:
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.int64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                    

In [39]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

<a id='part5'></a>
## <font color='darkblue'>Part5 - Handling missing values</font>
This part is from second session ([course link](https://www.crowdcast.io/e/ml-course/3?utm_source=crowdcast&utm_medium=browser-push&utm_campaign=followers)) and We are going to handle the situation with missing values in column `Age` as `NaN`. For example, check row 5 below:

In [40]:
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [41]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex,Name,Age
0,0,7.25,S,male,"Braund, Mr. Owen Harris",22.0
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,0,7.925,S,female,"Heikkinen, Miss. Laina",26.0
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,0,8.05,S,male,"Allen, Mr. William Henry",35.0
5,0,8.4583,Q,male,"Moran, Mr. James",
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J",54.0
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard",2.0
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


In [42]:
# Got exception:
# Input contains NaN, infinity or a value too large for dtype('float64').
# pipe.fit(X, y)

In [43]:
# Option1: dropping row with N/A
# suggestion to use this approach if the dropping happens randomly
X.dropna()

Unnamed: 0,Parch,Fare,Embarked,Sex,Name,Age
0,0,7.25,S,male,"Braund, Mr. Owen Harris",22.0
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th...",38.0
2,0,7.925,S,female,"Heikkinen, Miss. Laina",26.0
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",35.0
4,0,8.05,S,male,"Allen, Mr. William Henry",35.0
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J",54.0
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard",2.0
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",27.0
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)",14.0


In [44]:
# Option2: dropping column with NaN
X.dropna(axis='columns')

Unnamed: 0,Parch,Fare,Embarked,Sex,Name
0,0,7.25,S,male,"Braund, Mr. Owen Harris"
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,0,7.925,S,female,"Heikkinen, Miss. Laina"
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,8.05,S,male,"Allen, Mr. William Henry"
5,0,8.4583,Q,male,"Moran, Mr. James"
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J"
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard"
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)"


We will use package [**sklearn.impute**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.impute) (<font color='brown'>Transformers for missing value imputation</font>) to handle missing values:

In [45]:
from sklearn.impute import SimpleImputer

imp = SimpleImputer()

In [46]:
imp.fit_transform(X[['Age']])

array([[22.        ],
       [38.        ],
       [26.        ],
       [35.        ],
       [35.        ],
       [28.11111111],
       [54.        ],
       [ 2.        ],
       [27.        ],
       [14.        ]])

In [47]:
imp.statistics_  # Value to fill the NaN

array([28.11111111])

In [48]:
# Update column transformer by adding (img, ['Age'])
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough'
)

In [49]:
ct.fit_transform(X)

<10x48 sparse matrix of type '<class 'numpy.float64'>'
	with 88 stored elements in Compressed Sparse Row format>

In [50]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=...
                                                  input='content',
                                                  lowercase=True, max_df=1.0,
                                             

In [51]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 0, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

We can treat missing value as a feature by parameter `add_indicator` which is useful for MNAR (Missing not at random):

In [52]:
imp_indicator = SimpleImputer(add_indicator=True)
imp_indicator.fit_transform(X[['Age']])

array([[22.        ,  0.        ],
       [38.        ,  0.        ],
       [26.        ,  0.        ],
       [35.        ,  0.        ],
       [35.        ,  0.        ],
       [28.11111111,  1.        ],
       [54.        ,  0.        ],
       [ 2.        ,  0.        ],
       [27.        ,  0.        ],
       [14.        ,  0.        ]])

<a id='part6'></a>
## <font color='darkblue'>Part 6 - Switching to the full dataset</font>
We will switch to a full dataset and handle the encountered problem along the way.

In [53]:
# Training data
df = pd.read_csv('http://bit.ly/kaggletrain')
df.shape

(891, 12)

In [54]:
# Testing data
df_new = pd.read_csv('http://bit.ly/kaggletest')
df_new.shape

(418, 11)

In [55]:
df.isna().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

In [56]:
df_new.isna().sum()

PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64

In [57]:
X = df[cols]
y = df['Survived']

In [58]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough'
)

In [59]:
# Have exception
# Embarked has missing value
# ValueError: Input contains NaN
# ct.fit_transform(X)

In [60]:
# Fill the missl value N/A with value `missing`
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')

In [61]:
# Making pipeline:
#  Imputer -> One hot encoder
imp_ohe = make_pipeline(imp_constant, ohe)
imp_ohe.fit_transform(X[['Embarked']])

<891x4 sparse matrix of type '<class 'numpy.float64'>'
	with 891 stored elements in Compressed Sparse Row format>

In [62]:
# Or a equal way just look like:
ohe.fit_transform(imp_constant.fit_transform(X[['Embarked']]))

<891x4 sparse matrix of type '<class 'numpy.float64'>'
	with 891 stored elements in Compressed Sparse Row format>

In [63]:
# Now let's update our column transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age']),
    remainder='passthrough'
)

In [64]:
ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

We still have to handle missing values in column `Fare` happen in testing data.

In [65]:
X.columns

Index(['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age'], dtype='object')

In [66]:
# Add column `Fare` into imputation transformer
ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough'
)

In [67]:
ct

ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                  transformer_weights=None,
                  transformers=[('pipeline',
                                 Pipeline(memory=None,
                                          steps=[('simpleimputer',
                                                  SimpleImputer(add_indicator=False,
                                                                copy=True,
                                                                fill_value='missing',
                                                                missing_values=nan,
                                                                strategy='constant',
                                                                verbose=0)),
                                                 ('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=

In [68]:
# Do the fit transform again
ct.fit_transform(X)

<891x1518 sparse matrix of type '<class 'numpy.float64'>'
	with 7328 stored elements in Compressed Sparse Row format>

In [69]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('pipeline',
                                                  Pipeline(memory=None,
                                                           steps=[('simpleimputer',
                                                                   SimpleImputer(add_indicator=False,
                                                                                 copy=True,
                                                                                 fill_value='missing',
                                                                                 missing_values=nan,
                                                                                 strategy='constant',
                               

In [70]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

In [71]:
# Values for missing values for column `Age` and `Fare`
ct.named_transformers_.simpleimputer.statistics_

array([29.69911765, 32.20420797])

In [72]:
# Values for missing values for column `Embarked` and `Sex`
ct.named_transformers_.pipeline.named_steps.simpleimputer.statistics_

array(['missing', 'missing'], dtype=object)

## <font color='darkblue'>Recap</font>
For the recap, refer to [link here](https://gist.github.com/justmarkham/ae7793dec68169488b08242181583b47):
<br/>

![Pipeline](images/recap2.png)

In [73]:
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# Columns as features
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name', 'Age']

# Training data
df = pd.read_csv('http://bit.ly/kaggletrain')
X = df[cols]
y = df['Survived']

# Testing data
df_new = pd.read_csv('http://bit.ly/kaggletest')
X_new = df_new[cols]

# Build column transformer
imp_constant = SimpleImputer(strategy='constant', fill_value='missing')
ohe = OneHotEncoder()

imp_ohe = make_pipeline(imp_constant, ohe)
vect = CountVectorizer()
imp = SimpleImputer()

ct = make_column_transformer(
    (imp_ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    (imp, ['Age', 'Fare']),
    remainder='passthrough')

logreg = LogisticRegression(solver='liblinear', random_state=1)

# Build pipeline
pipe = make_pipeline(ct, logreg)

# Training & Prediction
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1,
       0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 0,
       0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0,

<a id='part7'></a>
## <font color='darkblue'>Part7 - Evaluating and tuning a Pipeline</font>
We still have many things to do to improve the model. One of it is to turn the [**model hyperparameters**](https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)). Firstly, let's check the performance of current model produced by the pipeline:

In [74]:
from sklearn.model_selection import cross_val_score

cross_val_score(pipe, X, y, cv=5, scoring='accuracy').mean()

0.8114619295712762

In [75]:
pipe.named_steps.keys()

dict_keys(['columntransformer', 'logisticregression'])

Then we will define the hyperparameters for grid search defined in [**sklearn.model_selection.GridSearchCV**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html):

In [76]:
# Define our hyperparameters for grid search
params = {}
params['logisticregression__penalty'] = ['l1', 'l2']
params['logisticregression__C'] = [0.1, 1, 10]
params

{'logisticregression__penalty': ['l1', 'l2'],
 'logisticregression__C': [0.1, 1, 10]}

In [77]:
# Then to create a grid search object to do the iterative search for 
# optimized hyperparameter composition.
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')

In [78]:
grid.fit(X, y);

In [79]:
results = pd.DataFrame(grid.cv_results_)
results

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.016141,0.000424,0.006391,0.000468,0.1,l1,"{'logisticregression__C': 0.1, 'logisticregres...",0.787709,0.803371,0.769663,0.758427,0.797753,0.783385,0.016946,6
1,0.017161,0.000397,0.006775,0.000733,0.1,l2,"{'logisticregression__C': 0.1, 'logisticregres...",0.798883,0.803371,0.764045,0.775281,0.803371,0.78899,0.016258,5
2,0.018759,0.00076,0.006971,2.5e-05,1.0,l1,"{'logisticregression__C': 1, 'logisticregressi...",0.815642,0.820225,0.797753,0.792135,0.848315,0.814814,0.019787,2
3,0.020519,0.002954,0.007395,0.001476,1.0,l2,"{'logisticregression__C': 1, 'logisticregressi...",0.798883,0.825843,0.803371,0.786517,0.842697,0.811462,0.020141,3
4,0.028144,0.00277,0.006971,0.00088,10.0,l1,"{'logisticregression__C': 10, 'logisticregress...",0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,1
5,0.021729,0.000975,0.006988,1.1e-05,10.0,l2,"{'logisticregression__C': 10, 'logisticregress...",0.782123,0.803371,0.808989,0.797753,0.853933,0.809234,0.02408,4


In [80]:
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
4,0.028144,0.00277,0.006971,0.00088,10.0,l1,"{'logisticregression__C': 10, 'logisticregress...",0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,1
2,0.018759,0.00076,0.006971,2.5e-05,1.0,l1,"{'logisticregression__C': 1, 'logisticregressi...",0.815642,0.820225,0.797753,0.792135,0.848315,0.814814,0.019787,2
3,0.020519,0.002954,0.007395,0.001476,1.0,l2,"{'logisticregression__C': 1, 'logisticregressi...",0.798883,0.825843,0.803371,0.786517,0.842697,0.811462,0.020141,3
5,0.021729,0.000975,0.006988,1.1e-05,10.0,l2,"{'logisticregression__C': 10, 'logisticregress...",0.782123,0.803371,0.808989,0.797753,0.853933,0.809234,0.02408,4
1,0.017161,0.000397,0.006775,0.000733,0.1,l2,"{'logisticregression__C': 0.1, 'logisticregres...",0.798883,0.803371,0.764045,0.775281,0.803371,0.78899,0.016258,5
0,0.016141,0.000424,0.006391,0.000468,0.1,l1,"{'logisticregression__C': 0.1, 'logisticregres...",0.787709,0.803371,0.769663,0.758427,0.797753,0.783385,0.016946,6


Now let's try to optimized the parameters `drop` in [**OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html), `ngram` parameter in [**CountVectorizer**](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) and `add_indicator` from [**SimpleImputer**](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html):

In [81]:
pipe.named_steps.columntransformer.named_transformers_

{'pipeline': Pipeline(memory=None,
          steps=[('simpleimputer',
                  SimpleImputer(add_indicator=False, copy=True,
                                fill_value='missing', missing_values=nan,
                                strategy='constant', verbose=0)),
                 ('onehotencoder',
                  OneHotEncoder(categories='auto', drop=None,
                                dtype=<class 'numpy.float64'>,
                                handle_unknown='error', sparse=True))],
          verbose=False),
 'countvectorizer': CountVectorizer(analyzer='word', binary=False, decode_error='strict',
                 dtype=<class 'numpy.int64'>, encoding='utf-8', input='content',
                 lowercase=True, max_df=1.0, max_features=None, min_df=1,
                 ngram_range=(1, 1), preprocessor=None, stop_words=None,
                 strip_accents=None, token_pattern='(?u)\\b\\w\\w+\\b',
                 tokenizer=None, vocabulary=None),
 'simpleimputer': SimpleImp

In [85]:
params['columntransformer__pipeline__onehotencoder__drop'] = [None, 'first']
params['columntransformer__countvectorizer__ngram_range'] = [(1, 1), (1, 2)]
params['columntransformer__simpleimputer__add_indicator'] = [False, True]

grid = GridSearchCV(pipe, params, cv=5, scoring='accuracy')

In [86]:
grid.fit(X, y);

In [87]:
results = pd.DataFrame(grid.cv_results_)
results.sort_values('rank_test_score')

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_columntransformer__countvectorizer__ngram_range,param_columntransformer__pipeline__onehotencoder__drop,param_columntransformer__simpleimputer__add_indicator,param_logisticregression__C,param_logisticregression__penalty,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
28,0.033961,0.002002,0.007202,0.0004231692,"(1, 2)",,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.860335,0.825843,0.825843,0.780899,0.859551,0.830494,0.029113,1
40,0.043926,0.002406,0.007804,0.001306359,"(1, 2)",first,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.849162,0.825843,0.814607,0.786517,0.853933,0.826012,0.024517,2
34,0.035233,0.002802,0.007584,0.0004706093,"(1, 2)",,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.854749,0.820225,0.820225,0.780899,0.853933,0.826006,0.027231,3
46,0.041702,0.004648,0.007612,0.0008498794,"(1, 2)",first,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.843575,0.831461,0.814607,0.780899,0.853933,0.824895,0.0256,4
4,0.027326,0.003489,0.006982,4.862804e-07,"(1, 1)",,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.832402,0.814607,0.820225,0.786517,0.853933,0.821537,0.022107,5
22,0.031297,0.004401,0.006988,0.0006304437,"(1, 1)",first,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.821229,0.820225,0.814607,0.792135,0.853933,0.820426,0.019787,6
16,0.034094,0.003792,0.007381,0.0004885781,"(1, 1)",first,False,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.826816,0.820225,0.814607,0.780899,0.853933,0.819296,0.023467,7
10,0.024928,0.002748,0.006439,0.0004350619,"(1, 1)",,True,10.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.821229,0.820225,0.808989,0.780899,0.853933,0.817055,0.023494,8
20,0.018728,0.000419,0.006597,0.0004971077,"(1, 1)",first,True,1.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.810056,0.820225,0.797753,0.792135,0.853933,0.81482,0.021852,9
44,0.025333,0.001384,0.007429,0.0005787612,"(1, 2)",first,True,1.0,l1,{'columntransformer__countvectorizer__ngram_ra...,0.810056,0.820225,0.797753,0.792135,0.853933,0.81482,0.021852,9


In [88]:
grid.best_score_

0.8304940053982801

In [89]:
grid.best_params_

{'columntransformer__countvectorizer__ngram_range': (1, 2),
 'columntransformer__pipeline__onehotencoder__drop': None,
 'columntransformer__simpleimputer__add_indicator': False,
 'logisticregression__C': 10,
 'logisticregression__penalty': 'l1'}

## <font color='darkblue'>Supplement</font>
* [FAQ - Using Scikit-Learn OneHotEncoder with a Pandas DataFrame](https://stackoverflow.com/questions/58101126/using-scikit-learn-onehotencoder-with-a-pandas-dataframe)