This notebook is used to keep note of lesson from DataSchool [**Building an Effective ML Workflow with scikit-learn**](https://www.crowdcast.io/e/ml-course):

<font size=3>**Outline:**</font>
1. [Review of the basic Machine Learning workflow](#part1)
2. [Encoding categorical data](#part2)
3. [Using ColumnTransformer and Pipeline](#part3)
4. [Encoding text data](#part4)
5. Handling missing values
6. Switching to the full dataset
7. Evaluating and tuning a Pipeline

<a id='part1'></a>
### <font color='darkblue'>Part1</font>

In [1]:
# Make sure your scikit-learn version is 0.22.x up
import sklearn

print(sklearn.__version__) # Make sure to have 0.22.x version

0.22.2.post1


Loading data from Kaggle [**Titanic: Machine Learning from Disaster**](https://www.kaggle.com/c/titanic) data set and show a few records of this dataset.

In [2]:
import pandas as pd
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Then we are going to extract the `X` (features) and `y` (class/result) from the dataset:

In [3]:
X = df[['Parch', 'Fare']]
''' Use column `Parch` and `Fare` as features'''
r'''
df[['Survived']] gives you a DataFrame, while df['Survived'] gets you a Series
'''
y = df['Survived']
''' extracted class '''
y.shape

(10,)

Then we use [**LogisticRegression**](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html) to demonstrate the workflow in building the ML model as below:

In [4]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(solver='liblinear', random_state=1)

Then we can leverage [cross validation](https://en.wikipedia.org/wiki/Cross-validation_(statistics)) (Check [**sklearn.model_selection**](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)) to check the performance of created model with cv=3 (k-fold with k=3):

In [5]:
from sklearn.model_selection import cross_val_score
cross_val_score(logreg, X, y, cv=3, scoring='accuracy').mean()

0.6944444444444443

We got a cross validation score 0.69. Let's train the model now:

In [6]:
logreg.fit(X, y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

Let's load the testing data set with only 10 rows to speedup the demonstration:

In [7]:
df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
df_new

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S
5,897,3,"Svensson, Mr. Johan Cervin",male,14.0,0,0,7538,9.225,,S
6,898,3,"Connolly, Miss. Kate",female,30.0,0,0,330972,7.6292,,Q
7,899,2,"Caldwell, Mr. Albert Francis",male,26.0,1,1,248738,29.0,,S
8,900,3,"Abrahim, Mrs. Joseph (Sophie Halaut Easu)",female,18.0,0,0,2657,7.2292,,C
9,901,3,"Davies, Mr. John Samuel",male,21.0,2,0,A/4 48871,24.15,,S


In [8]:
X_new = df_new[['Parch', 'Fare']]
X_new

Unnamed: 0,Parch,Fare
0,0,7.8292
1,0,7.0
2,0,9.6875
3,0,8.6625
4,1,12.2875
5,0,9.225
6,0,7.6292
7,1,29.0
8,0,7.2292
9,0,24.15


Let's make prediction on `X_new`:

In [9]:
logreg.predict(X_new)

array([0, 0, 0, 0, 1, 0, 0, 1, 0, 1], dtype=int64)

<a id='part2'></a>
## <font color=darkblue>Part 2</font>

We need tne encode the categorical column by using [**OneHotEncoder**](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html) here:

In [10]:
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder()
ohe.fit_transform(df[['Embarked']])

<10x3 sparse matrix of type '<class 'numpy.float64'>'
	with 10 stored elements in Compressed Sparse Row format>

You may ask below questions:
* What is a sparse matrix? and why sparse matrix?
* What's the value in the sparse matrix?

You can get the learned categories as below:

In [11]:
# Show the learned category values
ohe.categories_

[array(['C', 'Q', 'S'], dtype=object)]

In [12]:
ohe.get_feature_names()

array(['x0_C', 'x0_Q', 'x0_S'], dtype=object)

In [13]:
import numpy as np

# Show comparison between category and encoded data
transformed_data = ohe.transform(df[['Embarked']]).toarray()
ohe_df = pd.DataFrame(transformed_data, columns=ohe.get_feature_names())
pd.concat([df, ohe_df], axis=1)[['Embarked']+ohe.get_feature_names().tolist()]

Unnamed: 0,Embarked,x0_C,x0_Q,x0_S
0,S,0.0,0.0,1.0
1,C,1.0,0.0,0.0
2,S,0.0,0.0,1.0
3,S,0.0,0.0,1.0
4,S,0.0,0.0,1.0
5,Q,0.0,1.0,0.0
6,S,0.0,0.0,1.0
7,S,0.0,0.0,1.0
8,S,0.0,0.0,1.0
9,C,1.0,0.0,0.0


In [14]:
# Don't use sparse matrix
ohe = OneHotEncoder(sparse=False)
ohe.fit_transform(df[['Embarked', 'Sex']])

array([[0., 0., 1., 0., 1.],
       [1., 0., 0., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 1., 0.],
       [0., 0., 1., 0., 1.],
       [0., 1., 0., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 0., 1.],
       [0., 0., 1., 1., 0.],
       [1., 0., 0., 1., 0.]])

Anything you have done to training data, you have to do the samething to the testing data! In order to do duplicate work, we will use pipeline to define the transformation in one shot.

<a id='part3'></a>
## <font color='darkblue'>Part3</font>

In [15]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex']

X = df[cols]
X.tail(n=10)

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.25,S,male
1,0,71.2833,C,female
2,0,7.925,S,female
3,0,53.1,S,female
4,0,8.05,S,male
5,0,8.4583,Q,male
6,0,51.8625,S,male
7,1,21.075,S,male
8,2,11.1333,S,female
9,0,30.0708,C,female


Now we are going to leverage [**sklearn.compose**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.compose) package to define the column transformation process.

In [16]:
from sklearn.compose import make_column_transformer

ohe = OneHotEncoder()
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough'
)

In [40]:
transformed_data = ct.fit_transform(X)
transformed_data

array([[ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  7.25  ],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 71.2833],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    ,  7.925 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  0.    , 53.1   ],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    ,  8.05  ],
       [ 0.    ,  1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  8.4583],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  0.    , 51.8625],
       [ 0.    ,  0.    ,  1.    ,  0.    ,  1.    ,  1.    , 21.075 ],
       [ 0.    ,  0.    ,  1.    ,  1.    ,  0.    ,  2.    , 11.1333],
       [ 1.    ,  0.    ,  0.    ,  1.    ,  0.    ,  0.    , 30.0708]])

In [44]:
ct_df = pd.DataFrame(transformed_data, columns=['e1', 'e2', 'e3', 's1', 's2', 'Parch', 'Fare'])
ct_df

Unnamed: 0,e1,e2,e3,s1,s2,Parch,Fare
0,0.0,0.0,1.0,0.0,1.0,0.0,7.25
1,1.0,0.0,0.0,1.0,0.0,0.0,71.2833
2,0.0,0.0,1.0,1.0,0.0,0.0,7.925
3,0.0,0.0,1.0,1.0,0.0,0.0,53.1
4,0.0,0.0,1.0,0.0,1.0,0.0,8.05
5,0.0,1.0,0.0,0.0,1.0,0.0,8.4583
6,0.0,0.0,1.0,0.0,1.0,0.0,51.8625
7,0.0,0.0,1.0,0.0,1.0,1.0,21.075
8,0.0,0.0,1.0,1.0,0.0,2.0,11.1333
9,1.0,0.0,0.0,1.0,0.0,0.0,30.0708


In [45]:
pd.concat([df[['Embarked', 'Sex']], ohe_df], axis=1)

Unnamed: 0,Embarked,Sex,e1,e2,e3,s1,s2,Parch,Fare
0,S,male,0.0,0.0,1.0,0.0,1.0,0.0,7.25
1,C,female,1.0,0.0,0.0,1.0,0.0,0.0,71.2833
2,S,female,0.0,0.0,1.0,1.0,0.0,0.0,7.925
3,S,female,0.0,0.0,1.0,1.0,0.0,0.0,53.1
4,S,male,0.0,0.0,1.0,0.0,1.0,0.0,8.05
5,Q,male,0.0,1.0,0.0,0.0,1.0,0.0,8.4583
6,S,male,0.0,0.0,1.0,0.0,1.0,0.0,51.8625
7,S,male,0.0,0.0,1.0,0.0,1.0,1.0,21.075
8,S,female,0.0,0.0,1.0,1.0,0.0,2.0,11.1333
9,C,female,1.0,0.0,0.0,1.0,0.0,0.0,30.0708


Then we are going to make the pipeline by package [**sklearn.pipeline**](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.pipeline):

In [19]:
from sklearn.pipeline import make_pipeline

pipe = make_pipeline(ct, logreg)

In [20]:
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Embarked', 'Sex'])],
                                   verbose=False)),
                ('logisticregression',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                         

In [21]:
# Above single code is similar to below function call chain
logreg.fit(ct.fit_transform(X), y)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

We can access each step from pipeline as below:

In [22]:
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex'])],
                   verbose=False),
 'logisticregression': LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                    intercept_scaling=1, l1_ratio=None, max_iter=100,
                    multi_class='auto', n_jobs=None, penalty='l2',
                    random_state=1, solver='liblinear', tol=0.0001, verbose=0,
                    warm_start=False)}

In [23]:
pipe.named_steps.logisticregression.coef_

array([[ 0.26491287, -0.19848033, -0.22907928,  1.0075062 , -1.17015293,
         0.20056557,  0.01597307]])

Now we can make the prediction by pipeline:

In [36]:
X_new = df_new[cols]
X_new

Unnamed: 0,Parch,Fare,Embarked,Sex
0,0,7.8292,Q,male
1,0,7.0,S,female
2,0,9.6875,Q,male
3,0,8.6625,S,male
4,1,12.2875,S,female
5,0,9.225,S,male
6,0,7.6292,Q,female
7,1,29.0,S,male
8,0,7.2292,C,female
9,0,24.15,S,male


In [37]:
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

In [39]:
# Similar to
logreg.predict(ct.transform(X_new))

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

## <font color='darkblue'>Recap</font>
https://gist.github.com/justmarkham/6a04f852443a0bc522afc0740dd9cb7f

In [47]:
import pandas as pd
from sklearn.preprocessing import OneHotEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.compose import make_column_transformer
from sklearn.pipeline import make_pipeline

# 0) Loading data
df = pd.read_csv('http://bit.ly/kaggletrain', nrows=10)
X = df[cols]
y = df['Survived']

df_new = pd.read_csv('http://bit.ly/kaggletest', nrows=10)
X_new = df_new[cols]


# 1) Select columns as features
cols = ['Parch', 'Fare', 'Embarked', 'Sex']



# 2) Define pipeline
ohe = OneHotEncoder()
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    remainder='passthrough')

logreg = LogisticRegression(solver='liblinear', random_state=1)

pipe = make_pipeline(ct, logreg)

# 3) Training & Making prediction
pipe.fit(X, y)
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

<a id='part4'></a>
## <font color='darkblue'>Part4</font>
We are going to create feature from column(s) with text in them by module [**sklearn.feature_extraction**](https://scikit-learn.org/stable/modules/feature_extraction.html):

In [48]:
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [50]:
from sklearn.feature_extraction.text import CountVectorizer

vect = CountVectorizer()
dtm = vect.fit_transform(df['Name'])
dtm

<10x40 sparse matrix of type '<class 'numpy.int64'>'
	with 46 stored elements in Compressed Sparse Row format>

Let's do a little bit exploration on object `dtm`:

In [52]:
# Found 40 unique words from column `Name`
print(vect.get_feature_names())

['achem', 'adele', 'allen', 'berg', 'bradley', 'braund', 'briggs', 'cumings', 'elisabeth', 'florence', 'futrelle', 'gosta', 'harris', 'heath', 'heikkinen', 'henry', 'jacques', 'james', 'john', 'johnson', 'laina', 'leonard', 'lily', 'master', 'may', 'mccarthy', 'miss', 'moran', 'mr', 'mrs', 'nasser', 'nicholas', 'oscar', 'owen', 'palsson', 'peel', 'thayer', 'timothy', 'vilhelmina', 'william']


In [53]:
pd.DataFrame(dtm.toarray(), columns=vect.get_feature_names())

Unnamed: 0,achem,adele,allen,berg,bradley,braund,briggs,cumings,elisabeth,florence,...,nasser,nicholas,oscar,owen,palsson,peel,thayer,timothy,vilhelmina,william
0,0,0,0,0,0,1,0,0,0,0,...,0,0,0,1,0,0,0,0,0,0
1,0,0,0,0,1,0,1,1,0,1,...,0,0,0,0,0,0,1,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,1,0,0,0,0
4,0,0,1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,1
5,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
6,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,1,0,0
7,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,1,0,0,0,0,0
8,0,0,0,1,0,0,0,0,1,0,...,0,0,1,0,0,0,0,0,1,0
9,1,1,0,0,0,0,0,0,0,0,...,1,1,0,0,0,0,0,0,0,0


In [54]:
df.loc[0, 'Name']

'Braund, Mr. Owen Harris'

Now we are going to add `Name` column as one of the features:

In [55]:
cols = ['Parch', 'Fare', 'Embarked', 'Sex', 'Name']
X = df[cols]
X

Unnamed: 0,Parch,Fare,Embarked,Sex,Name
0,0,7.25,S,male,"Braund, Mr. Owen Harris"
1,0,71.2833,C,female,"Cumings, Mrs. John Bradley (Florence Briggs Th..."
2,0,7.925,S,female,"Heikkinen, Miss. Laina"
3,0,53.1,S,female,"Futrelle, Mrs. Jacques Heath (Lily May Peel)"
4,0,8.05,S,male,"Allen, Mr. William Henry"
5,0,8.4583,Q,male,"Moran, Mr. James"
6,0,51.8625,S,male,"McCarthy, Mr. Timothy J"
7,1,21.075,S,male,"Palsson, Master. Gosta Leonard"
8,2,11.1333,S,female,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)"
9,0,30.0708,C,female,"Nasser, Mrs. Nicholas (Adele Achem)"


In [56]:
ct = make_column_transformer(
    (ohe, ['Embarked', 'Sex']),
    (vect, 'Name'),
    remainder='passthrough'
)

In [57]:
ct.fit_transform(X)

<10x47 sparse matrix of type '<class 'numpy.float64'>'
	with 78 stored elements in Compressed Sparse Row format>

In [58]:
pipe = make_pipeline(ct, logreg)
pipe.fit(X, y)

Pipeline(memory=None,
         steps=[('columntransformer',
                 ColumnTransformer(n_jobs=None, remainder='passthrough',
                                   sparse_threshold=0.3,
                                   transformer_weights=None,
                                   transformers=[('onehotencoder',
                                                  OneHotEncoder(categories='auto',
                                                                drop=None,
                                                                dtype=<class 'numpy.float64'>,
                                                                handle_unknown='error',
                                                                sparse=True),
                                                  ['Embarked', 'Sex']),
                                                 ('countvectorizer',
                                                  CountVectorizer(analyzer...
                                            

In [60]:
pipe.named_steps

{'columntransformer': ColumnTransformer(n_jobs=None, remainder='passthrough', sparse_threshold=0.3,
                   transformer_weights=None,
                   transformers=[('onehotencoder',
                                  OneHotEncoder(categories='auto', drop=None,
                                                dtype=<class 'numpy.float64'>,
                                                handle_unknown='error',
                                                sparse=True),
                                  ['Embarked', 'Sex']),
                                 ('countvectorizer',
                                  CountVectorizer(analyzer='word', binary=False,
                                                  decode_error='strict',
                                                  dtype=<class 'numpy.int64'>,
                                                  encoding='utf-8',
                                                  input='content',
                                    

In [61]:
X_new = df_new[cols]
pipe.predict(X_new)

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0], dtype=int64)

## <font color='darkblue'>Supplement</font>
* [FAQ - Using Scikit-Learn OneHotEncoder with a Pandas DataFrame](https://stackoverflow.com/questions/58101126/using-scikit-learn-onehotencoder-with-a-pandas-dataframe)