# Complex machine-learning pipeline

First, let's fetch the "titanic" dataset directly from OpenML.

In [1]:
import pandas as pd

In this dataset, the missing values are stored with the following character `"?"`. We will notify it to Pandas when reading the CSV file.

In [2]:
df = pd.read_csv("../datasets/titanic.csv",na_values='?')
df.head()

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2.0,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.9167,1,2,113781,151.55,C22 C26,S,11.0,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"


The classification task is to predict whether or not a person will survive the Titanic disaster.

In [3]:
X_df = df.drop(columns='survived')
y = df['survived']

We will split the data into a training and a testing set.

In [4]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42
)

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [5]:
from sklearn.ensemble import RandomForestClassifier

# TODO

# Working only with numerical data

We already saw in the previous lecture that we can easily train scikit-learn model on numerical data. Therefore, we will start to select only the numerical columns and train a model.

## Pandas preprocessing

Before to use scikit-learn, we will try make some simple preprocessing using pandas. First, let's select only the numerical columns

In [6]:
num_cols = ['age', 'pclass', 'parch', 'fare']

X_train_num = X_train[num_cols]

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [7]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_num, y_train)

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

We might want to look into a summary of the data that we try to fit.

In [8]:
X_train_num.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     784 non-null    float64
 1   pclass  981 non-null    int64  
 2   parch   981 non-null    int64  
 3   fare    980 non-null    float64
dtypes: float64(2), int64(2)
memory usage: 38.3 KB


Since there are some missing data, we can replace them with a mean.

In [9]:
X_train_num_imputed = X_train_num.fillna(X_train_num.mean())
X_train_num_imputed.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 4 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   age     981 non-null    float64
 1   pclass  981 non-null    int64  
 2   parch   981 non-null    int64  
 3   fare    981 non-null    float64
dtypes: float64(2), int64(2)
memory usage: 38.3 KB


In [10]:
model.fit(X_train_num_imputed, y_train)

RandomForestClassifier()

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>What should we do if there are also missing values in the test set?</li>
    <li>Process the test set so as to be able to compute the test score of the model.</li>
    </ul>
</div>

In [11]:
X_test_num_imputed =  # TODO

SyntaxError: invalid syntax (1309997571.py, line 1)

In [12]:
model.score(X_test_num_imputed, y_test)

NameError: name 'X_test_num_imputed' is not defined

## Make it less error prone using scikit-learn

Scikit-learn provides some "transformers" to preprocess the data. `sklearn.impute.SimpleImputer` is a transformer allowing for the same job than the processing done with Pandas. However, we will see later that it integrates greatly with other scikit-learn components.

In [13]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")

As any estimator in scikit-learn, a transformer has a `fit` method which should be called on the training data to learn the required statistics. In the case of a mean imputer, we need to compute the mean for each feature.

In [14]:
imputer.fit(X_train_num)

SimpleImputer()

In [15]:
imputer.statistics_

array([29.34768278,  2.29867482,  0.39143731, 33.68646633])

To impute the values by the mean, we can use the `transform` method.

In [16]:
imputer.transform(X_train_num)

array([[38.    ,  3.    ,  0.    ,  7.8958],
       [ 6.    ,  3.    ,  1.    , 15.2458],
       [52.    ,  1.    ,  1.    , 79.65  ],
       ...,
       [28.5   ,  3.    ,  0.    , 16.1   ],
       [26.    ,  3.    ,  0.    ,  7.925 ],
       [28.    ,  3.    ,  0.    ,  7.8958]])

As previoulsy mentioned, we should impute with the values computed in `fit` when imputing the test set.

We saw earlier that we should be careful when preprocessing data to avoid any "data leak" (i.e. reusing some knowledge from the training when testing our model). Scikit-learn provides the `Pipeline` class to make successive transformations. In addition, it will ensure that the right operations will be applied at the right time.

In [17]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),
                      RandomForestClassifier(n_estimators=100))
model.fit(X_train_num, y_train)

Pipeline(steps=[('simpleimputer', SimpleImputer()),
                ('randomforestclassifier', RandomForestClassifier())])

In [18]:
model.score(X_test_num, y_test)

NameError: name 'X_test_num' is not defined

If we want to directly fit the model on `X_train`, we can select the numerical columns using  a `ColumnTransformer` object:

In [19]:
from sklearn.compose import make_column_transformer


numerical_preprocessing = make_column_transformer(
    (SimpleImputer(strategy='mean'), num_cols)
)
model = make_pipeline(
    numerical_preprocessing,
    RandomForestClassifier(n_estimators=100),
)
model.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('simpleimputer',
                                                  SimpleImputer(),
                                                  ['age', 'pclass', 'parch',
                                                   'fare'])])),
                ('randomforestclassifier', RandomForestClassifier())])

In [20]:
model.score(X_test, y_test)

0.6707317073170732

# Working only with categorical data

Categorical columns (even more string data types) are not supported natively by machine-learning algorithms and required some preprocessing step usually called encoding. The most meaningful encoding with tree-based algorithms is the `OrdinalEncoder`.

In [21]:
X_train.head()

Unnamed: 0,pclass,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1139,3,"Rekic, Mr. Tido",male,38.0,0,0,349249,7.8958,,S,,,
678,3,"Boulos, Master. Akar",male,6.0,1,1,2678,15.2458,,C,,,"Syria Kent, ON"
290,1,"Taussig, Mr. Emil",male,52.0,1,1,110413,79.65,E67,S,,,"New York, NY"
285,1,"Straus, Mr. Isidor",male,67.0,1,0,PC 17483,221.7792,C55 C57,S,,96.0,"New York, NY"
1157,3,"Rosblom, Mr. Viktor Richard",male,18.0,1,1,370129,20.2125,,S,,,


In [22]:
cat_col = ['sex', 'embarked', 'pclass']

In [23]:
X_train_cat = X_train[cat_col]

In [24]:
X_train_cat.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 981 entries, 1139 to 1126
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   sex       981 non-null    object
 1   embarked  980 non-null    object
 2   pclass    981 non-null    int64 
dtypes: int64(1), object(2)
memory usage: 30.7+ KB


In [25]:
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(),
)

categorical_preprocessing = make_column_transformer(
    (cat_pipeline, cat_col)
)
model = make_pipeline(
    categorical_preprocessing,
    RandomForestClassifier(n_estimators=100)
)
model.fit(X_train, y_train)

Pipeline(steps=[('columntransformer',
                 ColumnTransformer(transformers=[('pipeline',
                                                  Pipeline(steps=[('simpleimputer',
                                                                   SimpleImputer(fill_value='missing',
                                                                                 strategy='constant')),
                                                                  ('ordinalencoder',
                                                                   OrdinalEncoder())]),
                                                  ['sex', 'embarked',
                                                   'pclass'])])),
                ('randomforestclassifier', RandomForestClassifier())])

In [26]:
model.score(X_test, y_test)

0.7713414634146342

# Combining both categorical and numerical data in the pipeline

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>Try to combine the numerical and categorical pipelines into a single <tt>ColumnTransformer</tt></li>
        <li>Fit a <tt>RandomForestClassifier</tt> on the output of this feature engineering. How does the test score evolve?</li>
    </ul>
</div>