# Complex machine-learning pipeline

First, let's fetch the "titanic" dataset directly from OpenML.

In [None]:
import pandas as pd

In this dataset, the missing values are stored with the following character `"?"`. We will notify it to Pandas when reading the CSV file.

In [None]:
df = pd.read_csv(
    "https://www.openml.org/data/get_csv/16826755/phpMYEkMl.csv",
    na_values='?'
)
df.head()

The classification task is to predict whether or not a person will survive the Titanic disaster.

In [None]:
X_df = df.drop(columns='survived')
y = df['survived']

We will split the data into a training and a testing set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X_df, y, random_state=42
)

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [None]:
from sklearn.ensemble import RandomForestClassifier

# TODO

# Working only with numerical data

We already saw in the previous lecture that we can easily train scikit-learn model on numerical data. Therefore, we will start to select only the numerical columns and train a model.

## Pandas preprocessing

Before to use scikit-learn, we will try make some simple preprocessing using pandas. First, let's select only the numerical columns

In [None]:
num_cols = ['age', 'pclass', 'parch', 'fare']

X_train_num = X_train[num_cols]

<div class="alert alert-success">
    <p><b>QUESTIONS</b>:</p>
    <ul>
        <li>What would happen if you try to fit a <tt>RandomForestClassifier</tt>?</li>
    </ul>
</div>

In [None]:
model = RandomForestClassifier(n_estimators=100)
model.fit(X_train_num, y_train)

We might want to look into a summary of the data that we try to fit.

In [None]:
X_train_num.info()

Since there are some missing data, we can replace them with a mean.

In [None]:
X_train_num_imputed = X_train_num.fillna(X_train_num.mean())
X_train_num_imputed.info()

In [None]:
model.fit(X_train_num_imputed, y_train)

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>What should we do if there are also missing values in the test set?</li>
    <li>Process the test set so as to be able to compute the test score of the model.</li>
    </ul>
</div>

In [None]:
X_test_num_imputed =  # TODO

In [None]:
model.score(X_test_num_imputed, y_test)

## Make it less error prone using scikit-learn

Scikit-learn provides some "transformers" to preprocess the data. `sklearn.impute.SimpleImputer` is a transformer allowing for the same job than the processing done with Pandas. However, we will see later that it integrates greatly with other scikit-learn components.

In [None]:
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="mean")

As any estimator in scikit-learn, a transformer has a `fit` method which should be called on the training data to learn the required statistics. In the case of a mean imputer, we need to compute the mean for each feature.

In [None]:
imputer.fit(X_train_num)

In [None]:
imputer.statistics_

To impute the values by the mean, we can use the `transform` method.

In [None]:
imputer.transform(X_train_num)

As previoulsy mentioned, we should impute with the values computed in `fit` when imputing the test set.

We saw earlier that we should be careful when preprocessing data to avoid any "data leak" (i.e. reusing some knowledge from the training when testing our model). Scikit-learn provides the `Pipeline` class to make successive transformations. In addition, it will ensure that the right operations will be applied at the right time.

In [None]:
from sklearn.pipeline import make_pipeline

model = make_pipeline(SimpleImputer(strategy='mean'),
                      RandomForestClassifier(n_estimators=100))
model.fit(X_train_num, y_train)

In [None]:
model.score(X_test_num, y_test)

If we want to directly fit the model on `X_train`, we can select the numerical columns using  a `ColumnTransformer` object:

In [None]:
from sklearn.compose import make_column_transformer


numerical_preprocessing = make_column_transformer(
    (SimpleImputer(strategy='mean'), num_cols)
)
model = make_pipeline(
    numerical_preprocessing,
    RandomForestClassifier(n_estimators=100),
)
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# Working only with categorical data

Categorical columns (even more string data types) are not supported natively by machine-learning algorithms and required some preprocessing step usually called encoding. The most meaningful encoding with tree-based algorithms is the `OrdinalEncoder`.

In [None]:
X_train.head()

In [None]:
cat_col = ['sex', 'embarked', 'pclass']

In [None]:
X_train_cat = X_train[cat_col]

In [None]:
X_train_cat.info()

In [None]:
from sklearn.preprocessing import OrdinalEncoder

cat_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='missing'),
    OrdinalEncoder(),
)

categorical_preprocessing = make_column_transformer(
    (cat_pipeline, cat_col)
)
model = make_pipeline(
    categorical_preprocessing,
    RandomForestClassifier(n_estimators=100)
)
model.fit(X_train, y_train)

In [None]:
model.score(X_test, y_test)

# Combining both categorical and numerical data in the pipeline

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
    <li>Try to combine the numerical and categorical pipelines into a single <tt>ColumnTransformer</tt></li>
        <li>Fit a <tt>RandomForestClassifier</tt> on the output of this feature engineering. How does the test score evolve?</li>
    </ul>
</div>