# Complex machine-learning pipeline

We saw in the previous notebook that we can encounter different types of data:
(i) numerical and (ii) categorical data. We showed how to handle each of these
types.

In this notebook, we introduce a new scikit-learn called `ColumnTransformer`
allowing to preprocess each data types differently before to train a machine
learning model.

First, let's load the Adult census dataset.

In [None]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")
# drop the duplicated column `"education-num"` as stated in the first notebook
adult_census = adult_census.drop(columns="education-num")

target_name = "class"
target = adult_census[target_name]

data = adult_census.drop(columns=[target_name])

We previously used the function `make_column_selector` to automatically
select columns based on their data types. We will reuse this function to
select the categorical columns.

In [None]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
categorical_columns

We will decide that we are going to use a `RandomForestClassifier`. Thus, we
previously saw that an `OrdinalEncoder` is sufficient enough encoding strategy
in this case. So, we will use a `ColumnTransformer` to encode the categorical
columns and let the numerical data pass as-is.

In [None]:
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OrdinalEncoder

preprocessor = make_column_transformer(
    (OrdinalEncoder(handle_unknown="use_encoded_value", unknown_value=-1), categorical_columns),
    remainder="passthrough",
)

We can quickly check what will be the effect of applying this preprocessor.

In [None]:
preprocessor.fit_transform(data)

Now that the preprocessor is working, we can train a `RandomForestClassifier`.

In [None]:
from sklearn.pipeline import make_pipeline
from sklearn.ensemble import RandomForestClassifier

model = make_pipeline(preprocessor, RandomForestClassifier())


And finally, we can evaluate our model by cross-validation as we previously
did.

In [None]:
from sklearn.model_selection import cross_validate

cv_results = cross_validate(model, data, target)
cv_results

In [None]:
scores = cv_results["test_score"]
print(f"The accuracy is: {scores.mean():.3f} +/- {scores.std():.3f}")

## Exercise

Now, this is your turn to create a similar complex pipeline but this time
using a linear model. You will need to adjust both the categorical and
numerical preprocessing.

In [None]:
# %load solutions/solution_01.py