# 📝 Exercise M1.04

The goal of this exercise is to evaluate the impact of using an arbitrary
integer encoding for categorical variables along with a linear
classification model such as Logistic Regression.

To do so, let's try to use `OrdinalEncoder` to preprocess the categorical
variables. This preprocessor is assembled in a pipeline with
`LogisticRegression`. The statistical performance of the pipeline can be
evaluated by cross-validation and then compared to the score obtained when
using `OneHotEncoder` or to some other baseline score.

First, we load the dataset.

In [1]:
import pandas as pd

adult_census = pd.read_csv("../datasets/adult-census.csv")

In [2]:
target_name = "class"
target = adult_census[target_name]
data = adult_census.drop(columns=[target_name, "education-num"])

In the previous notebook, we used `sklearn.compose.make_column_selector` to
automatically select columns with a specific data type (also called `dtype`).
Here, we will use this selector to get only the columns containing strings
(column with `object` dtype) that correspond to categorical features in our
dataset.

In [3]:
from sklearn.compose import make_column_selector as selector

categorical_columns_selector = selector(dtype_include=object)
categorical_columns = categorical_columns_selector(data)
data_categorical = data[categorical_columns]

We filter our dataset that it contains only categorical features.
Define a scikit-learn pipeline composed of an `OrdinalEncoder` and a
`LogisticRegression` classifier.

Because `OrdinalEncoder` can raise errors if it sees an unknown category at
prediction time, you can set the `handle_unknown="use_encoded_value"` and
`unknown_value` parameters. You can refer to the
[scikit-learn documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html)
for more details regarding these parameters.

In [11]:
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import OrdinalEncoder
from sklearn.linear_model import LogisticRegression

model_o = make_pipeline(
    OrdinalEncoder(
        handle_unknown='use_encoded_value', 
        unknown_value=100
    ),
    LogisticRegression(max_iter=500)
)

Your model is now defined. Evaluate it using a cross-validation using
`sklearn.model_selection.cross_validate`.

In [14]:
from sklearn.model_selection import cross_validate

cv_results_o = cross_validate(model_o, data_categorical, target)
cv_results_o

{'fit_time': array([0.60099888, 0.43199515, 0.5714469 , 0.58800197, 0.46999693]),
 'score_time': array([0.05700207, 0.04799867, 0.04499888, 0.05500221, 0.04200149]),
 'test_score': array([0.75514382, 0.75555328, 0.75573301, 0.75307125, 0.75788288])}

Now, we would like to compare the statistical performance of our previous
model with a new model where instead of using an `OrdinalEncoder`, we will
use a `OneHotEncoder`. Repeat the model evaluation using cross-validation.
Compare the score of both models and conclude on the impact of choosing a
specific encoding strategy when using a linear model.

In [15]:
from sklearn.preprocessing import OneHotEncoder

model_h = make_pipeline(
    OneHotEncoder(
        handle_unknown='ignore'
    ),
    LogisticRegression(max_iter=500)
)

In [16]:
cv_results_h = cross_validate(model_h, data_categorical, target)
cv_results_h

{'fit_time': array([1.12699723, 0.918715  , 0.82701635, 0.89847875, 0.84300184]),
 'score_time': array([0.03200006, 0.03199649, 0.04098797, 0.03598309, 0.03200197]),
 'test_score': array([0.83222438, 0.83560242, 0.82872645, 0.83312858, 0.83466421])}