# Linear models - Preprocessing

In this notebook, we will discuss the importance of preprocessing in linear models, especially when the solver used rely on gradient-based optimization method.

In [None]:
# temporary fix to avoid spurious warning raised in scikit-learn 1.0.0
# it will be solved in scikit-learn 1.0.1
import warnings
warnings.filterwarnings("ignore", message="X has feature names.*")
warnings.filterwarnings("ignore", message="X does not have valid feature names.*")

## Importance of data scaling

Since we would like to demonstrate issue related to gradient-based optimization solver, we will need to use a linear model that does not rely on an algorithm that provide a closed-form solution. Such algorithm is `LogisticRegression` for instance (in the contrary to `LinearRegression` or `Ridge`).

Thus, let's start by loading our penguins dataset to distinguish the different species.

In [None]:
import pandas as pd

data = pd.read_csv("../datasets/penguins_classification.csv")
target_name = "Species"
X = data.drop(columns=target_name)
y = data[target_name]

Up to now, we did not bother much about evaluating our model: we used a single dataset just to illustrate some fitting property of the different estimators. However, the preprocessing that we are going to use need to be applied in a specific manner depending if one is training or testing a model. Therefore, we will start by using a training and testing set.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, random_state=0
)
X_train.head()

Previously, we showed that we could train a `LogisticRegression` model in the following manner.

In [None]:
from sklearn.linear_model import LogisticRegression

model = LogisticRegression()
model.fit(X_train, y_train)
model.score(X_test, y_test)

We previously stated that this model is minimizing a specific loss function (log loss). However, we did not mentioned what algorithm was used to find the optimal parameters $\beta$ that minimize this log loss. We only discuss such details with `LinearRegression` where we used the Normal equation that is a closed-form solution to the least squared minimization.

For the `LogisticRegression`, the problem does not have a closed-form solution. Instead, the different algorithms rely on the derivatives of the log loss to find the best parameter. One can check the available algorithm looking at the documentation of the `solver` parameter: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html

All the different solvers rely on derivatives of the log loss, meaning that an iterative algorithm will take place to find the optimal parameters of the model. Therefore, once the `LogisticRegression` is fitted, we can know the number of iteration that an algorithm did to find the optimal parameters.

In [None]:
model.n_iter_

However, there is something that is not really proper regarding our training dataset. We can have a look to our dataset

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_context("poster")

_, ax = plt.subplots(figsize=(8, 8))
ax = sns.kdeplot(
    data=X_train,
    x="Culmen Length (mm)",
    y="Culmen Depth (mm)",
    levels=10,
    fill=True,
    cmap=plt.cm.viridis,
    ax=ax,
)
_ = ax.axis("square")

Looking at our data distribution, we observed that the deviation from the mean is more important for the culmen length feature than the culmen depth feature. This will have an impact when dealing with gradient-based model.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    <ul>
        <li>Using pandas, rescale the dataset such that they will be centered to (0, 0) and have a unit standard deviation for both feature.</li>
        <li>Plot the distribution of the data as previously done.</li>
        <li>Fit a <tt>LogisticRegression</tt> on the scaled dataset.</li>
        <li>Check the number of iterations needed to train the model.</li>
    </ul>
</div>

In [None]:
# %load solutions/solution_25.py

In [None]:
# %load solutions/solution_26.py

In [None]:
# %load solutions/solution_27.py

## Scikit-learn transformers API

There is a family of estimator in scikit-learn that allows to "transform" data. As predictor, they can learn some states during `fit` and later reuse these states when calling the method `transform`. Let's perform the previous scaling using the `StandardScaler` transformer from scikit-learn.

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)

As stated, we expect our scaler to have some state after calling `fit`. Indeed, our scaler will have store the mean and standard deviation of the dataset.

In [None]:
scaler.mean_, scaler.scale_

Now we can use the `transform` method to scale the data.

In [None]:
X_train_scaled = scaler.transform(X_train)
# scikit-learn will transform any pandas dataframe into a NumPy array
X_train_scaled = pd.DataFrame(X_train_scaled, columns=X_train.columns)
X_train_scaled.head()

Let's plot the distribution of the data to convince ourself that the transformation applied is the same than we did by hand earlier.

In [None]:
_, ax = plt.subplots(figsize=(8, 8))
ax = sns.kdeplot(
    data=X_train_scaled,
    x="Culmen Length (mm)",
    y="Culmen Depth (mm)",
    levels=10,
    fill=True,
    cmap=plt.cm.viridis,
    ax=ax,
)
_ = ax.axis("square")

In [None]:
X_train_scaled.mean()

In [None]:
X_train_scaled.std()

The advantage of using scikit-learn transformer over manually manipulating the dataset is that we can make complex pipeline. A pipeline can be represented as a sequence of scikit-learn transformers finishing by a scikit-learn predictor. This pipeline will have the same API than a scikit-learn predictor (i.e. `fit`, `predict`, `predict_proba`, `decision_function`), and will take care of the transformation and the transformation state for us.

Let's define such of a scikit-learn pipeline.

In [None]:
import sklearn
# to make nice diagram when plotting complex pipeline
sklearn.set_config(display="diagram")

In [None]:
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

model = make_pipeline(StandardScaler(), LogisticRegression())
model

Here, we created a pipeline that will be in charge of scaling the data first and then pass it to the classifier. Let's demonstrate how to train this pipeline.

In [None]:
model.fit(X_train, y_train)

We just trained our model without the need to take care about the scaling ourselve. We can check that model have internal state learnt during `fit`. Let's first check our scaler.

In [None]:
model[0].mean_, model[0].scale_

So the first step of the pipeline stored the mean and standard deviation of the training set. Did we learn the optimal parameter of the `LogisticRegression`?

In [None]:
model[-1].coef_

Apparently, we did. We can even check the number of iteration that it took to train the model.

In [None]:
model[-1].n_iter_

So the pipeline did all the job that we previously manually did with the advantage that it exposes the same API than any predictor that we used up to now. What about prediction and scoring?

Indeed, during prediction, we would need to scale the testing set using the statistic found during training. Using the pipeline with the `predict` method will take care of this processing for us.

In [None]:
y_pred = model.predict(X_test)

In [None]:
(y_test == y_pred).mean()

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    In the first course, we saw that we could pass a model to <tt>cross_validate</tt> to get a distrbution of score.
    Use the previous complex pipeline, and evaluate it using the <tt>cross_validate</tt> function.
</div>

## Side effect of longer processing

We would like to trigger a behaviour that you could encounter in scikit-learn. It is quite important to know what it means and how to act. Let's load a dataset first.

In [None]:
data = pd.read_csv("../datasets/adult-census-numeric-all.csv")
data.head()

In [None]:
target_name = "class"
X = data.drop(columns=target_name)
y = data[target_name]

This dataset is linked to a classification task.

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    Fit a <tt>LogisticRegression</tt> algorithm without scaling the data first. In addition, force the maximum number of iteration of the solver to be at most 50 iterations. What is the result of the training?
</div>

In [None]:
# %load solutions/solution_28.py

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    Implement both proposals stipulated in the warning message and argument which option you should choose and why?
</div>

In [None]:
# %load solutions/solution_29.py

In [None]:
# %load solutions/solution_30.py

In [None]:
# %load solutions/solution_31.py

In [None]:
# %load solutions/solution_32.py

<div class="alert alert-success">
    <p><b>EXERCISE</b>:</p>
    What is the impact of using a scaler on the coefficient?
</div>

In [None]:
# %load solutions/solution_33.py

In [None]:
# %load solutions/solution_34.py