<div class="alert alert-block alert-success">
    <b>ARTIFICIAL INTELLIGENCE (E016350A)</b> <br>
ALEKSANDRA PIZURICA <br>
GHENT UNIVERSITY <br>
AY 2024/2025 <br>
Assistant: Nicolas Vercheval
</div>

# Pipeline

In this notebook, we explain how to implement a *pipeline*. A pipeline is a sequence of steps that are grouped (conceptually and functionally) into a single operation. We implement it using the `pipeline` package from the `scikit-learn` library.

In [1]:
from sklearn import pipeline
from sklearn import datasets
from sklearn import linear_model
from sklearn import model_selection
from sklearn import preprocessing

In the previous Lab session, we repeated some steps when building, training, tuning and testing a model. For example, we applied the standardization obtained from the training dataset to the training and test set by calling the `transform` function. The same was true when processing the attributes: we selected the appropriate attributesd over the training set and extracted them from the training and test set. A "pipeline" defines a series of steps executed sequentially. All library classes with the `fit` and `transform` functions can be placed in the pipeline. Pipelines inherit the fit/transform/predict functionalities, presenting an equivalent interface to the individual steps.

## Tumour dataset
To demonstrate how to work with pipelines, we load the tumour dataset from `scikit-learn` and divide it into training and test sets.

In [2]:
data = datasets.load_breast_cancer()

In [3]:
X = data.data
y = data.target

In [4]:
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size = 0.33, stratify = y, random_state = 10)

### No Pipeline

In [5]:
# standardization
scaler = preprocessing.StandardScaler()

# fitting the values for standardization
scaler.fit(X_train)

# apply to train
X_train_transformed = scaler.transform(X_train)

# apply to test
X_test_transformed = scaler.transform(X_test)

We train the classification model of choice. Let it be a logistic regression model with regularization parameter $2$ (the parameter $C$ corresponds to the reciprocal of the penalization coefficient).

In [6]:
model = linear_model.LogisticRegression(C = 2)

In [7]:
model.fit(X_train_transformed, y_train)

We evaluate the model by calling the `score` function or by calling the `predict` function and selecting the appropriate measure of the `metrics` package.

In [8]:
model.score(X_test_transformed, y_test)

0.973404255319149

### Pipeline
The following code replicates the above one more compactly. The `make_pipeline` method allows us to create a rudimentary pipeline consisting of a standard scaler and the model. When the `fit` method from the pipeline object, it calls the `fit` method of the single components. Calling the `score` method over a pipeline means calling the `transform` method over the second to last step and then calling the `score` method over the final step. The values calculated by the `fit` method are used when executing the `transform` function. This routine prevents the possibility of omitting one of the steps or applying it to the wrong set.

In [19]:
linreg_pipeline = pipeline.make_pipeline(preprocessing.StandardScaler(), linear_model.LogisticRegression(C=2))

In [20]:
linreg_pipeline.fit(X_train, y_train)

In [21]:
linreg_pipeline.score(X_test, y_test)

0.973404255319149

### Pipeline alternative implementation
Besides the `make_pipeline` method, the `Pipeline` class can also be used, which offers a handy way to manage the steps by specifying a name for each step. It is convenient when adjusting or returning the hyperparameters related to a step.

In [22]:
linreg_pipeline = pipeline.Pipeline(steps=[('scaler', preprocessing.StandardScaler()), ('logreg', linear_model.LogisticRegression())])

We can adjust the parameters of any pipeline element with the `set_params` method. The parameters are described by the name of the pipeline element followed by a double underline and the parameter name. When specifying the values of several parameters at once, the arguments should be separated by commas.

In [24]:
linreg_pipeline.set_params()  # your code here

The `fit` function works as described.

In [25]:
linreg_pipeline.fit(X_train, y_train)

Individual elements of the pipeline can be retrieved using their names. For example, the next block of code obtains the learned coefficients of the logistic regression model.

In [26]:
linreg_pipeline['logreg'].coef_

array([[-0.71287111, -0.54929786, -0.68058605, -0.74137892, -0.29316   ,
         0.32212727, -0.87269498, -0.79199619,  0.27300058,  0.33862541,
        -1.1171623 ,  0.35158335, -0.76694739, -0.84669554, -0.32886658,
         0.65717451,  0.02115188, -0.02726388, -0.0933187 ,  0.64001075,
        -1.01571092, -1.13111336, -0.88277088, -0.94940954, -1.05485044,
         0.1135596 , -0.82438701, -0.51588989, -0.72373462, -0.49042159]])

In [27]:
# bias
linreg_pipeline['logreg'].intercept_

array([0.08878887])

The `score` method works as before.

In [28]:
linreg_pipeline.score(X_test, y_test)

0.973404255319149

The same holds for the `predict` method.

In [31]:
linreg_pipeline.predict([X_test[0,:]])

array([1])