<a href="https://colab.research.google.com/github/pipuf/ml_dev_cert/blob/main/9_1_3_THEORY_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Pipeline

It's time to start uniting different concepts that we have been seeing not only in this last class, but throughout the entire race. And for that we are going to introduce the concept of __Pipeline__ and we are going to see how to used [Scikit-Learn Pipeline](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline).

Any machine learning problem can be divided into successive black boxes (or steps) that are executed sequentially: __transformations on the input data__ (standardizations, dimension reduction, etc.), __training a model__, __optimization of hyperparameters with cross-validation__ and __prediction__.

Although all this can be done "by hand" simply by connecting these different steps, but it is a good practice to implement a pipeline and instead of executing each one of the steps individually, we execute the Pipeline that contains everything related to the model.


Import necesary Libraries

In [None]:
import pandas as pd # For dataframes
import numpy as np # For matrices
import matplotlib.pyplot as plt # For plotting data
import seaborn as sns # For plotting data
from sklearn.model_selection import train_test_split # For train/test splits
from sklearn.neighbors import KNeighborsClassifier # The k-nearest neighbor classifier
from sklearn.feature_selection import VarianceThreshold # Feature selector
from sklearn.pipeline import Pipeline # For setting up pipeline
from sklearn.preprocessing import Normalizer, StandardScaler, MinMaxScaler, PowerTransformer, MaxAbsScaler, LabelEncoder

#### The Dataset
We’ll use the Ecoli Dataset from the UCI Machine Learning Repository to demonstrate all the concepts of this tutorial. This dataset is maintained by Kenta Nakai. Let’s first load the Ecoli dataset in a Pandas DataFrame and view the first few rows.

In [None]:
# Read ecoli dataset from the UCI ML Repository and store in
# dataframe df
!gdown "1VPCAftz3JpUDRPz2usM8BYnidU6P1K1M"

df = pd.read_csv('pipeline.csv')
df

Downloading...
From: https://drive.google.com/uc?id=1VPCAftz3JpUDRPz2usM8BYnidU6P1K1M
To: /content/pipeline.csv
  0% 0.00/17.2k [00:00<?, ?B/s]100% 17.2k/17.2k [00:00<00:00, 36.0MB/s]


Unnamed: 0.1,Unnamed: 0,0,1,2,3,4,5,6,7,8
0,0,AAT_ECOLI,0.49,0.29,0.48,0.5,0.56,0.24,0.35,cp
1,1,ACEA_ECOLI,0.07,0.40,0.48,0.5,0.54,0.35,0.44,cp
2,2,ACEK_ECOLI,0.56,0.40,0.48,0.5,0.49,0.37,0.46,cp
3,3,ACKA_ECOLI,0.59,0.49,0.48,0.5,0.52,0.45,0.36,cp
4,4,ADI_ECOLI,0.23,0.32,0.48,0.5,0.55,0.25,0.35,cp
...,...,...,...,...,...,...,...,...,...,...
331,331,TREA_ECOLI,0.74,0.56,0.48,0.5,0.47,0.68,0.30,pp
332,332,UGPB_ECOLI,0.71,0.57,0.48,0.5,0.48,0.35,0.32,pp
333,333,USHA_ECOLI,0.61,0.60,0.48,0.5,0.44,0.39,0.38,pp
334,334,XYLF_ECOLI,0.59,0.61,0.48,0.5,0.42,0.42,0.37,pp


We’ll ignore the first column, which specifies the sequence name. The last column is the class label. Let’s separate the features from the class label and split the dataset into 2/3 training instances and 1/3 test examples.

In [None]:
# The data matrix X
X = df.iloc[:,2:-1]
# The labels
y = df.iloc[:,-1]

# Encode the labels into unique integers
encoder = LabelEncoder()
y = encoder.fit_transform(np.ravel(y))

# Split the data into test and train
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=1/3,
    random_state=0)

print(X_train.shape)
print(X_test.shape)

(224, 7)
(112, 7)


Great! Now we have 224 samples in the training set and 112 samples in the test set. We have chosen a small dataset so that we can focus on the concepts, rather than the data itself.

For this tutorial, we have chosen the k-nearest neighbor classifier to perform the classification of this dataset.

In [None]:
knn = KNeighborsClassifier(n_neighbors=5).fit(X_train, y_train)
print('Training set score: ' + str(knn.score(X_train,y_train)))
print('Test set score: ' + str(knn.score(X_test,y_test)))

Training set score: 0.9017857142857143
Test set score: 0.8482142857142857


For this tutorial, we’ll set up a very basic pipeline that consists of the following sequence:

1. Scaler: For pre-processing data, i.e., transform the data to zero mean and unit variance using the StandardScaler().
2. Feature selector: Use VarianceThreshold() for discarding features whose variance is less than a certain defined threshold.
3. Classifier: KNeighborsClassifier(), which implements the k-nearest neighbor classifier and selects the class of the majority k points, which are closest to the test example.

In [None]:
pipe = Pipeline(
  [
    ('scaler', StandardScaler()),
    ('selector', VarianceThreshold()),
    ('classifier', KNeighborsClassifier()),
  ]
)

The pipe object is simple to understand. It says, scale first, select features second and classify in the end. Let’s call fit() method of the pipe object on our training data and get the training and test scores.

In [None]:
pipe.fit(X_train, y_train)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
print('Test set score: ' + str(pipe.score(X_test,y_test)))

Training set score: 0.8794642857142857
Test set score: 0.8392857142857143


So it looks like the performance of this pipeline is worse than the single classifier performance on raw data. Not only did we add extra processing, but it was all in vain.
Don’t despair, the real benefit of the pipeline comes from its tuning. The next section explains how to do that.

###Tunning the PIPELINE

In the code below, we’ll show the following:

We can search for the best scalers. Instead of just the StandardScaler(), we can try MinMaxScaler(), Normalizer() and MaxAbsScaler().
We can search for the best variance threshold to use in the selector, i.e., VarianceThreshold().
We can search for the best value of k for the KNeighborsClassifier().

In [None]:
pipe = Pipeline([
('scaler', MinMaxScaler()),
('selector', VarianceThreshold(0.001)),
('classifier', KNeighborsClassifier(leaf_size=1, n_neighbors=5))])

In [None]:
pipe.fit(X_train, y_train)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
print('Test set score: ' + str(pipe.score(X_test,y_test)))

Training set score: 0.8928571428571429
Test set score: 0.8482142857142857


See some changes there, a little bit of difference between the first pipe and the last one... Now, we strongly recommend doing one change at a time in order to analyze which variable makes noise or improves the pipe. Let´s do another example.

In [None]:
# Remember to change just one parameter!
pipe = Pipeline([
('scaler', MinMaxScaler()),
('selector', VarianceThreshold(0.001)),
('classifier', KNeighborsClassifier(leaf_size=1, n_neighbors=5))])

In [None]:
pipe.fit(X_train, y_train)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
print('Test set score: ' + str(pipe.score(X_test,y_test)))

Training set score: 0.8928571428571429
Test set score: 0.8482142857142857


Seems to be that more neighbors make a worse pipe, let´s try with another parameter and come back neighbors to five... and then you can try by yourself!

In [None]:
pipe = Pipeline([
('scaler', StandardScaler()),
('selector', VarianceThreshold(0.001)),
('classifier', KNeighborsClassifier(leaf_size=1, n_neighbors=5))])

In [None]:
pipe.fit(X_train, y_train)

print('Training set score: ' + str(pipe.score(X_train,y_train)))
print('Test set score: ' + str(pipe.score(X_test,y_test)))

Training set score: 0.8794642857142857
Test set score: 0.8392857142857143


By tuning the pipeline, we achieved quite an improvement over a simple classifier and a non-optimized pipeline. It is important to analyze the results of the optimization process.

Now is your turn to get, at least, .85 in Test set score.
Change the leaf_size, the scaler (StandardScaler(), MinMaxScaler(), Normalizer(), MaxAbsScaler()), and so on.
##You can do it