# Pipelines in Machine Learning

Video Link: 
    https://www.youtube.com/watch?v=HZ9MUzCRlzI&list=PLT6wrBlkasCNqKnKcs1hOoCEhtcUZiAqo&index=102&t=215s

In [2]:
from sklearn.pipeline import Pipeline

## Pipeline Implementation Basic Example

### 1. Import standard scaling and regression technique

In [4]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### 2. Combine standard scalar with logistic regression

standard scalar is a transformation technique and logistic regression is an estimator. Estimator basically means the kind of a model we want to apply. In order to apply the model we create a number of steps. steps are in the form of tuples.

In [5]:
steps=[("standard_scaler", StandardScaler()),
       ("classifier", LogisticRegression())
    ]

In [6]:
steps

[('standard_scaler', StandardScaler()), ('classifier', LogisticRegression())]

**Inference:**
    
    1. First step is a Standard scaler.
    2. Seconda step is a classifier.

### 3. Convert the steps into a pipeline.

In [8]:
pipe=Pipeline(steps)

In [9]:
pipe

Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('classifier', LogisticRegression())])

### 4. Visualize the pipeline

In [10]:
from sklearn import set_config

In [11]:
set_config(display='diagram')

In [12]:
pipe

**Inference:**

Steps involved in pipeline is visualised here.

### 5. Creating a Sample dataset

Create a classification dataset

In [13]:
from sklearn.datasets import make_classification

In [14]:
X,y=make_classification() # create a random classification dataset.

In [15]:
X

array([[-0.8893426 , -0.48468934, -0.70284202, ..., -2.32482432,
         2.80715278, -0.46992368],
       [-1.6974948 , -0.31449376,  1.43447252, ...,  0.73839567,
         0.20834267, -2.05826059],
       [-1.40240351, -0.30746283,  0.22037491, ...,  0.38789815,
        -0.55328078,  1.29948373],
       ...,
       [ 0.00672725, -0.40953861,  1.36179955, ...,  0.33324286,
        -0.13486778,  2.00019353],
       [ 0.36112235, -0.36786791,  0.42053888, ...,  1.39141826,
        -0.7998025 ,  0.91773249],
       [ 1.23801716,  1.75867201, -1.1755313 , ...,  1.03609155,
        -0.05100509,  1.07133446]])

In [16]:
y

array([0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0,
       0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1,
       0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1])

### 6. Train Test Split the Dataset

In [17]:
from sklearn.model_selection import train_test_split

In [18]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [19]:
X_train

array([[ 0.31584134, -1.26598152,  0.74370039, ...,  0.676992  ,
         0.07601347,  1.74167262],
       [ 1.02462789,  1.1845937 ,  0.37518611, ...,  0.55508769,
         0.30563926, -0.3852162 ],
       [-0.05955848,  0.91760858, -0.11861463, ...,  0.12433064,
         0.09922655,  1.12533622],
       ...,
       [-1.25726864, -0.24975358, -0.71358001, ...,  0.47218291,
        -0.72596439,  1.7079668 ],
       [-1.31921601, -1.09533032,  0.14033446, ...,  0.12105736,
         1.05055007,  0.53419023],
       [-0.39372686, -0.73266008,  0.42498512, ..., -1.60258066,
        -0.0496499 , -1.58634599]])

In [20]:
y_train

array([1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1,
       0])

### 7. Fit the pipeline on X_train and y_train data

Once we call fit on X_train and y_train, it passes through the pipeline and get trained for Logistic regression.

In [21]:
pipe.fit(X_train, y_train)

### 8. Prediction

In [23]:
y_pred = pipe.predict(X_test)

In [24]:
y_pred

array([0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0])

**Inference:**

- In prediction, when *X_test* passes through this *StandardScaler* pipeline, It performs only *transform()* method. 

- This is how we can use pipelines. We can combine the transformation and feature scaling techniques.

## Pipeline Implementation Example 2

### 1. Import PCA and SVC

In [27]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

### 2. Steps to apply for model creation

In [28]:
steps = [
    ("scaling", StandardScaler()),
    ("PCA", PCA(n_components=3)),
    ("SVC", SVC())
]

### 3. Convert the Steps into Pipeline

In [31]:
pipe2=Pipeline(steps)

In [32]:
pipe2

**Inference:**
    
    This means in the first step we apply standard scaler, followed by PCA and SVC.

### 4. Fit the pipeline on X_train and y_train

In [33]:
pipe2.fit(X_train,y_train)

### 5. Prediction

In [35]:
pipe2.predict(X_test)

array([1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1])

## Pipeline Implementation Example 3 - Column Transformer

In [36]:
# time: 15:45