# Pipelines in Machine Learning

Video Link: 
    https://www.youtube.com/watch?v=HZ9MUzCRlzI&list=PLT6wrBlkasCNqKnKcs1hOoCEhtcUZiAqo&index=102&t=215s

In [1]:
from sklearn.pipeline import Pipeline

## Pipeline Implementation Basic Example

### 1. Import standard scaling and regression technique

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### 2. Combine standard scalar with logistic regression

standard scalar is a transformation technique and logistic regression is an estimator. Estimator basically means the kind of a model we want to apply. In order to apply the model we create a number of steps. steps are in the form of tuples.

In [3]:
steps=[("standard_scaler", StandardScaler()),
       ("classifier", LogisticRegression())
    ]

In [4]:
steps

[('standard_scaler', StandardScaler()), ('classifier', LogisticRegression())]

**Inference:**
    
    1. First step is a Standard scaler.
    2. Seconda step is a classifier.

### 3. Convert the steps into a pipeline.

In [5]:
pipe=Pipeline(steps)

In [6]:
pipe

Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('classifier', LogisticRegression())])

### 4. Visualize the pipeline

In [7]:
from sklearn import set_config

In [8]:
set_config(display='diagram')

In [9]:
pipe

**Inference:**

Steps involved in pipeline is visualised here.

### 5. Creating a Sample dataset

Create a classification dataset

In [10]:
from sklearn.datasets import make_classification

In [11]:
X,y=make_classification() # create a random classification dataset.

In [12]:
X

array([[-2.51876924, -0.07106745,  0.06826223, ..., -1.11460356,
         0.20461509,  2.05256632],
       [ 0.32646908,  1.08815252,  1.90898687, ...,  0.42515242,
         1.94851406,  0.21353191],
       [ 1.001031  ,  1.08257456,  0.70872979, ...,  1.3713969 ,
        -0.21497598, -0.08871041],
       ...,
       [-0.39291761, -0.31187759,  1.46832999, ..., -1.75207364,
        -1.09218128,  1.33628158],
       [-0.16326442,  0.54194698, -0.69813275, ..., -0.40051423,
        -1.08109713, -0.15673293],
       [-1.60363925,  0.48560624,  0.10257006, ...,  0.15958213,
        -0.52968568, -1.21680435]])

In [13]:
y

array([1, 1, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0,
       0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0,
       1, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0])

### 6. Train Test Split the Dataset

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [16]:
X_train

array([[ 1.35493571, -1.28393659,  0.04572112, ..., -0.00366179,
         0.99295554, -0.97590252],
       [ 0.76097742, -2.72437166,  0.63668071, ..., -1.03921526,
        -0.90379363,  2.31514132],
       [-0.07924928,  1.02925873,  0.41424283, ..., -1.10933297,
         1.0985804 ,  0.22658714],
       ...,
       [ 1.11515487,  0.33833086,  0.80532659, ..., -0.24313914,
        -0.31174538,  0.30374692],
       [ 1.39192703,  1.52442735, -1.24589215, ...,  0.08077754,
         1.45197455,  0.22819958],
       [-0.60951902, -0.33028115,  0.57507368, ...,  0.29912976,
        -0.26560902, -1.84825889]])

In [17]:
y_train

array([1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1,
       0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1,
       1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1,
       1])

### 7. Fit the pipeline on X_train and y_train data

Once we call fit on X_train and y_train, it passes through the pipeline and get trained for Logistic regression.

In [18]:
pipe.fit(X_train, y_train)

### 8. Prediction

In [19]:
y_pred = pipe.predict(X_test)

In [20]:
y_pred

array([1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 1,
       1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 1])

**Inference:**

- In prediction, when *X_test* passes through this *StandardScaler* pipeline, It performs only *transform()* method. 

- This is how we can use pipelines. We can combine the transformation and feature scaling techniques.

## Pipeline Implementation Example 2

### 1. Import PCA and SVC

In [21]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

### 2. Steps to apply for model creation

In [22]:
steps = [
    ("scaling", StandardScaler()),
    ("PCA", PCA(n_components=3)),
    ("SVC", SVC())
]

### 3. Convert the Steps into Pipeline

In [23]:
pipe2=Pipeline(steps)

In [24]:
pipe2

**Inference:**
    
    This means in the first step we apply standard scaler, followed by PCA and SVC.

### 4. Fit the pipeline on X_train and y_train

In [25]:
pipe2.fit(X_train,y_train)

### 5. Prediction

In [26]:
pipe2.predict(X_test)

array([1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0])

## Pipeline Implementation Example 3 - Column Transformer

### 1. Impute missing values using pipeline

In [27]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [28]:
import numpy as np

### 2. Create a numreical processing pipeline to impute missing values in numerical categories.

In [29]:
numeric_processor = Pipeline(steps=[
    ("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
    ("scaler", StandardScaler())
])

# here missing values replaced with nan.

In [30]:
numeric_processor

**Inference**

Here first step is Simple Imputer followed by Standard scaling.

## 3. Categorical prcessing pipeline to impute missing data in categorical columns

**Steps**

    1. Simple Imputation
    2. One Hot Encoding

In [31]:
from sklearn.preprocessing import OneHotEncoder

categorical_processing = Pipeline(steps=[
    ("imputation_constant", SimpleImputer(fill_value="missing", strategy="constant")),
    ("onehot_encoding", OneHotEncoder(handle_unknown="ignore"))
])

In [32]:
categorical_processing

#### Inference

**Steps**

    1. Fill missing columns with value "missing".
    2. One Hot encoding to ignore unknown things.

## 4. Combine both the numerical and categorical pipelines

In [37]:
from sklearn.compose import ColumnTransformer

In [39]:
preprocessor = ColumnTransformer(
    [("categorical", categorical_processing, ['gender','City']),
     ("numerical", numeric_processor, ['age', 'height']),
    ]
)

In [40]:
preprocessor

**Inference:**
    
    We combined both the pipelines. In categorical, first simple imputer is applied followed by One hot encoder.
    In numerical, first simple imputer is applied followed by standard scaler.
    
    Here, gender, city, age and height are columns.

## 5. Create a custom pipeline and combine with an estimator

In [41]:
from sklearn.pipeline import make_pipeline

In [44]:
pipe = make_pipeline(preprocessor, LogisticRegression()) 

# estimator - LogicticRegression.

In [43]:
pipe