# Pipelines in Machine Learning

Video Link: 
    https://www.youtube.com/watch?v=HZ9MUzCRlzI&list=PLT6wrBlkasCNqKnKcs1hOoCEhtcUZiAqo&index=102&t=215s

In [1]:
from sklearn.pipeline import Pipeline

## Pipeline Implementation Basic Example

### 1. Import standard scaling and regression technique

In [2]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

### 2. Combine standard scalar with logistic regression

standard scalar is a transformation technique and logistic regression is an estimator. Estimator basically means the kind of a model we want to apply. In order to apply the model we create a number of steps. steps are in the form of tuples.

In [3]:
steps=[("standard_scaler", StandardScaler()),
       ("classifier", LogisticRegression())
    ]

In [4]:
steps

[('standard_scaler', StandardScaler()), ('classifier', LogisticRegression())]

**Inference:**
    
    1. First step is a Standard scaler.
    2. Seconda step is a classifier.

### 3. Convert the steps into a pipeline.

In [5]:
pipe=Pipeline(steps)

In [6]:
pipe

Pipeline(steps=[('standard_scaler', StandardScaler()),
                ('classifier', LogisticRegression())])

### 4. Visualize the pipeline

In [7]:
from sklearn import set_config

In [8]:
set_config(display='diagram')

In [9]:
pipe

**Inference:**

Steps involved in pipeline is visualised here.

### 5. Creating a Sample dataset

Create a classification dataset

In [10]:
from sklearn.datasets import make_classification

In [11]:
X,y=make_classification() # create a random classification dataset.

In [12]:
X

array([[ 1.35099749,  0.4111458 , -0.77686541, ...,  0.454708  ,
         0.53500001,  0.46129788],
       [ 1.06852358,  0.97327654,  0.11177978, ..., -0.74464801,
        -1.19467619, -0.60661708],
       [-0.39455779, -1.05462452, -0.13985979, ...,  1.18658429,
         1.5280588 ,  0.9528506 ],
       ...,
       [-0.2629402 ,  3.04260847,  0.22485953, ..., -2.69629097,
         0.41637013, -3.43482468],
       [ 1.42620522, -0.52787131,  0.50045873, ...,  1.11191361,
         0.09699036,  1.05100121],
       [ 0.97390141,  0.07349023,  0.55239911, ...,  1.27328512,
        -1.61257602,  1.81299812]])

In [13]:
y

array([0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1,
       0, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0,
       1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 1, 0, 1,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1,
       1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0])

### 6. Train Test Split the Dataset

In [14]:
from sklearn.model_selection import train_test_split

In [15]:
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.33,random_state=42)

In [16]:
X_train

array([[ 0.35887048,  0.5261859 , -0.08823691, ...,  0.84287421,
        -0.86463594,  1.48526829],
       [-0.20644335, -0.51020794, -1.28234859, ..., -1.10190841,
         0.68838791, -1.73148931],
       [-0.65120957,  1.31834014, -0.75048807, ...,  1.32258963,
         0.0647055 ,  1.75841825],
       ...,
       [-0.85257485,  0.66238881,  0.87150754, ..., -2.20881234,
        -1.51540075, -2.86993861],
       [-0.17907684,  1.39338275, -1.4778554 , ...,  0.61538041,
        -1.03007462,  0.66218852],
       [-0.33350289, -0.37746613,  1.47481218, ...,  0.15937483,
         0.63328215,  0.83027206]])

In [17]:
y_train

array([0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1,
       0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1,
       0])

### 7. Fit the pipeline on X_train and y_train data

Once we call fit on X_train and y_train, it passes through the pipeline and get trained for Logistic regression.

In [18]:
pipe.fit(X_train, y_train)

### 8. Prediction

In [19]:
y_pred = pipe.predict(X_test)

In [20]:
y_pred

array([1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0])

**Inference:**

- In prediction, when *X_test* passes through this *StandardScaler* pipeline, It performs only *transform()* method. 

- This is how we can use pipelines. We can combine the transformation and feature scaling techniques.

## Pipeline Implementation Example 2

### 1. Import PCA and SVC

In [21]:
from sklearn.decomposition import PCA
from sklearn.svm import SVC

### 2. Steps to apply for model creation

In [22]:
steps = [
    ("scaling", StandardScaler()),
    ("PCA", PCA(n_components=3)),
    ("SVC", SVC())
]

### 3. Convert the Steps into Pipeline

In [23]:
pipe2=Pipeline(steps)

In [24]:
pipe2

**Inference:**
    
    This means in the first step we apply standard scaler, followed by PCA and SVC.

### 4. Fit the pipeline on X_train and y_train

In [25]:
pipe2.fit(X_train,y_train)

### 5. Prediction

In [26]:
pipe2.predict(X_test)

array([1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1,
       1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0])

## Pipeline Implementation Example 3 - Column Transformer

### 1. Impute missing values using pipeline

In [33]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

In [30]:
import numpy as np

### 2. Create a numreical processing pipeline to impute missing values in numerical categories.

In [34]:
numeric_processor = Pipeline(steps=[
    ("imputation_mean", SimpleImputer(missing_values=np.nan, strategy="mean")),
    ("scaler", StandardScaler())
])

# here missing values replaced with nan.

In [35]:
numeric_processor

**Inference**

Here first step is Simple Imputer followed by Standard scaling.

## 3. Categorical prcessing pipeline to impute missing data in categorical columns

**Steps**

    1. Simple Imputation
    2. One Hot Encoding

In [45]:
from sklearn.preprocessing import OneHotEncoder

categorical_processing = Pipeline(steps=[
    ("imputation_constant", SimpleImputer(fill_value="missing", strategy="constant")),
    ("onehot_encoding", OneHotEncoder(handle_unknown="ignore"))
])

In [41]:
categorical_processing

#### Inference

**Steps**

    1. Fill missing columns with value "missing".
    2. One Hot encoding to ignore unknown things.

In [46]:
# time: 21:15