#### Pipeline
A Pipeline in ML is a way to bundle together all the steps of a machine learning workflow into a singl object.<br>
Instead of manually doing:-
- Data Preprocessing (ex. handle mising values , scalling)
- Feature Transformation (ex. encoding etc)
- Model Training<br>
You can put them all inside one pipeline and execute them in sequence.

#### Why use Pipeline?
- <b>Cleaner Code - </b>no need to repeat preprocessing for training and testing separately.
- <b>Avoid Data Leakage - </b>Transformations are learned only on training data and applied to test data automatically.
- <b>Easier Hyperparameter Tuning - </b>You can tune preprocessing and model parameters together using <b>GridSearchCV</b>
- <b>Reproducibility - </b>A Single object contains the full ML workflow.

#### Basic Structure of a Pipeline
Pipeline(steps=[<br>
    ('step1',transformer1),<br>
    ('step2',transformer2),<br>
    ('model',model)<br>
])

In [52]:
import pandas as pd
import seaborn as sns
df = sns.load_dataset('titanic')

In [53]:
df = df[['survived','sex','age','fare','embarked']]
df.shape

(891, 5)

In [54]:
df.head(2)

Unnamed: 0,survived,sex,age,fare,embarked
0,0,male,22.0,7.25,S
1,1,female,38.0,71.2833,C


In [55]:
df.isnull().sum()

survived      0
sex           0
age         177
fare          0
embarked      2
dtype: int64

In [56]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   survived  891 non-null    int64  
 1   sex       891 non-null    object 
 2   age       714 non-null    float64
 3   fare      891 non-null    float64
 4   embarked  889 non-null    object 
dtypes: float64(2), int64(1), object(2)
memory usage: 34.9+ KB


In [57]:
X = df.drop(columns=['survived'])
y = df['survived']

In [58]:
from sklearn.model_selection import train_test_split

In [59]:
X_train , X_test , y_train , y_test = train_test_split(X,y,test_size=0.2,random_state=42)

#### Pipeline Creation

In [60]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler , OneHotEncoder
from sklearn.compose import ColumnTransformer

In [61]:
# Numerical Feature
numerical_feature = ['age','fare']
numerical_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='mean')),
    ('scaler',StandardScaler())
])

In [62]:
# Categorical Feature
categorical_feature = ['sex','embarked']
categorical_transformer = Pipeline(steps=[
    ('imputer',SimpleImputer(strategy='most_frequent')),
    ('onehot',OneHotEncoder(handle_unknown='ignore'))
])

In [63]:
# Combining preprocesing for both types
preprocessor = ColumnTransformer(transformers=[
    ('num',numerical_transformer,numerical_feature),
    ('cat',categorical_transformer,categorical_feature)
])

In [64]:
# Add Model into the Pipeline
from sklearn.linear_model import LogisticRegression

In [65]:
clf = Pipeline(steps=[
    ('preprocessor',preprocessor),
    ('classifier',LogisticRegression())
])

In [66]:
clf.fit(X_train,y_train)

In [67]:
y_pred = clf.predict(X_test)

In [68]:
# Model Evaluation
from sklearn.metrics import accuracy_score, classification_report

In [69]:
print("Accuracy Score :",accuracy_score(y_test,y_pred))
print("Classification Report")
print(classification_report(y_test,y_pred))

Accuracy Score : 0.776536312849162
Classification Report
              precision    recall  f1-score   support

           0       0.80      0.83      0.81       105
           1       0.74      0.70      0.72        74

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



##### Benefits of Action / Pipeline
- No mannual preprocessing for <b>X_test</b> - It's done inside the pipeline
- If we switch <b>LogisticRegression to RandomForestClassifier</b>, everything else still works
- We can <b>tune both preprocessing and model hyperparameters</b> in one place.

#### Hyperparameter Tuning with Pipeline

In [70]:
from sklearn.model_selection import GridSearchCV

In [94]:
param_grid = {
    'classifier__C':[0.1,1.0,10],    # C Parameter for logistic Regression
}

In [95]:
grid_search = GridSearchCV(clf,param_grid,cv=5,n_jobs=-1)

In [96]:
grid_search.fit(X_train,y_train)

In [98]:
print("Best Parameters :",grid_search.best_params_)
print("Best Cross-validation Score :",grid_search.best_score_)

Best Parameters : {'classifier__C': 0.1}
Best Cross-validation Score : 0.783669851275485


In [99]:
from sklearn.metrics import accuracy_score,classification_report

In [100]:
y_pred = grid_search.predict(X_test)

In [103]:
print("Test Accuracy :",accuracy_score(y_test,y_pred))
print("Classification Report\n",classification_report(y_test,y_pred))

Test Accuracy : 0.776536312849162
Classification Report
               precision    recall  f1-score   support

           0       0.80      0.83      0.81       105
           1       0.74      0.70      0.72        74

    accuracy                           0.78       179
   macro avg       0.77      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



### Key Points to Remember
- <b>Always put preprocessing inside the pipeline</b> to avoid leakage.
- <b>ColumnTransformer</b> is the best way to handle multiple column types.
- The <b>last step in a pipeline must be an estimator</b> (classifier/regressor).
- Pipelines can be saved using <b>joblib or pickle</b>.

#### Save a Pipeline using Joblib

In [104]:
import joblib

In [116]:
# Save Pipeline
joblib.dump(grid_search.best_estimator_,'titanic_pipeline.pkl')

['titanic_pipeline.pkl']

In [117]:
# Load pipeline
loaded_pipeline = joblib.load('titanic_pipeline.pkl')

In [118]:
predictions  = loaded_pipeline.predict(X_test)

In [119]:
predictions[:10]

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

##### Why Joblib?
- Optimized for scikit-learn objects.
- handles large NumPy arrays efficiently.

#### Using Pickle

In [120]:
import pickle

In [121]:
# Save
with open('titanic_pipeline2.pkl','wb') as f:
    pickle.dump(grid_search.best_estimator_ , f)

In [122]:
# load
with open('titanic_pipeline2.pkl','rb') as f:
    loaded_pipeline = pickle.load(f)

In [123]:
predictions  = loaded_pipeline.predict(X_test)

In [124]:
predictions[:10]

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1], dtype=int64)

### Note :- Joblib is usually faster for ML models, but pickle works too.