# Pipelines 

In machine learning, a pipeline is a way of organizing and streamlining the various steps in a machine learning workflow. It ensures that all steps, from data preprocessing to model evaluation, are executed in a consistent and reproducible manner.

Components of a Pipeline:

    Data Preprocessing: This includes steps like data cleaning (handling missing values), scaling or normalizing features, encoding categorical variables, and feature selection.

    Model Training: The actual machine learning algorithm (e.g., decision trees, support vector machines) is applied to the processed data to build a model.

    Model Evaluation: This step involves evaluating the model’s performance using metrics like accuracy, precision, recall, etc., typically with a validation set or using cross-validation.

Why Use Pipelines?

    Streamlined Workflow: Pipelines allow you to chain multiple steps (like preprocessing and model training) together into a single object. This reduces the risk of errors when manually performing each step individually.

    Consistency: With a pipeline, you ensure that the same preprocessing steps are applied to both the training and testing data, which is crucial for model generalization.

    Reusability: Pipelines can be reused and shared with others, making it easier to apply the same sequence of operations to different datasets.

    Ease of Hyperparameter Tuning: When performing grid search or other hyperparameter optimization methods, pipelines ensure that all transformations are applied to each fold of the data in the correct order.


```python

# pipeline example 

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split

# Sample data (features and target)
X = ...  # feature matrix
y = ...  # target variable

# Split data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Create a pipeline
pipeline = Pipeline([
    ("ohe",Onehotencoder())
    ('std', StandardScaler()),  # Step 1: Standardize the features
    ('clf', RandomForestClassifier())  # Step 2: Train a Random Forest model
])

# Train the pipeline
pipeline.fit(X_train, y_train)

# Make predictions
y_pred = pipeline.predict(X_test)

# Evaluate the model
accuracy = (y_pred == y_test).mean()
print(f'Accuracy: {accuracy}')
```

# Pipelines with grid search

```python 

from sklearn.model_selection import GridSearchCV

# Define parameter grid
param_grid = {
    'clf__n_estimators': [50, 100],  # 'classifier' is the RandomForest model
    'clf__max_depth': [10, 20]
}

# Perform grid search on the pipeline
grid_search = GridSearchCV(pipeline, param_grid, cv=5)
grid_search.fit(X_train, y_train)

# Get best parameters
print("Best parameters:", grid_search.best_params_)

```

## Logistic regression without piplines

In [1]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt  
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler,MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

In [2]:
# data loading
df =  pd.read_csv("./data/titanic_data.csv",index_col="PassengerId")
df.head()

Unnamed: 0_level_0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1
1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
# info
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 11 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Ticket    891 non-null    object 
 8   Fare      891 non-null    float64
 9   Cabin     204 non-null    object 
 10  Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 83.5+ KB


Age and Cabin contain null values  we drop cabin and impute Age 

In [4]:
# drop null values
df.drop("Cabin",axis=1,inplace=True)

We also drop Embarked and Ticket

In [5]:
# drop unnecessary columns
df.drop(["Embarked","Ticket","Name"],axis=1,inplace=True)

In [6]:
# make a copy of the original dataframe
orginal_df = df.copy()

In [7]:
# impute missing values in 'Age' column
imputer =  SimpleImputer(strategy="mean")
df["Age"] = imputer.fit_transform(df[["Age"]])
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 55.7+ KB


In [8]:
df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,1,0,7.25
2,1,1,female,38.0,1,0,71.2833
3,1,3,female,26.0,0,0,7.925
4,1,1,female,35.0,1,0,53.1
5,0,3,male,35.0,0,0,8.05


### OHE Categorical columns 

In [9]:
# OHE encoding
df =  pd.get_dummies(df,columns=["Sex"],drop_first=True,dtype=int)
df.head()

Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,22.0,1,0,7.25,1
2,1,1,38.0,1,0,71.2833,0
3,1,3,26.0,0,0,7.925,0
4,1,1,35.0,1,0,53.1,0
5,0,3,35.0,0,0,8.05,1


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Age       891 non-null    float64
 3   SibSp     891 non-null    int64  
 4   Parch     891 non-null    int64  
 5   Fare      891 non-null    float64
 6   Sex_male  891 non-null    int32  
dtypes: float64(2), int32(1), int64(4)
memory usage: 52.2 KB


In [11]:
# split X and y
X = df.drop("Survived",axis =1)
y = df["Survived"]
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,22.0,1,0,7.25,1
2,1,38.0,1,0,71.2833,0
3,3,26.0,0,0,7.925,0
4,1,35.0,1,0,53.1,0
5,3,35.0,0,0,8.05,1


In [12]:
X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 6 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    891 non-null    int64  
 1   Age       891 non-null    float64
 2   SibSp     891 non-null    int64  
 3   Parch     891 non-null    int64  
 4   Fare      891 non-null    float64
 5   Sex_male  891 non-null    int32  
dtypes: float64(2), int32(1), int64(3)
memory usage: 45.2 KB


In [13]:
#stardaize numerical colums 
scaler = MinMaxScaler()
X[["Age","Parch","Fare"]] = scaler.fit_transform(X[["Age","Parch","Fare"]])
X.head()

Unnamed: 0_level_0,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
1,3,0.271174,1,0.0,0.014151,1
2,1,0.472229,1,0.0,0.139136,0
3,3,0.321438,0,0.0,0.015469,0
4,1,0.434531,1,0.0,0.103644,0
5,3,0.434531,0,0.0,0.015713,1


In [14]:
# test train split 
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [15]:
# Model training and evaluation
model =  LogisticRegression()

model.fit(X_train,y_train)
y_pred = model.predict(X_test)

In [16]:
# report
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## With a grid Search

In [17]:
# Define the hyperparameters to tune
param_grid = {
    'C': [0.1, 1, 10],            # Regularization strength (Inverse of regularization strength)
    'solver': ['liblinear', 'saga'],  # Optimization algorithms
    'penalty': ['l2', 'l1'],         # Regularization types
    'class_weight': [None, 'balanced']  # Handle imbalanced classes (optional)
}# Define the hyperparameters to tune


In [18]:
# Set up GridSearchCV

grid_search = GridSearchCV(estimator=LogisticRegression(),param_grid=param_grid,cv=20)


In [19]:
# Fit the model with the best hyperparameters
grid_search.fit(X_train,y_train)



GridSearchCV(cv=20, estimator=LogisticRegression(),
             param_grid={'C': [0.1, 1, 10], 'class_weight': [None, 'balanced'],
                         'penalty': ['l2', 'l1'],
                         'solver': ['liblinear', 'saga']})

In [20]:
grid_search.best_params_

{'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'saga'}

In [21]:
grid_search.predict(X_test)

array([0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
      dtype=int64)

In [23]:
# Output the best parameters and score
print("Best parameters found: ", grid_search.best_params_)
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))

# Evaluate  the model on the test set
best_model = grid_search.best_estimator_
test_score = best_model.score(X_test, y_test)
print("Test set score: {:.2f}".format(test_score))

Best parameters found:  {'C': 0.1, 'class_weight': None, 'penalty': 'l2', 'solver': 'saga'}
Best cross-validation score: 0.80
Test set score: 0.90


# With Pipelines 

In [25]:
# import pipelines
from sklearn.pipeline import Pipeline


In [26]:
# recheck data 
df.head()


Unnamed: 0_level_0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,22.0,1,0,7.25,1
2,1,1,38.0,1,0,71.2833,0
3,1,3,26.0,0,0,7.925,0
4,1,1,35.0,1,0,53.1,0
5,0,3,35.0,0,0,8.05,1


In [27]:
# re assign X and y
X=df.drop("Survived",axis=1)
y=df["Survived"]

In [28]:
# train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=20,random_state=42)

In [29]:
# Make Pipeline for stardadization and modeling

pipe = Pipeline([
    ("one",MinMaxScaler()),
    ("model",LogisticRegression())
])



In [30]:
# fit pipeline and make predictions
pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)


In [31]:
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## Pipelines with Grid Search

In [39]:
# redifine Param grid
param_grid = {
    'model__C': [0.1, 1, 10],            # Regularization strength (Inverse of regularization strength)
    'model__solver': ['liblinear', 'saga'],  # Optimization algorithms
    'model__penalty': ['l2', 'l1'],         # Regularization types
    'model__class_weight': [None, 'balanced']  # Handle imbalanced classes (optional)
}

In [40]:
# define a grid

grid = GridSearchCV(estimator=pipe,param_grid=param_grid)

In [41]:
# fit and get best param
grid.fit(X_train,y_train)

y_pred = grid.predict(X_test)

In [35]:
print(classification_report(y_pred=y_pred,y_true=y_test))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



## changing models /Steps in a pipeline


In [42]:
# switching to Desciscion tree
from sklearn.tree import DecisionTreeClassifier
pipe.set_params(one=StandardScaler())
pipe.set_params(model=DecisionTreeClassifier())



Pipeline(steps=[('one', StandardScaler()), ('model', DecisionTreeClassifier())])

In [43]:
# fit and make predictions 
pipe.fit(X_train,y_train)

y_pred = pipe.predict(X_test)



In [44]:
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.78      0.70      0.74        10
           1       0.73      0.80      0.76        10

    accuracy                           0.75        20
   macro avg       0.75      0.75      0.75        20
weighted avg       0.75      0.75      0.75        20



In [51]:
# Todo use swith to any other model and also do a grid search 
# Here you have to change the param_grid parameters according to the model you are using
# and also change the model in the pipeline

Can You Fit GridSearchCV After Changing the Model in a Pipeline?
Short Answer: No, if you change the model in the pipeline without updating the param_grid in GridSearchCV, the grid search will fail or produce incorrect results.

In [52]:
from sklearn.ensemble import RandomForestClassifier
pipe.set_params(model=RandomForestClassifier())
param_grid = {
    'model__n_estimators': [50, 100],  # RF params (not LogisticRegression's 'C')
    'model__max_depth': [3, 5]
}

In [56]:
grid = GridSearchCV(estimator=pipe,param_grid=param_grid, cv=6)
grid.fit(X_train,y_train)
y_pred = grid.predict(X_test)
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.83      1.00      0.91        10
           1       1.00      0.80      0.89        10

    accuracy                           0.90        20
   macro avg       0.92      0.90      0.90        20
weighted avg       0.92      0.90      0.90        20



In [58]:
grid.best_estimator_

Pipeline(steps=[('one', StandardScaler()),
                ('model',
                 RandomForestClassifier(max_depth=5, n_estimators=50))])

## Column Tranformers 

A **ColumnTransformer** in machine learning is used to apply different preprocessing techniques to different subsets of columns (features) in a dataset. It allows you to transform numerical and categorical columns with different operations, such as scaling numerical data or encoding categorical data, in a clean and efficient way.

### Benefits:
1. **Streamlined preprocessing**: You can apply different transformations to different columns in a single, unified step.
2. **Cleaner code**: Organizes preprocessing tasks and avoids manually separating data by column types.
3. **Flexibility**: You can specify custom transformations for each set of columns (e.g., scaling for numerical columns, one-hot encoding for categorical columns).
4. **Improved pipeline integration**: It integrates well within machine learning pipelines, ensuring consistency when training and testing models.

In [59]:
# imports 
from sklearn.compose import ColumnTransformer

from sklearn.preprocessing import OneHotEncoder,OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer

Original data inspection

In [60]:
orginal_df.head()

Unnamed: 0_level_0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
1,0,3,male,22.0,1,0,7.25
2,1,1,female,38.0,1,0,71.2833
3,1,3,female,26.0,0,0,7.925
4,1,1,female,35.0,1,0,53.1
5,0,3,male,35.0,0,0,8.05


In [61]:
orginal_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 7 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
dtypes: float64(2), int64(4), object(1)
memory usage: 55.7+ KB


Custom Function

In [62]:
# custom function 
def plus_one(x):
    return x+50

## Tranformers

In [63]:
# # Creating tranformers

tranformer = ColumnTransformer([
    
    ("ohe",OneHotEncoder(),["Sex"]),
    ("impute",SimpleImputer(strategy="mean"),["Age","Fare"]),
    ("std",MinMaxScaler(),["Age","Fare"])
    
],remainder="passthrough")


pipe = Pipeline([
    ("pre_pro",tranformer),
    ("model",LogisticRegression())
])



In [64]:
def upper_case(x):
    return str.upper(x)

In [65]:
cat = ["Sex","Name"] 
num = ["Age","Fare"]
cat_ord = ["grades"]

In [None]:
tranformer2 = ColumnTransformer([
    ("cat",Pipeline([
        ("imputer",SimpleImputer(strategy="Mode")),
        ("ohe",OneHotEncoder())
        ])
    ,cat
    ),
    ("num",Pipeline([
        ("imputer",SimpleImputer(strategy="mean")),
        ("scaler",MinMaxScaler())
        ]),
     num
    ),
    ("ord",Pipeline([
        
        ]),cat_ord)
])

In [71]:
tranformer3 = ColumnTransformer([
    ("cat",Pipeline([
        ("impute_mode",SimpleImputer(strategy="most_frequent")),
        ("ohe",OneHotEncoder())
        ]),["Sex"]
     ),
    ("num",Pipeline([
        ("impute_mean",SimpleImputer(strategy="mean")),
        ("scale",MinMaxScaler()),
        ]),["Age","Fare"])
])

In [73]:
pipe3 = Pipeline([
    ("tranformer",tranformer3),
    ("model",LogisticRegression())
])

In [74]:
X=orginal_df.drop("Survived",axis=1)
y=orginal_df["Survived"]

In [75]:
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [76]:
pipe3.fit(X_train,y_train)

y_pred = pipe3.predict(X_test)

print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       105
           1       0.75      0.70      0.73        74

    accuracy                           0.78       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179



In [77]:
# structural difference between Pipeline and ColumnTransformer
# Transformer = [("name"step,"",["columns"])]
# Pipeline = ("name",["steps"])

In [232]:
[("ohe",OneHotEncoder())]

[('ohe', OneHotEncoder())]

In [None]:
# Pipeline
pipe = Pipeline([
    ("pre_pro",tranformer2),
    ("model",LogisticRegression())
])

In [234]:
# re assign X and y
X=orginal_df.drop("Survived",axis=1)
y=orginal_df["Survived"]

In [235]:
# train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=42)

In [236]:
# fit and make predictions 
# pipe.fit(X_train,y_train)

# y_pred = pipe.predict(X_test)

In [237]:
print(classification_report(y_true=y_test,y_pred=y_pred))

              precision    recall  f1-score   support

           0       0.80      0.84      0.82       105
           1       0.75      0.70      0.73        74

    accuracy                           0.78       179
   macro avg       0.78      0.77      0.77       179
weighted avg       0.78      0.78      0.78       179

