# **Machine Learning with Pipeline - 1**
Machine Learning with a pipeline is a common practice in the field of data science and machine learning. A pipeline is a series of data processing components (transformers and an estimator) that are chained together to streamline the workflow in machine learning tasks. It helps in organizing and automating the various steps involved in building and evaluating machine learning models.

<img src="https://static.javatpoint.com/tutorial/machine-learning/images/machine-learning-pipeline2.png" style="width:100%">

## **Import Required Libraries**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## **Read the Data**

In [2]:
df = pd.read_csv(r"D:\Coding\Datasets\titanic.csv")
df

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


In [3]:
# Dropping the unecessary columns
df.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"], inplace=True)
df.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [4]:
# Check to column informations
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       714 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


## **Train Test Split**

In [5]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(df.drop("Survived", axis=1),
                                                    df["Survived"],
                                                    test_size=0.3,
                                                    random_state=0)
x_train.shape, x_test.shape

((623, 7), (268, 7))

In [6]:
x_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
857,1,male,51.0,0,0,26.5500,S
52,1,female,49.0,1,0,76.7292,C
386,3,male,1.0,5,2,46.9000,S
124,1,male,54.0,0,1,77.2875,S
578,3,female,,1,0,14.4583,C
...,...,...,...,...,...,...,...
835,1,female,39.0,1,1,83.1583,C
192,3,female,19.0,1,0,7.8542,S
629,3,male,,0,0,7.7333,Q
559,3,female,36.0,1,0,17.4000,S


## **Preprocess the Data using Column Transformer**

In [7]:
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import SelectKBest, chi2
from sklearn.compose import ColumnTransformer

In [8]:
# Create an imputation transformer for 'Age' and 'Embarked' columns
transformer_1 = ColumnTransformer([
    ("impute_age", SimpleImputer(), [2]),
    ("impute_embarked", SimpleImputer(strategy="most_frequent"), [6])
], remainder="passthrough")

In [9]:
# Create an One Hot Encoding tranformer for 'Sex' and 'Embarked' columns
transformer_2 = ColumnTransformer([
    ("ohe_sex_embarked", OneHotEncoder(sparse_output=False, handle_unknown="ignore"), [1, 6])
], remainder="passthrough")

In [10]:
# Create a transformer for scale the values
transformer_3 = ColumnTransformer([
    ("scale", MinMaxScaler(), slice(0, 10))
])

In [11]:
# Create a transformer to select best 8 features
transformer_4 = SelectKBest(score_func=chi2, k=8)

## **Train a Decision Tree Model**

In [12]:
from sklearn.tree import DecisionTreeClassifier

In [13]:
# Train a decision tree classifier
dt_classifier = DecisionTreeClassifier(random_state=0)

## **Create Pipeline**

In [14]:
from sklearn.pipeline import Pipeline

In [15]:
pipe = Pipeline([
    ("transformer_1", transformer_1),
    ("transformer_2", transformer_2),
    ("transformer_3", transformer_3),
    ("transformer_4", transformer_4),
    ("transformer_5", dt_classifier)
])

## **Pipeline vs make_pipeline**

In [16]:
from sklearn.pipeline import make_pipeline

In [17]:
# Alternate Syntax
pipe = make_pipeline(transformer_1, transformer_2, transformer_3, transformer_4, dt_classifier)

## **Train the Model using Pipeline**

In [18]:
# Train the model
pipe.fit(x_train, y_train)

## **Explore the Pipeline**

In [19]:
# Print the steps in the pipeline
pipe.named_steps

{'columntransformer-1': ColumnTransformer(remainder='passthrough',
                   transformers=[('impute_age', SimpleImputer(), [2]),
                                 ('impute_embarked',
                                  SimpleImputer(strategy='most_frequent'),
                                  [6])]),
 'columntransformer-2': ColumnTransformer(remainder='passthrough',
                   transformers=[('ohe_sex_embarked',
                                  OneHotEncoder(handle_unknown='ignore',
                                                sparse_output=False),
                                  [1, 6])]),
 'columntransformer-3': ColumnTransformer(transformers=[('scale', MinMaxScaler(), slice(0, 10, None))]),
 'selectkbest': SelectKBest(k=8, score_func=<function chi2 at 0x000001D093F07560>),
 'decisiontreeclassifier': DecisionTreeClassifier(random_state=0)}

In [20]:
# Check the mean value of the SimpleImputer object for 'age' column
pipe.named_steps["columntransformer-1"].transformers_[0][1].statistics_

array([29.91533865])

## **Accuracy Assessment**

In [21]:
# Predict the test data
y_pred = pipe.predict(x_test)

In [22]:
from sklearn.metrics import accuracy_score

In [23]:
# Print the overall accuracy of the model
accuracy_score(y_test, y_pred)

0.6417910447761194

## **Cross Validation using Pipeline**

In [24]:
from sklearn.model_selection import cross_val_score

In [25]:
# Cross validation using cross_val_score
cross_val_score(pipe, x_train, y_train, cv=5, scoring="accuracy").mean()

0.6324387096774193

## **GridSearch using Pipeline**

In [26]:
# Define the parameters for GridSearch
params = {
    "decisiontreeclassifier__max_depth":[1, 2, 3, 4, 5, None]
}

In [27]:
from sklearn.model_selection import GridSearchCV

In [28]:
# Create an object of the GridSearchCV Class
grid = GridSearchCV(estimator=pipe, param_grid=params, cv=5, scoring="accuracy")

# Fit the training data
grid.fit(x_train, y_train)

In [29]:
# Print the best parameters for the model
grid.best_params_

{'decisiontreeclassifier__max_depth': 5}

In [30]:
# Print the overall accuracy
grid.best_score_

0.6324387096774193

## **Export the Pipeline**

In [31]:
import pickle

In [32]:
pickle.dump(pipe, file=open("D:\Coding\Models\pipe.pkl", "wb"))