<a href="https://colab.research.google.com/github/prachigupta2006/feature-engineering/blob/main/titanic_with_pipelines.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
import pandas as pd
import numpy as np

In [2]:
from sklearn.model_selection import train_test_split
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.tree import DecisionTreeClassifier
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline,make_pipeline
from sklearn.feature_selection import SelectKBest,chi2

In [3]:
from google.colab import files
uploaded = files.upload()

Saving Titanic-Dataset.csv to Titanic-Dataset.csv


In [4]:
df= pd.read_csv('Titanic-Dataset.csv')
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
df.drop(['PassengerId', 'Name', 'Ticket', 'Cabin'], axis=1, inplace=True)

In [6]:
df.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Sex,0
Age,177
SibSp,0
Parch,0
Fare,0
Embarked,2


In [7]:
X_train,X_test,y_train,y_test = train_test_split(df.drop('Survived',axis=1),df['Survived'],test_size=0.2,random_state=42)

In [8]:
X_train

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
331,1,male,45.5,0,0,28.5000,S
733,2,male,23.0,0,0,13.0000,S
382,3,male,32.0,0,0,7.9250,S
704,3,male,26.0,1,0,7.8542,S
813,3,female,6.0,4,2,31.2750,S
...,...,...,...,...,...,...,...
106,3,female,21.0,0,0,7.6500,S
270,1,male,,0,0,31.0000,S
860,3,male,41.0,2,0,14.1083,S
435,1,female,14.0,1,2,120.0000,S


In [9]:
y_train

Unnamed: 0,Survived
331,0
733,0
382,0
704,0
813,0
...,...
106,1
270,0
860,0
435,1


when doing imputation it is suggested that rather than calling by name you should call by index number so that it could be used in other following plans


In [10]:
#imputation transformer
trf1=ColumnTransformer([
    ('impute_age',SimpleImputer(),[2]),
    ('impute_embarked',SimpleImputer(strategy='most_frequent'),[6])
],remainder='passthrough')
#2 = age column
#6 = embarked column
#if not used passthrough other columns would have been dropped

In [11]:
#one hot encoding
trf2=ColumnTransformer([
    ('ohe_sex_embarked',OneHotEncoder(sparse=False,drop='first'),[1,6])
],remainder='passthrough')

#sex and embarked both columns ohe hot encoding is taking place together as above
#drop first that is dummmy encoding is not used

dummy encoding mostly matters in regression model but in decision tree it is not really required.

In [12]:
#scaling

trf3 = ColumnTransformer([
    ('scale',MinMaxScaler(),slice(0,10))
])

we used the min-max function because of feature slection in next step as this data set requires min max scaling for it only

In [13]:
#feature selection

trf4=SelectKBest(score_func=chi2,k=8)

In [14]:
#train the model

trf5=DecisionTreeClassifier()

# CREATE PIPELINE

In [15]:
from  sklearn.pipeline import Pipeline,make_pipeline

from os import pipe

In [16]:
pipe = Pipeline([
    ('trf1',trf1),
    ('trf2',trf2),
    ('trf3',trf3),
    ('trf4',trf4),
    ('trf5',trf5)
])

PIPELINE VS make_pipeline

In [17]:
#alternate syntax
'''pipe=make_pipeline(trf1,trf2,trf3,trf3,trf5)'''

'pipe=make_pipeline(trf1,trf2,trf3,trf3,trf5)'

the diffrence between pipe line and make_pipeline is that you dont have to pass the name but only the object

In [18]:
#train

pipe.fit(X_train,y_train)



"fit " fuction is done when an algorithhm or a ML model used to fit and predict where as
when we use "fit_transformers" in when we dont give a algorithm and hence we fit and transform

In [19]:
#disply

from sklearn import set_config
set_config(display='diagram')

# EXPLORE PIPELINES


In [20]:
pipe.named_steps['trf1'].transformers_

[('impute_age', SimpleImputer(), [2]),
 ('impute_embarked', SimpleImputer(strategy='most_frequent'), [6]),
 ('remainder', 'passthrough', [0, 1, 3, 4, 5])]

In [21]:
pipe.named_steps['trf1'].transformers_[0]

('impute_age', SimpleImputer(), [2])

In [22]:
pipe.named_steps['trf1'].transformers_[0][1]

In [23]:
pipe.named_steps['trf1'].transformers_[0][1].statistics_

array([29.49884615])

In [24]:
pipe.named_steps['trf1'].transformers_[1][1].statistics_

array(['S'], dtype=object)

In [25]:
#display

from sklearn import set_config
set_config(display='diagram')

with this step we can do backtracking of the code and know about the info or any kind of steps inorder to do other analysis further

We can do experiments with tranformers and do Post-mortem of the code


# PREDICTION

In [None]:
y_pred=pipe.predict(X_test)

In [None]:
y_pred

In [None]:
from sklearn.metrics import accuracy_score
accuracy_score(y_test,y_pred)

# CROSS VALIDATION

In [None]:
#cross validation using cross_val_score
from sklearn.model_selection import cross_val_score
cross_val_score(pipe,X_train,y_train,cv=5,scoring='accuracy').mean()

# GRIDSEARCH USING PIPELINE

using the max deapth fuction you can tune your data inorder to increase or decrease the improvement of a data

In [None]:
#gridsearchcv
para ={
    'trf5__max_depth':[1,2,3,4,5,None]
}

In [None]:
from sklearn_model_selection import GridSearchCV
grid = GridSearchCV(pipe,params,cv=5,scoring='accuracy')
grid.fit(X_train,y_train)

here trf5 was our model name that was a decisiontree model

In [None]:
grid.best_score_

In [None]:
grid.best_params_

# EXPORT THE PIPELINE

In [None]:
#export
import pickle
import numpy as np
pickle.dump(pipe,open('pipe.pkl','wb'))

u dont have to take one hot encoding and simple imputer knowlege because it is there in the pipe function above

In [None]:
 #asume user input
 test_input2 = np.array([2,'male',31.0,0,0,10.5,'S'],dtype=object).reshape(1,7)


In [None]:
pipe.predict(test_input2)

in the file without pipeline you had to write so many code so rather than that when you use pipeline unction your code writing becomes efficient and less time consuming

how to load the file?

copy the url and then paste it in another tab .you will not be able to see the file but you can load it.