# Titanic Competition
You should build an end-to-end machine learning pipeline to predict survivors of the Titanic disaster and participate in the corresponding Kaggle competition. In particular, you should do the following:
- Read the Titanic competition page on [Kaggle](https://www.kaggle.com/competitions/titanic/overview).
- Load the `titanic` dataset using [Pandas](https://pandas.pydata.org/docs/reference/api/pandas.read_csv.html). You can find this dataset in the datasets folder.
- Split the dataset into training and test sets using [Scikit-Learn](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html).
- Build an end-to-end machine learning pipeline, including all necessary steps, to have a running solution with some performance.
- Collaborate with your groupmates to finalize your pipeline by
    - reading the discussion forum to learn from other community members;
    - discussing the bottlenecks of your current solution;
    - running experiments on your pipeline;
    - improving the performance of your pipeline.
- Test the best pipeline on the test set and report various [evaluation metrics](https://scikit-learn.org/0.15/modules/model_evaluation.html).  
- Present your pipeline.
- Submit your predictions to Kaggle.

##Importing Libraries

In [382]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression, Lasso, Ridge
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from mlxtend.plotting import plot_decision_regions


##Importing Dataset

In [354]:
df = pd.read_csv('https://raw.githubusercontent.com/m-mahdavi/teaching/refs/heads/main/datasets/titanic.csv')

##Splitting Dataset

In [355]:
df_train, df_test = train_test_split(df, random_state=42)

##Data Exploration

In [356]:
df_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
298,299,1,1,"Saalfeld, Mr. Adolphe",male,,0,0,19988,30.5,C106,S
884,885,0,3,"Sutehall, Mr. Henry Jr",male,25.0,0,0,SOTON/OQ 392076,7.05,,S
247,248,1,2,"Hamalainen, Mrs. William (Anna)",female,24.0,0,2,250649,14.5,,S
478,479,0,3,"Karlsson, Mr. Nils August",male,22.0,0,0,350060,7.5208,,S
305,306,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S


In [357]:
df_train.shape

(668, 12)

In [358]:
df_train.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,132
SibSp,0
Parch,0
Ticket,0
Fare,0


In [359]:
df_train["Age"] = df_train["Age"].fillna(df_train["Age"].mean())
df_train["Embarked"] = df_train["Embarked"].fillna(df_train["Embarked"].mode()[0])

In [360]:
df_train.isnull().sum()

Unnamed: 0,0
PassengerId,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Ticket,0
Fare,0


##Data Preprocessing

In [361]:
x_train = df_train.drop(["PassengerId","Survived", "Name", "Ticket", "Cabin"], axis=1)
x_test = df_test.drop(["PassengerId","Survived", "Name", "Ticket", "Cabin"], axis=1)
y_train = df_train["Survived"]
y_test = df_test["Survived"]

##Feature Engineering

In [362]:
categorical_attributes = x_train.select_dtypes(include=['object']).columns
numerical_attributes = x_train.select_dtypes(include=['int64']).columns
ct = ColumnTransformer(
    [
        ("scaling",StandardScaler(),numerical_attributes),
        ("encoding",OneHotEncoder(handle_unknown='ignore'),categorical_attributes)

    ]
)
ct.fit(x_train)
x_train = ct.transform(x_train)
x_test = ct.transform(x_test)

print("x_train size:",x_train.shape)
print("x_test size:",x_test.shape)

x_train size: (668, 8)
x_test size: (223, 8)


##Model Training

In [363]:
sv = SVC()
sv.fit(x_train,y_train)

In [364]:
lr = LogisticRegression()
lr.fit(x_train,y_train)

In [365]:
dt = DecisionTreeClassifier()
dt.fit(x_train,y_train)

In [366]:
knn = KNeighborsClassifier()
knn.fit(x_train,y_train)

In [367]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

##Model Prediction

In [368]:
sv_pred = sv.predict(x_test)
lr_pred = lr.predict(x_test)
dt_pred = dt.predict(x_test)
knn_pred = knn.predict(x_test)
rf_pred = rf.predict(x_test)

##Model Evaluation

In [369]:
print("SVM: ",sv.score(x_train,y_train)*100, sv.score(x_test,y_test)*100)
print("LR: ",lr.score(x_train,y_train)*100, lr.score(x_test,y_test)*100)
print("DT: ",dt.score(x_train,y_train)*100, dt.score(x_test,y_test)*100)
print("KNN: ",knn.score(x_train,y_train)*100, knn.score(x_test,y_test)*100)
print("RF: ",rf.score(x_train,y_train)*100, rf.score(x_test,y_test)*100)

SVM:  81.88622754491018 80.26905829596413
LR:  79.49101796407186 77.57847533632287
DT:  84.28143712574851 78.9237668161435
KNN:  82.03592814371258 80.71748878923766
RF:  84.28143712574851 78.47533632286996


In [370]:
print("SVM Report:\n",classification_report(y_test, sv_pred))

SVM Report:
               precision    recall  f1-score   support

           0       0.81      0.88      0.84       134
           1       0.79      0.69      0.73        89

    accuracy                           0.80       223
   macro avg       0.80      0.78      0.79       223
weighted avg       0.80      0.80      0.80       223



In [371]:
print("Logisitic Regression Report:\n",classification_report(y_test, lr_pred))

Logisitic Regression Report:
               precision    recall  f1-score   support

           0       0.83      0.79      0.81       134
           1       0.71      0.75      0.73        89

    accuracy                           0.78       223
   macro avg       0.77      0.77      0.77       223
weighted avg       0.78      0.78      0.78       223



In [372]:
print("Decision Trees Report:\n",classification_report(y_test, dt_pred))

Decision Trees Report:
               precision    recall  f1-score   support

           0       0.79      0.88      0.83       134
           1       0.78      0.65      0.71        89

    accuracy                           0.79       223
   macro avg       0.79      0.77      0.77       223
weighted avg       0.79      0.79      0.79       223



In [373]:
print("KNN Report:\n",classification_report(y_test, knn_pred))

KNN Report:
               precision    recall  f1-score   support

           0       0.84      0.84      0.84       134
           1       0.76      0.75      0.76        89

    accuracy                           0.81       223
   macro avg       0.80      0.80      0.80       223
weighted avg       0.81      0.81      0.81       223



In [374]:
print("Random Forest Report: \n",classification_report(y_test, rf_pred))

Random Forest Report: 
               precision    recall  f1-score   support

           0       0.79      0.87      0.83       134
           1       0.77      0.66      0.71        89

    accuracy                           0.78       223
   macro avg       0.78      0.76      0.77       223
weighted avg       0.78      0.78      0.78       223



##Hyper Parameter tuning using Grid Search CV

##SVM

In [375]:
gs_svm = {
    'C': [0.1, 1, 10, 100],
    'kernel': ['linear', 'poly', 'rbf', 'sigmoid']
}

gd_svm = GridSearchCV(estimator=sv, param_grid=gs_svm, cv=10)
gd_svm.fit(x_train, y_train)

##Logistic Regression

In [376]:
gs_lr = {
    'C': [0.001, 0.01, 0.1, 1, 10, 100],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear', 'saga'],
    'max_iter': [100, 200, 300]
}
gd_lr = GridSearchCV(estimator=lr, param_grid=gs_lr, cv=10)
gd_lr.fit(x_train, y_train)



##K_Nearest_Neighbor

In [377]:
gs_knn = {"n_neighbors" : [2],
         "metric" : ["euclidean", "manhattan", "chebyshev", "minkowski"]}

gd_knn = GridSearchCV(estimator=knn, param_grid=gs_knn, cv=10)
gd_knn.fit(x_train, y_train)

##DecisionTree

In [378]:
gs_dt = {"criterion" : ["gini", "entropy", "log_loss"],
         "splitter" : ["best", "random"],
         "max_depth" : [i for i in range(1,10)]
         }

gd = GridSearchCV(estimator=dt, param_grid=gs_dt, cv=10)
gd.fit(x_train, y_train)

##Random Forest

In [379]:
gs_rf ={
  'n_estimators': [50, 100],
    'max_depth': [None, 10],
    'min_samples_split': [2, 5],
    'min_samples_leaf': [1, 2],
    'max_features': ['auto', 'sqrt']
}

gd_rf = GridSearchCV(estimator=rf, param_grid=gs_rf, cv=10)
gd_rf.fit(x_train, y_train)

160 fits failed out of a total of 320.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
160 fits failed with the following error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.11/dist-packages/sklearn/model_selection/_validation.py", line 866, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 1382, in wrapper
    estimator._validate_params()
  File "/usr/local/lib/python3.11/dist-packages/sklearn/base.py", line 436, in _validate_params
    validate_parameter_constraints(
  File "/usr/local/lib/python3.11/dist-packages/sklearn/utils/_param_validation.py", line 98, in validate_parameter_constraints
    raise InvalidParameterError(
sk

In [380]:
print("SVM: ",gd_svm.best_params_)
print("Logistic Regression: ",gd_lr.best_params_)
print("KNN: ",gd_knn.best_params_)
print("Decision Tree: ",gd.best_params_)
print("Random Forest: ",gd_rf.best_params_)

SVM:  {'C': 0.1, 'kernel': 'rbf'}
Logistic Regression:  {'C': 0.1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'liblinear'}
KNN:  {'metric': 'chebyshev', 'n_neighbors': 2}
Decision Tree:  {'criterion': 'entropy', 'max_depth': 4, 'splitter': 'random'}
Random Forest:  {'max_depth': 10, 'max_features': 'sqrt', 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 100}


In [381]:
print("Decision Tree: ",gd.best_score_*100)
print("SVM: ",gd_svm.best_score_*100)
print("Random Forest: ",gd_rf.best_score_*100)
print("Logistic Regression: ",gd_lr.best_score_*100)
print("KNN: ",gd_knn.best_score_*100)


Decision Tree:  81.00407055630934
SVM:  80.6919945725916
Random Forest:  80.10176390773405
Logistic Regression:  79.19945725915876
KNN:  77.40615106286748
