<a href="https://colab.research.google.com/github/merazAfridi/TitanicML/blob/main/Titanic_ML_GaziMerazMehedi.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Introduction
In this notebook I will dive into the Titanic dataset, and see if I can predict; with high accuracy , if a passenger survived or not.

I will be incorporating machine learning, as well as feature engineering to build my models.

This is for the ongoing Titanic - Machine Learning from Disaster competition.

**Importing Libraries**

In [None]:
# For data manipulation and visualization
import pandas as pd
import numpy as np

# For data visualization
import matplotlib.pyplot as plt
import seaborn as sns

# For Machine Learning model building
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier


# For model evaluation
from sklearn.metrics import accuracy_score,roc_auc_score, plot_confusion_matrix



import warnings
warnings.filterwarnings('ignore')

ImportError: ignored

# **Process the data**

**I have browsed Titanic dataset from (https://www.kaggle.com/c/titanic/data) . Then Downloaded 'train.csv' and 'test.csv'.**

In [None]:
train_df = pd.read_csv('/train.csv')
test_df = pd.read_csv('/test.csv')

FileNotFoundError: ignored

**Viewing the Data Information**

In [None]:
print(f'Train Data Shape: {train_df.shape}')
print(f'Test Data Shape: {test_df.shape}')

In [None]:
# lets check the first five rows of Train data
train_df.head()

# **Cleaning the Data**

***Droping PassengerId, Name and Ticket, I don't think its useful for model performance.***

In [None]:
train_df.drop(['PassengerId','Name','Ticket'], axis=1, inplace=True)

In [None]:
# Summary Statistics
train_df.describe()

In [None]:
# checking for null values in train data
train_df.isnull().sum()

In [None]:
# checking for null values in test data
test_df.isnull().sum()

***Most of the Cabin data is missing so I have decided to drop Cabin.***

In [None]:
train_df.drop('Cabin', axis=1, inplace=True)

In [None]:
# Let's get an overview of features datatype
train_df.dtypes

***We can see that Survived, Pclass, SibSp and Parch have an integer data type but we know that they are categorical variables. So lets convert them to object datatype.***

In [None]:
# convert datatype to object for further analysis
columns_to_convert = ['Survived','Pclass','SibSp','Parch']
train_df[columns_to_convert] = train_df[columns_to_convert].astype(str)

In [None]:
# categorical and numeric features
cat_features = train_df.select_dtypes(exclude="number").columns
num_cols = train_df.select_dtypes(include="number").columns
print('Categorical Features are: ', cat_features)
print('Numerical Features are: ', num_cols)

In [None]:
plt.figure(figsize=(8,6))
sns.countplot(x = 'Survived', data = train_df)
plt.show()

# **Exploring the Data**

**It's clear from the above plot that majority of the people onboarding the titanic didn't survived.**

In [None]:
fig,ax=plt.subplots(2,3,figsize=(25,15))
for i,col in enumerate(num_cols):
    plt.suptitle("Visualizing Continuous Features",size=25)
    d = sns.distplot(train_df[col], ax=ax[i,0], kde=True)
    d.set_title(f'Distribution Plot of {col}', loc='center', y=1.05, size=18, weight='bold',color='r')
    b = sns.boxplot(data=train_df, x=col, ax=ax[i,1])
    b.set_title(f'Boxplot of {col} ', loc='center', y=1.04, size=18, weight='bold',color='b')
    s = sns.kdeplot(data = train_df, x = col, hue= 'Survived', shade= True, ax=ax[i,2], palette = 'ocean')
    s.set_title(f'Distribution of {col} based on Survived', loc='center', y=1.04, size=18, weight='bold',color='green')


***From the above distribution plot of age based on survival we can say that children tend to have more chances of survival as compared to older individuals.***

In [None]:
# Checking value in each categorical feature
cat_cols = cat_features[1:]
for col in cat_cols:
    print(f'============{col}============\n {train_df[col].value_counts()}\n')

In [None]:
fig,ax = plt.subplots(5,2,figsize=(18,30))
for i, col in enumerate(cat_cols):
    sns.countplot(data = train_df, x = col, ax=ax[i,0])
    sns.countplot(data = train_df, x = col,hue='Survived', ax=ax[i,1])
    if i == 0:
        ax[0,0].set_title('Count plot for Categorical Features', loc='center', y=1.1, size=18, weight='bold',color='green')
    else:
        ax[0,1].set_title('Count plot for Categorical Features Based on Survived', loc='center', y=1.1, size=18, weight='bold',color='green')

***1. It seems like people travelling in 3rd class were less likely to survive as compared to people travelling in first class.***

***2. We can also conclude that females passengers were more likely to survive as compared to males.***

***3. Majority of the people Embarked from Southampton, so I have decided to fill the missing values in Embarked column with S(Southampton).***

# Data Preprocessing

**Both the SibSp and Parch column suggests whether the person was person was travelling with his family or not. So we will convert these features into a single feature called SibSP_Parch.**

In [None]:
# convert datatype to object for further analysis
columns_to_convert = ['Survived','Pclass','SibSp','Parch']
train_df[columns_to_convert] = train_df[columns_to_convert].astype(int)

In [None]:
train_df['SibSP_Parch'] = np.where(train_df['SibSp'] + train_df['Parch'] > 0, 1, 0)
# drop SibSp and Parch
train_df.drop(['SibSp', 'Parch'], axis= 1, inplace= True)

In [None]:
onehot_encoding_features = ['Sex','Embarked']
scaling_features = num_cols

In [None]:
# splitting the data into X and y
X = train_df.drop('Survived', axis=1)
y = train_df['Survived']

**Displaying first 5 data of X**

In [None]:
X.head()

***For categorical features, we'll impute the missing values with the mode of the column and encode them with One-Hot encoding***

In [None]:

categorical_pipeline = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy='most_frequent')),
        ("onehot-encoder", OneHotEncoder(handle_unknown="ignore", sparse=False, drop='first')),
    ]
)

***For the numeric features, specifically 'Age' we will impute the missing values with mean of the column***

In [None]:
numeric_pipeline = Pipeline(
    steps=[
           ("imputer", SimpleImputer(strategy='median')),
            ("scaler", StandardScaler())
         ]
)

***Next, we will input these along with their corresponding pipelines into a ColumnTransFormer instance***

In [None]:
col_transformer  = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, scaling_features),
        ("categorical", categorical_pipeline, onehot_encoding_features),
    ],
    remainder='passthrough'
)

***Apply preprocessing***

In [None]:
X_transformed  = col_transformer.fit_transform(X)
y_transformed = y.values.reshape(-1,1)
print('X Shape: ', X_transformed.shape)
print('y shape: ', y_transformed.shape)

In [None]:
onehot_cols = (
    col_transformer
    .named_transformers_["categorical"]
    .named_steps["onehot-encoder"]
    .get_feature_names_out(onehot_encoding_features)
)
onehot_cols

In [None]:
passthrough_features = [col for col in X.columns if (col not in onehot_encoding_features) and (col not in scaling_features)]
transformed_columns = scaling_features.tolist() + onehot_cols.tolist() + passthrough_features
transformed_columns

In [None]:
X_transformed = pd.DataFrame(X_transformed, columns = transformed_columns)
X_transformed.head()

***Now we will split train and test data***

In [None]:
# train test split
X_train, X_test, y_train, y_test = train_test_split(X_transformed, y_transformed, test_size = 0.2, random_state = 2022)

# **Choosing the Model**

**Selecting my classification model (trying  7 model to find best one)**

Here we will implement Logistic Regression, KNeighborsclassifier, SVC (Support Vector Classifier), GaussianNB, DecisionTreeClassifier, RandomForestClassifier, XGBClassifier models.

In [None]:
# Models
models = [
           LogisticRegression(solver='liblinear'),
           KNeighborsClassifier(n_neighbors = 5),
           SVC(probability=True),
           GaussianNB(),
           DecisionTreeClassifier(random_state=2022),
           RandomForestClassifier(random_state=2022),
           XGBClassifier(random_state=2022)]

model_names=['Logistic Regression','KNN', 'SVM','Naive Bayes', 'Decision Tree','Random Forest','XGBoost']

***Put the ROC, AUC scores and accuracy scores in a data frame. An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds. AUC stands for "Area under the ROC Curve." AUC provides an aggregate measure of performance across all possible classification thresholds.***


In [None]:
def build_models(models, model_names):
    # lets create an empty lists to append the results
    roc_auc_scores  = []
    accuracy_scores = []
    results = {}

    fig, axes = plt.subplots(nrows=3, ncols=3, figsize=(15,10))
    axes = axes.ravel()
    fig.delaxes(axes[-1])
    # use enumerate() and zip() function to iterate the lists
    for idx, (ml_model_names, ml_models, ax) in enumerate(zip(model_names, models, axes.flatten())):
        clf = models[idx]
        clf.fit(X_train,y_train)
        y_pred = clf.predict(X_test)
        y_pred_proba = clf.predict_proba(X_test)[:, 1]
        roc_auc = roc_auc_score(y_test, y_pred_proba)
        accuracy = accuracy_score(y_test, y_pred)

        plot_confusion_matrix(clf,
                              X_test,
                              y_test,
                              ax=ax,
                              cmap='Blues')
        ax.title.set_text(ml_model_names)

        print("Model: {}".format(ml_model_names))
        print("Accuracy: {}".format(accuracy))
        print("Roc Auc Score: {}".format(roc_auc))
        print('\n')


        roc_auc_scores.append(roc_auc)
        accuracy_scores.append(accuracy)

    results = {'Model':model_names,
           'ROC AUC Score':roc_auc_scores,
           'Accuracy Score':accuracy_scores}
    plt.tight_layout()
    plt.show()
    # Put the roc_auc_scores and accuracy scores in a data frame.
    models_scores_df = pd.DataFrame(results)
    return models_scores_df

In [None]:
models_scores_df = build_models(models, model_names)

***Above we can see Model Name Along accuracy and ROC AUC score.We also can create Confusion Matrix for our models .***

**Accuracy measures how many observations, both positive and negative, were correctly classified. We shouldn’t use accuracy on imbalanced problems. Then, it is easy to get a high accuracy score by simply classifying all observations as the majority class.**


**We should use ROC AUC score when we care equally about positive and negative classes. It naturally extends the imbalanced data discussion from the last section. If we care about true negatives as much as we care about true positives then it totally makes sense to use ROC AUC.**

***Now we will create a comparison between all models:***

In [None]:
models_scores_df

**From above we will choose SVM (Support Vector Machine) as our final model, as we can see SVM has highest accuracy and very good ROC AUC score.**

In [None]:
# final model
svc = SVC(random_state=2022,verbose=0)
svc.fit(X_transformed, y_transformed)

# **Prediction on Test Data**

**The algorithm will generate probable values for an unknown variable for each record in the new data, allowing the model builder to identify what that value will most likely be**.

In [None]:
# passenger ids
test_ids = test_df['PassengerId']

In [None]:
# feature engineering
test_df['SibSP_Parch'] = np.where(test_df['SibSp'] + test_df['Parch'] > 0, 1, 0)

In [None]:
# convert datatype to object
test_columns_to_convert = ['Pclass','SibSP_Parch']
test_df[test_columns_to_convert] = test_df[test_columns_to_convert].astype(str)

In [None]:
test_data = test_df[X.columns]
test_data.head()

In [None]:
# categorical and numeric features
test_cat_features = test_data.select_dtypes(exclude="number").columns
test_num_cols = test_data.select_dtypes(include="number").columns
print('Test Data Categorical Features are: ', test_cat_features)
print('Test Data Numerical Features are: ', test_num_cols)

**Next, we will input these along with their corresponding pipelines into a ColumnTransFormer instance.**

In [None]:
test_col_transformer  = ColumnTransformer(
    transformers=[
        ("numeric", numeric_pipeline, test_num_cols),
        ("categorical", categorical_pipeline, onehot_encoding_features),
    ],
    remainder='passthrough'
)

***Apply preprocessing on test data***

In [None]:
# Apply preprocessing
X_transformed_test  = test_col_transformer.fit_transform(test_data)
print('Test Data Shape: ', X_transformed_test.shape)

In [None]:
X_transformed_test = pd.DataFrame(X_transformed_test, columns = transformed_columns)
X_transformed_test.head()

In [None]:
final_pred = svc.predict(X_transformed_test)

In [None]:
submission = pd.DataFrame({
    'PassengerId': test_ids,
    'Survived': final_pred
})

**Submission Data shape :**

In [None]:
submission.shape

In [None]:
submission

In [None]:
submission.Survived.value_counts()

In [None]:
submission.to_csv('submission.csv', index= False)