<a href="https://colab.research.google.com/github/proffranciscofernando/titanic_survival_prediction/blob/main/main.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Titanic Survival Prediction

This notebook demonstrates a machine learning pipeline built with Scikit-Learn to predict passenger survival on the Titanic. It encompasses data preprocessing, model training, hyperparameter optimisation using Grid Search with cross-validation, and an interactive interface for making predictions based on user input.

## 1. Importing Necessary Libraries

First, we import all the necessary libraries and suppress any non-critical warnings for a cleaner output.

In [1]:
# Importing Necessary Libraries and Suppressing Warnings

import numpy as np
import pandas as pd

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

import joblib  # For saving and loading the model

# Suppressing warnings
import warnings
warnings.filterwarnings('ignore')

## 2. Loading and Exploring the Dataset

We will use the **Titanic** dataset to predict the survival of passengers.

### 2.1 Loading the Dataset

In [2]:
# Loading the dataset
url = 'https://raw.githubusercontent.com/proffranciscofernando/titanic_survival_prediction/refs/heads/main/titanic.csv'
data = pd.read_csv(url)

# Viewing the first few rows
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


### 2.2 Exploratory Data Analysis

Let's check the basic information of the dataset to understand its structure.

In [3]:
# Basic information about the dataset
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


### 2.3 Selecting Features and Target Variable

We will select some relevant columns to simplify the example.

In [4]:
# Selecting features and target variable
X = data[['Pclass', 'Sex', 'Age', 'Fare']]
y = data['Survived']

### 2.4 Handling Missing Values

We will check for any missing values in the selected data.

In [5]:
# Checking for missing values
X.isnull().sum()

Unnamed: 0,0
Pclass,0
Sex,0
Age,177
Fare,0


As we can see, the **Age** column has missing values. We will fill these values with the mean age.

In [6]:
# Filling missing values in the 'Age' column with the mean
X['Age'].fillna(X['Age'].mean(), inplace=True)

## 3. Data Preparation

### 3.1 Splitting Data into Training and Testing Sets

We split the dataset into training and testing sets to evaluate the performance of the models.

In [7]:
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

## 4. Building the Pipeline

### 4.1 Creating the Preprocessing Pipeline

We define transformations for numerical and categorical columns.

In [8]:
# Defining numerical and categorical columns
numeric_features = ['Age', 'Fare']
numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_features = ['Pclass', 'Sex']
categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Combining the transformations
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

## 5. Hyperparameter Optimisation with Grid Search

### 5.1 Defining Models and Parameters

We will define a list of classification models and the hyperparameters we wish to optimise for each.

In [9]:
# Defining models and their hyperparameters
models_params = {
    'Logistic Regression': {
        'model': LogisticRegression(max_iter=1000),
        'params': {
            'classifier__C': [0.1, 1, 10],
            'classifier__penalty': ['l2']
        }
    },
    'Random Forest': {
        'model': RandomForestClassifier(),
        'params': {
            'classifier__n_estimators': [50, 100, 200],
            'classifier__max_depth': [None, 5, 10]
        }
    },
    'SVM': {
        'model': SVC(probability=True),
        'params': {
            'classifier__C': [0.1, 1, 10],
            'classifier__kernel': ['linear', 'rbf']
        }
    },
    'KNN': {
        'model': KNeighborsClassifier(),
        'params': {
            'classifier__n_neighbors': [3, 5, 7],
            'classifier__weights': ['uniform', 'distance']
        }
    },
    'Decision Tree': {
        'model': DecisionTreeClassifier(),
        'params': {
            'classifier__max_depth': [None, 5, 10],
            'classifier__criterion': ['gini', 'entropy']
        }
    }
}

### 5.2 Running Grid Search with Cross-Validation

Now, we will perform Grid Search with cross-validation for each model.

In [10]:
# Running Grid Search with cross-validation for each model
results = []

for name, mp in models_params.items():
    model = mp['model']
    params = mp['params']

    # Creating the pipeline with the current model
    pipeline = Pipeline(steps=[
        ('preprocessor', preprocessor),
        ('classifier', model)
    ])

    # Grid Search with cross-validation
    grid_search = GridSearchCV(
        estimator=pipeline,
        param_grid=params,
        cv=5,
        scoring='f1',
        n_jobs=-1
    )

    # Training the model
    grid_search.fit(X_train, y_train)

    # Best model found
    best_model = grid_search.best_estimator_

    # Making predictions on the test set
    y_pred = best_model.predict(X_test)

    # Calculating metrics
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    # Storing the results
    results.append({
        'Model': name,
        'Best Hyperparameters': grid_search.best_params_,
        'Accuracy': acc,
        'Precision': prec,
        'Recall': rec,
        'F1-Score': f1,
        'Pipeline': best_model  # Storing the best model
    })

### 5.3 Comparing Optimised Models

Let's visualise the metrics of each optimised model in a dataframe.

In [11]:
# Comparing the optimised models
df_results = pd.DataFrame(results)
df_results = df_results.sort_values(by='F1-Score', ascending=False)
df_results.reset_index(drop=True, inplace=True)
df_results[['Model', 'Accuracy', 'Precision', 'Recall', 'F1-Score', 'Best Hyperparameters']]

Unnamed: 0,Model,Accuracy,Precision,Recall,F1-Score,Best Hyperparameters
0,Random Forest,0.810056,0.822581,0.689189,0.75,"{'classifier__max_depth': 10, 'classifier__n_e..."
1,Logistic Regression,0.798883,0.779412,0.716216,0.746479,"{'classifier__C': 0.1, 'classifier__penalty': ..."
2,SVM,0.810056,0.833333,0.675676,0.746269,"{'classifier__C': 10, 'classifier__kernel': 'r..."
3,KNN,0.782123,0.777778,0.662162,0.715328,"{'classifier__n_neighbors': 3, 'classifier__we..."
4,Decision Tree,0.77095,0.836735,0.554054,0.666667,"{'classifier__criterion': 'entropy', 'classifi..."


## 6. Selecting and Saving the Best Model

### 6.1 Selecting the Best Model

In [12]:
# Selecting the best model based on F1-Score
best_model_name = df_results.loc[0, 'Model']
best_pipeline = df_results.loc[0, 'Pipeline']

print(f"The best model was: {best_model_name}")
print(f"With hyperparameters: {df_results.loc[0, 'Best Hyperparameters']}")

The best model was: Random Forest
With hyperparameters: {'classifier__max_depth': 10, 'classifier__n_estimators': 200}


### 6.2 Saving the Model

In [13]:
# Saving the model using joblib
joblib.dump(best_pipeline, 'best_model.pkl')

['best_model.pkl']

## 7. Loading the Model and Making Predictions

### 7.1 Making Predictions with New User Input

In this section, we will create an iterative interface where the user can input new data, and the model will make predictions based on that data.

In [14]:
# Loading the saved model
loaded_model = joblib.load('best_model.pkl')

#### Function to Get User Input Data

In [15]:
# Function to get user input data
def get_user_data():
    print("Please enter the passenger's details:")

    # Prompting user for input
    pclass = input("Class (1, 2 or 3): ")
    while pclass not in ['1', '2', '3']:
        print("Invalid value. Please enter 1, 2, or 3.")
        pclass = input("Class (1, 2 or 3): ")
    pclass = int(pclass)

    sex = input("Sex (male or female): ").lower()
    while sex not in ['male', 'female']:
        print("Invalid value. Please enter 'male' or 'female'.")
        sex = input("Sex (male or female): ").lower()

    age = input("Age: ")
    while True:
        try:
            age = float(age)
            if age < 0:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a positive number.")
            age = input("Age: ")

    fare = input("Fare Paid (between £0 and £513): ")
    while True:
        try:
            fare = float(fare)
            if not 0 <= fare <= 513:
                raise ValueError
            break
        except ValueError:
            print("Invalid value. Please enter a number between £0 and £513.")
            fare = input("Fare Paid must be between £0 and £513: ")

    # Creating a DataFrame with the entered data
    new_passenger = pd.DataFrame({
        'Pclass': [pclass],
        'Sex': [sex],
        'Age': [age],
        'Fare': [fare]
    })

    return new_passenger

#### Iterative Loop for Predictions

In [16]:
# Iterative loop for making predictions
while True:
    # Get user input data
    new_passenger = get_user_data()

    # Make prediction
    prediction = loaded_model.predict(new_passenger)
    probability = loaded_model.predict_proba(new_passenger)

    result = "Survived" if prediction[0] == 1 else "Did Not Survive"
    prob_survived = probability[0][1] * 100

    print(f"\nPrediction Result: {result}")
    print(f"Probability of Survival: {prob_survived:.2f}%\n")

    # Ask if the user wants to input another passenger
    continue_input = input("Would you like to enter another passenger? (y/n): ").lower()
    if continue_input != 'y':
        print("Ending predictions.")
        break

Please enter the passenger's details:
Class (1, 2 or 3): 1
Sex (male or female): male
Age: 22
Fare Paid (between £0 and £513): 90

Prediction Result: Did Not Survive
Probability of Survival: 26.40%

Would you like to enter another passenger? (y/n): n
Ending predictions.


## 3. Conclusion

In this notebook, we built a machine learning pipeline that includes multiple classification models. We utilised Grid Search with cross-validation to optimise the hyperparameters of each model. We evaluated each model using metrics such as Accuracy, Precision, Recall, and F1-Score. Based on these metrics, we selected the best optimised model and saved it for future use. Finally, we implemented an interactive interface that allows users to input new data and obtain model predictions, making the application practical and interactive.

This process is essential in machine learning projects to ensure that we are selecting the most appropriate model and hyperparameters for our problem, as well as facilitating the deployment and continuous use of the model in real-world environments.

## 4. References

- [Scikit-Learn Documentation on Pipelines](https://scikit-learn.org/stable/modules/compose.html#pipeline)
- [Scikit-Learn Documentation on GridSearchCV](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)
- [Saving Models with Joblib](https://scikit-learn.org/stable/modules/model_persistence.html)
- [Titanic Dataset on Kaggle](https://www.kaggle.com/c/titanic/data)
- [Cross-Validation in Scikit-Learn](https://scikit-learn.org/stable/modules/cross_validation.html)
- [Classification Metrics](https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics)