# 🚀 **Titanic Machine Learning Models Comparison**  

## By Kao Panboonyuen

### This Colab notebook will guide you through:

* ✅ Loading the Titanic dataset
* ✅ Exploring and analyzing the data (EDA)
* ✅ Splitting the data into training and testing sets
* ✅ Training 3 machine learning models:
     - 🏗️ **Random Forest Classifier**
     - 🔥 **Gradient Boosting Classifier**
     - 🏆 **Neural Network Classifier**
* ✅ Evaluating model performance using:
     - 📊 **Confusion Matrix**
     - 📉 **Precision, Recall, F1-Score**
* ✅ Comparing the performance of each model to find the best one

### 📌 **Step 1: Import Required Libraries**  

To get started, we'll need several libraries for data manipulation, visualization, and machine learning tasks.

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, f1_score, precision_score, recall_score

### 📌 **Step 2: Load the Titanic Dataset**  

Now, let's load the Titanic dataset from the provided GitHub URL.

In [None]:
# Load Titanic dataset from the provided link
url = 'https://github.com/kaopanboonyuen/OCSB-AI/blob/main/dataset/titanic-dataset.csv?raw=true'
# Write your code here

# Display the first few rows of the dataset
# Write your code here

### 📌 **Step 3: Exploratory Data Analysis (EDA) 🔍**  

Before diving into the models, it's important to understand our data. We will look for missing values, visualize some basic distributions, and explore the relationships between different features.

#### 3.1 Checking for Missing Values

We should start by checking for any missing values in the dataset.

In [None]:
# Check for missing values
# Write your code here

#### 3.2 Visualizing Data Distributions 📊

Let's visualize the distribution of the "Survived" column (target variable) and some other features.

In [None]:
# Visualize the target variable "Survived"
sns.countplot(x='Survived', data=data)
plt.title('Distribution of Survived')
plt.show()

# Visualize the distribution of age
sns.histplot(data['Age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.show()

### 📌 **Step 4: Preprocess the Data ⚙️**  

We need to handle missing values and convert categorical features into numerical ones. We will also split the dataset into features and labels.

In [None]:
# Handle missing values (impute or drop)
data['Age'].fillna(data['Age'].mean(), inplace=True)
data['Embarked'].fillna(data['Embarked'].mode()[0], inplace=True)

# Convert categorical features (Sex and Embarked) to numeric
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data = pd.get_dummies(data, columns=['Embarked'], drop_first=True)

# Features and target variable
X = data.drop(['Survived', 'Name', 'Ticket', 'Cabin', 'PassengerId'], axis=1)
y = data['Survived']

### 📌 **Step 5: Split the Data into Training and Testing Sets (with Fixed Seed)**  

Now, let's split the data into training and testing sets, ensuring we can reproduce the results by setting a random seed.

In [None]:
# Split the data with a fixed seed for reproducibility
# Write your code here

### 📌 **Step 6: Train and Evaluate the Models 🏗️**  

We will now train three different machine learning models: Random Forest, Gradient Boosting, and Neural Network.

#### 6.1 Random Forest Classifier

In [None]:
# Train Random Forest Classifier
# Write your code here

# Evaluate the model
rf_pred = rf_model.predict(X_test)
print("Random Forest Classifier - Classification Report:\n", classification_report(y_test, rf_pred, digits=4))

#### 6.2 Gradient Boosting Classifier 🔥

In [None]:
# Train Gradient Boosting Classifier
# Write your code here

# Evaluate the model
gb_pred = gb_model.predict(X_test)
print("Gradient Boosting Classifier - Classification Report:\n", classification_report(y_test, gb_pred, digits=4))

#### 6.3 Neural Network Classifier 🏆

In [None]:
# Train Neural Network Classifier (MLP)
# Write your code here

# Evaluate the model
nn_pred = nn_model.predict(X_test)
print("Neural Network Classifier - Classification Report:\n", classification_report(y_test, nn_pred, digits=4))

### 📌 **Step 7: Confusion Matrix and Performance Comparison 📊**  

Let's create confusion matrices and compare the F1-score, Precision, and Recall for each model.

#### 7.1 Confusion Matrix

In [None]:
# Confusion Matrix for Random Forest
plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix(y_test, rf_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Random Forest Confusion Matrix")
plt.show()

# Confusion Matrix for Gradient Boosting
plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix(y_test, gb_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Gradient Boosting Confusion Matrix")
plt.show()

# Confusion Matrix for Neural Network
plt.figure(figsize=(6, 6))
sns.heatmap(confusion_matrix(y_test, nn_pred), annot=True, fmt='d', cmap='Blues')
plt.title("Neural Network Confusion Matrix")
plt.show()

#### 7.2 Performance Metrics Comparison

In [None]:
f1_rf = f1_score(y_test, rf_pred)
f1_gb = f1_score(y_test, gb_pred)
f1_nn = f1_score(y_test, nn_pred)

print(f"Random Forest F1-Score: {f1_rf:.4f}")
print(f"Gradient Boosting F1-Score: {f1_gb:.4f}")
print(f"Neural Network F1-Score: {f1_nn:.4f}")

# Precision and Recall Comparison
precision_rf = precision_score(y_test, rf_pred)
precision_gb = precision_score(y_test, gb_pred)
precision_nn = precision_score(y_test, nn_pred)

recall_rf = recall_score(y_test, rf_pred)
recall_gb = recall_score(y_test, gb_pred)
recall_nn = recall_score(y_test, nn_pred)

print(f"Random Forest Precision: {precision_rf:.4f}, Recall: {recall_rf:.4f}")
print(f"Gradient Boosting Precision: {precision_gb:.4f}, Recall: {recall_gb:.4f}")
print(f"Neural Network Precision: {precision_nn:.4f}, Recall: {recall_nn:.4f}")

### 📌 **Step 8: Model Comparison 🏆**  

Finally, we will compare the performance of the three models to determine which one performed the best.

- Based on **F1-Score**, **Precision**, and **Recall**, we will decide the winning model.

---

### 🎉 **Conclusion**  

In this notebook, we have:
- Loaded and preprocessed the Titanic dataset.
- Performed exploratory data analysis (EDA).
- Built and evaluated three different machine learning models: Random Forest, Gradient Boosting, and Neural Networks.
- Compared the performance of each model using confusion matrices and classification reports.