<div class="alert alert-block alert-success">
    <h1 align="center">Titanic Dataset</h1>
    <h3 align="center">Building Interactive Web Applications for the Titanic Dataset</h3>
    <h4 align="center"><a href="https://www.linkedin.com/in/iman-mansouri-76647a45/">Iman Mansouri</a></h5>
</div>

In [1]:
# Install required libraries
!pip install scikit-learn xgboost

Collecting xgboost
  Downloading xgboost-1.7.5-py3-none-win_amd64.whl (70.9 MB)
     ---------------------------------------- 70.9/70.9 MB 4.3 MB/s eta 0:00:00
Installing collected packages: xgboost
Successfully installed xgboost-1.7.5


In [2]:
# Import necessary libraries and modules

# Pandas is used for data manipulation and analysis
import pandas as pd

# The 're' module provides support for regular expressions (regex)
import re

# NumPy is used for numerical operations on arrays and matrices
import numpy as np

# sklearn.preprocessing provides functions for preprocessing input data
from sklearn.preprocessing import OneHotEncoder, StandardScaler

# sklearn.impute provides functions for handling missing values
from sklearn.impute import SimpleImputer

# sklearn.pipeline provides a pipeline for building and executing a sequence of data processing steps
from sklearn.pipeline import Pipeline

# sklearn.compose provides tools for combining transformers and estimators
from sklearn.compose import ColumnTransformer, make_column_transformer

# sklearn.base provides base classes for creating custom estimators and transformers
from sklearn.base import BaseEstimator, TransformerMixin

# XGBoost is a gradient boosting library for classification and regression tasks
from xgboost import XGBClassifier

# sklearn.model_selection provides functions for splitting datasets and performing cross-validation
from sklearn.model_selection import train_test_split

# sklearn.metrics provides functions for model evaluation
from sklearn.metrics import confusion_matrix, precision_score, recall_score

# The warnings module is used to suppress any warning messages
import warnings
warnings.filterwarnings('ignore')

In [3]:
# Load the dataset
df = pd.read_csv(r'C:\Users\im_pc\Documents\TitanicDashboard\train.csv')

# Separate the features (X) and the target variable (y)
X = df.drop('Survived', axis=1)
y = df['Survived']

In [4]:
# Display the first few rows of the DataFrame
df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
# Summary statistics of the DataFrame
# .round(2): Round the values to two decimal places
df.describe().round(2)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.38,2.31,29.7,0.52,0.38,32.2
std,257.35,0.49,0.84,14.53,1.1,0.81,49.69
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.12,0.0,0.0,7.91
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.45
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.33


In [6]:
# The presence of the BaseEstimator class ensures that the transformer has the necessary get_params and set_params methods as required by all scikit-learn estimators.
# The TransformerMixin class provides the fit_transform method to the transformer.

# Define a custom transformer class for data preprocessing
class PrepProcesor(BaseEstimator, TransformerMixin): 
    def fit(self, X, y=None): 
        # Initialize and fit the age imputer
        self.ageImputer = SimpleImputer()
        self.ageImputer.fit(X[['Age']])        
        return self 
        
    def transform(self, X, y=None):
        # Impute missing values in 'Age' column
        X['Age'] = self.ageImputer.transform(X[['Age']])
        
        # Preprocess 'Cabin' column to extract cabin class and number
        X['CabinClass'] = X['Cabin'].fillna('M').apply(lambda x: str(x).replace(" ", "")).apply(lambda x: re.sub(r'[^a-zA-Z]', '', x))
        X['CabinNumber'] = X['Cabin'].fillna('M').apply(lambda x: str(x).replace(" ", "")).apply(lambda x: re.sub(r'[^0-9]', '', x)).replace('', 0) 
        
        # Fill missing values in 'Embarked' column
        X['Embarked'] = X['Embarked'].fillna('M')
        
        # Drop unnecessary columns
        X = X.drop(['PassengerId', 'Name', 'Ticket','Cabin'], axis=1)
        
        return X

In [7]:
# Instantiate a PrepProcesor object
preproc = PrepProcesor()  

# Define a numeric pipeline with a StandardScaler
numeric_pipeline = Pipeline([
    ('Scaler', StandardScaler())
])

# Define a categorical pipeline with OneHotEncoder
categorical_pipeline = Pipeline([
    ('OneHot', OneHotEncoder(handle_unknown='ignore'))
])

# Create a transformer using ColumnTransformer
# The transformer applies the numeric pipeline to the columns ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'CabinNumber']
# and the categorical pipeline to the columns ['Sex', 'Embarked', 'CabinClass']
transformer = ColumnTransformer([
    ('num', numeric_pipeline, ['Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'CabinNumber']),
    ('cat', categorical_pipeline, ['Sex', 'Embarked', 'CabinClass'])
])

In [8]:
# Split the data into training and testing sets

# The test set size is 20% of the entire dataset
# The random_state is set to 1234 to ensure reproducibility of the split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20, random_state=1234)

In [9]:
# Print training features
X_train

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
125,126,3,"Nicola-Yarred, Master. Elias",male,12.00,1,0,2651,11.2417,,C
305,306,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.5500,C22 C26,S
631,632,3,"Lundahl, Mr. Johan Svensson",male,51.00,0,0,347743,7.0542,,S
643,644,3,"Foo, Mr. Choong",male,,0,0,1601,56.4958,,S
808,809,2,"Meyer, Mr. August",male,39.00,0,0,248723,13.0000,,S
...,...,...,...,...,...,...,...,...,...,...,...
204,205,3,"Cohen, Mr. Gurshon ""Gus""",male,18.00,0,0,A/5 3540,8.0500,,S
53,54,2,"Faunthorpe, Mrs. Lizzie (Elizabeth Anne Wilkin...",female,29.00,1,0,2926,26.0000,,S
294,295,3,"Mineff, Mr. Ivan",male,24.00,0,0,349233,7.8958,,S
723,724,2,"Hodges, Mr. Henry Price",male,50.00,0,0,250643,13.0000,,S


In [10]:
# Create a machine learning pipeline
# The pipeline consists of three steps:
#   1. InitialPreproc: PrepProcesor() is used to perform initial preprocessing
#   2. Transformer: The previously defined transformer is applied to further transform the data
#   3. xgb: XGBClassifier() is used as the final estimator in the pipeline
mlpipe = Pipeline([
    ('InitialPreproc', PrepProcesor()),
    ('Transformer', transformer),
    ('xgb', XGBClassifier())
])

In [11]:
# Fit the machine learning pipeline to the training data

mlpipe.fit(X_train, y_train)

In [12]:
# Use the trained machine learning pipeline to predict labels for the test data

y_hat = mlpipe.predict(X_test)

In [13]:
# Retrieve the shape of the predicted labels
# y_hat.shape: Shape of the predicted labels array
y_hat.shape

(179,)

In [14]:
 # Print variable representing the true labels for the test data
y_test 

523    1
778    0
760    0
496    1
583    0
      ..
100    0
773    0
222    0
495    0
99     0
Name: Survived, Length: 179, dtype: int64

In [15]:
# Calculate the precision score between the true labels (y_test) and predicted labels (y_hat)
precision_score(y_test, y_hat)

0.7714285714285715

In [16]:
# Import the joblib module for saving and loading models
import joblib

In [17]:
# Save the trained machine learning pipeline to a joblib file
# mlpipe: Trained machine learning pipeline
# 'xgbpipe.joblib': File path where the pipeline will be saved
joblib.dump(mlpipe, 'xgbpipe.joblib')

['xgbpipe.joblib']

In [18]:
# Load the trained machine learning pipeline from a joblib file
# 'xgbpipe.joblib': File path from where the pipeline will be loaded
# model: Loaded machine learning pipeline
model = joblib.load('xgbpipe.joblib')

In [19]:

# Read the test data from a CSV file into a pandas DataFrame
test = pd.read_csv(r'C:\Users\im_pc\Documents\TitanicDashboard\test.csv')

In [20]:
# Retrieve the column names of the test data
test.columns

Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

In [21]:
# Use the trained machine learning pipeline to predict labels for the test data
# test: Testing features
# yhat: Predicted labels
yhat = mlpipe.predict(test)
yhat

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0,
       1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0,
       1, 1, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1,
       1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1, 0,

In [22]:
 # pip freeze > requirements.txt

**Conclusion**

In this notebook, I explored the Titanic dataset, aiming to predict passenger survival based on various features. My analysis using XGBoost resulted in a precision score of 0.77, which suggests a reasonable ability to predict survival accurately. However, there is room for further investigation and improvement. Future work may involve exploring alternative models, addressing data imbalance, and incorporating additional features to enhance the predictive performance.