<a href="https://www.kaggle.com/code/jennisjane/titanic-survival-prediction-gbt?scriptVersionId=225465473" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

## Titanic Survival Prediction – Kaggle Competition  
**Objective:** Predict passenger survival using machine learning models.  

### Key Steps  

### Data Preprocessing & Feature Engineering  
- Loaded dataset using **NumPy, Pandas, TensorFlow**.  
- Handled **missing values** (replaced with median).  
- Tokenized **names and ticket numbers** into meaningful features.  
- Removed **irrelevant features** to improve model performance.  

### Exploratory Data Analysis (EDA)  
- Analyzed survival rates based on **class, gender, age, and fare distribution**.  
- Explored relationships between **family size (SibSp, Parch) and survival probability**.  
- Visualized data trends to guide feature selection.  

### Model Development & Training  
- Converted **Pandas DataFrame** into **TensorFlow dataset** for efficient processing.  
- Trained a **Gradient Boosting model** with hyperparameter tuning.  
- Evaluated model performance and optimized predictions.  

### Results  
- Final accuracy: **0.8261 accuracy** 



## Import libraries

In [1]:
import os
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import re 
import plotly.express as px
import tensorflow as tf
import tensorflow_decision_forests as tfdf
import random
from itertools import product

## Upload dataset

In [2]:
#load dataset
train_df = pd.read_csv("/kaggle/input/titanic/train.csv")
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

# Data Dictionary for the dataset

## Variable	Definition	Key
- survival	Survival	0 = No, 1 = Yes
- pclass	Ticket class	1 = 1st, 2 = 2nd, 3 = 3rd
- sex	Sex	
- Age	Age in years	
- sibsp	# of siblings / spouses aboard the Titanic	
- parch	# of parents / children aboard the Titanic	
- ticket	Ticket number	
- fare	Passenger fare	
- cabin	Cabin number	
- embarked	Port of Embarkation	C = Cherbourg, Q = Queenstown, S = Southampton
- Variable Notes
- pclass: A proxy for socio-economic status (SES):1st = Upper, 2nd = Middle, 3rd = Lower
- age: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
- sibsp: The dataset defines family relations in this way...
- Sibling = brother, sister, stepbrother, stepsister
- Spouse = husband, wife (mistresses and fiancés were ignored)
- parch: The dataset defines family relations in this way...
- Parent = mother, father
- Child = daughter, son, stepdaughter, stepson
- Some children travelled only with a nanny, therefore parch=0 for them.



In [3]:
train_df.head(10)

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S
5,6,0,3,"Moran, Mr. James",male,,0,0,330877,8.4583,,Q
6,7,0,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,17463,51.8625,E46,S
7,8,0,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,349909,21.075,,S
8,9,1,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,347742,11.1333,,S
9,10,1,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,237736,30.0708,,C


In [4]:
train_df.describe()

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,891.0,714.0,891.0,891.0,891.0
mean,446.0,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,257.353842,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,1.0,0.0,1.0,0.42,0.0,0.0,0.0
25%,223.5,0.0,2.0,20.125,0.0,0.0,7.9104
50%,446.0,0.0,3.0,28.0,0.0,0.0,14.4542
75%,668.5,1.0,3.0,38.0,1.0,0.0,31.0
max,891.0,1.0,3.0,80.0,8.0,6.0,512.3292


## Cleaning data 
- "Name" split into pieces 
- "Ticket" split into pieces
- check the missing values on each features
- remove unrelevant features

In [5]:
def preprocessed(df):
    df = df.copy()

    def split_name(name):
        return re.sub(r'[^\w\s]','',name)
   
    def split_ticket(ticket):
        if not isinstance(ticket, str) or not ticket.strip():
            return "None", "0"  # Default values for empty tickets
        items = ticket.split()
        ticket_item = items[0] if len(items) > 1 else "None"
        ticket_number = items[-1]
        return ticket_item, ticket_number

    df["Name"] = df["Name"].apply(split_name)
    df['Ticket_Item'],df['Ticket_Number'] = zip(*df['Ticket'].apply(split_ticket))
                 
    return df

In [6]:
preprocessed_train_df = preprocessed(train_df)
preprocessed_test_df = preprocessed(test_df)

In [7]:
preprocessed_train_df.head(10)

  has_large_values = (abs_vals > 1e6).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()
  has_small_values = ((abs_vals < 10 ** (-self.digits)) & (abs_vals > 0)).any()


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Ticket_Item,Ticket_Number
0,1,0,3,Braund Mr Owen Harris,male,22.0,1,0,A/5 21171,7.25,,S,A/5,21171
1,2,1,1,Cumings Mrs John Bradley Florence Briggs Thayer,female,38.0,1,0,PC 17599,71.2833,C85,C,PC,17599
2,3,1,3,Heikkinen Miss Laina,female,26.0,0,0,STON/O2. 3101282,7.925,,S,STON/O2.,3101282
3,4,1,1,Futrelle Mrs Jacques Heath Lily May Peel,female,35.0,1,0,113803,53.1,C123,S,,113803
4,5,0,3,Allen Mr William Henry,male,35.0,0,0,373450,8.05,,S,,373450
5,6,0,3,Moran Mr James,male,,0,0,330877,8.4583,,Q,,330877
6,7,0,1,McCarthy Mr Timothy J,male,54.0,0,0,17463,51.8625,E46,S,,17463
7,8,0,3,Palsson Master Gosta Leonard,male,2.0,3,1,349909,21.075,,S,,349909
8,9,1,3,Johnson Mrs Oscar W Elisabeth Vilhelmina Berg,female,27.0,0,2,347742,11.1333,,S,,347742
9,10,1,2,Nasser Mrs Nicholas Adele Achem,female,14.0,1,0,237736,30.0708,,C,,237736


All the punctuation had been removed and the ticket seperated into two ticket_item and ticket_number.


## check missing value

In [8]:
print(preprocessed_train_df.isnull().sum())

PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age              177
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
Ticket_Item        0
Ticket_Number      0
dtype: int64


'Age' has 177 NaN values so we replaced it with median of age

In [9]:
preprocessed_train_df['Age'].fillna(preprocessed_train_df['Age'].median(), inplace=True)

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  preprocessed_train_df['Age'].fillna(preprocessed_train_df['Age'].median(), inplace=True)


In [10]:
# checking NaN value on each feature
print(preprocessed_train_df.isnull().sum())

PassengerId        0
Survived           0
Pclass             0
Name               0
Sex                0
Age                0
SibSp              0
Parch              0
Ticket             0
Fare               0
Cabin            687
Embarked           2
Ticket_Item        0
Ticket_Number      0
dtype: int64


Now, missing value from 'Age' column solved. 
On the other side, 'Cabin' also has too many missing values. 

## Trying on the relationship between features


In [11]:
#Import libraries

import seaborn as sns
import matplotlib.pyplot as plt

### 1. Survival by Class

In [12]:
# Create the histogram using Plotly
fig = px.histogram(preprocessed_train_df, 
                   x='Pclass', 
                   color='Survived',
                   barmode='group', 
                   title='Survived Count by Passenger Class',
                   labels={'Pclass': 'Passenger Class', 'count': 'Count'},
                   text_auto=True)  # Displays values on the bars

# Update layout for better visuals
fig.update_layout(
    xaxis_title='Passenger Class',
    yaxis_title='Count',
    legend_title='Survival Status',
    bargap=0.2,  # Gap between bars
)

# Show the figure
fig.show()


As we can see from the graph, 
- The 3rd class had the highest number of fatalities, while the 1st class had the highest survival rate.

### 2. Survival by Gender


In [13]:
fig = px.histogram(preprocessed_train_df, 
                   x='Sex', 
                   color='Survived',
                   barmode='group', 
                   title='Survived Count by Gender',
                   labels={'sex': 'Sex', 'count': 'Count'},
                   text_auto=True)  # Displays values on the bars

# Update layout for better visuals
fig.update_layout(
    xaxis_title='Sex',
    yaxis_title='Count',
    legend_title='Survival Status',
    bargap=0.2,  # Gap between bars
)

# Show the figure
fig.show()



From the graph, 
- female survival rate is higher than male. 

### 3. Age Distribution of Survivors
A histogram to compare the age distributions of survivors and non-survivors:

In [14]:
fig = px.histogram(preprocessed_train_df, 
                   x='Age', 
                   color='Survived',
                   barmode='group', 
                   title='Survived Count by Gender',
                   labels={'age': 'Age', 'count': 'Count'},
                   text_auto=True)  # Displays values on the bars

# Update layout for better visuals
fig.update_layout(
    xaxis_title='Age',
    yaxis_title='Count',
    legend_title='Survival Status',
    bargap=0.2,  # Gap between bars
)

# Show the figure
fig.show()


### 4. Fare Distribution of Survivors
A boxplot to compare the fare paid by survivors and non-survivors:

In [15]:
#sns.boxplot(x='Survived', y='Fare', data=preprocessed_train_df)
#plt.title('Fare Distribution of Survivors vs. Non-Survivors')
#plt.xlabel('Survived')
#plt.ylabel('Fare')
#plt.xticks([0, 1], ['Not Survived', 'Survived'])
#plt.show()


# Create the boxplot using Plotly Express
fig = px.box(preprocessed_train_df, 
             x='Survived',  # Categorical variable (Survived)
             y='Fare',  # Numerical variable (Fare)
             color='Survived',  # Color by survival status
             title='Fare Distribution of Survivors vs. Non-Survivors',
             labels={'Survived': 'Survival Status', 'Fare': 'Fare'})  # Labels for axes

# Update layout for better visuals
fig.update_layout(
    xaxis_title='Survival Status',
    yaxis_title='Fare',
    legend_title='Survival Status',
)

# Show the figure
fig.show()



In [18]:

# Group the data by 'SibSp', 'Parch', and 'Survived' to count occurrences
grouped_data = preprocessed_train_df.groupby(['SibSp', 'Parch', 'Survived']).size().reset_index(name='Count')

# Create a grouped bar chart
fig = px.bar(grouped_data, 
             x='SibSp', 
             y='Count', 
             color='Survived',  # Color by survival status
             barmode='group',   # Group the bars together
             facet_col='Parch',  # Split by 'Parch'
             title='Survival by SibSp and Parch',
             labels={'SibSp': 'Number of Siblings/Spouses', 'Count': 'Survival Count', 'Parch': 'Number of Parents/Children'},
             text='Count')  # Add count labels to each bar

fig.update_layout(
    xaxis_title='Number of Siblings/Spouses Aboard',
    yaxis_title='Count of Passengers',
    legend_title='Survival Status',
    bargap=0.2,  # Gap between bars
    width=1800, 
    height=800   
)

fig.show()



- Survival Rates Vary with SibSp and Parch: The color intensity in each cell represents the proportion of passengers who survived within that specific combination of SibSp and Parch values. Darker shades likely indicate higher survival rates.

- Passengers with No Siblings/Spouses and No Parents/Children had the highest survival rate: The cell with the darkest shade corresponds to passengers with 0 SibSp and 0 Parch, suggesting they had the highest survival rate.

- Survival rate decreases as the number of siblings/spouses increases: As we move horizontally across the graph, increasing the number of siblings/spouses, the survival rate generally decreases. This suggests that having more siblings/spouses onboard might have negatively impacted survival chances.

- Survival rate also decreases as the number of parents/children increases: Similarly, as we move vertically down the graph, increasing the number of parents/children, the survival rate tends to decrease. This suggests that having more family members onboard might have also negatively impacted survival chances.

In Summary:

The graph provides a visual representation of how survival rates on the Titanic varied based on the number of siblings/spouses and parents/children onboard. It suggests that passengers traveling alone or with fewer family members had a higher chance of survival.

In [19]:
#drop 'cabin' which has too many missing values
preprocessed_train_df.drop(columns=['Cabin'], inplace=True)

print(preprocessed_train_df.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Embarked', 'Ticket_Item', 'Ticket_Number'],
      dtype='object')


## Convert Pandas dataset into Tensorflow dataset

Before converting, need to seperate the features into categorical and numerical.
## Handle Categorical Variables
- Sex: Encode male as 0 and female as 1
- Embarked: Convert the Embarked column (e.g., S, C, Q) into numeric values using label encoding.
- Ticket_Item: Apply one-hot encoding to the Ticket_Item column.
- Ticket_Number: Depending on your analysis, you can either treat this as a numerical or categorical feature. If you treat it as categorical, one-hot encoding could be applied.

In [20]:
#ensure to remove to unnecessary columns 
input_features = list(preprocessed_train_df.columns)
input_features.remove("Ticket")
input_features.remove("PassengerId")
input_features.remove("Survived")
#input_features.remove("Ticket_number")

print(f"Input features: {input_features}")


Input features: ['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Ticket_Item', 'Ticket_Number']


In [21]:
def tokenize_names(features, labels=None):
    """Divite the names into tokens. TF-DF can consume text tokens natively."""
    features["Name"] =  tf.strings.split(features["Name"])
    return features, labels

train_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_train_df,label="Survived").map(tokenize_names)
test_ds = tfdf.keras.pd_dataframe_to_tf_dataset(preprocessed_test_df).map(tokenize_names)



invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater



## Now 'train_ds' and 'test_ds' are ready to be used for training and serving, respectively

# Train Model with best parameter
by testing few times on GradientBoostedTreesModel model with the default parameters.
- Hyperparameter Tuning using Random Search. Before training, we perform hyperparameter tuning to find the best set of parameters.
- Optimize model performance by selecting values that improve accuracy.



In [22]:
# Define hyperparameter search space
param_space = {
    "num_trees": [50, 100, 200],
    "max_depth": [5, 10, 15],
    "shrinkage": [0.1, 0.05, 0.01],
    "subsample": [0.8, 1.0],
    "min_examples": [2, 5, 10],
}

# Random search with sampling
search_trials = 20  # Number of trials
best_accuracy = 0
best_params = None

for _ in range(search_trials):
    params = {key: random.choice(values) for key, values in param_space.items()}
    model = tfdf.keras.GradientBoostedTreesModel(
        verbose=0,
        features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
        exclude_non_specified_features=True,
        num_trees=params["num_trees"],
        max_depth=params["max_depth"],
        shrinkage=params["shrinkage"],
        subsample=params["subsample"],
        min_examples=params["min_examples"],
        random_seed=1234,
    )
    model.fit(train_ds)
    self_evaluation = model.make_inspector().evaluation()
    print(f"Params: {params}")
    print(f"Accuracy: {self_evaluation.accuracy}, Loss: {self_evaluation.loss}")
    
    if self_evaluation.accuracy > best_accuracy:
        best_accuracy = self_evaluation.accuracy
        best_params = params

print(f"Best Accuracy: {best_accuracy} with Parameters: {best_params}")


Params: {'num_trees': 100, 'max_depth': 5, 'shrinkage': 0.01, 'subsample': 1.0, 'min_examples': 2}
Accuracy: 0.782608687877655, Loss: 0.9822587370872498
Params: {'num_trees': 200, 'max_depth': 5, 'shrinkage': 0.1, 'subsample': 0.8, 'min_examples': 5}
Accuracy: 0.79347825050354, Loss: 0.914966344833374
Params: {'num_trees': 50, 'max_depth': 15, 'shrinkage': 0.01, 'subsample': 0.8, 'min_examples': 5}
Accuracy: 0.8152173757553101, Loss: 1.0805474519729614
Params: {'num_trees': 200, 'max_depth': 10, 'shrinkage': 0.1, 'subsample': 0.8, 'min_examples': 10}
Accuracy: 0.8152173757553101, Loss: 0.9404773712158203
Params: {'num_trees': 100, 'max_depth': 15, 'shrinkage': 0.1, 'subsample': 1.0, 'min_examples': 2}
Accuracy: 0.804347813129425, Loss: 0.9376649856567383
Params: {'num_trees': 100, 'max_depth': 5, 'shrinkage': 0.1, 'subsample': 0.8, 'min_examples': 5}
Accuracy: 0.79347825050354, Loss: 0.914966344833374
Params: {'num_trees': 200, 'max_depth': 15, 'shrinkage': 0.01, 'subsample': 1.0, 'min

## Fit the best param to the model and start training

In [23]:
best_params = {
    "num_trees": 50,
    "max_depth": 10,
    "shrinkage": 0.01,
    "subsample": 1.0,
    "min_examples": 5,
}

model = tfdf.keras.GradientBoostedTreesModel(
    verbose=0,
    features=[tfdf.keras.FeatureUsage(name=n) for n in input_features],
    exclude_non_specified_features=True,
    **best_params,
    random_seed=1234,
)
model.fit(train_ds)

# Inspect the trained model to view information about the trees and features used
inspector = model.make_inspector()

# Print the model structure information
print(f"Number of Trees: {inspector.num_trees()}")
print(f"Features used in the model: {inspector.features()}")

# For more detailed evaluation:
evaluation = inspector.evaluation()
print(f"Final Accuracy: {evaluation.accuracy}")
print(f"Final Loss: {evaluation.loss}")

# Additional model insights
print("\n--- Model Insights ---")
print(f"Variable importances: {inspector.variable_importances()}\n")


Number of Trees: 50
Features used in the model: ["Age" (1; #0), "Embarked" (4; #1), "Fare" (1; #2), "Name" (5; #3), "Parch" (1; #4), "Pclass" (1; #5), "Sex" (4; #6), "SibSp" (1; #7), "Ticket_Item" (4; #8), "Ticket_Number" (4; #9)]
Final Accuracy: 0.8260869383811951
Final Loss: 1.068824052810669

--- Model Insights ---
Variable importances: {'NUM_NODES': [("Name" (5; #3), 1742.0), ("Fare" (1; #2), 1057.0), ("Age" (1; #0), 1002.0), ("Ticket_Item" (4; #8), 267.0), ("Embarked" (4; #1), 152.0), ("Parch" (1; #4), 131.0), ("SibSp" (1; #7), 102.0), ("Pclass" (1; #5), 87.0), ("Ticket_Number" (4; #9), 71.0), ("Sex" (4; #6), 52.0)], 'SUM_SCORE': [("Sex" (4; #6), 1605.6526038999227), ("Fare" (1; #2), 704.0864378250374), ("Name" (5; #3), 483.6270302723298), ("Pclass" (1; #5), 471.02402290400687), ("Age" (1; #0), 435.018253210779), ("SibSp" (1; #7), 203.09745964407017), ("Ticket_Item" (4; #8), 154.98678843270582), ("Embarked" (4; #1), 57.75564853094524), ("Ticket_Number" (4; #9), 43.3624663395651), 

## Make prediction 

In [24]:
def prediction_to_kaggle_format(model, threshold=0.5):
    proba_survive = model.predict(test_ds, verbose=0)[:,0]
    return pd.DataFrame({
        "PassengerId":test_df["PassengerId"],
        "Survived": (proba_survive >= threshold).astype(int)
    })

def make_submission(kaggle_predictions):
    path="/kaggle/working/submission.csv"
    kaggle_predictions.to_csv(path, index=False)
    print(f"Submission exported to {path}")
    
kaggle_predictions = prediction_to_kaggle_format(model)
make_submission(kaggle_predictions)
!head /kaggle/working/submission.csv


Submission exported to /kaggle/working/submission.csv
PassengerId,Survived
892,0
893,0
894,0
895,0
896,1
897,0
898,0
899,0
900,0
