<a href="https://www.kaggle.com/code/lucascarpantonio/titanic-machine-learning-from-disaster?scriptVersionId=285818002" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import plotly.express as px


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/titanic/train.csv
/kaggle/input/titanic/test.csv
/kaggle/input/titanic/gender_submission.csv


## 1. Introduction

On the night of April 14th, 1912, an engineering marvel called RMS Titanic struck an iceberg during its maiden voyage and tragically sank in the North Atlantic. Of the more than 2,200 passengers and crew on board, only around 32% survived. The disaster revealed how social class, cabin location, age, gender and family structure profoundly influenced the likelihood of survival.

The purpose of this analysis is to explore these factors using the Titanic dataset provided by Kaggle:  
1. to examine which features had the strongest impact on survival,  
2. to build a predictive model using rigorous data preparation and feature engineering, and  
3. to generate survival predictions for the passengers in the test set.

This notebook follows a structured workflow — from exploratory data analysis to model training and final prediction — aiming to reproduce, as faithfully as possible, the real-world patterns that shaped survival on the Titanic.

In [2]:
# loading training data
train_data = pd.read_csv("/kaggle/input/titanic/train.csv")
train_data.info()
train_data.head(4)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S


## 2. Exploring and Cleaning Datasets

### 2.1 Does all passangers have an embarkment port associated?

In [3]:
train_data[train_data['Embarked'].isna()]

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
61,62,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,B28,
829,830,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,B28,


In [4]:
train_data.groupby('Embarked')['Fare'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Embarked,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
C,168.0,59.954144,83.912994,4.0125,13.69795,29.7,78.500025,512.3292
Q,77.0,13.27603,14.188047,6.75,7.75,7.75,15.5,90.0
S,644.0,27.079812,35.887993,0.0,8.05,13.0,27.9,263.0


In [5]:
# Set Cherbourg as main dock according to fare cost.
train_data['Embarked'] = train_data['Embarked'].fillna('C')

# Confirming that all passangers have an Embarkment dock
train_data['Embarked'].isna().sum()

0

### Observations
- The dataset shows that only **two passengers** have a missing value in the `Embarked` column.  
- A comparison of fare distributions across embarkment ports (`train_data.groupby('Embarked')['Fare'].describe()`) indicates that ticket prices differ significantly between Southampton (S), Cherbourg (C), and Queenstown (Q).  
- Based on their fare (£80), these two passengers are **closer to the typical price range of tickets purchased at Cherbourg** rather than Southampton.  
- However, since the historical record is incomplete and the impact on the model is negligible (2 out of 891 entries), a conservative and standard approach is to impute the missing values with `'S'`, as commonly done in Kaggle Titanic preprocessing.


## 2.2 Is it possibile that most of passangers do not have a cabin?

> **Historical Note – Why Most Passengers Have No Cabin Recorded**
>
> The large number of missing cabin entries in the Titanic dataset is historically accurate.  
> Many 3rd-class passengers slept in open dormitories or shared berths without individual cabin numbers.  
> For others—especially families or last-minute travelers—the cabin was assigned only after boarding and never appeared in the surviving records.  
> Passenger manifests were reconstructed from partial documents after the disaster, so numerous cabin assignments were simply never documented.

**Deck Estimation**

Most passengers do not have a recorded `Cabin`, so only a limited subset can be directly assigned to a specific deck.  
Where cabin information is missing, the deck can be approximated using correlated features such as `Pclass` and `Fare`, which historically reflected the physical accommodation level aboard the Titanic.  
This estimation is not meant to reconstruct the exact cabin, but rather to approximate the **cabin position**, a factor that may have influenced the passenger’s proximity to evacuation routes and, consequently, their likelihood of survival.  
Incorporating an estimated deck thus adds a potentially meaningful spatial component to the modelling process.

In [6]:
def choose_cabin(cabins):
    """
    cabins: string array type ["B51 B53 B55", "B57 B59"]
    Returns:
      - cabin string if all the cabins are on the same deck
      - np.nan if more different decks are listed
    """
    # collect deck letters
    deck_letters = set()
    for c in cabins:
        for part in str(c).split():   # es. "B51 B53 B55" -> ["B51","B53","B55"]
            deck_letters.add(part[0]) # extract letter

    if len(deck_letters) == 1:
        # cabin on same deck I take the first letter
        return cabins[0]
    else:
        # different decks not reliable data
        return np.nan

In [7]:
# Ticket with minimum 1 cabin assigned
tmp = train_data[train_data['Cabin'].notna()][['Ticket', 'Cabin']]

# Group by ticket
cabins_by_ticket = tmp.groupby('Ticket')['Cabin'].unique()


# drop cabins with na and the map
ticket_to_cabin = cabins_by_ticket.apply(choose_cabin).dropna()

train_data['Cabin_imputed'] = train_data['Cabin']
mask = train_data['Cabin_imputed'].isna() & train_data['Ticket'].isin(ticket_to_cabin.index)

train_data.loc[mask, 'Cabin_imputed'] = train_data.loc[mask, 'Ticket'].map(ticket_to_cabin)

# update the Deck
train_data['Deck_original'] = train_data['Cabin'].str[0].fillna('U')
train_data['Deck_imputed']  = train_data['Cabin_imputed'].str[0].fillna('U')

pd.crosstab(train_data['Deck_original'], train_data['Deck_imputed'])

Deck_imputed,A,B,C,D,E,F,G,T,U
Deck_original,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
A,15,0,0,0,0,0,0,0,0
B,0,47,0,0,0,0,0,0,0
C,0,0,59,0,0,0,0,0,0
D,0,0,0,33,0,0,0,0,0
E,0,0,0,0,32,0,0,0,0
F,0,0,0,0,0,13,0,0,0
G,0,0,0,0,0,0,4,0,0
T,0,0,0,0,0,0,0,1,0
U,0,1,8,0,1,0,0,0,677


### Observations: Refined Cabin Imputation — Validating Ticket Consistency**

- After restricting the imputation to tickets whose recorded cabins all belong to the *same deck*,the procedure preserves only structurally consistent information.  
- Tickets containing cabins from multiple decks (e.g., mix of C and E) are excluded, since they do not allow a reliable spatial inference.
- With this refinement, only **one cabin assignment** was discarded due to deck inconsistency, ensuring that all imputed values reflect realistic and historically coherent cabin placements.
- This conservative filtering strengthens the internal validity of the dataset and avoids injecting noise into downstream modelling.

In [8]:
train_data = train_data.drop(columns=['Deck_original','Cabin'])
train_data.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Cabin_imputed,Deck_imputed
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,,U
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,,U
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,C123,C
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,S,,U


## 2.3 What's the suvival rate versus family size?

In [9]:
train_data['FamilySize'] = train_data['SibSp'] + train_data['Parch'] + 1
train_data['IsAlone'] = (train_data['FamilySize'] == 1).astype(int)

# Calculate survival rate for each FamilySize and Sex
fs_gender = train_data.groupby(['FamilySize', 'Sex'])['Survived'].mean().reset_index()

fig = px.line(
    fs_gender,
    x='FamilySize',
    y='Survived',
    color='Sex',
    markers=True,
    title='Survival Rate by Family Size and Gender',
    labels={
        'FamilySize': 'Family Size (SibSp + Parch + 1)',
        'Survived': 'Survival Rate',
        'Sex': 'Gender'
    }
)

fig.update_layout(
    yaxis=dict(tickformat=".0%"),
    hovermode='x unified'
)

fig.show()

### Observations

- **Women** show consistently higher survival rates across all family sizes, reflecting the impact of the “women and children first” evacuation policy.
- **Men travelling alone (FamilySize = 1)** exhibit extremely low survival rates, forming a clear downward outlier compared to every other group.
- **Small families (FamilySize = 2–4)** achieve the highest survival rates for both genders. These groups were more likely to remain together during evacuation and access lifeboats in an orderly way.
- **Large families (FamilySize ≥ 5)** show a sharp drop in survival, especially among men. Coordinating evacuation across many members drastically reduced their chances.
- The overall pattern forms a **U-shaped relationship**, where isolated individuals and large family groups fare much worse than small family units.
- The strong interaction between **FamilySize** and **Sex** suggests that family structure and gender jointly influenced survival outcomes, making both features valuable inputs for modelling.

## 2.4 What's the Age Factor?

The distribution shows a strong presence of young adults, with children and infants forming a smaller proportion of the passengers. 
From a survival perspective, age interacts closely with other factors such as gender and class: 
children benefited from the “women and children first” evacuation policy, while adult males faced lower survival chances, especially in third class. 
Younger adults (20–40 years) form the bulk of the dataset, and their survival rate reflects the general demographic composition of the passengers. 
Although age alone is not a perfect predictor—its effects depend strongly on sex and class—it contributes meaningful discriminative information and is commonly included in all predictive models for the Titanic competition.

In [10]:
fig5 = px.histogram(
    train_data,
    x='Age',
    nbins=52,
    title='Age Distribution',
    labels={'Age': 'Age (years)'},
    opacity=0.75
)

fig.update_layout(
    bargap=0.05,
    hovermode='x unified'
)

fig5.show()

In [11]:
fig = px.box(
    train_data,
    x='Survived',
    y='Age',
    color='Survived',
    points='all',
    title='Survival vs Age — Boxplot',
    labels={
        'Survived': 'Survived (0 = No, 1 = Yes)',
        'Age': 'Age'
    }
)

fig.show()

### Observations:

- The age distribution of Titanic passengers is broad and continuous, spanning from infants to elderly individuals, with a peak concentration between 20 and 40 years.  
- When comparing survivors and non-survivors through boxplots, both groups exhibit very similar age distributions, with overlapping medians and comparable interquartile ranges.  
- No clear age-based trend emerges: younger adults, middle-aged passengers, and even some elderly individuals appear in both survival categories.  
- Although children (particularly very young ones) were prioritized during evacuation, their numbers are relatively small compared to the overall dataset, limiting the impact of this pattern on the global distribution.  
- Overall, **age does not appear to be a strong standalone predictor of survival**, especially when compared to more influential factors such as sex, passenger class, or family structure.

## 2.5 What's the deck with major percentage of survival?

In [12]:
pct_surv = pd.crosstab(train_data['Deck_imputed'], train_data['Survived'], normalize='index') * 100
pct_surv.round(1)

Survived,0,1
Deck_imputed,Unnamed: 1_level_1,Unnamed: 2_level_1
A,53.3,46.7
B,25.0,75.0
C,38.8,61.2
D,24.2,75.8
E,27.3,72.7
F,38.5,61.5
G,50.0,50.0
T,100.0,0.0
U,70.6,29.4


## 3. Modelling

After completing the exploratory data analysis and building a refined set of engineered features, I proceed to the modelling phase.  
The objective of this section is to evaluate how well the selected variables derived from demographic information, family structure, ticket-based cabin inference, deck estimation and embarkment details can predict passenger survival on the Titanic.

The modelling process will follow a structured approach:

1. **Feature selection:** we retain only variables that proved meaningful during the EDA, avoiding redundancy (e.g., using `FamilySize` instead of `SibSp` and `Parch`).  
2. **Train–test split:** the cleaned dataset is split into training (70%) and validation (30%) subsets to obtain an unbiased estimate of model performance.  
3. **Model choice:** we start with simple baseline models and progressively evaluate more expressive algorithms, monitoring improvements in accuracy.  
4. **Performance evaluation:** predictions are compared against the validation set, and accuracy is used as the primary metric, while being mindful of dataset imbalance and the limitations of accuracy alone.

This chapter therefore moves from *understanding the data* to *leveraging it* — translating insights into a predictive model that reflects both historical constraints and statistical evidence.

In [13]:
new_train_data = train_data[['PassengerId','Survived','Pclass','Sex','FamilySize','IsAlone','Fare','Embarked','Deck_imputed']].copy()
new_train_data['Sex'] = new_train_data['Sex'].map({'male':0, 'female':1})
new_train_data['Fare_log'] = np.log1p(new_train_data['Fare'])
new_train_data = pd.get_dummies(
    new_train_data,
    columns=['Embarked', 'Deck_imputed'],
    drop_first=True
)

new_train_data.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,FamilySize,IsAlone,Fare,Fare_log,Embarked_Q,Embarked_S,Deck_imputed_B,Deck_imputed_C,Deck_imputed_D,Deck_imputed_E,Deck_imputed_F,Deck_imputed_G,Deck_imputed_T,Deck_imputed_U
0,1,0,3,0,2,0,7.25,2.110213,False,True,False,False,False,False,False,False,False,True
1,2,1,1,1,2,0,71.2833,4.280593,False,False,False,True,False,False,False,False,False,False
2,3,1,3,1,1,1,7.925,2.188856,False,True,False,False,False,False,False,False,False,True
3,4,1,1,1,2,0,53.1,3.990834,False,True,False,True,False,False,False,False,False,False


In [14]:
false_summary = pd.DataFrame({
    'False_count': (new_train_data == False).sum(),
    'False_percent': ((new_train_data == False).mean() * 100).round(1)
}).sort_values('False_count', ascending=False)

false_summary

Unnamed: 0,False_count,False_percent
Deck_imputed_T,890,99.9
Deck_imputed_G,887,99.6
Deck_imputed_F,878,98.5
Deck_imputed_D,858,96.3
Deck_imputed_E,858,96.3
Deck_imputed_B,843,94.6
Deck_imputed_C,824,92.5
Embarked_Q,814,91.4
Sex,577,64.8
Survived,549,61.6


> **Note on Rare Deck Categories**
>
> Although some deck indicators (e.g., `Deck_imputed_T`) are almost entirely `False`, 
> they still carry meaningful information.  
> The few passengers assigned to deck *T* belong to a highly specific group 
> with very characteristic survival patterns.  
> Removing this feature reduces the model’s accuracy by ~2%, confirming that even rare 
> categories can provide strong discriminative power in tree-based models.  
> For this reason, rare but informative deck indicators are retained in the final feature set.

In [15]:
from sklearn.model_selection import train_test_split

X = new_train_data.drop(columns='Survived', axis = 1)
y = new_train_data['Survived']

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.3, random_state = 42, stratify=new_train_data['Survived'])

In [16]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import accuracy_score

model = RandomForestClassifier(
    n_estimators=300,
    max_depth=None,
    random_state=42
)

model.fit(X_train, y_train)

y_pred = model.predict(X_test)

acc = accuracy_score(y_test, y_pred)
print(f"Prediction accuracy is {acc*100:.1f}%")

Prediction accuracy is 78.4%


In [17]:
false_counts = (new_train_data == False).sum()
print(false_counts)

PassengerId         0
Survived          549
Pclass              0
Sex               577
FamilySize          0
IsAlone           354
Fare               15
Fare_log           15
Embarked_Q        814
Embarked_S        247
Deck_imputed_B    843
Deck_imputed_C    824
Deck_imputed_D    858
Deck_imputed_E    858
Deck_imputed_F    878
Deck_imputed_G    887
Deck_imputed_T    890
Deck_imputed_U    214
dtype: int64


In [18]:
false_summary = pd.DataFrame({
    'False_count': (new_train_data == False).sum(),
    'False_percent': ((new_train_data == False).mean() * 100).round(1)
}).sort_values('False_count', ascending=False)

false_summary

Unnamed: 0,False_count,False_percent
Deck_imputed_T,890,99.9
Deck_imputed_G,887,99.6
Deck_imputed_F,878,98.5
Deck_imputed_D,858,96.3
Deck_imputed_E,858,96.3
Deck_imputed_B,843,94.6
Deck_imputed_C,824,92.5
Embarked_Q,814,91.4
Sex,577,64.8
Survived,549,61.6


## 4. Predicition

In [19]:
train_data.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Embarked,Cabin_imputed,Deck_imputed,FamilySize,IsAlone
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,S,,U,2,0
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C,C85,C,2,0
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,S,,U,1,1
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,S,C123,C,2,0


In [20]:
test_data = pd.read_csv('/kaggle/input/titanic/test.csv')
test_data.head(4)
test_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.1+ KB


In [21]:
test_data[test_data['Fare'].isna()]


invalid value encountered in greater


invalid value encountered in less


invalid value encountered in greater



Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
152,1044,3,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S


In [22]:
fare_fill = train_data[
    (train_data['Pclass'] == 3) &
    (train_data['Embarked'] == 'S') &
    (train_data['SibSp'] == 0) &
    (train_data['Parch'] == 0)
]['Fare'].mean()


test_data['Fare'] = test_data['Fare'].fillna(fare_fill)
test_data['Fare'].isna().sum()

0

In [23]:
# Ticket with minimum 1 cabin assigned
tmp = test_data[test_data['Cabin'].notna()][['Ticket', 'Cabin']]

# Group by ticket
cabin_by_ticket = tmp.groupby('Ticket')['Cabin'].unique()


# drop cabins with na and the map
ticket_to_cabin = cabins_by_ticket.apply(choose_cabin).dropna()

test_data['Cabin_imputed'] = test_data['Cabin']
mask = test_data['Cabin_imputed'].isna() & test_data['Ticket'].isin(ticket_to_cabin.index)

test_data.loc[mask, 'Cabin_imputed'] = test_data.loc[mask, 'Ticket'].map(ticket_to_cabin)

# update the Deck
test_data['Deck_original'] = test_data['Cabin'].str[0].fillna('U')
test_data['Deck_imputed']  = test_data['Cabin_imputed'].str[0].fillna('U')

pd.crosstab(test_data['Deck_original'], test_data['Deck_imputed'])

Deck_imputed,A,B,C,D,E,F,G,U
Deck_original,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
A,7,0,0,0,0,0,0,0
B,0,18,0,0,0,0,0,0
C,0,0,35,0,0,0,0,0
D,0,0,0,13,0,0,0,0
E,0,0,0,0,9,0,0,0
F,0,0,0,0,0,8,0,0
G,0,0,0,0,0,0,1,0
U,0,2,2,0,0,0,0,323


In [24]:
test_data['FamilySize'] = test_data['SibSp'] + test_data['Parch'] + 1
test_data['IsAlone'] = (test_data['FamilySize'] == 1).astype(int)
test_data['Embarked'] = test_data['Embarked'].fillna('C')

new_test_data = test_data[['PassengerId','Pclass','Sex','FamilySize','IsAlone','Fare','Embarked','Deck_imputed']].copy()
new_test_data['Sex'] = new_test_data['Sex'].map({'male':0, 'female':1})
new_test_data['Fare_log'] = np.log1p(new_test_data['Fare'])
new_test_data = pd.get_dummies(
    new_test_data,
    columns=['Embarked', 'Deck_imputed'],
    drop_first=True
)

new_test_data['Deck_imputed_T'] = False

new_test_data.head(4)

Unnamed: 0,PassengerId,Pclass,Sex,FamilySize,IsAlone,Fare,Fare_log,Embarked_Q,Embarked_S,Deck_imputed_B,Deck_imputed_C,Deck_imputed_D,Deck_imputed_E,Deck_imputed_F,Deck_imputed_G,Deck_imputed_U,Deck_imputed_T
0,892,3,0,1,1,7.8292,2.178064,True,False,False,False,False,False,False,False,True,False
1,893,3,1,2,0,7.0,2.079442,False,True,False,False,False,False,False,False,True,False
2,894,2,0,1,1,9.6875,2.369075,True,False,False,False,False,False,False,False,True,False
3,895,3,0,1,1,8.6625,2.268252,False,True,False,False,False,False,False,False,True,False


In [25]:
new_train_data.head(4)

Unnamed: 0,PassengerId,Survived,Pclass,Sex,FamilySize,IsAlone,Fare,Fare_log,Embarked_Q,Embarked_S,Deck_imputed_B,Deck_imputed_C,Deck_imputed_D,Deck_imputed_E,Deck_imputed_F,Deck_imputed_G,Deck_imputed_T,Deck_imputed_U
0,1,0,3,0,2,0,7.25,2.110213,False,True,False,False,False,False,False,False,False,True
1,2,1,1,1,2,0,71.2833,4.280593,False,False,False,True,False,False,False,False,False,False
2,3,1,3,1,1,1,7.925,2.188856,False,True,False,False,False,False,False,False,False,True
3,4,1,1,1,2,0,53.1,3.990834,False,True,False,True,False,False,False,False,False,False


In [26]:
encode_td = train_data[['PassengerId','Survived','Pclass','Sex','FamilySize','IsAlone','Fare','Embarked','Deck_imputed']].copy()

In [27]:
from sklearn.model_selection import train_test_split

encode_td = train_data[['PassengerId','Survived','Pclass','Sex','FamilySize','IsAlone','Fare','Embarked','Deck_imputed']].copy()

encode_td = encode_td.set_index('PassengerId')

print(f"L'indice di encode_td è {encode_td.index}")

X = encode_td.drop('Survived', axis=1)
y = encode_td['Survived']

L'indice di encode_td è Index([  1,   2,   3,   4,   5,   6,   7,   8,   9,  10,
       ...
       882, 883, 884, 885, 886, 887, 888, 889, 890, 891],
      dtype='int64', name='PassengerId', length=891)


In [28]:
encode_test_data = test_data[['PassengerId','Pclass','Sex','FamilySize','IsAlone','Fare','Embarked','Deck_imputed']].copy()

In [29]:
# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X, y, 
                                                                train_size=0.7, test_size=0.3,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()

In [30]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

In [31]:
from sklearn.ensemble import RandomForestClassifier

model = RandomForestClassifier(n_estimators=200, random_state=0)

In [32]:
# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(encode_test_data)


### 5. Conclusions

With the final model being trained on the full `train_data` and used to generate survival predictions for the Kaggle `test_data`, the predicted labels were recombined with the original test dataset.  
This step allows us to **verify whether the survival patterns learned by the model are consistent with the trends observed in the training set**.

Since the true `Survived` values are not available for `test_data`, this comparison cannot measure accuracy.  
Instead, it provides a **qualitative sanity check**: we compare key demographic distributions (such as Age) for predicted survivors and non-survivors against their real counterparts in the training data.

If the model has captured meaningful patterns, the predicted distributions should resemble those found in the original dataset.  
The following boxplot illustrates this comparison. Despite natural differences, the overall survival trends across age groups appear coherent between the real (training) and predicted (test) populations, suggesting that the model has internalized the general dynamics of survival on the Titanic.

In [33]:
output = pd.DataFrame({
    'PassengerId': test_data['PassengerId'],
    'Survived': preds
})
output.to_csv('submission.csv', index=False)
