# Titanic survival prediction

**Project overview**

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems certain groups of people were more likely to survive than others. In this project, we aim to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class, etc).

**Data description**

The dataset provided by Kaggle includes a training set and a test set. The features included involve passenger demographics and travel characteristics:

1. `PassengerId`: Unique identifier for each passenger
2. `Survived`: Survival (0 = No, 1 = Yes)
3. `Pclass`: Ticket class — a proxy for socio-economic status (1 = 1st, 2 = 2nd, 3 = 3rd)
4. `Name`: Full name of the passenger
5. `Sex`: Gender of the passenger
6. `Age`: Age in years
7. `SibSp`: Number of siblings/spouses aboard the Titanic
8. `Parch`: Number of parents/children aboard the Titanic
9. `Ticket`: Ticket number
10. `Fare`: Passenger fare
11. `Cabin`: Cabin number
12. `Embarked`: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**Objective**

The primary objective of this project is to make predictions on the survival of passengers. Our main metric is `accuracy` - the percentage of passengers we predicted correctely.

**Output**

A `csv` file with 418 entries plus a header row:
1. `PassengerId` (sorted in any order)
2. `Survived` (contains your binary predictions: 1 for survived, 0 for deceased)

**Methodology**

Our approach will consist of the following steps:

1. Data exploration: Analyzing the features to understand the data's structure and the relationships between different variables.
2. Data cleaning and preprocessing: Dealing with missing values, encoding categorical variables, and scaling features where necessary.
3. Feature engineering: Creating new features from the existing data to improve the predictive power of our model.
4. Model Selection: Comparing different machine learning algorithms and selecting the most appropriate model for our data.
5. Model training and evaluation: Training the model using the training dataset and evaluating its performance with a validation set.
6. Model tuning: Improving the model by tuning its parameters.
7. Prediction: Applying the final model to the test set to predict survival.
8. Results iterpretation: Understanding the output of the model and the factors that influence the prediction.


In [1]:
import pandas as pd
import numpy as np
import plotly.express as px
import optuna

from caseconverter import snakecase
from collections import defaultdict
from IPython.display import display

from fast_ml import eda
from ydata_profiling import ProfileReport

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler
from category_encoders import MEstimateEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.metrics import accuracy_score

In [2]:
FIG_WIDTH = 9 * 100
FIG_HEIGHT = 5 * 100
RANDOM_SEED = 42

In [3]:
try:
    raw_train = pd.read_csv('train.csv')
    raw_test = pd.read_csv('test.csv')
except:
    raw_train = pd.read_csv('/kaggle/input/titanic/train.csv')
    raw_test = pd.read_csv('/kaggle/input/titanic/test.csv')

# Exploratory Data Analysis

In this section, we focus on the critical aspects of understanding the Titanic dataset:

1. Outlier detection: identify data points that deviate significantly from other observations.
2. Missing values: quantify and analyze the presence of missing data across different features.
3. Data consistency: check for any discrepancies or anomalies in the dataset that could indicate errors.
4. Feature distributions: examine the distribution of each feature to understand the spread and central tendencies.
5. Correlation analysis: investigate the relationships between different features, especially how they relate to the target variable 'Survived'.
6. Data types: Assess the type of data (numerical/categorical) for appropriate preprocessing techniques.

By addressing these points, we aim to prepare the dataset adequately for the subsequent stages of modeling and prediction.

## Train data

Let's first explore train data.

In [4]:
raw_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
display(eda.df_info(raw_train))

Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
PassengerId,int64,Numerical,891,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]",0,0.0
Survived,int64,Numerical,2,"[0, 1]",0,0.0
Pclass,int64,Numerical,3,"[3, 1, 2]",0,0.0
Name,object,Categorical,891,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B...",0,0.0
Sex,object,Categorical,2,"[male, female]",0,0.0
Age,float64,Numerical,88,"[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0,...",177,19.86532
SibSp,int64,Numerical,7,"[1, 0, 3, 4, 2, 5, 8]",0,0.0
Parch,int64,Numerical,7,"[0, 1, 2, 5, 3, 4, 6]",0,0.0
Ticket,object,Categorical,681,"[A/5 21171, PC 17599, STON/O2. 3101282, 113803...",0,0.0
Fare,float64,Numerical,248,"[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51....",0,0.0


In [6]:
display(round(raw_train.describe().T, 2))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.35,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.38,0.49,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.31,0.84,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.7,14.53,0.42,20.12,28.0,38.0,80.0
SibSp,891.0,0.52,1.1,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.38,0.81,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.2,49.69,0.0,7.91,14.45,31.0,512.33


In [7]:
ProfileReport(raw_train).to_widgets()

Summarize dataset: 100%|██████████| 47/47 [00:04<00:00, 11.34it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.20s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…