# Titanic survival prediction

**Project overview**

The sinking of the Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems certain groups of people were more likely to survive than others. In this project, we aim to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (i.e., name, age, gender, socio-economic class, etc).

**Data description**

The dataset provided by Kaggle includes a training set and a test set. The features included involve passenger demographics and travel characteristics:

1. `PassengerId`: Unique identifier for each passenger
2. `Survived`: Survival (0 = No, 1 = Yes)
3. `Pclass`: Ticket class — a proxy for socio-economic status (1 = 1st, 2 = 2nd, 3 = 3rd)
4. `Name`: Full name of the passenger
5. `Sex`: Gender of the passenger
6. `Age`: Age in years
7. `SibSp`: Number of siblings/spouses aboard the Titanic
8. `Parch`: Number of parents/children aboard the Titanic
9. `Ticket`: Ticket number
10. `Fare`: Passenger fare
11. `Cabin`: Cabin number
12. `Embarked`: Port of Embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)

**Objective**

The primary objective of this project is to make predictions on the survival of passengers. Our main metric is `accuracy` - the percentage of passengers we predicted correctely.

**Output**

A `csv` file with 418 entries plus a header row:
1. `PassengerId` (sorted in any order)
2. `Survived` (contains your binary predictions: 1 for survived, 0 for deceased)

**Methodology**

Our approach will consist of the following steps:

1. Data exploration: Analyzing the features to understand the data's structure and the relationships between different variables.
2. Data cleaning and preprocessing: Dealing with missing values, encoding categorical variables, and scaling features where necessary.
3. Feature engineering: Creating new features from the existing data to improve the predictive power of our model.
4. Model selection: Comparing different machine learning algorithms and selecting the most appropriate model for our data.
5. Model training and evaluation: Training the model using the training dataset and evaluating its performance with a validation set.
6. Model tuning: Improving the model by tuning its parameters.
7. Prediction: Applying the final model to the test set to predict survival.
8. Results iterpretation: Understanding the output of the model and the factors that influence the prediction.


In [54]:
import pandas as pd
import numpy as np
import plotly.express as px
import optuna

from caseconverter import snakecase
from collections import defaultdict
from IPython.display import display

from fast_ml import eda
from ydata_profiling import ProfileReport

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
from category_encoders import MEstimateEncoder

from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

from sklearn.model_selection import cross_val_score
from sklearn.metrics import accuracy_score

In [55]:
FIG_WIDTH = 9 * 100
FIG_HEIGHT = 5 * 100
RANDOM_SEED = 42

In [56]:
try:
    raw_train = pd.read_csv('train.csv')
    raw_test = pd.read_csv('test.csv')
except:
    raw_train = pd.read_csv('/kaggle/input/titanic/train.csv')
    raw_test = pd.read_csv('/kaggle/input/titanic/test.csv')

# Exploratory Data Analysis

In this section, we focus on the critical aspects of understanding the Titanic dataset:

1. Outlier detection: identify data points that deviate significantly from other observations.
2. Missing values: quantify and analyze the presence of missing data across different features.
3. Data consistency: check for any discrepancies or anomalies in the dataset that could indicate errors.
4. Feature distributions: examine the distribution of each feature to understand the spread and central tendencies.
5. Correlation analysis: investigate the relationships between different features, especially how they relate to the target variable 'Survived'.
6. Data types: Assess the type of data (numerical/categorical) for appropriate preprocessing techniques.

By addressing these points, we aim to prepare the dataset adequately for the subsequent stages of modeling and prediction.

## Train data

Let's first explore train data.

Dataset sample:

In [57]:
raw_train.head(5)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Dataset overall summary:

In [58]:
display(eda.df_info(raw_train))

Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
PassengerId,int64,Numerical,891,"[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]",0,0.0
Survived,int64,Numerical,2,"[0, 1]",0,0.0
Pclass,int64,Numerical,3,"[3, 1, 2]",0,0.0
Name,object,Categorical,891,"[Braund, Mr. Owen Harris, Cumings, Mrs. John B...",0,0.0
Sex,object,Categorical,2,"[male, female]",0,0.0
Age,float64,Numerical,88,"[22.0, 38.0, 26.0, 35.0, nan, 54.0, 2.0, 27.0,...",177,19.86532
SibSp,int64,Numerical,7,"[1, 0, 3, 4, 2, 5, 8]",0,0.0
Parch,int64,Numerical,7,"[0, 1, 2, 5, 3, 4, 6]",0,0.0
Ticket,object,Categorical,681,"[A/5 21171, PC 17599, STON/O2. 3101282, 113803...",0,0.0
Fare,float64,Numerical,248,"[7.25, 71.2833, 7.925, 53.1, 8.05, 8.4583, 51....",0,0.0


Dataset numerical distributions:

In [59]:
display(round(raw_train.describe().T, 2))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,891.0,446.0,257.35,1.0,223.5,446.0,668.5,891.0
Survived,891.0,0.38,0.49,0.0,0.0,0.0,1.0,1.0
Pclass,891.0,2.31,0.84,1.0,2.0,3.0,3.0,3.0
Age,714.0,29.7,14.53,0.42,20.12,28.0,38.0,80.0
SibSp,891.0,0.52,1.1,0.0,0.0,0.0,1.0,8.0
Parch,891.0,0.38,0.81,0.0,0.0,0.0,0.0,6.0
Fare,891.0,32.2,49.69,0.0,7.91,14.45,31.0,512.33


Detailed breakdown:

In [60]:
ProfileReport(raw_train).to_widgets()

Summarize dataset: 100%|██████████| 47/47 [00:03<00:00, 11.94it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:04<00:00,  4.65s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

**A summary of key observations that stand out:**

Completeness of Data:
1. The `Age` feature has 177 missing values, which is about 20% of the data. This is significant and needs to be addressed, either through imputation or by discarding incomplete records depending on the chosen strategy.
2. The `Cabin` feature has a substantial amount of missing data, with 687 missing values, accounting for 77% of the entire dataset. The high percentage of missing data might make this feature less reliable for predictive modeling unless we can infer missing cabins from other data or decide to exclude it from the analysis.
3. There are 2 missing values in the `Embarked` feature, a negligible amount of the data which should be relatively straightforward to handle, either by removal or imputation.

Potential Outliers:
1. The `Fare` feature shows a considerable standard deviation (49.7) relative to its mean (32.2), and the maximum value is 512.3, which suggests the presence of outliers that could be distorting the overall distribution. These may need to be investigated to determine if they are legitimate values or anomalies due to errors or other factors.

Feature Distributions:
1. The target variable `Survived` is imbalanced. With a mean of 0.38, fewer passengers in the training set survived (38%) than did not, which is an essential consideration for model training.
2. `Pclass` appears to be well distributed across the three possible values, which are indicative of socio-economic status.
3. `SibSp` and `Parch` have a right-skewed distribution, with most passengers having no siblings/spouses or parents/children aboard. However, there are passengers with as many as 8 siblings/spouses and 6 parents/children, which could be outliers or represent large families traveling together.
4. `Age` seems reasonably normally distributed but slightly right-skewed due to younger passengers, with the mean age around 29.7 years.

Categorical Features:
1. The `Name` and `Ticket` features are unique to each passenger, which suggests they won’t be directly useful for machine learning models without feature engineering. For instance, titles extracted from names could provide information on social status or gender.
2. The `Sex` feature has two unique values and will need to be encoded into numerical form for model processing.
3. The `Cabin` data has many unique values with a high percentage of missing data, which complicates its use directly as a feature.

Data Integrity:
1. There are no missing values reported for several critical features such as `PassengerId`, `Survived`, `Pclass`, `Name`, `Sex`, `SibSp`, `Parch`, `Ticket`, and `Fare`, which is good for the integrity of those columns.

**A summary of correlations:**

Survivability correlations:
1. There is a moderate positive correlation between `Survived` and `Fare` (0.28), which suggests that passengers who paid higher fares had a better chance of surviving. This could be linked to the socio-economic status of passengers, where higher-paying passengers might have been given priority during the evacuation.
2. `Survived` is also moderately positively correlated with `Pclass` (0.34), indicating that passenger class had an impact on survival, with first-class passengers more likely to survive than those in third class.
3. The strongest correlation with `Survived` is with `Sex` (0.54), showing that females had a much higher likelihood of survival compared to males, likely due to the "women and children first" protocol followed during the evacuation.

Socio-economic status:
1. `Fare` and `Pclass` have a strong negative correlation (-0.48), as expected, because first-class tickets were more expensive. This suggests that `Pclass` could be a proxy for socio-economic status and financial capability.
2. `Age` and `Pclass` are positively correlated (0.27), implying that older passengers tended to travel in higher classes.

Family and traveling companions:
1. There is a substantial positive correlation between `SibSp` (number of siblings/spouses aboard) and `Parch` (number of parents/children aboard) (0.45). This suggests that families tended to travel together on the Titanic.

Age-related correlations:
1. Both `SibSp` and `Parch` show a negative correlation with `Age`, -0.18 and -0.25 respectively. This might indicate that younger passengers were more likely to be traveling with siblings and parents.

Embarkation points:
1. The correlations involving `Embarked` are relatively low, suggesting that the port of embarkation does not have a strong linear relationship with other numerical variables in the dataset. However, considering that `Embarked` is a categorical variable, correlation coefficients may not fully capture the relationships with this feature.

It is important to note that correlation does not imply causation. High or low correlation coefficients indicate a possible association but do not confirm a direct cause-and-effect relationship. Additionally, for categorical variables like `Sex` and `Embarked`, which have been numerically encoded, the interpretation of correlations can be less intuitive and require careful analysis.

## Test data

Now let's have a look at test data.

Dataset sample:

In [61]:
raw_test.head(5)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


Dataset overall summary:

In [62]:
display(eda.df_info(raw_test))

Unnamed: 0,data_type,data_type_grp,num_unique_values,sample_unique_values,num_missing,perc_missing
PassengerId,int64,Numerical,418,"[892, 893, 894, 895, 896, 897, 898, 899, 900, ...",0,0.0
Pclass,int64,Numerical,3,"[3, 2, 1]",0,0.0
Name,object,Categorical,418,"[Kelly, Mr. James, Wilkes, Mrs. James (Ellen N...",0,0.0
Sex,object,Categorical,2,"[male, female]",0,0.0
Age,float64,Numerical,79,"[34.5, 47.0, 62.0, 27.0, 22.0, 14.0, 30.0, 26....",86,20.574163
SibSp,int64,Numerical,7,"[0, 1, 2, 3, 4, 5, 8]",0,0.0
Parch,int64,Numerical,8,"[0, 1, 3, 2, 4, 6, 5, 9]",0,0.0
Ticket,object,Categorical,363,"[330911, 363272, 240276, 315154, 3101298, 7538...",0,0.0
Fare,float64,Numerical,169,"[7.8292, 7.0, 9.6875, 8.6625, 12.2875, 9.225, ...",1,0.239234
Cabin,object,Categorical,76,"[nan, B45, E31, B57 B59 B63 B66, B36, A21, C78...",327,78.229665


Dataset numerical distributions:

In [63]:
display(round(raw_test.describe().T, 2))

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
PassengerId,418.0,1100.5,120.81,892.0,996.25,1100.5,1204.75,1309.0
Pclass,418.0,2.27,0.84,1.0,1.0,3.0,3.0,3.0
Age,332.0,30.27,14.18,0.17,21.0,27.0,39.0,76.0
SibSp,418.0,0.45,0.9,0.0,0.0,0.0,1.0,8.0
Parch,418.0,0.39,0.98,0.0,0.0,0.0,0.0,9.0
Fare,417.0,35.63,55.91,0.0,7.9,14.45,31.5,512.33


Detailed breakdown:

In [64]:
ProfileReport(raw_test).to_widgets()

Summarize dataset: 100%|██████████| 46/46 [00:05<00:00,  8.58it/s, Completed]                       
Generate report structure: 100%|██████████| 1/1 [00:05<00:00,  5.28s/it]
Render widgets:   0%|          | 0/1 [00:00<?, ?it/s]

                                                             

VBox(children=(Tab(children=(Tab(children=(GridBox(children=(VBox(children=(GridspecLayout(children=(HTML(valu…

**A summary of key observations that stand out:**

Missing values:
1. `Age` has missing values (86), which is approximately 20.6% of the test data. This is slightly higher than in the train dataset, which had about 19.9% missing values for `Age`.
2. `Cabin` has a large number of missing values (327), accounting for 78% of the test data, which is in line with the train dataset (77%).
3. `Fare` has 1 missing value, which is about 0.24% of the test data. In the train dataset, there were no missing values for Fare.

Data distribution:
1. The `Age` distribution seems similar, with the median age in the 20s. The mean age is slightly higher in the test set (30.3) compared to the train set (29.7), which could indicate a slightly older passenger demographic in the test set.
2. The `Fare` distribution in the test set has a similar range to the train set, but the mean is slightly higher (35.6 in the test set vs. 32.2 in the train set). This might suggest that the test set includes passengers who paid a bit more on average, but the median values are very close (14.5 in both), indicating a similar central tendency.
3. `Pclass`, `SibSp`, and `Parch` distributions seem to be consistent between the train and test datasets, indicating that these features have similar spreads and central tendencies.

PassengerId:
1. `PassengerId` in the test set starts from 892 and goes up to 1309, which suggests that this dataset is a direct continuation of the train dataset `PassengerIds`. 

Potential Issues for model generalization:
1. If the test data has a slightly different distribution from the train data (e.g., a higher average fare or different age distribution), the model trained on the train data might not generalize as well to the test data. It's important to consider these distribution differences when evaluating model performance.
2. Given that there's a missing value in `Fare`, it will need to be imputed for the test dataset before making predictions, ideally using a method consistent with how missing values were handled in the training dataset.

When preparing to apply a model to this test data, the key is to apply the same preprocessing steps as were used on the train dataset to maintain consistency. This includes handling missing values in `Age`, `Fare`, and `Cabin` in the same manner as the train dataset, and ensuring categorical variables are encoded similarly if a model requires numerical input.

# Data cleaning and preprocessing

As the next step we need to clean and prepare the data for ML. Since the data is already split into train and test, we don't need to worry about data leakage:

1. Drop columns which we are not useful for the model: `PassengerID`.
2. Set columns to `snake_case` and rename them to be a bit more self explanatory.
3. Create additional features using existing columns (e.g.: split names or create age groups).
4. Fill-in blanks in `Age` (both `train` and `test`) and `Fare` (`test` only) using standard `KNNImputer`. For `Embarked` in `train` we can do `SimpleImputer` with `most_frequent` setting.
5. Standardise numerical features with `StandardScaler`.
6. Encode categorical features with `MEstimateEncoder`.

In [65]:
def preprocess_data(df: pd.DataFrame, columns: list) -> pd.DataFrame:
    """
    Perform preprocessing on Titanic dataset to prepare for machine learning analysis.

    The function processes a DataFrame by executing several steps:
    - Dropping the 'PassengerId' column as it is not needed for analysis.
    - Renaming columns to snake_case for Pythonic consistency.
    - Extracting 'ticket_number', 'title', 'surname', and 'first_name' from the 'name' column for further analysis.
    - Categorizing 'age' into predefined groups for more granular analysis.
    - Computing 'family_size' by adding 'siblibngs_spouses_no' and 'parents_children_no', then adding 1 (for the passenger themself).
    - Creating a binary 'is_male' column from the 'sex' column.
    - Generating a 'class_sex' interaction feature to capture the interplay between class and gender.
    - Retaining only the specified relevant columns for further analysis.
    - Converting column data types to ensure proper format for machine learning algorithms.

    The 'Survived' column, if present, is maintained and its data type is converted to an integer
    for use as a label in classification tasks.

    Parameters:
    - df (pd.DataFrame): The input DataFrame containing passenger data from the Titanic dataset.
    - columns (list): A list of column names that should be kept in the final DataFrame. 

    Returns:
    - pd.DataFrame: A DataFrame with the preprocessing applied, ready for further analysis or as input to machine learning models.
    """
    columns_to_keep = columns.copy()

    columns_datatypes = {
        'is_male': 'int', 'is_alone': 'int', 'age': 'float',
        'siblibngs_spouses_no': 'int', 'parents_children_no': 'int',
        # 'ticket_number': 'int', 
        'fare': 'float'
    }
    
    if 'Survived' in df.columns:
        columns_to_keep.append('survived')
        columns_datatypes['survived'] = 'int'

    df = (
        df
        .copy()
        .drop(['PassengerId'], axis=1)
        .rename(columns={
            'Pclass': 'ticket_class', 'SibSp': 'siblibngs_spouses_no', 'Parch': 'parents_children_no',
        })
        .rename(columns=lambda column: snakecase(column))
        .assign(
            ticket_number=lambda df: df.ticket.str.extract(r'(\d+$)', expand=False).fillna('0'),
            title=lambda df: (
                df.name.str.extract(r',\s*([^\.]*)\.', expand=False)
                .replace(['Mlle', 'Ms', 'Mme', 'Master'], ['Miss', 'Miss', 'Mrs', 'Mr'])
                .replace(['Lady', 'the Countess', 'Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Rare')
            ),
            surname=lambda df: df.name.str.extract(r'(^[\w\s]+),', expand=False),
            first_name=lambda df: df.name.str.extract(r'\.\s+(.*)', expand=False),
            age_group=lambda df: pd.cut(
                df.age, bins=[0, 10, 20, 40, 60, 100], labels=['child', 'teen', 'adult', 'middle-age', 'senior']
            ),
            family_size=lambda df: df.siblibngs_spouses_no + df.parents_children_no + 1,
            is_alone=lambda df: df.family_size == 1,
            is_male=lambda df: df.sex.map({'male': 1, 'female': 0}),
            class_sex=lambda df: df.ticket_class * df.is_male
        )
        .loc[:, columns_to_keep]
        .astype(columns_datatypes)
    )    
    
    return df

Let's save those datasets into a single variable we are going to be using for ML.

In [66]:
columns_to_keep = [
    'title', # 'first_name', 'surname', 
    'age', 'age_group', 'is_male',
    'siblibngs_spouses_no', 'parents_children_no', 'family_size', 'is_alone',
    # 'ticket_number', 
    'ticket_class', 'embarked', 'fare', 'class_sex'
]

dct_splits = {
    'train': {
        'features': preprocess_data(raw_train, columns_to_keep).drop('survived', axis=1),
        'target': preprocess_data(raw_train, columns_to_keep)[['survived']]
    },
    'test': {
        'features': preprocess_data(raw_test, columns_to_keep)
    }
}

display(dct_splits['train']['features'].head())

Unnamed: 0,title,age,age_group,is_male,siblibngs_spouses_no,parents_children_no,family_size,is_alone,ticket_class,embarked,fare,class_sex
0,Mr,22.0,adult,1,1,0,2,0,3,S,7.25,3
1,Mrs,38.0,adult,0,1,0,2,0,1,C,71.2833,0
2,Miss,26.0,adult,0,0,0,1,1,3,S,7.925,0
3,Mrs,35.0,adult,0,1,0,2,0,1,S,53.1,0
4,Mr,35.0,adult,1,0,0,1,1,3,S,8.05,3


Now we can impute missing values, scale numerical values and encode category values.

In [67]:
numerical_transformer = Pipeline(steps=[
    ('imputer', KNNImputer()),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', MEstimateEncoder())
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, ['age', 'fare']),
        ('cat', categorical_transformer, ['embarked', 'title', 'age_group'])
    ],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessor', preprocessor),
])

pipeline.fit(
    dct_splits['train']['features'],
    dct_splits['train']['target']
)

dct_splits['train']['features'] = pd.DataFrame(
    pipeline.transform(dct_splits['train']['features']), columns=columns_to_keep
)
dct_splits['test']['features'] = pd.DataFrame(
    pipeline.transform(dct_splits['test']['features']), columns=columns_to_keep
)

display(dct_splits['train']['features'].head())

Unnamed: 0,title,age,age_group,is_male,siblibngs_spouses_no,parents_children_no,family_size,is_alone,ticket_class,embarked,fare,class_sex
0,-0.572142,-0.502445,0.339079,0.187068,0.364803,1.0,1.0,0.0,2.0,0.0,3.0,3.0
1,0.626331,0.786845,0.552567,0.790424,0.364803,0.0,1.0,0.0,2.0,0.0,1.0,0.0
2,-0.272524,-0.488854,0.339079,0.700988,0.364803,0.0,0.0,0.0,1.0,1.0,3.0,0.0
3,0.401618,0.42073,0.339079,0.790424,0.364803,0.0,1.0,0.0,2.0,0.0,1.0,0.0
4,0.401618,-0.486337,0.339079,0.187068,0.364803,1.0,0.0,0.0,1.0,1.0,3.0,3.0


# Model selection and training

In this section we will go thorugh `LogisticRegression`, `DecisionTreeClassifier`, `LGBMClassifier` and `CatBoostClassifier` and evaluate their peformance. We will use `optuna` to optimise the hyperparameters.

Let's create a function that will handle the results for us.

In [84]:
def optimize_classifiers(ftr_train, tgt_train, n_trials: int):
    """
    Trains and optimizes classification models using Optuna.

    Args:
    - ftr_train (pd.DataFrame): Training features.
    - tgt_train (pd.Series): Training target.
    - n_trials (int): The number of trials for Optuna optimization.

    Returns:
    - An Optuna study object containing the optimal model and its parameters.
    """
    def get_classifier(trial):
        classifiers = {
            'LogisticRegression': LogisticRegression(
                penalty=trial.suggest_categorical('lr_penalty', ['l1', 'l2']),
                C=trial.suggest_float('lr_C', 0.1, 10.0),
                solver='liblinear',
                random_state=RANDOM_SEED
            ),
            'DecisionTreeClassifier': DecisionTreeClassifier(
                max_depth=trial.suggest_int('dt_max_depth', 1, 32),
                min_samples_split=trial.suggest_int('dt_min_samples_split', 2, 150),
                random_state=RANDOM_SEED
            ),
            'LGBMClassifier': LGBMClassifier(
                max_depth=trial.suggest_int('lgbm_max_depth', 30, 100),
                n_estimators=trial.suggest_int('lgbm_n_estimators', 100, 1000),
                learning_rate=trial.suggest_float('lgbm_learning_rate', 0.001, 0.1),
                random_state=RANDOM_SEED
            ),
            'CatBoostClassifier': CatBoostClassifier(
                iterations=trial.suggest_int('cb_iterations', 100, 600),
                learning_rate=trial.suggest_float('cb_learning_rate', 0.01, 0.3),
                depth=trial.suggest_int('cb_depth', 1, 10),
                silent=True,
                random_state=RANDOM_SEED
            )
        }
        classifier_name = trial.suggest_categorical('classifier', list(classifiers.keys()))
        return classifiers[classifier_name]

    def objective(trial):
        """
        Objective function for Optuna optimization. Computes the accuracy for a given classifier.

        Args:
        - trial (optuna.Trial):  A trial is a process of evaluating an objective function. This object
        is passed to an objective function and provides interfaces to suggest hyperparameters.

        Returns:
        - float: Accuracy of the classifier's predictions.
        """
        classifier_obj = get_classifier(trial)
        score = cross_val_score(classifier_obj, ftr_train, tgt_train, n_jobs=-1, cv=3, scoring='accuracy').mean()
        return score

    optuna.logging.set_verbosity(optuna.logging.WARNING)
    study = optuna.create_study(direction='maximize', sampler=optuna.samplers.TPESampler(seed=RANDOM_SEED))
    study.optimize(objective, n_trials=n_trials)

    return study

Let's find the best classifier.

In [85]:
study = optimize_classifiers(
    dct_splits['train']['features'],
    dct_splits['train']['target'],
    n_trials=50
)

And check its parameters.

In [86]:
best_params = study.best_params
classifier_name = best_params['classifier']

param_prefixes = {
    'LogisticRegression': 'lr_',
    'DecisionTreeClassifier': 'dt_',
    'LGBMClassifier': 'lgbm_',
    'CatBoostClassifier': 'cb_'
}

# Filter the best_params to only include relevant parameters for the selected classifier
relevant_params = {k: v for k, v in best_params.items() if k.startswith(param_prefixes[classifier_name]) or k == 'classifier'}

formatted_params = "\n".join([f"  {key}: {value}" for key, value in relevant_params.items()])
print(f"Best params:\n{formatted_params}")

Best params:
  lgbm_max_depth: 91
  lgbm_n_estimators: 338
  lgbm_learning_rate: 0.010396975557648184
  classifier: LGBMClassifier


The plots below will also show how models were optimised and which model performed best.

In [87]:
fig = optuna.visualization.plot_optimization_history(study)
fig.update_layout(
    legend=dict(orientation='h'),
    template='plotly_white',
    width=FIG_WIDTH, height=FIG_HEIGHT
)
fig.show()

fig = optuna.visualization.plot_slice(study, params=relevant_params)
fig.update_layout(
    legend=dict(orientation='h'),
    template='plotly_white',
    width=FIG_WIDTH, height=FIG_HEIGHT
)
fig.update_xaxes(tickangle=-90)
fig.show()

Finally, let's use the optimal model to predict if a person would survive the Titanic disaster.

In [81]:
# model = CatBoostClassifier(
#     iterations=study.best_params['cb_iterations'],
#     learning_rate=study.best_params['cb_learning_rate'],
#     depth=study.best_params['cb_depth'],
#     silent=True,
#     random_state=RANDOM_SEED
# )

model = LGBMClassifier(
    max_depth=study.best_params['lgbm_max_depth'],
    n_estimators=study.best_params['lgbm_n_estimators'],
    learning_rate=study.best_params['lgbm_learning_rate'],
    random_state=RANDOM_SEED
)

model.fit(dct_splits['train']['features'], dct_splits['train']['target'].values.ravel());

for dataset in ['train', 'test']:
    dct_splits[dataset]['prediction'] = pd.DataFrame(
        model.predict(dct_splits[dataset]['features']), columns=['survived']
    )

print(
    'Accuracy on training dataset:',
    round(accuracy_score(
        dct_splits['train']['prediction'], dct_splits['train']['target']
    ), 3)
)

Accuracy on training dataset: 0.827


Let's visualise the results.

In [82]:
df_temp = (
    pd.concat([
        dct_splits['train']['target'].assign(is_predicted='actual'),
        dct_splits['train']['prediction'].assign(is_predicted='predicted')
    ])
)

fig = px.bar(
    df_temp.groupby(['survived', 'is_predicted']).size().reset_index(name='count'),
    y='survived',
    x='count',
    color='is_predicted',
    barmode='group',
    orientation='h',
    template='plotly_white',
    width=FIG_WIDTH, height=FIG_HEIGHT
)

fig.update_layout(
    title_text='Actual vs predicted Split',
    xaxis_title_text='Count',
    yaxis_title_text='Survived',
    bargap=0.1,
    legend=dict(orientation='h')
)

fig.show()

And save a Kaggle submission!

In [83]:
submission = pd.DataFrame({
    'PassengerId': raw_test.PassengerId,
    'Survived': dct_splits['test']['prediction']['survived']
})
submission.to_csv('Submission.csv')

display(submission.head())

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1
