# Introduction

## Task

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

## Guide description

In this notebook We will go step-by-step through my solution. This covers initial reading of the data, initial exploration, dealing with missing values, exploratory data analysis, feature engineering and putting all steps together with classification models in a compact pipeline. In the end we will take a look at ensemble techniques of stacking and voting classifiers.

In [None]:
# EDA imports
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="whitegrid")

import warnings
warnings.filterwarnings('ignore')

Read the data

In [None]:
df = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

## Initial Data Exploration and preprocessing

Initially when we read the data for the first time, we want to get to know the data. This could mean looking at the feature values, the ranges of these values, the data types, missing values, distributions of the values described by statictics and so on. The first functions you'll usually see are `head()`, `info()` and `describe()`.

In [None]:
df.head()

From the first look at the data, the data will require some processing. The machine learning models usually require numerical features. We could drop the non-numerical features but we would lose a lot of information. We will look at various techniques on how to deal with this later.

In [None]:
df.describe()

The minimal `Age` of 0 seems suspicious. Other ranges of numerical values seem okay, but the distributions are heavily skewed.

In [None]:
df['Age'].sort_values().hist(bins=int(df['Age'].max()))

By looking at the histogram of `Age` we can see that 0 is either used as missing value or there was a lot of infants onboard :).

In [None]:
df.info()

We can see that some of the values are missing. Let's explore those first.

## Missing values

In [None]:
df.isna().sum()

In [None]:
print(df[df.isna().any(axis=1)==True].shape[0], 'rows have atleast 1 value equal to NaN.')

In [None]:
sns.heatmap(df[df.isna().any(axis=1)==True].isna().astype(int))

From the data description we know that passengers in cryosleep are confined to their cabins. Therefore their spendings on services (`RoomService`,`FoodCourt`, etc.) should equal to 0. 

In [None]:
df[df['CryoSleep']==True].describe()

By confirming the hypothesis we can use this information to fill in the missing values in services of passengers in cryo sleep.

In [None]:
service_cols = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
df_preprocess = df.copy()
df_preprocess.loc[
    (df_preprocess['CryoSleep']==True) &
    (df_preprocess[service_cols].isna().any(axis=1))
    ,service_cols]=0

Additionally we can take a look at how passengers spend w.r.t. their age.

In [None]:
for service in service_cols:
    sns.scatterplot(df_preprocess, x='Age',y=service)

In [None]:
df_preprocess.groupby('Age')[service_cols].sum().max(axis=1).loc[:15]

We can see that passengers younger than 13 years old do not spend any money on services. Therefore we can fill missing values for service spendings for passengers younger than 13 years old with 0.

In [None]:
df_preprocess.loc[(df_preprocess['Age'] < 13) & (df_preprocess[service_cols].isna().any(axis=1)),service_cols] = 0

### Summary - missing values

* We have filled missing values where the value could be determined from the data.
* For other values we have couple of options:
    * Impute mean, mode, median of feature or a set constant .
    * Use kNN to impute the value of nearest neighbors values.
    * Learn a classification of regression model for imputing.
* We will try the first 2 options later as we have to fit those only on the training part of the dataset to avoid data leakage.
* By filling in the values determined by data we have reduced rows with atleast 1 missing value by around 300.

In [None]:
print(df_preprocess[df_preprocess.isna().any(axis=1)==True].shape[0], 'rows have atleast 1 value equal to NaN.')
sns.heatmap(df_preprocess[df_preprocess.isna().any(axis=1)==True].isna().astype(int))

## EDA and Feature engineering

Now that we have (sort of) dealt with missing values we can move further on EDA. In this section we will take a look at certain features w.r.t. our target feature `Transported`. From the insights found we will try to create new features that could help our future model in classification.
____

Sort of dealt with missing values - Not all missing values are filled right now, we are gonna use some of the mentioned techniques but for that we would have to further split our data to train and validation sets so that we do not leak any information and we are gonna do that as a last step before modelling.

Before we do any further exploration, we can create new features, feature `TotalSpend` summing all expenses of a passenger and feature `Spent` boolean feature indicating whether passenger spent any money. As we saw the services feature distributions are heavily skewed, where the richest spent a lot but majority of the people did not spend any money. We will also create feature transformed using logarithmic transformation `{service}_log'.

In [None]:
df_preprocess = (df_preprocess
    .assign(
        TotalSpend=lambda x: x[service_cols].sum(axis=1),
        Spent=lambda x: (x['TotalSpend'] > 0).astype(int),
        TotalSpend_log=lambda x: np.log(x['TotalSpend']+1),
        **{f'{s}_log': lambda x: np.log(x[s]+1) for s in service_cols}
    )
)
df_preprocess.head()

Now we will look at distributions of categorical features w.r.t. target variable.

In [None]:
for c in df_preprocess.select_dtypes(object).columns.tolist():
    if df_preprocess[c].unique().size < 10:
        sns.countplot(df_preprocess, x=c, hue='Transported')
        plt.show()

We can do the same for the `Age` feature.

In [None]:
sns.histplot(df_preprocess, x='Age', hue='Transported')

There is a pretty big difference in the distributions for passengers under the age of 13. As we saw earlier these passengers did not spend any money on board. Let's take a look at the distributions of categorical features w.r.t. target feature only for passengers of under the age of 13.

In [None]:
for c in df_preprocess.select_dtypes(object).columns.tolist():
    if df_preprocess[df_preprocess['Age'] == 0.0][c].unique().size < 10:
        sns.countplot(df_preprocess[(df_preprocess['Age'] < 13.0)], x=c, hue='Transported')
        plt.show()

From the plots we can see that almost everyone under the age of 13 not coming from Earth got transported. We can create a feature for that. Firstly we will create a `AgeLimit` binary feature and then binary feature from underaged from Earth.

We also see that passengers under that age limit are never VIP. We can take a look at if there are any missing values for VIP of underaged passengers and fill them if it is the case.

In [None]:
df_preprocess[df_preprocess['Age'] < 13 & df_preprocess['VIP'].isna()].shape[0]

In [None]:
df_preprocess.loc[(df_preprocess['Age']) < 13 & (df_preprocess['VIP'].isna()),'VIP'] = False
df_preprocess = (df_preprocess
    .assign(
        AgeLimit=lambda x: (x['Age']<13).astype(int),
        UnderageEarth=lambda x: ((x['Age']< 13) & (x['HomePlanet'] == 'Earth')).astype(int)
    )
)
df_preprocess.head()

Now we are gonna focus on feature extraction from given features. Feature `PassengerId` can be split into passenger group and passenger id within the group. Group ID could be a useful feature so we will add it. After that we will split the cabin feature into `Deck`, `Num` and `Side` Features.

In [None]:
def process_cabin(cabin: str) -> pd.Series:
    if not pd.isna(cabin):
        deck, num, side = cabin.split('/')
    else:
        deck, num, side = np.NaN, np.NaN, np.NaN
    return pd.Series({'Deck':deck, 'Num':float(num), 'Side':side})

def process_passengerid(passengerid: str) -> int:
    group, _ = passengerid.split('_')
    return int(group)

df_preprocess.loc[:,['Deck','Num','Side']] = df_preprocess['Cabin'].apply(process_cabin)
df_preprocess.loc[:,'Side'] = df_preprocess['Side'].replace({'P':0.0, 'S':1.0})
df_preprocess.loc[:,'Group'] = df_preprocess['PassengerId'].apply(process_passengerid)
df_preprocess.loc[:,'GroupLen'] = df_preprocess.groupby('Group').transform('size')
df_preprocess.head()

For this guide I personally will skip feature engineering on the feature `Name`. One could split the feature into a First and Last name and then possibly used Label Encoder to map the feature values to numbers for additional information.

Additionally before moving to the next step, We will drop redundant columns.

In [None]:
df_preprocess = df_preprocess.drop(columns=['Cabin','PassengerId','Name'])

## Modelling and Pipeline composition

In this section we will move to modelling. Now that we have done majority of the EDA we will convert all the preprocessing code into functions and classes. To make the code more readable and maintainable.

Some of the code is applicable to the data before imputing and some after so we will split our pipeline into few steps.

In [None]:
from sklearn.base import BaseEstimator, TransformerMixin

class PreprocessBeforeImputing(BaseEstimator, TransformerMixin):
    def __init__(self, service_cols) -> None:
        super().__init__()
        self.service_cols = service_cols

    def fit(self, X:pd.DataFrame, y=None):
        return self

    def transform(self, X:pd.DataFrame, y=None)->pd.DataFrame:
        X = X.copy()

        # Impute values based on data relationships
        X.loc[(X['CryoSleep']==True) & (X[self.service_cols].isna().any(axis=1)),self.service_cols]=0
        X.loc[(X['Age'] < 13) & (X[self.service_cols].isna().any(axis=1)),self.service_cols]=0
        X.loc[(X['Age']) < 13 & (X['VIP'].isna()),'VIP'] = False

        # Extract features from PassengerId
        X.loc[:,'Group'] = X['PassengerId'].apply(self.process_passengerid)
        X.loc[:,'GroupLen'] = X.groupby('Group').transform('size')
        X = X.drop(columns=['PassengerId','Name'])
        
        return X

    def process_passengerid(self, passengerid: str) -> int:
        group, _ = passengerid.split('_')
        return int(group)
        

class PreprocessAfterImputing(BaseEstimator, TransformerMixin):
    def __init__(self, service_cols) -> None:
        super().__init__()
        self.service_cols = service_cols
    
    def fit(self, X:pd.DataFrame, y=None):
        return self
    
    def transform(self, X:pd.DataFrame, y=None) -> pd.DataFrame:
        X = X.copy()

        # Create features based on Age
        X.loc[:,'AgeLimit'] = (X['Age']<13).astype(int)
        X.loc[:,'UnderageEarth'] = ((X['Age']< 13) & (X['HomePlanet'] == 'Earth')).astype(int)
        
        # Feature extraction from Cabin
        X.loc[:,['Deck','Num','Side']] = X['Cabin'].apply(self.process_cabin)
        X = X.drop(columns=['Cabin'])
        X['Side'] = X['Side'].replace({'P':0.0, 'S':1.0})
        X['Num'] = X['Num'].astype(int)
        
        # Features from services
        X['TotalSpend'] = X[self.service_cols].sum(axis=1)
        X['Spent'] = (X['TotalSpend'] > 0).astype(float)
        for col in self.service_cols:
            X[f'{col}_log']=np.log(X[col]+1)
        X['TotalSpend_log'] = np.log(X['TotalSpend']+1)
        
        return X
    
    def process_cabin(self, cabin: str) -> pd.Series:
        if not pd.isna(cabin):
            deck, num, side = cabin.split('/')
        else:
            deck, num, side = np.NaN, np.NaN, np.NaN
        return pd.Series({'Deck':deck, 'Num':float(num), 'Side':side})

Now let's start putting all together.

Firstly import all everything needed to build the pipeline and `set_config(transform_output='pandas')` so that the output of transformers is pd.Dataframe.

In [None]:
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV, KFold
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.feature_selection import RFECV
from sklearn import set_config
set_config(transform_output='pandas')


Read again data and seperate data from labels

In [None]:
target = 'Transported'

# Read data
df_train = pd.read_csv('data/train.csv')
df_test = pd.read_csv('data/test.csv')

# Seperate training data and Label
X_train = df_train.drop(columns=[target])
y_train = df_train[target].astype(float)
X_test = df_test.copy()

Next, we will create ColumnTransformer that will impute all the missing values, it consists of an Imputer for numerical features and imputer for categorical features. After that we will create OneHotEncoder transformer that will encode all categorical features.

In [None]:
impute = ColumnTransformer([
    ('numimp',KNNImputer(),make_column_selector(dtype_include=np.number)),
    ('catimp',SimpleImputer(strategy='most_frequent'),make_column_selector(dtype_include=object)),
    ], verbose_feature_names_out=False)

ohe = ColumnTransformer([(
        'ohe',
        OneHotEncoder(sparse_output=False ,drop='if_binary', handle_unknown='infrequent_if_exist'),
        make_column_selector(dtype_include=object)
    )], remainder='passthrough', verbose_feature_names_out=False
)

Now, last step, putting together the entire pipeline. We will use RandomForestClassifier as the first model.

In [None]:
service_cols = [
    'RoomService',
    'FoodCourt',
    'ShoppingMall',
    'Spa',
    'VRDeck'
]

pipe = Pipeline(steps=[
    ('bp', PreprocessBeforeImputing(service_cols=service_cols)),
    ('imp', impute),
    ('ap', PreprocessAfterImputing(service_cols=service_cols)),
    ('ohe', ohe),
    ('clf', RandomForestClassifier())
])

One of the advantages that the sklearn's pipelines offer is the ability to optimize hyperparameters over the entire pipeline. In next step, we will create a hyperparameter grid. We will try to find the best hyperparameters for chosen classifier and the imputing techniques. Another advantage of the pipeline is that we do not have to worry about data leakages. Even in the cross validation setting all the steps of the pipeline will be fitted only on the training folds.

In [None]:
param_grid = {
    'clf__n_estimators': [100],
    'clf__max_depth': list(range(3,17)),
    'clf__min_samples_split': [2**i for i in range(1,7)],
    'clf__min_samples_leaf': [2**i for i in range(0,7)],
    'imp__numimp':[SimpleImputer(strategy='mean'), SimpleImputer(strategy='median'), KNNImputer()],
}

kf = KFold(n_splits=5)
est = RandomizedSearchCV(pipe, param_grid, scoring='accuracy', cv=kf, n_iter=100, n_jobs=8, verbose=1)
est.fit(X_train, y_train)

Let's see how the model performed with different imputing strategies.

In [None]:
res = pd.DataFrame(est.cv_results_)
res['imp'] = [str(i) for i in res['param_imp__numimp'].values]

In [None]:
sns.boxplot(res,x='mean_test_score', y='imp')

Based on the previous results we will use SimpleImputer with `strategy=mean`.

In [None]:
impute = ColumnTransformer([
    ('numimp',SimpleImputer(strategy='mean'),make_column_selector(dtype_include=np.number)),
    ('catimp',SimpleImputer(strategy='most_frequent'),make_column_selector(dtype_include=object)),
    ], verbose_feature_names_out=False)

pipe = Pipeline(steps=[
    ('bp', PreprocessBeforeImputing(service_cols=service_cols)),
    ('imp', impute),
    ('ap', PreprocessAfterImputing(service_cols=service_cols)),
    ('ohe', ohe),
])

X_train = pipe.fit_transform(X_train, y_train)
X_test = pipe.transform(X_test)

Now, that we have a pipeline for data processing we can fully move to modelling and predictions. We will use RandomForestClassifier and various boosted trees algorithms. We will be training many instances of the algorithms and then see the performance of the best instance on test data and then we will try creating stacking and voting classifier ensembles to see whether this will perform better.

In [None]:
import xgboost as xgb
import lightgbm as lgb
from sklearn.ensemble import GradientBoostingClassifier, HistGradientBoostingClassifier

def optimize_model(model, X_train, y_train, *, param_grid, n_iter):
    kf = KFold()
    est = RandomizedSearchCV(
            model,
            param_grid,
            scoring='accuracy',
            cv=kf,
            n_iter=n_iter,
            n_jobs=-1,
            verbose=0
    )
    est.fit(X_train, y_train)
    print(model)
    print(est.best_score_)
    return est

def make_predictions_and_write_csv(model, X_test, df_test, filename):
    y = model.predict(X_test)
    pd.concat(
        [df_test['PassengerId'],pd.Series(y,name='Transported',dtype=bool)],
        axis=1
    ).to_csv(f'{filename}.csv', index=False)


In [None]:
pg = {
    'n_estimators': [100, 200],
    'max_depth': list(range(3,17)),
    'min_samples_split': [2**i for i in range(1,7)],
    'min_samples_leaf': [2**i for i in range(0,7)],
}

gb_pg = {
    'n_estimators' :[100, 200],
    'max_depth' : list(range(3,17)),
    'max_leaves' : [2**i for i in range(4,8)],
    'learning_rate':[0.1],
}

lgb_pg = {
    'n_estimators' :[100, 200],
    'max_depth' : list(range(3,20)),
    'num_leaves' : [2**i for i in range(4,10)],
    'learning_rate':[0.1],
    'min_child_samples' : [2**i for i in range(2,7)],
}

g_pg = {
    'n_estimators': [100, 200],
    'learning_rate': [0.1],
    'max_depth': list(range(3,17)),
    'min_samples_split': [2**i for i in range(1,7)],
    'min_samples_leaf': [2**i for i in range(0,7)],
}

hist_pg = {
    'learning_rate':[0.1],
    'max_iter':[100, 200],
    'max_leaf_nodes':[2**i for i in range(4,10)],
    'max_depth':list(range(3,20)),
    'min_samples_leaf':[2**i for i in range(2,7)],
}

In [None]:
models = [
    RandomForestClassifier(n_jobs=1),
    xgb.XGBClassifier(n_jobs=1,),
    lgb.LGBMClassifier(n_jobs=1, objective='binary'),
    GradientBoostingClassifier(),
    HistGradientBoostingClassifier(),
]
pgs = [
    pg,
    gb_pg,
    lgb_pg,
    g_pg,
    hist_pg
]

counts = [3,2,5,2,5]
abbrs = ['rf','xgb','lgb','gbc','hgbc']
n_iters = [50, 35, 50, 35, 50]

ests = []
for i, (model, pg) in enumerate(zip(models, pgs)):
    for j in range(counts[i]):
        est = optimize_model(
            model,
            X_train,
            y_train,
            param_grid=pg,
            n_iter=n_iters[i]
        )
        est = est.best_estimator_
        ests.append((f'{abbrs[i]}_{j}', est))

## Final Predictions

Best estimator was: LGBMClassifier

With LeaderBoard score of 0.80102

In [None]:
make_predictions_and_write_csv(ests[5][1], X_test, df_test, 'preds_best')

Stacking classifier achieved LeaderBoard score of 0.7912

Voting classifier achieved LeaderBoard score of 0.80289

In [None]:
from sklearn.ensemble import StackingClassifier, VotingClassifier

sc = StackingClassifier(
    ests,
    final_estimator=LogisticRegression(),
    n_jobs=-1)
sc.fit(X_train, y_train)
make_predictions_and_write_csv(sc, X_test, df_test, 'preds_sc')

vc = VotingClassifier(
    ests,
    voting='hard',
    n_jobs=-1)
vc.fit(X_train, y_train)
make_predictions_and_write_csv(vc, X_test, df_test, 'preds_vc')