---
---

# Spaceship Titanic (kaggle competition)

Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

<div style='text-align: center;'>
    <img src='../imgs/spaceshipTitanic.jpg' alt='Spaceship Titanic' width='50%'/>
</div>

---
## Feature Information:

|Variable|Description|
|-|-|
|`PassengerId`|A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always|
|`HomePlanet`|The planet the passenger departed from, typically their planet of permanent residence|
|`CryoSleep`|Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins|
|`Cabin`|The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard|
|`Destination`|The planet the passenger will be debarking to|
|`Age`|The age of the passenger|
|`VIP`|Whether the passenger has paid for special VIP service during the voyage|
|`RoomService`|Amount the passenger has billed for room service|
|`FoodCourt`|Amount the passenger has billed at the food court|
|`ShoppingMall`|Amount the passenger has billed at the shopping mall|
|`Spa`|Amount the passenger has blled at the spa|
|`VR Deck`|Amount the passenger has billed at the VR deck|
|`Name`|The first and last name of the passenger|
|`Transported`|Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict|
  
## Submission Info:

A submission file in the correct format:
 - `PassengerId`: Id for each passenger in the test set.
 - `Transported`: The target. For each passenger, predict either True or False.

## Metric Info:

Submissions are evaluated based on their classification accuracy, the percentage of predicted labels that are correct.

---
---
## Introduction

This notebook aims to:
1. Carry out a full EDA of the Spaceship Titanic train dataset
    - Univariate
    - Bivariate
    - Multivariate
2. Include relevant feature engineering
3. Develop machine learning models (with sklearn)
    - Classification algorithms
    - Try using VotingClassifier and StackingClassifier
4. Produce a valid submission for the kaggle competition

In [1]:
# ------------------------------------------------------------------------------------------------------------------------------------------------------------
# Data Handling and Processing
import numpy as np
import pandas as pd
import math
from sklearn.impute import KNNImputer
from sklearn.utils import shuffle
from sklearn.preprocessing import OneHotEncoder, StandardScaler, PolynomialFeatures, PowerTransformer, FunctionTransformer
# ------------------------------------------------------------------------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------------------------------------------------------------------------
# Visualization 
import matplotlib.pyplot as plt
import seaborn as sns
import viztoolz as viz
import mltoolz as mlt
# ------------------------------------------------------------------------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------------------------------------------------------------------------
# Model Selection, Metrics & Evaluation
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier
# ------------------------------------------------------------------------------------------------------------------------------------------------------------

# ------------------------------------------------------------------------------------------------------------------------------------------------------------
# Pipeline Construction 
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
# ------------------------------------------------------------------------------------------------------------------------------------------------------------

---
## Load Datasets

Separate train and test datasets provided

In [2]:
train = pd.read_csv('../data/raw/train.csv')
test = pd.read_csv('../data/raw/test.csv')

print('-'*16)
print(f'Train Set Shape:\n{train.shape}')
print('-'*16)
print(f'Test Set Shape:\n{test.shape}')
print('-'*16)

----------------
Train Set Shape:
(8693, 14)
----------------
Test Set Shape:
(4277, 13)
----------------


---
## Train Dataset Info Review:

In [3]:
mlt.describe_and_suggest(train)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns)
dtypes: object(7), float64(6), bool(1)
memory usage: 3461.6 KB

Total Percentage of Null Values: 26.73%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,8693,0,0.0,8693,100.0,Categorical
HomePlanet,object,8492,201,2.31,3,0.03,Categorical
CryoSleep,object,8476,217,2.5,2,0.02,Binary
Cabin,object,8494,199,2.29,6560,75.46,Categorical
Destination,object,8511,182,2.09,3,0.03,Categorical
Age,float64,8514,179,2.06,80,0.92,Numerical Discrete
VIP,object,8490,203,2.34,2,0.02,Binary
RoomService,float64,8512,181,2.08,1273,14.64,Numerical Continuous
FoodCourt,float64,8510,183,2.11,1507,17.34,Numerical Continuous
ShoppingMall,float64,8485,208,2.39,1115,12.83,Numerical Continuous


In [4]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


---
## Initial Ideas:

1. `PassengerId`, `Name`, and `Cabin` have very high % cardinality.

***`PassengerId`***:
- `PassengerId` can be transformed to get information about whether a passenger travelled in a group and the size of that group, which will also help with imputing `HomePlanet`
- This will also create two new features `InGroup` (binary) and `GroupSize` (probably discrete groupings)

***`Name`***:
- This feature has near 100% cardinality. It would work as an index but its not really necessary.
- Could be interesting to do some last name extraction to see if there is a link between last names and `HomePlanet`
- For now most likely will drop this feature

***`Cabin`***:
- Again very high cardinality
- Can be split into more useful information that will reduce cardinality given that the string represents deck/number/side.
- This can be extracted to create these three elements as new features; `Deck`, `CabinNumber`, `Side`.
- As CabinNumber would end up having a very high cardinality and not offer much value, it can be binned into quarters to visualize the cabins location for and aft on the ship. This will create the feature `CabinPosition` and help to assess the cabins location had an impact on whether a passenger was transported off the ship.

2. Numerical features `Age`, `RoomService`, `ShoppingMall`, `FoodCourt`, `Spa` will most likely need scaling at the very least and possibly transforming in some way to achieve a more gausian distribution for the assumptions of regression based algorithms. However, i will check this with a .describe() call after splitting the training data into a train and validation set

---
## Split train into train and validation

- This requires filling NaNs first then i will turn them back into NaNs to implement a cleaning and imputation strategy
- I have chosen to split the training data into a train and validation set to avoid data leakage and preserve a sample for validating models before moving on to testing with the test / prediction set and submitting to Kaggle.
    - As there are over 8000 samples in the training dataset i do not think the reduction created by splitting will drastically reduce the performance of any models, however, it might be worth considering a trial of training on the entire dataset as well.
    - For now i will continue with a split dataset into train and validation

In [None]:
# Split train into train and val sets
train_set, val_set = train_test_split(train, test_size=0.25, stratify=train['Transported'], random_state=42)

In [6]:
# Change NaNs to 'unk'
"""train.fillna('unk', axis=1, inplace=True)"""

"train.fillna('unk', axis=1, inplace=True)"

In [7]:
# Turn 'unk' back into NaNs 
"""for dset in [train_set, val_set]:
    dset.replace('unk', np.nan, inplace=True)"""

"for dset in [train_set, val_set]:\n    dset.replace('unk', np.nan, inplace=True)"

---
## Define Target

As I know that the objective is to predict `Transported` i will define the target here

In [8]:
target = 'Transported'

---
---
## Imputation Strategy for NaN values

- All columns except `PassengerId` and `Transported` have around 2% NaN values
- A full dropna() call is out of the question given that the total percentage of missing values is nearly 27%
- Need to devise a strategy to impute missing values

### Idea for workflow:

- Use `PassengerId` to extract whether a passenger travelled in a group or not
- Assuming that people within the same group likely left/are from the same planet
    1. Impute `HomePlanet` NaNs based on travel groups
    3. Any remaining NaNs in categorical or binary features will be filled with modal values using SimpleImputer(strategy='most_frequent').

- Impute NaNs in `Age` with mean age for the passengers `HomePlanet`.
- Impute other numerical columns NaN values using KNNImputer(n_neighbors=5)

In [9]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 14 columns)
dtypes: object(7), float64(6), bool(1)
memory usage: 2646.3 KB

Total Percentage of Null Values: 26.81%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6362,157,2.41,3,0.05,Categorical
CryoSleep,object,6358,161,2.47,2,0.03,Binary
Cabin,object,6371,148,2.27,5150,79.0,Categorical
Destination,object,6372,147,2.25,3,0.05,Categorical
Age,float64,6390,129,1.98,80,1.23,Numerical Discrete
VIP,object,6370,149,2.29,2,0.03,Binary
RoomService,float64,6385,134,2.06,1055,16.18,Numerical Continuous
FoodCourt,float64,6373,146,2.24,1241,19.04,Numerical Continuous
ShoppingMall,float64,6371,148,2.27,953,14.62,Numerical Continuous


---
## Functions for imputation

I am going to define functions to carry out my imputation strategy so that they can be easily called in a pipeline

In [10]:
# 1. Create functions to transform PassengerId and Cabin

def transform_passengerId(X):
    X['GroupId'] = X['PassengerId'].str.split('_').str[0]
    X['PassengerNumber'] = X['PassengerId'].str.split('_').str[1].astype(float)
    group_counts = X['GroupId'].value_counts()
    X['GroupSize'] = X['GroupId'].map(group_counts)
    X['InGroup'] = np.where(X['GroupSize'] > 1, 1, 0)
    return X

def transform_cabin(X):
    X['Deck'] = X['Cabin'].str.split('/').str[0]
    X['CabinNumber'] = X['Cabin'].str.split('/').str[1].astype(float)
    X['Side'] = X['Cabin'].str.split('/').str[2]
    bin_edges = np.linspace(X['CabinNumber'].min(), X['CabinNumber'].max(), 5)
    X['CabinPosition'] = pd.cut(X['CabinNumber'],
                                 bins=bin_edges,
                                 labels=['Front','Second','Third','Back'],
                                 include_lowest=True)
    return X

# 2. Create function to impute HomePlanet

def impute_homePlanet(X):
    group_modes = X.groupby('GroupId')['HomePlanet'].transform(lambda x: x.mode().iloc[0] if not x.mode().empty else np.nan)
    X.loc[X['HomePlanet'].isna(), 'HomePlanet'] = group_modes[X['HomePlanet'].isna()]
    X['HomePlanet'].fillna(X['HomePlanet'].mode().iloc[0], inplace=True)
    return X

## Idea:

- It has occurred to me that carrying out an arbitrary imputation strategy prior to carrying out exploratory data analysis could prejudice the way that imputation happens and therefore impact the performance of any models negatively.
- I am going to opt to transform `PassengerId` and `Cabin` and impute `HomePlanet` but then i will carry some preliminary exploratory data analysis.
- I want to see if passengers from the different `HomePlanets` were distributed differently for all other categorical features or whether a simple and straight foreward imputation of mode for each feature is more efficient.
- If there are differences in the distributions of each categorical feature for each `HomePlanet` then i will devise a strategy that imputes the missing values in each categorical feature proportionally.

In [11]:
train_set = transform_passengerId(train_set)
train_set = transform_cabin(train_set)
train_set = impute_homePlanet(train_set)

In [12]:
info_df = mlt.describe_and_suggest(train_set)
info_df

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 22 columns)
dtypes: object(10), float64(8), int64(2), bool(1), category(1)
memory usage: 3829.0 KB

Total Percentage of Null Values: 33.49%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6519,0,0.0,3,0.05,Categorical
CryoSleep,object,6358,161,2.47,2,0.03,Binary
Cabin,object,6371,148,2.27,5150,79.0,Categorical
Destination,object,6372,147,2.25,3,0.05,Categorical
Age,float64,6390,129,1.98,80,1.23,Numerical Discrete
VIP,object,6370,149,2.29,2,0.03,Binary
RoomService,float64,6385,134,2.06,1055,16.18,Numerical Continuous
FoodCourt,float64,6373,146,2.24,1241,19.04,Numerical Continuous
ShoppingMall,float64,6371,148,2.27,953,14.62,Numerical Continuous


In [13]:
cat_cols = info_df[info_df['Suggested Type'].isin(['Categorical', 'Binary'])].index.to_list()
num_cols = info_df[info_df['Suggested Type'].isin(['Numerical Discrete', 'Numerical Continuous'])].index.to_list()

In [14]:
train_set.groupby('HomePlanet')['CryoSleep'].value_counts(True)

HomePlanet  CryoSleep
Earth       False        0.688543
            True         0.311457
Europa      False        0.563218
            True         0.436782
Mars        False        0.615326
            True         0.384674
Name: CryoSleep, dtype: float64

To better explain my thinking. It seems more sensical to impute `CryoSleep` in the True/False proportions for each `HomePlanet`
- For passengers with `HomePlanet` 'Earth' and NaN in `CroSleep` impute 69% False, 31% True
- For passengers with `HomePlanet` 'Europa' and NaN in `CroSleep` impute 56% False, 43% True 
- For passengers with `HomePlanet` 'Mars' and NaN in `CroSleep` impute 62% False, 38% True 

In [15]:
# Create function to impute categorical features missing values proportionally

def proportional_imputer(X, impute_cols):
    for col in impute_cols:
        # store proportions from groupby of feature
        proportions = X.groupby('HomePlanet')[col].value_counts(normalize=True)

        # Define inside function to 
        def impute_values(row):
            if pd.isna(row[col]):
                group = row['HomePlanet']
                if pd.notna(group) and group in proportions.index:
                    group_proportions = proportions.loc[group].dropna()
                    return np.random.choice(group_proportions.index, p=group_proportions.values)
            return row[col]
        
        # Apply the impute function to each column
        X[col] = X.apply(impute_values, axis=1)
    return X

In [16]:
train_set = proportional_imputer(train_set, cat_cols)

In [17]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 22 columns)
dtypes: object(9), float64(8), bool(3), int64(2)
memory usage: 3739.6 KB

Total Percentage of Null Values: 15.06%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6519,0,0.0,3,0.05,Categorical
CryoSleep,bool,6519,0,0.0,2,0.03,Binary
Cabin,object,6519,0,0.0,5150,79.0,Categorical
Destination,object,6519,0,0.0,3,0.05,Categorical
Age,float64,6390,129,1.98,80,1.23,Numerical Discrete
VIP,bool,6519,0,0.0,2,0.03,Binary
RoomService,float64,6385,134,2.06,1055,16.18,Numerical Continuous
FoodCourt,float64,6373,146,2.24,1241,19.04,Numerical Continuous
ShoppingMall,float64,6371,148,2.27,953,14.62,Numerical Continuous


In [18]:
# Create function to to carry out a KNNImputer of NaN values in numerical features
def knn_imputer(X, columns):
    imputer = KNNImputer(n_neighbors=5)
    X[columns] = imputer.fit_transform(X[columns])
    return X

In [19]:
knnimputer_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']
train_set = knn_imputer(train_set, knnimputer_cols)

In [20]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 22 columns)
dtypes: object(9), float64(8), bool(3), int64(2)
memory usage: 3740.6 KB

Total Percentage of Null Values: 2.27%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6519,0,0.0,3,0.05,Categorical
CryoSleep,bool,6519,0,0.0,2,0.03,Binary
Cabin,object,6519,0,0.0,5150,79.0,Categorical
Destination,object,6519,0,0.0,3,0.05,Categorical
Age,float64,6519,0,0.0,127,1.95,Numerical Discrete
VIP,bool,6519,0,0.0,2,0.03,Binary
RoomService,float64,6519,0,0.0,1108,17.0,Numerical Continuous
FoodCourt,float64,6519,0,0.0,1301,19.96,Numerical Continuous
ShoppingMall,float64,6519,0,0.0,1009,15.48,Numerical Continuous


In [21]:
# Create 'TotalSpent' feature
def create_totalSpent(X):
    X['TotalSpent'] = X[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].sum(axis=1)
    return X

In [22]:
train_set = create_totalSpent(train_set)

In [23]:
# Convert specific columns to integers
def convert_to_int(X):
    for col in ['InGroup', 'CryoSleep', 'VIP', 'Transported']:
        if col in X.columns:
            X[col] = X[col].astype(int)
    return X

In [24]:
train_set = convert_to_int(train_set)

In [25]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 23 columns)
dtypes: object(9), float64(9), int64(5)
memory usage: 3925.2 KB

Total Percentage of Null Values: 2.27%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6519,0,0.0,3,0.05,Categorical
CryoSleep,int64,6519,0,0.0,2,0.03,Binary
Cabin,object,6519,0,0.0,5150,79.0,Categorical
Destination,object,6519,0,0.0,3,0.05,Categorical
Age,float64,6519,0,0.0,127,1.95,Numerical Discrete
VIP,int64,6519,0,0.0,2,0.03,Binary
RoomService,float64,6519,0,0.0,1108,17.0,Numerical Continuous
FoodCourt,float64,6519,0,0.0,1301,19.96,Numerical Continuous
ShoppingMall,float64,6519,0,0.0,1009,15.48,Numerical Continuous


can now drop `Cabin`, `CabinNumber`, `GroupId`, `PassengerId`, `PassengerNumber` and `Name`

In [26]:
def drop_unwanted(X):
    droppers = ['Cabin', 'CabinNumber', 'GroupId', 'PassengerId', 'PassengerNumber', 'Name']
    X.drop(droppers, axis=1, inplace=True)
    return X

In [27]:
train_set = drop_unwanted(train_set)

In [28]:
# create a function that calls all of the defined functions for quicker execution
def process_dataframe(X):
    X = transform_passengerId(X)
    X = transform_cabin(X)
    X = impute_homePlanet(X)
    X = proportional_imputer(X, impute_cols=['Destination', 'Deck', 'Side', 'CabinPosition', 'VIP', 'CryoSleep'])
    X = knn_imputer(X, columns=['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck'])
    X = create_totalSpent(X)
    X = convert_to_int(X)
    X = drop_unwanted(X)
    return X

In [29]:
train = pd.read_csv('../data/raw/train.csv')
train_set, val_set = train_test_split(train, test_size=0.25, stratify=train['Transported'], random_state=42)

In [30]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 14 columns)
dtypes: object(7), float64(6), bool(1)
memory usage: 2646.3 KB

Total Percentage of Null Values: 26.81%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
PassengerId,object,6519,0,0.0,6519,100.0,Categorical
HomePlanet,object,6362,157,2.41,3,0.05,Categorical
CryoSleep,object,6358,161,2.47,2,0.03,Binary
Cabin,object,6371,148,2.27,5150,79.0,Categorical
Destination,object,6372,147,2.25,3,0.05,Categorical
Age,float64,6390,129,1.98,80,1.23,Numerical Discrete
VIP,object,6370,149,2.29,2,0.03,Binary
RoomService,float64,6385,134,2.06,1055,16.18,Numerical Continuous
FoodCourt,float64,6373,146,2.24,1241,19.04,Numerical Continuous
ShoppingMall,float64,6371,148,2.27,953,14.62,Numerical Continuous


In [31]:
train_set = process_dataframe(train_set)

In [32]:
mlt.describe_and_suggest(train_set)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6519 entries, 0 to 6518
Data columns (total 17 columns)
dtypes: float64(7), object(5), int64(5)
memory usage: 2370.4 KB

Total Percentage of Null Values: 0.00%


Unnamed: 0,Data Type,Not-Null,Missing,Missing (%),Unique,Cardinality (%),Suggested Type
HomePlanet,object,6519,0,0.0,3,0.05,Categorical
CryoSleep,int64,6519,0,0.0,2,0.03,Binary
Destination,object,6519,0,0.0,3,0.05,Categorical
Age,float64,6519,0,0.0,127,1.95,Numerical Discrete
VIP,int64,6519,0,0.0,2,0.03,Binary
RoomService,float64,6519,0,0.0,1108,17.0,Numerical Continuous
FoodCourt,float64,6519,0,0.0,1301,19.96,Numerical Continuous
ShoppingMall,float64,6519,0,0.0,1009,15.48,Numerical Continuous
Spa,float64,6519,0,0.0,1156,17.73,Numerical Continuous
VRDeck,float64,6519,0,0.0,1127,17.29,Numerical Continuous
