# Install Dependencies

In [1]:
%pip install --upgrade pip pandas scikit-learn scipy

Defaulting to user installation because normal site-packages is not writeable
Note: you may need to restart the kernel to use updated packages.


# Import Libraries

In [2]:
import pickle
import pandas as pd
import numpy as np

from scipy.stats import randint, uniform
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.compose import ColumnTransformer, make_column_selector
from sklearn.ensemble import GradientBoostingClassifier, RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OneHotEncoder


In [3]:
RANDOM_STATE = 33

# Load Data

In [4]:
dataset_df = pd.read_csv('../kaggle/datasets/spaceship-titanic/train.csv')
print('Train dataset shape:', dataset_df.shape)

Train dataset shape: (8693, 14)


In [5]:
# Extract the target variable
y = dataset_df['Transported']
X = dataset_df.drop(['Transported'], axis=1)

In [6]:
# Split the dataset into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=RANDOM_STATE)

The data is also available in the [Kaggle Spaceship Titanic competition](https://www.kaggle.com/competitions/spaceship-titanic/data).

# Exploritory Data Analysis

**train.csv** - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
- `PassengerId` - A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
- `HomePlanet` - The planet the passenger departed from, typically their planet of permanent residence.
- `CryoSleep` - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
- `Cabin` - The cabin number where the passenger is staying. Takes the form `deck/num/side`, where side can be either P for Port or S for Starboard.
- `Destination` - The planet the passenger will be debarking to.
- `Age` - The age of the passenger.
- `VIP` - Whether the passenger has paid for special VIP service during the voyage.
- `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, `VRDeck` - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- `Name` - The first and last names of the passenger.
- `Transported` - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

**test.csv** - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

**sample_submission.csv** - A submission file in the correct format.
- `PassengerId` - Id for each passenger in the test set.
- `Transported` - The target. For each passenger, predict either True or False.

In [7]:
dataset_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 14 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   PassengerId   8693 non-null   object 
 1   HomePlanet    8492 non-null   object 
 2   CryoSleep     8476 non-null   object 
 3   Cabin         8494 non-null   object 
 4   Destination   8511 non-null   object 
 5   Age           8514 non-null   float64
 6   VIP           8490 non-null   object 
 7   RoomService   8512 non-null   float64
 8   FoodCourt     8510 non-null   float64
 9   ShoppingMall  8485 non-null   float64
 10  Spa           8510 non-null   float64
 11  VRDeck        8505 non-null   float64
 12  Name          8493 non-null   object 
 13  Transported   8693 non-null   bool   
dtypes: bool(1), float64(6), object(7)
memory usage: 891.5+ KB


In [8]:
dataset_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


Looking at the data in combination with the description, we can see that the data is a mix of categorical and numerical data. The categorical data is `PassengerId`, `HomePlanet`, `CryoSleep`, `Cabin`, `Destination`, `VIP`, `Name`, and `Transported`. The numerical data is `Age`, `RoomService`, `FoodCourt`, `ShoppingMall`, `Spa`, and `VRDeck`.

The description reveals further information which is not immediately obvious and can be used to engineer new features. The `PassengerId` is a unique identifier for each passenger, but it is also a group identifier. The `Cabin` column contains information about the deck, room number, and side of the ship.

In [9]:
dataset_df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


The descriptive statistics reveal that the most of the passengers are in their 20s and 30s, with a mean age of 28.82. Half of the passengers do not have any charges for the amenities. 

In [10]:
# Check for missing values
dataset_df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

All columns have around 200 missing values (except for the `PassengerId` and the Target `Transported`) which is around 2% of the total dataset.

In [11]:
# Check rows without any missing values
dataset_df.dropna().shape

(6606, 14)

If we drop the rows with missing values, we will lose around 25% of the data. This is a significant amount of data to lose, so we will need to impute the missing values. It tells us that the missing values are spread across the dataset and not concentrated in a few rows.

# Custom Preprocessing

While sci-kit learn has a lot of preprocessing tools, some of the preprocessing steps are too specific to the dataset to be included in the library. For example, the `Cabin` column contains information about the deck, room number, and side of the ship. We can extract this information and create new features.

In [12]:
class PassengerIdSplitter(BaseEstimator, TransformerMixin):
    '''Split the PassengerId into Group and Number'''
    
    def fit(self, X: pd.DataFrame, y=None):
        return self

    def transform(self, X: pd.DataFrame):
        # Split the PassengerId into Group and Number
        X['Group'] = X['PassengerId'].str.split('_').str[0]
        X['Number'] = X['PassengerId'].str.split('_').str[1]
        # Drop the original column
        return X.drop(['PassengerId'], axis=1)
        

In [13]:
class CabinSplitter(BaseEstimator, TransformerMixin):
    '''Split the Cabin into Deck and Room'''
    
    def fit(self, X: pd.DataFrame, y=None):
        return self
    
    def transform(self, X: pd.DataFrame):
        # Split the Cabin into Deck, Room and Side (port or starboard)
        X['Deck'] = X['Cabin'].str.split('/').str[0]
        X['Room'] = X['Cabin'].str.split('/').str[1].astype(int) # treat as numerical to avoid high cardinality
        X['Side'] = X['Cabin'].str.split('/').str[2]
        # Drop the original column
        return X.drop(['Cabin'], axis=1)

In [14]:
class ColumnDropper(BaseEstimator, TransformerMixin):
    '''Drop the specified columns'''

    def __init__(self, columns):
        self.columns = columns

    def fit(self, X: pd.DataFrame, y=None):
        return self

    def transform(self, X: pd.DataFrame):
        # Drop the specified columns
        return X.drop(self.columns, axis=1)

# Column Transformer

Now that we have our custom preprocessing steps, we can create a column transformer. This will allow us to apply different preprocessing steps to different columns based on their data type.

First, we will create a pipeline for the numerical data. We will use the `SimpleImputer` to impute the missing values with the median. Then we will use the `StandardScaler` to scale the data.

In [15]:
numerical_preprocessor = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

Next, we will create a pipeline for the categorical data. We will use the `SimpleImputer` to impute the missing values with the most frequent value. Then we will use the `OneHotEncoder` to encode the categorical data. We will use the `handle_unknown='ignore'` parameter to ignore unknown categories in the test set and the `sparse=False` parameter to return a full array instead of a sparse matrix.

In [16]:
categorical_preprocessor = Pipeline([
    ('onehot', OneHotEncoder(sparse_output=False, handle_unknown='ignore'))
])


Finally, we will combine the two pipelines using the `ColumnTransformer`. Here we can use the `make_column_selector` to select the columns we want to apply the specific pipeline to. This works because we kept all numerical columns which represent categorical data as strings.

In [17]:
column_transformer = ColumnTransformer([
    ('numerical_preprocessing', numerical_preprocessor, make_column_selector(dtype_include=np.number)),
    ('categorical_preprocessing', categorical_preprocessor, make_column_selector(dtype_include=object))
])

# Creating a Baseline Model

Now that we have our data preprocessing steps, we can create a baseline model. We will use the Random Forest Classifier with default hyperparameters as our baseline model.

In [18]:
pipeline = Pipeline([
    ('column_dropper', ColumnDropper(columns=['Name'])),
    ('column_transformer', column_transformer),
    ('classifier', RandomForestClassifier(random_state=RANDOM_STATE, n_jobs=-1))
    ])

In [19]:
pipeline.fit(X_train, y_train)

In [20]:
baseline_accuracy = pipeline.score(X_test, y_test)
print(f'Accuracy score: {baseline_accuracy:.3}')

Accuracy score: 0.798


# Model Selection and Hyperparameter Tuning

Now that we have a baseline model, we can try different models and tune the hyperparameters to improve the model performance. We will use the Random Forest Classifier and GradientBoosting with the `RandomizedSearchCV` to tune the hyperparameters for each model.

In [21]:
search_space = [
    {
        'classifier': [RandomForestClassifier(random_state=RANDOM_STATE)],
        'classifier__n_estimators': randint(100, 500),
        'classifier__max_depth': randint(3,16),
        'classifier__min_samples_split': randint(2, 100),
        'classifier__min_samples_leaf': randint(1, 50),
        'classifier__max_features': uniform(0.1, 1),
    },
    {
        'classifier': [GradientBoostingClassifier(random_state=RANDOM_STATE)],
        'classifier__n_estimators': randint(100, 500),
        'classifier__learning_rate': uniform(0.01, 0.3),
        'classifier__max_depth': randint(3,16),
        'classifier__min_samples_split': randint(2, 100),  
        'classifier__min_samples_leaf': randint(1, 50),
        'classifier__max_features': uniform(0.1, 1),
    }
]

In [22]:
random_search = RandomizedSearchCV(
    pipeline, 
    search_space,
    scoring='accuracy',
    refit=True,
    n_iter=1000,
    cv=4, 
    verbose=1, 
    n_jobs=-1,
    random_state=RANDOM_STATE
)

In [23]:
random_search.fit(X_train, y_train)

Fitting 4 folds for each of 10 candidates, totalling 40 fits


In [24]:
best_estimator = random_search.best_estimator_
print(best_estimator)

Pipeline(steps=[('column_dropper', ColumnDropper(columns=['Name'])),
                ('column_transformer',
                 ColumnTransformer(transformers=[('numerical_preprocessing',
                                                  Pipeline(steps=[('imputer',
                                                                   SimpleImputer(strategy='median')),
                                                                  ('scaler',
                                                                   StandardScaler())]),
                                                  <sklearn.compose._column_transformer.make_column_selector object at 0x1478c5130>),
                                                 ('categorical_preprocessing',
                                                  Pipeline(steps=[('onehot',
                                                                   OneHotEncoder(handle_unknown='ignore',
                                                                           

In [25]:
best_params = random_search.best_params_
print(best_params)

{'classifier': GradientBoostingClassifier(random_state=33), 'classifier__learning_rate': 0.04298260317086207, 'classifier__max_depth': 6, 'classifier__max_features': 'sqrt', 'classifier__min_samples_leaf': 43, 'classifier__min_samples_split': 47, 'classifier__n_estimators': 391}


In [26]:
random_search_accuracy = random_search.score(X_test, y_test)
print(f'Accuracy score: {random_search_accuracy:.3}')

Accuracy score: 0.745


In [27]:
results_df = pd.DataFrame(random_search.cv_results_)
results_df.sort_values(by='rank_test_score').head(10)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_classifier,param_classifier__max_depth,param_classifier__max_features,param_classifier__min_samples_leaf,param_classifier__min_samples_split,param_classifier__n_estimators,param_classifier__learning_rate,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,mean_test_score,std_test_score,rank_test_score
1,6.592054,0.062966,0.137776,0.015946,GradientBoostingClassifier(random_state=33),6,sqrt,43,47,391,0.042983,{'classifier': GradientBoostingClassifier(rand...,0.749281,0.72858,0.735903,0.739931,0.738424,0.007474,1
2,6.822435,0.059294,0.156666,0.030797,GradientBoostingClassifier(random_state=33),15,sqrt,40,14,495,0.018032,{'classifier': GradientBoostingClassifier(rand...,0.743531,0.72743,0.743383,0.733602,0.736986,0.006829,2
9,1.87127,0.091501,0.063203,0.010778,GradientBoostingClassifier(random_state=33),15,sqrt,11,54,188,0.047212,{'classifier': GradientBoostingClassifier(rand...,0.747556,0.717079,0.7313,0.736479,0.733103,0.010958,3
0,5.748564,0.276807,0.188149,0.026811,RandomForestClassifier(random_state=33),10,sqrt,3,20,301,,{'classifier': RandomForestClassifier(random_s...,0.746406,0.715929,0.719793,0.733026,0.728789,0.011986,4
5,1.303303,0.019728,0.135826,0.014702,GradientBoostingClassifier(random_state=33),6,log2,19,83,352,0.25925,{'classifier': GradientBoostingClassifier(rand...,0.710178,0.722829,0.717491,0.712888,0.715847,0.004805,5
6,1.11007,0.036101,0.111809,0.017836,GradientBoostingClassifier(random_state=33),9,log2,27,34,336,0.136524,{'classifier': GradientBoostingClassifier(rand...,0.710178,0.713053,0.71519,0.708861,0.711821,0.002466,6
7,1.87337,0.051847,0.133745,0.018246,RandomForestClassifier(random_state=33),12,sqrt,31,55,155,,{'classifier': RandomForestClassifier(random_s...,0.730305,0.684876,0.72267,0.696778,0.708657,0.018518,7
8,1.309343,0.085351,0.119116,0.017536,GradientBoostingClassifier(random_state=33),4,log2,37,99,275,0.168265,{'classifier': GradientBoostingClassifier(rand...,0.710178,0.713053,0.705409,0.673188,0.700457,0.015979,8
3,2.640391,0.070992,0.130081,0.020787,RandomForestClassifier(random_state=33),3,sqrt,43,18,184,,{'classifier': RandomForestClassifier(random_s...,0.705003,0.684301,0.705984,0.697353,0.69816,0.008671,9
4,0.871305,0.013645,0.163705,0.058066,RandomForestClassifier(random_state=33),11,log2,24,75,199,,{'classifier': RandomForestClassifier(random_s...,0.508913,0.505463,0.505178,0.506904,0.506615,0.00148,10


In [28]:
results_df.to_csv('results.csv', index=False)

In [29]:
pickle.dump(random_search, open('model.pkl', 'wb'))