Data is taken from https://www.kaggle.com/c/spaceship-titanic/data?select=train.csv.

In [189]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import RandomizedSearchCV

import numpy as np
import random

from xgboost import XGBClassifier

random.seed(0)
np.random.seed(0)

In [190]:
df = pd.read_csv("train.csv")
print(df.head())

  PassengerId HomePlanet CryoSleep  Cabin  Destination   Age    VIP  \
0     0001_01     Europa     False  B/0/P  TRAPPIST-1e  39.0  False   
1     0002_01      Earth     False  F/0/S  TRAPPIST-1e  24.0  False   
2     0003_01     Europa     False  A/0/S  TRAPPIST-1e  58.0   True   
3     0003_02     Europa     False  A/0/S  TRAPPIST-1e  33.0  False   
4     0004_01      Earth     False  F/1/S  TRAPPIST-1e  16.0  False   

   RoomService  FoodCourt  ShoppingMall     Spa  VRDeck               Name  \
0          0.0        0.0           0.0     0.0     0.0    Maham Ofracculy   
1        109.0        9.0          25.0   549.0    44.0       Juanna Vines   
2         43.0     3576.0           0.0  6715.0    49.0      Altark Susent   
3          0.0     1283.0         371.0  3329.0   193.0       Solam Susent   
4        303.0       70.0         151.0   565.0     2.0  Willy Santantines   

   Transported  
0        False  
1         True  
2        False  
3        False  
4         True  


Some feature engineering is needed to use the data

In [191]:
print(df["HomePlanet"].value_counts())

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64


HomePlanet can be one-hot encoded

In [192]:
df = df.join(pd.get_dummies(data = df["HomePlanet"], prefix = "From")).drop("HomePlanet", axis = 1)

In [193]:
df.head()

Unnamed: 0,PassengerId,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,From_Earth,From_Europa,From_Mars
0,0001_01,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0,1,0
1,0002_01,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,1,0,0
2,0003_01,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,0,1,0
3,0003_02,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,0,1,0
4,0004_01,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1,0,0


Name could contain important information, like gender, family surname etc, but it would be rather tedious to extract it.

In [194]:
df = df.drop("Name", axis = 1)

Let's drop NaNs.

In [195]:
df = df.dropna()

And change dtypes to bool where that works

In [196]:
df["CryoSleep"] = df["CryoSleep"].astype(bool)
df["VIP"] = df["VIP"].astype(bool)

Destination can be one-hot encoded.

In [197]:
df["Destination"].value_counts()

TRAPPIST-1e      4812
55 Cancri e      1466
PSO J318.5-22     654
Name: Destination, dtype: int64

In [198]:
df = df.join(pd.get_dummies(data = df["Destination"], prefix = "Dest")).drop("Destination", axis = 1)

Cabin column can be transformed into 3 columns

In [199]:
splitted_cabin = df["Cabin"].str.split('/', 2, expand = True)
df[["cab_deck", "cab_num", "cab_side"]] = splitted_cabin
df = df.drop("Cabin", axis = 1)
df["cab_num"] = df["cab_num"].astype(int)

Cabine deck and cabine side can be encoded

In [200]:
df["cab_deck"] = df["cab_deck"].astype(str)
df["cab_deck"] = [ ord(x[0])  - 64 for x in df["cab_deck"] ]



In [201]:
df = df.join(pd.get_dummies(data = df["cab_side"], prefix = "side")).drop("cab_side", axis = 1)
df.head()

Unnamed: 0,PassengerId,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,From_Earth,From_Europa,From_Mars,Dest_55 Cancri e,Dest_PSO J318.5-22,Dest_TRAPPIST-1e,cab_deck,cab_num,side_P,side_S
0,0001_01,False,39.0,False,0.0,0.0,0.0,0.0,0.0,False,0,1,0,0,0,1,2,0,1,0
1,0002_01,False,24.0,False,109.0,9.0,25.0,549.0,44.0,True,1,0,0,0,0,1,6,0,0,1
2,0003_01,False,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,0,1,0,0,0,1,1,0,0,1
3,0003_02,False,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,0,1,0,0,0,1,1,0,0,1
4,0004_01,False,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1,0,0,0,0,1,6,1,0,1


PassengerId can be splitted

In [202]:
splitted_cabin = df["PassengerId"].str.split('_', 1, expand = True)
df[["group_id", "id_in_group"]] = splitted_cabin
df = df.drop("PassengerId", axis = 1)

In [203]:
df["id_in_group"] = df["id_in_group"].astype(int)
df["group_id"] = df["group_id"].astype(int)

Let's create train and test sets

In [204]:
train, test = train_test_split(df, test_size=0.2, random_state = 0)

In [205]:
train_Y = train["Transported"]
train_X = train.drop("Transported", axis = 1)

test_Y = test["Transported"]
test_X = test.drop("Transported", axis = 1)

First model that we'll try to use will be a random forest model. Our problem is classification. Let's first run the validation thing.

In [206]:
grid = {'n_estimators': [90, 100, 150, 250],
            'max_features': ['auto', 'sqrt'],
            'max_depth': range(2, 20),
            'min_samples_split':  range(2, 10),
            'min_samples_leaf': range(2, 10),
            'bootstrap': [True, False]}

Tuning hyperparameters

In [207]:
model = RandomForestClassifier(random_state = 0)

cv = RandomizedSearchCV(estimator = model, param_distributions = grid, scoring = "accuracy", n_iter = 100, cv = 3, verbose=2, n_jobs = -1, random_state = 0)
cv.fit(train_X, train_Y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=RandomForestClassifier(random_state=0),
                   n_iter=100, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'max_depth': range(2, 20),
                                        'max_features': ['auto', 'sqrt'],
                                        'min_samples_leaf': range(2, 10),
                                        'min_samples_split': range(2, 10),
                                        'n_estimators': [90, 100, 150, 250]},
                   random_state=0, scoring='accuracy', verbose=2)

Best parameters:

In [208]:
print(cv.best_estimator_)

RandomForestClassifier(max_depth=13, max_features='sqrt', min_samples_leaf=6,
                       min_samples_split=3, random_state=0)


In [218]:
model = RandomForestClassifier(max_depth=13, max_features='sqrt', min_samples_leaf=6,
                       min_samples_split=3, random_state=0)
model.fit(train_X, train_Y)
print("Model accuracy on test data", model.score(test_X, test_Y))

Model accuracy on test data 0.8147080028839221


Now we'll fit xgboost.

Hyperparameter grid and hyperparameter search 

In [210]:
param_grid = {'gamma': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4, 200],
              'learning_rate': [0.01, 0.03, 0.06, 0.1, 0.15, 0.2, 0.25, 0.300000012, 0.4, 0.5, 0.6, 0.7],
              'max_depth': [5,6,7,8,9,10,11,12,13,14],
              'n_estimators': [50,65,80,100,115,130,150],
              'reg_alpha': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200],
              'reg_lambda': [0,0.1,0.2,0.4,0.8,1.6,3.2,6.4,12.8,25.6,51.2,102.4,200]}

In [211]:
model = XGBClassifier()

cv = RandomizedSearchCV(estimator = model, param_distributions = param_grid, scoring = "accuracy", n_iter = 100, cv = 3, verbose=2, n_jobs = -1, random_state = 0)
cv.fit(train_X, train_Y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=XGBClassifier(), n_iter=100, n_jobs=-1,
                   param_distributions={'gamma': [0, 0.1, 0.2, 0.4, 0.8, 1.6,
                                                  3.2, 6.4, 12.8, 25.6, 51.2,
                                                  102.4, 200],
                                        'learning_rate': [0.01, 0.03, 0.06, 0.1,
                                                          0.15, 0.2, 0.25,
                                                          0.300000012, 0.4, 0.5,
                                                          0.6, 0.7],
                                        'max_depth': [5, 6, 7, 8, 9, 10, 11, 12,
                                                      13, 14],
                                        'n_estimators': [50, 65, 80, 100, 115,
                                                         130, 150],
                                        'reg_alpha': [0, 0.1, 0.2, 0.4, 0.8,
                                

In [212]:
print(cv.best_estimator_)

XGBClassifier(gamma=0.8, learning_rate=0.5, max_depth=13, n_estimators=65,
              reg_alpha=0.8, reg_lambda=200)


In [217]:
model = XGBClassifier(gamma=0.8, learning_rate=0.5, max_depth=13, n_estimators=65,
              reg_alpha=0.8, reg_lambda=200)
model.fit(train_X, train_Y)
y_pred = model.predict(test_X)
print((test_Y == y_pred).sum() / len(test_Y))

0.8190338860850757


We'll also fit KNN model

In [214]:
from sklearn.neighbors import KNeighborsClassifier
leaf_size = list(range(1, 50))
n_neighbors = list(range(1, 30))
p = [1, 2, 3]

param_grid = {
    "leaf_size" : leaf_size,
    "n_neighbors" : n_neighbors,
    "p" : p
}

model = KNeighborsClassifier()
cv = RandomizedSearchCV(estimator = model, param_distributions = param_grid, scoring = "accuracy", n_iter = 100, cv = 3, verbose=2, n_jobs = -1, random_state = 0)
cv.fit(train_X, train_Y)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


RandomizedSearchCV(cv=3, estimator=KNeighborsClassifier(), n_iter=100,
                   n_jobs=-1,
                   param_distributions={'leaf_size': [1, 2, 3, 4, 5, 6, 7, 8, 9,
                                                      10, 11, 12, 13, 14, 15,
                                                      16, 17, 18, 19, 20, 21,
                                                      22, 23, 24, 25, 26, 27,
                                                      28, 29, 30, ...],
                                        'n_neighbors': [1, 2, 3, 4, 5, 6, 7, 8,
                                                        9, 10, 11, 12, 13, 14,
                                                        15, 16, 17, 18, 19, 20,
                                                        21, 22, 23, 24, 25, 26,
                                                        27, 28, 29],
                                        'p': [1, 2, 3]},
                   random_state=0, scoring='accuracy', verbose=2)

In [215]:
print(cv.best_estimator_)

KNeighborsClassifier(leaf_size=5, n_neighbors=21, p=3)


In [219]:
model = KNeighborsClassifier(leaf_size=5, n_neighbors=21, p=3)
model.fit(train_X, train_Y)
print(model.score(test_X, test_Y))

0.7764960346070656


XGboost  and random forest classifier perform similarly well. My result is probably near to the optimal one (since the results in the competition are near 0.8, measured by accuracy). KNN is the worst, but I think that this is not really surprising - there are quite a few features, so that might be the reason. I theoretically could use one of the methods for reducing feature count, but I think that this is out of scope for this exercise. I didn't use neural networks, as I believe they don't work well for tabular problems. Low accuracy may be caused by for example inbalance in data - it can be observed for example, that the amount of cabins on different levels differs a lot.