<a href="https://colab.research.google.com/github/jhonnybenna/machine_learning/blob/main/Spaceship_Titanic.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Machine Learning and Data Mining Project: Spaceship Titanic
In this Colab Notebook we will use Machine Learning techniques to solve the Kaggle competions "Spaceship Titanic".



#Importing Dependencies


In [48]:
URL = "https://raw.githubusercontent.com/jhonnybenna/machine_learning/main/"
OUTPUT_PATH = "kaggle_submissions/"
RANDOM_STATE = 3993
TRAIN_SIZE = 0.8

DROPCOLUMS=["PassengerId","Name","Transported"]

#Insert here the description of your test in order to submit to Kaggle
Description="Submission for the kaggle competition Spaceship Titanic "

In [49]:
import os
import graphviz
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import (
    GridSearchCV,
    StratifiedKFold,
    cross_val_score,
    train_test_split,
)
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OrdinalEncoder
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from typing import List


Since the dataset contains, as we will see briefly, both Numerical and Categorical values, we cannot use SKLearn's Decision Tree without encoding the data in a numerical form. Therefore, we will use TensorFlow's Decision Forests, which automatically deals with different types of values without the need for encoding

In [50]:
!pip install tensorflow_decision_forests
import tensorflow_decision_forests as tfdf



#Importing Dataset from GitHub

In [51]:
train = pd.read_csv(URL + "train.csv")
test = pd.read_csv(URL + "test.csv")

In [52]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [53]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez


#Data Inspection

In [54]:
train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

As we can see, the dataset contains both numerical data and non-numerical data (object).

Our goal is to predict the value of "Transported" based on the other features.

But, before training the model, let's make sure all of the data is in a proper format.

Since the Cabin columns contain data written in the format: Deck/Cabin_num/Side; it is necessary to split it into three different columns, each containing only one of the three variables

In [55]:
train[["Deck", "Cabin_num", "Side"]] = train["Cabin"].str.split("/", expand=True)
#dropping the column Cabin after splitting it into three separate columns
train.drop('Cabin', axis='columns', inplace=True)

In [56]:
test[["Deck", "Cabin_num", "Side"]] = test["Cabin"].str.split("/", expand=True)
#dropping the column Cabin after splitting it into three separate columns
test.drop('Cabin', axis='columns', inplace=True)

In [57]:
train

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Deck,Cabin_num,Side
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,B,0,P
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,F,0,S
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,A,0,S
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,A,0,S
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,F,1,S
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False,A,98,P
8689,9278_01,Earth,True,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False,G,1499,S
8690,9279_01,Earth,False,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True,G,1500,S
8691,9280_01,Europa,False,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False,E,608,S


In [58]:
test.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Deck,Cabin_num,Side
0,0013_01,Earth,True,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning,G,3,S
1,0018_01,Earth,False,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers,F,4,S
2,0019_01,Europa,True,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus,C,0,S
3,0021_01,Europa,False,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter,C,1,S
4,0023_01,Earth,False,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez,F,5,S


In [59]:
number_of_missing_in_cols = train.shape[0] - train.count()
#count() counts not_NA values
number_of_missing_in_cols.sort_values()

PassengerId       0
Transported       0
Age             179
RoomService     181
Destination     182
FoodCourt       183
Spa             183
VRDeck          188
Deck            199
Cabin_num       199
Side            199
Name            200
HomePlanet      201
VIP             203
ShoppingMall    208
CryoSleep       217
dtype: int64

As we can see from the previous line of code, there are many missing values in each column except, obviously, the Transported (target) and PassengerID ones.

This does not constitute a problem, as tensorflow is able to deal with missing values.

##Dealing with Missing Values


In [60]:
# Count the number of unique values in dataframe
cols_unique_vals_count = train.drop(columns=DROPCOLUMS).nunique().sort_values()
cols_unique_vals_count

CryoSleep          2
VIP                2
Side               2
HomePlanet         3
Destination        3
Deck               8
Age               80
ShoppingMall    1115
RoomService     1273
VRDeck          1306
Spa             1327
FoodCourt       1507
Cabin_num       1817
dtype: int64

In [61]:
# If the column has only 2 unique values it is a binary col
BINARY_COLS = [col for col, val in cols_unique_vals_count.items() if val == 2]
BINARY_COLS

['CryoSleep', 'VIP', 'Side']

N.B: The Transported column has been dropped even though it's a binary column, as we assume it isn't missing any value; therefore we do not need to deal with it.

In [62]:
# Maximum number of unique values which represent a Nominal (categorical) feature
NOMINAL_NUNIQUE_THRESHOLD = 10
NOMINAL_COLS = [
    col
    for col, val in cols_unique_vals_count.items()
    if val > 2 and val < NOMINAL_NUNIQUE_THRESHOLD
]
NOMINAL_COLS

['HomePlanet', 'Destination', 'Deck']

In [63]:
# Maximum number of unique values which represent a Nominal (categorical) feature
NUMERICAL_COLS = [
    col
    for col, val in cols_unique_vals_count.items()
    if val > NOMINAL_NUNIQUE_THRESHOLD
]
NUMERICAL_COLS

['Age',
 'ShoppingMall',
 'RoomService',
 'VRDeck',
 'Spa',
 'FoodCourt',
 'Cabin_num']

In [64]:
number_of_missing_in_cols = train.shape[0] - train.count()
#count() counts not_NA values in every column
number_of_missing_in_cols.sort_values()

PassengerId       0
Transported       0
Age             179
RoomService     181
Destination     182
FoodCourt       183
Spa             183
VRDeck          188
Deck            199
Cabin_num       199
Side            199
Name            200
HomePlanet      201
VIP             203
ShoppingMall    208
CryoSleep       217
dtype: int64

In [66]:
train.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Deck,Cabin_num,Side
0,0001_01,Europa,False,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,B,0,P
1,0002_01,Earth,False,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,F,0,S
2,0003_01,Europa,False,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,A,0,S
3,0003_02,Europa,False,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,A,0,S
4,0004_01,Earth,False,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,F,1,S


In [67]:
ord_enc = OrdinalEncoder()
train_code = pd.DataFrame(train["Age"])
train_code["home_code"] = ord_enc.fit_transform(train[["HomePlanet"]])
train_code["destination_code"] = ord_enc.fit_transform(train[["Destination"]])
train_code["deck_code"] = ord_enc.fit_transform(train[["Deck"]])
train_code["side_code"] = ord_enc.fit_transform(train[["Side"]])
train_code.append(train[["RoomService","FoodCourt","ShoppingMall"]])
train_code.head()

  train_code.append(train[["RoomService","FoodCourt","ShoppingMall"]])


Unnamed: 0,Age,home_code,destination_code,deck_code,side_code
0,39.0,1.0,2.0,1.0,0.0
1,24.0,0.0,2.0,5.0,1.0
2,58.0,1.0,2.0,0.0,1.0
3,33.0,1.0,2.0,0.0,1.0
4,16.0,0.0,2.0,5.0,1.0


In [68]:
ord_enc = OrdinalEncoder()
train_code = train.copy()
train_code["home_code"] = ord_enc.fit_transform(train[["HomePlanet"]])
train_code.drop('HomePlanet', axis='columns', inplace=True)
train_code["destination_code"] = ord_enc.fit_transform(train[["Destination"]])
train_code.drop('Destination', axis='columns', inplace=True)
train_code["deck_code"] = ord_enc.fit_transform(train[["Deck"]])
train_code.drop('Deck', axis='columns', inplace=True)
train_code["side_code"] = ord_enc.fit_transform(train[["Side"]])
train_code.drop('Side', axis='columns', inplace=True)
train_code["cryo_code"] = ord_enc.fit_transform(train[["Side"]])
train_code.drop('CryoSleep', axis='columns', inplace=True)
train_code["vip_code"] = ord_enc.fit_transform(train[["Side"]])
train_code.drop('VIP', axis='columns', inplace=True)
train_code.drop('Name', axis='columns', inplace=True)
train_code.drop('PassengerId', axis='columns', inplace=True)
train_code.head()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_num,home_code,destination_code,deck_code,side_code,cryo_code,vip_code
0,39.0,0.0,0.0,0.0,0.0,0.0,False,0,1.0,2.0,1.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0,True,0,0.0,2.0,5.0,1.0,1.0,1.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,False,0,1.0,2.0,0.0,1.0,1.0,1.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,False,0,1.0,2.0,0.0,1.0,1.0,1.0
4,16.0,303.0,70.0,151.0,565.0,2.0,True,1,0.0,2.0,5.0,1.0,1.0,1.0


In [70]:
def array_to_dataframe(arr: np.ndarray, columns: List[str]):
    return pd.DataFrame(data=arr, columns=columns)

In [72]:
median_imp_arr = SimpleImputer(strategy="median").fit_transform(train_code)
median_imp_df = array_to_dataframe(arr=median_imp_arr, columns=train_code.columns)
median_imp_df

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Cabin_num,home_code,destination_code,deck_code,side_code,cryo_code,vip_code
0,39.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,2.0,1.0,0.0,0.0,0.0
1,24.0,109.0,9.0,25.0,549.0,44.0,1.0,0.0,0.0,2.0,5.0,1.0,1.0,1.0
2,58.0,43.0,3576.0,0.0,6715.0,49.0,0.0,0.0,1.0,2.0,0.0,1.0,1.0,1.0
3,33.0,0.0,1283.0,371.0,3329.0,193.0,0.0,0.0,1.0,2.0,0.0,1.0,1.0,1.0
4,16.0,303.0,70.0,151.0,565.0,2.0,1.0,1.0,0.0,2.0,5.0,1.0,1.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,41.0,0.0,6819.0,0.0,1643.0,74.0,0.0,98.0,1.0,0.0,0.0,0.0,0.0,0.0
8689,18.0,0.0,0.0,0.0,0.0,0.0,0.0,1499.0,0.0,1.0,6.0,1.0,1.0,1.0
8690,26.0,0.0,0.0,1872.0,1.0,0.0,1.0,1500.0,0.0,2.0,6.0,1.0,1.0,1.0
8691,32.0,0.0,1049.0,0.0,353.0,3235.0,0.0,608.0,1.0,0.0,4.0,1.0,1.0,1.0


#Model Training

In [None]:
train[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']] = train[['VIP', 'CryoSleep', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].fillna(value=0)
train.isnull().sum().sort_values(ascending=False)

Next, we convert the boolean values in Transported, VIP and CryoSleep into numerical values 0 and 1

In [None]:
train["Transported"] = train["Transported"].astype(int)
train["VIP"] = train["VIP"].astype(int)
train["CryoSleep"] = train["CryoSleep"].astype(int)

At this point, we can train the RandomForest model

In [None]:
X_train = tfdf.keras.pd_dataframe_to_tf_dataset(train, label="Transported")

tfdf_model = tfdf.keras.RandomForestModel()
tfdf_model.fit(X_train)

In [None]:
tfdf.model_plotter.plot_model_in_colab(tfdf_model, tree_idx=1)

In [None]:
tfdf_model.summary()