# Spaceship Titanic - EDA + Random Forest (Work in Progress)

> Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.
> The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.
> While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension
> **To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system. Help save them and change history!**

[Link to the competition](https://www.kaggle.com/competitions/spaceship-titanic/overview)

## Variables
- **HomePlanet** - The planet the passenger departed from.
- **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage.
- **Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
- **Destination** - The planet the passenger will be debarking to.
- **Age** - The age of the passenger.
- **VIP** - Whether the passenger has paid for special VIP service during the voyage.
- **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
- **Name** - The first and last names of the passenger.
- **Transported** - Whether the passenger was transported to another dimension. This is the **target column**.

## Goal
The competition is a binary classification problem with two possible outcomes of the voyage in space: the passenger has either been transported to another dimension (`True`) or not (`False`). The main metric for the competition is classification `accuracy`.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
%matplotlib inline

In [None]:
df = pd.read_csv("/kaggle/input/spaceship-titanic/train.csv")
df.head()

In [None]:
print(df.info())
display(df.describe())
display(df.describe(exclude=np.number))

In [None]:
display(df.shape)
display(df.duplicated().sum())
display(df.nunique())
display(df.isna().sum())

In [None]:
print(df.HomePlanet.value_counts(), "\n")
print(df.CryoSleep.value_counts(), "\n")
print(df.Destination.value_counts(), "\n")
print(df.VIP.value_counts())

In [None]:
fig = px.histogram(df, x="Age", color="Transported")
fig.show()

In [None]:
fig = px.histogram(df, x="RoomService", color="Transported")
fig.show()

In [None]:
pd.crosstab(index=df.Transported, columns="perc", normalize=True)

In [None]:
pd.crosstab(index=df.HomePlanet, columns=df.Transported, normalize="index")

In [None]:
pd.crosstab(index=df.CryoSleep, columns=df.Transported, normalize="index")

In [None]:
pd.crosstab(index=df.Destination, columns=df.Transported, normalize="index")

In [None]:
pd.crosstab(index=df.Age, columns=df.Transported, normalize="index")

In [None]:
pd.crosstab(index=df.VIP, columns=df.Transported, normalize="index")

In [None]:
fig = px.imshow(df.corr(), text_auto=True, width=800, height=800)
fig.show()

## Data Transformation

In [None]:
df.isna().sum()

In [232]:
from sklearn.model_selection import train_test_split

df = pd.read_csv("data/raw/train.csv")
df["Missingness"] = df.isna().sum(axis=1)
df[["CabinDeck", "CabinNumber", "CabinSide"]] = df.Cabin.str.split("/", expand=True)
df[["GroupId", "GroupNum"]] = df.PassengerId.str.split("_", expand=True)
df[["FirstName", "LastName"]] = df.loc[:, "Name"].str.split(expand=True)
df["GroupSize"] = df.groupby("GroupId")["GroupId"].transform("count")
df["Solo"] = df.GroupSize == 1
df["TotalSpending"] = (
    df["RoomService"]
    + df["FoodCourt"]
    + df["ShoppingMall"]
    + df["Spa"]
    + df["VRDeck"]
)
df["NoSpending"] = df["TotalSpending"] == 0
df = df.fillna(df.mode().iloc[0])

features = df.drop("Transported", axis=1)
target = df.set_index("PassengerId").loc[:, "Transported"].astype(int)

features_train, features_test, target_train, target_test = train_test_split(
    features, target, test_size=0.3, random_state=42
)

In [233]:
import pdpipe as pdp

preprocessing = pdp.PdPipeline(
    [
        pdp.df.set_index("PassengerId"),
        pdp.Encode(["CryoSleep", "VIP", "CabinSide", "Solo", "NoSpending"]),
        pdp.OneHotEncode(["HomePlanet", "Destination", "CabinDeck", "Missingness"], drop_first=True),
        pdp.Log(["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck", "TotalSpending"], drop=True, const_shift=1),
        pdp.Scale("MinMaxScaler"),
        pdp.ColDrop(["Cabin", "Name", "CabinNumber", "GroupId", "GroupNum", "FirstName", "LastName"])
    ]
)

In [234]:
features_train_clean = preprocessing.fit_transform(features_train)
features_test_clean = preprocessing.transform(features_test)

In [227]:
features_train_clean.head()

Unnamed: 0_level_0,CryoSleep,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,CabinSide,GroupSize,...,CabinDeck_B,CabinDeck_C,CabinDeck_D,CabinDeck_E,CabinDeck_F,CabinDeck_G,CabinDeck_T,Missingness_1,Missingness_2,Missingness_3
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
3282_03,0.0,0.544304,0.0,0.0,0.710954,0.0,0.453163,0.497185,1.0,0.285714,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
8276_02,1.0,0.291139,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.142857,...,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1911_01,0.0,0.582278,0.0,0.229596,0.633582,0.0,0.182285,0.447001,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1808_01,0.0,0.417722,0.0,0.0,0.648928,0.218315,0.111767,0.340289,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
6995_01,0.0,0.303797,0.0,0.0,0.398584,0.638694,0.0,0.370383,1.0,0.0,...,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0


In [228]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score

model_base = DecisionTreeClassifier(criterion='entropy', random_state=42)

model_base.fit(features_train_clean, target_train)
target_test_pred = model_base.predict(features_test_clean)
accuracy_score(target_test, target_test_pred)

0.7315950920245399

In [229]:
from sklearn.ensemble import RandomForestClassifier

model_rf = RandomForestClassifier(random_state=42)

model_rf.fit(features_train_clean, target_train)
target_test_pred = model_rf.predict(features_test_clean)
accuracy_score(target_test, target_test_pred)

0.7772239263803681

In [241]:
from catboost import CatBoostClassifier

model_cat = CatBoostClassifier(verbose=0, random_state=42)
model_cat.fit(features_train_clean, target_train)
target_test_pred = model_cat.predict(features_test_clean)
accuracy_score(target_test, target_test_pred)

0.803680981595092

In [242]:
model_cat.feature_importances_

array([7.94884614e+00, 7.56420385e+00, 1.01512053e-01, 7.44564489e+00,
       7.81501967e+00, 4.61814831e+00, 1.19612319e+01, 9.99811880e+00,
       3.68920682e+00, 1.80870111e+00, 2.79174179e-01, 7.39655685e+00,
       9.79161497e-01, 7.55419328e+00, 4.05024804e+00, 5.75290786e-01,
       1.80600000e+00, 1.38822116e+00, 2.95644225e+00, 3.07323155e-01,
       3.49140256e+00, 2.62632826e+00, 2.42629457e+00, 1.49430657e-03,
       9.95053353e-01, 1.96610791e-01, 1.95714694e-02])

In [239]:
from sklearn.model_selection import GridSearchCV

model_cat = CatBoostClassifier(verbose=0, random_state=42)
param_grid = {'iterations':[500, 600,700,800],
              'learning_rate':[0.04, 0.05, 0.06],
              'depth':[6,7,8],
              'l2_leaf_reg': [10,20,30]
             }

grid_cat = GridSearchCV(estimator=model_cat, param_grid=param_grid, cv=3, scoring="accuracy")
grid_cat.fit(features_train_clean, target_train)
grid_cat.best_params_

{'depth': 7, 'iterations': 600, 'l2_leaf_reg': 20, 'learning_rate': 0.05}

In [238]:
model_cat_best = CatBoostClassifier(depth=7, iterations=600, learning_rate=0.1, l2_leaf_reg=20, verbose=0, random_state=42)
model_cat_best.fit(features_train_clean, target_train)
target_test_pred = model_cat_best.predict(features_test_clean)
accuracy_score(target_test, target_test_pred)

0.7960122699386503

In [None]:
from sklearn.model_selection import RandomizedSearchCV# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

max_features = ['auto', 'sqrt']

max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

min_samples_split = [2, 5, 10]

min_samples_leaf = [1, 2, 4]

bootstrap = [True, False]
random_grid = {'n_estimators': n_estimators,
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf,
               'bootstrap': bootstrap}

In [None]:
rf_random = RandomizedSearchCV(estimator=model_rf, param_distributions=random_grid, n_iter=100, cv=3, verbose=2, random_state=42, n_jobs=-1)
rf_random.fit(features_train_clean, target_train)

In [None]:
rf_random.best_params_

In [None]:
rf_random.best_params_

In [None]:
from sklearn.model_selection import GridSearchCV# Create the parameter grid based on the results of random search
param_grid = {
    'bootstrap': [True],
    'max_depth': [40, 50, 60],
    'max_features': ['sqrt'],
    'min_samples_leaf': [3, 4, 5],
    'min_samples_split': [9, 10, 11],
    'n_estimators': [700, 800, 900]
}

grid_search = GridSearchCV(estimator=model_rf, param_grid=param_grid, scoring='accuracy',
                          cv=3, n_jobs=-1, verbose=2)

In [None]:
grid_search.fit(features_train_clean, target_train)
grid_search.best_params_

In [None]:
model_rf_best = grid_search.best_estimator_
model_rf_best.fit(features_train_clean, target_train)
target_test_pred = model_rf_best.predict(features_test_clean)
accuracy_score(target_test, target_test_pred)

In [219]:
features_aim = pd.read_csv("data/raw/test.csv")
features_aim["Missingness"] = features_aim.isna().sum(axis=1)
features_aim = features_aim.fillna(features_aim.mode().iloc[0])
features_aim[["CabinDeck", "CabinNumber", "CabinSide"]] = features_aim.Cabin.str.split("/", expand=True)
features_aim[["GroupId", "GroupNum"]] = features_aim.PassengerId.str.split("_", expand=True)
features_aim[["FirstName", "LastName"]] = features_aim.loc[:, "Name"].str.split(expand=True)
features_aim["GroupSize"] = features_aim.groupby("GroupId")["GroupId"].transform("count")
features_aim["Solo"] = features_aim.GroupSize == 1
features_aim["TotalSpending"] = (
    features_aim["RoomService"]
    + features_aim["FoodCourt"]
    + features_aim["ShoppingMall"]
    + features_aim["Spa"]
    + features_aim["VRDeck"]
)
features_aim["NoSpending"] = features_aim["TotalSpending"] == 0

In [220]:
features_aim_clean = preprocessing.transform(features_aim)
features_clean = preprocessing.transform(features)
model_cat_best.fit(features_clean, target)

<catboost.core.CatBoostClassifier at 0x7f1cfc6e7520>

In [221]:
target_aim_pred = model_cat_best.predict(features_aim_clean)
target_aim_pred

array([1, 0, 1, ..., 1, 1, 1])

In [222]:
submission = pd.DataFrame({"PassengerId": features_aim_clean.index, "Transported": target_aim_pred})
submission["Transported"] = submission["Transported"].astype(bool)
submission.to_csv("submission_cat_optimized.csv", index=False)