<a href="https://www.kaggle.com/code/mcpenguin/spaceship-titanic-eda-and-prediction?scriptVersionId=143235257" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [None]:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import KFold, cross_val_score
import xgboost as xgb

# 0. Load Datasets

In [None]:
INPUT_DIR = "../input/spaceship-titanic"
y_variate = "Transported"

train_data = pd.read_csv(f"{INPUT_DIR}/train.csv")
test_data = pd.read_csv(f"{INPUT_DIR}/test.csv")

# 1. First Look at Training and Testing Data

Let's see what the training and testing datasets look like:

In [None]:
train_data.head()

In [None]:
test_data.head()

Initial observations:

* According to the problem brief, the `Cabin` variate actually consists of three variates: the deck, the number and the side. We will want to extract these before doing any more data analyses.

* Moreover, the `PassengerId` variate actually consists of two components: the group a passenger is travelling with, and their number within the group. We will also want to extract these before doing any more data analyses.

Similarly, we can check their dimensions:

In [None]:
train_data.shape

In [None]:
test_data.shape

Finally, let's check whether there are any missing values in the dataset:

In [None]:
train_data.isna().sum()

# 2. Feature Engineering

Before we proceed with our analyses, it might be helpful to set the missing variates explicitly in our data, as well as extract some of the useful aforementioned features to our dataset.

In [None]:
# function to process data
def process_data(data):
    # split cabin into three columns
    deck_split = pd.DataFrame(data["Cabin"].str.split("/").apply(pd.Series))
    data[["Deck", "Num", "Side"]] = pd.DataFrame(deck_split)
    data.drop(["Cabin"], axis=1)
    data["Num"] = pd.to_numeric(data['Num'])
    
    passengerId_split = pd.DataFrame(data["PassengerId"].str.split("_").apply(pd.Series))
    data[["Group", "NumberInGroup"]] = pd.DataFrame(passengerId_split)
    
    # replace nans in columns with appropriate values
    data["HomePlanet"] = data["HomePlanet"].replace(np.nan, "Missing")
    data["CryoSleep"] = data["CryoSleep"].replace(np.nan, False)
    data["Destination"] = data["Destination"].replace(np.nan, "Missing")
    data["VIP"] = data["VIP"].replace(np.nan, False)
    data[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]] = data[["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].replace(np.nan, 0)
    data["Age"] = data["Age"].replace(np.nan, 0)
    
    # replace missing values for deck, num, side
    data["Deck"] = data["Deck"].replace(np.nan, 'M')
    data["Num"] = data["Num"].replace(np.nan, 0)
    data["Side"] = data["Side"].replace(np.nan, 'M')
    
    return data

train_data = process_data(train_data)
test_data = process_data(test_data)

# 3. EDA

## 3.1. Home Planet

Next, we can perform some exploratory data analysis on the dataset. We can first start by examining the distribution of `Transported` by `HomePlanet` across both the training and test data. In particular, for the training dataset, we can investigate the distribution of the variate across passengers who were transported compared to those who were not:

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2)
fig.suptitle("Transported")
order = ["Earth", "Europa", "Mars", "Missing"]

sns.countplot(ax=axes[0], data=train_data, x="HomePlanet", hue="Transported", order=order)
axes[0].set_title("Training Data")
sns.countplot(ax=axes[1], data=test_data, x="HomePlanet", order=order)
axes[1].set_title("Test Data")
plt.show()

We see that for both datasets, a majority of the population come from `Earth`, with a roughly similar proportion of the population coming from `Europa` and `Mars`. We see a similar distribution between passengers who were transported vs. those who were not.

## 3.2. Groups

We can also investigate how many groups there are in the training dataset:

In [None]:
print("Number of groups in training dataset:", len(train_data["Group"].unique()))
print("Total number of data in training dataset:", train_data.shape[0])

print("Number of groups in test dataset:", len(test_data["Group"].unique()))
print("Total number of data in test dataset:", test_data.shape[0])

Since the number of groups is very close to the number of data points in both datasets, it suggests that each group is nearly unique to each passenger. Therefore, it might not make sense to put this in our model.

## 3.3. Destination

We can also investigate the `Destination` variate for both datasets. In particular, for the plot with the training dataset, we can investigate the count between passengers who were transported compared to those who were not:

In [None]:
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 5))
fig.suptitle("Destination")
order = ['TRAPPIST-1e', 'PSO J318.5-22', '55 Cancri e', 'Missing']

sns.countplot(ax=axes[0], data=train_data, x="Destination", hue="Transported", order=order)
axes[0].set_title("Training Data")
sns.countplot(ax=axes[1], data=test_data, x="Destination", order=order)
axes[1].set_title("Test Data")
plt.show()

We again see a similar distribution between `Destination` for both datasets, and a similar distribution between passengers who were transported and who were not in the training dataset.

## 3.4. Boolean Variates

Subsequently, we can investigate the distribution of the boolean variates:

In [None]:
for variate in ['CryoSleep', 'VIP']:
    print(f"Proportion of training data with true {variate}: {train_data[variate].sum()} / {train_data.shape[0]}")
    print(f"Proportion of test data with true {variate}: {test_data[variate].sum()} / {test_data.shape[0]}")

We can see that about one third of the passengers have `CryoSleep`, whereas about 5% of both the training and test datasets have `VIP`, which aligns with our intuitions.

## 3.5. Cabin Variates

Next, we can then analyze the `Deck`, `Num` and `Side` in the data. It seems like `Num` has many unique values, which makes sense intuitively, so let us only focus on the other two variates.

In [None]:
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=5.0)
cabin_variates = ['Deck', 'Side']

for idx, v in enumerate(cabin_variates):
    sns.countplot(ax=axes[idx, 0], data=train_data, x=v, hue="Transported")
    axes[idx, 0].set_title(f"Training Data: {v}")
    sns.countplot(ax=axes[idx, 1], data=test_data, x=v)
    axes[idx, 1].set_title(f"Test Data: {v}")
plt.show()

From this, it seems like the distribution of passengers who got transported across the `Deck` and `Side` variates are similar.

## 3.6. Luxury Amenities Variates

Next, we can investigate the variates associated with the luxury amenities. For the training data, we can further categorize the data by whether the passenger was Transported or not:

In [None]:
fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=5.0)
luxury_variates = ['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']

for idx, v in enumerate(luxury_variates):
    sns.histplot(ax=axes[idx, 0], data=train_data, x=v, hue="Transported", bins=30)
    axes[idx, 0].set_title(f"Training Data: {v}")
    sns.histplot(ax=axes[idx, 1], data=test_data, x=v, bins=30)
    axes[idx, 1].set_title(f"Test Data: {v}")
plt.show()

All these histograms are very skewed to the left, so using log transformations on all of these might help stabilize their variance and make the histograms more symmetric:

In [None]:
for data in train_data, test_data:
    for v in luxury_variates:
        data["Transformed" + v] = data[v].apply(lambda x: np.log(x+1))
    data.drop(luxury_variates, axis=1)

fig, axes = plt.subplots(nrows=5, ncols=2, figsize=(15, 15))
fig.tight_layout(pad=5.0)

for idx, v in enumerate(map(lambda x: "Transformed" + x, luxury_variates)):
    sns.histplot(ax=axes[idx, 0], data=train_data, x=v, hue="Transported", bins=30)
    axes[idx, 0].set_title(f"Training Data: {v}")
    sns.histplot(ax=axes[idx, 1], data=test_data, x=v, bins=10)
    axes[idx, 1].set_title(f"Test Data: {v}")
plt.show()

We see that the histograms are a lot more symmetric now. In particular, we see that the distributions of the luxury variates of passengers who were transported is very similar to the distribution of those who were not transported.

# 3. Feature Engineering

Before modelling, we need to convert all our categorical variates into one-hot encodings.

In [None]:
def make_one_hot_encodings(data):
    categorical_cols = ["HomePlanet", "Deck", "Side", "Destination"]
    data = pd.get_dummies(data, columns=categorical_cols) 
    return data
    
train_data = make_one_hot_encodings(train_data)
test_data = make_one_hot_encodings(test_data)

train_data.head()

In [None]:
train_data.columns

# 4. Modelling

We are now ready to start modelling. Our strategy will be to use `GradientBoostingRegression` with k-fold cross validation.

We first need to prepare our k-fold samples by removing any variates not needed in the calculation and splitting the data up correspondingly.

In [None]:
k = 5
kf = KFold(n_splits=k)

def prepare_Xy(data, isTest=False):
    X = data.drop(["PassengerId", "Name", "Group", "NumberInGroup", "Cabin"], axis=1)
    if isTest:
        return X
    X = X.drop(["Transported"], axis=1)
    y = data["Transported"]
    return X, y

X_train, y_train = prepare_Xy(train_data)
X_train.head()

We can now fit the model and calculate the average cross-validation score:

In [None]:
xgbc = xgb.XGBClassifier(n_estimators=100)
xgbc.fit(X_train, y_train)

kfold = KFold(n_splits=10, shuffle=True)
kf_cv_scores = cross_val_score(xgbc, X_train, y_train, cv=kfold)
print("K-fold CV average score: %.2f" % kf_cv_scores.mean())

## 5. Prediction

Finally, we can predict the `Transported` variate for the test values.

In [None]:
X_test = prepare_Xy(test_data, isTest=True)
y_test = xgbc.predict(X_test)

submission = test_data[["PassengerId"]]
submission["Transported"] = y_test.astype(bool)
submission.head()

In [None]:
# submit submission file to competition
submission.to_csv("submission.csv", index=False)