# Spaceship Titanic

The main goal of this competion is to predict whether a passenger was transported to an alternate dimension during the *Spaceship Titanic's* collision with the spacetime anomaly.

* Link to the competition: https://www.kaggle.com/competitions/spaceship-titanic/overview

## Get Data

We have to different files:

* **train.csv** - personal records for about two-thirds (~8700) of the passengers
* **test.csv** - personal records for the remaining one-third (~4300) of the passengers. We will need to predict the value of `Transported` for the passengers in this set.

In [None]:
pip install kaggle

In [None]:
from google.colab import userdata

# Retrieve credentials
KAGGLE_KEY =  userdata.get('KAGGLE_KEY')
KAGGLE_USERNAME = userdata.get('KAGGLE_USERNAME')

# Set environmental variables with %env to better work with kaggle
%env KAGGLE_USERNAME=$KAGGLE_USERNAME
%env KAGGLE_KEY=$KAGGLE_KEY

In [None]:
!kaggle competitions download -c spaceship-titanic


In [None]:
!unzip /content/spaceship-titanic.zip

## Inspect Data

In [None]:
import pandas as pd
test_df = pd.read_csv('/content/test.csv')
train_df = pd.read_csv('/content/train.csv')

In [None]:
# Check the train_df
train_df.head()

In [None]:
train_df.describe()

In [None]:
train_df.info()

In [None]:
# Check how many data is missing in each column
train_df.isnull().sum()

## Data Visualization

In this competition we want to see what passengers were `Transported` or not. We are going to see if we can find some type of relation between different values and if they were `Trasnported` or not transported.

In [None]:
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

figs, axs = plt.subplots(2,2, figsize=(15, 10)) # Adjust figsize for better readability

# Plot Transported vs VIP
pd.crosstab(train_df["Transported"], train_df["VIP"]).plot(kind="bar",color = ["lightseagreen", "olivedrab"], ax=axs[0][0])
axs[0][0].set_title("Transported vs VIP")
axs[0][0].set_xlabel("Transported")
axs[0][0].set_ylabel("Count")
axs[0][0].tick_params(axis='x', rotation=0) # Rotate x-axis labels

# Plot Transported vs CryoSleep
pd.crosstab(train_df["Transported"], train_df["CryoSleep"]).plot(kind="bar",color = ["lightseagreen", "olivedrab"], ax=axs[0][1])
axs[0][1].set_title("Transported vs CryoSleep")
axs[0][1].set_xlabel("Transported")
axs[0][1].set_ylabel("Count")
axs[0][1].tick_params(axis='x', rotation=0) # Rotate x-axis labels

# Plot Transported vs HomePlanet
pd.crosstab(train_df["Transported"], train_df["HomePlanet"]).plot(kind="bar",color = ["lightseagreen", "olivedrab", "indigo"], ax=axs[1][0])
axs[1][0].set_title("Transported vs HomePlanet")
axs[1][0].set_xlabel("Transported")
axs[1][0].set_ylabel("Count")
axs[1][0].tick_params(axis='x', rotation=0) # Rotate x-axis labels


# Plot Transported vs Age
sns.kdeplot(data=train_df, x='Age', hue='Transported', fill=True, ax=axs[1][1], palette='magma')
axs[1][1].set_title("Transported vs Age")
axs[1][1].legend(title='Transported Status', labels=['Not Transported', 'Transported'])


plt.tight_layout() # Adjust layout to prevent overlapping titles/labels
plt.show()

From the different plots we can see that:

* **VIPs** didn't have any special treatment to be selected or not selected.
* Those who were in a **CryoSleep** were more selected than those who weren't
* People from **Europa** and **Mars** were more selected that those from Earth
* People from ages ~ 10 to ~35 were the ones more selected.

## Prepare Data

We are going to split `Cabin` and `GroupSize` columns into different columns spliting them by "/" or by "_".

 * `Cabin` is going to be split in `Deck`, `Num` and `Size`.

 * ` GroupSize` is going to be split in `Group_ID` and `Number_Group`. With this information we can create a new column called `Alone` where we are going to say True/False to the people who travel alone or in groups.

After this we need to fill the missing data for the columns.
 * For the *numerical columns* we are going to use the median to fill those values
 * For the *categorical/boolean columns* we are going to use **.mode()** that is the value that appears most often to fill those missing values.

 After all this, we will drop the colmns that we don't need anymore.

In [None]:
# Drop Name
train_df = train_df.drop('Name', axis=1)
test_df = test_df.drop('Name', axis=1)

In [None]:
# Remind how many data is missing in each column
train_df.isnull().sum()

In [None]:
# Split Cabin in three diferent columns
train_df[["Deck", "Num", "Side"]] = train_df["Cabin"].str.split("/", expand = True)
test_df[["Deck", "Num", "Side"]] = test_df["Cabin"].str.split("/", expand = True)

# Change Num column to int
train_df["Num"] = train_df["Num"].astype(float)
test_df["Num"] = test_df["Num"].astype(float)

# Split Passaenger Id into two diffrerent columns
train_df[["Group_ID", "Number_Group"]] = train_df["PassengerId"].str.split("_", expand = True)
test_df[["Group_ID", "Number_Group"]] = test_df["PassengerId"].str.split("_", expand = True)

In [None]:
# Calculate GroupSize
train_df["GroupSize"] = train_df.groupby("Group_ID")["Group_ID"].transform("count")
test_df["GroupSize"] = test_df.groupby("Group_ID")["Group_ID"].transform("count")

# Drop PassangerID
train_df = train_df.drop('PassengerId', axis=1)
test_df = test_df.drop('PassengerId', axis=1)

# Create a new column to see if the Passangers are alone or not
train_df["Alone"] = (train_df["GroupSize"] == 1)
test_df["Alone"] = (test_df["GroupSize"] == 1)

# Change the Number group to an int
train_df["Number_Group"] = train_df["Number_Group"].astype(int)
test_df["Number_Group"] = test_df["Number_Group"].astype(int)

# Drop Group_ID since we have the groupsize and the alone column
train_df = train_df.drop('Group_ID', axis=1)
test_df = test_df.drop('Group_ID', axis=1)

In [None]:
#Calculate median for numerical columns
age_median = train_df["Age"].median()
room_service_median = train_df["RoomService"].median()
food_court_median = train_df["FoodCourt"].median()
shopping_mall_median = train_df["ShoppingMall"].median()
spa_median = train_df["Spa"].median()
vr_deck_median = train_df["VRDeck"].median()
num_median = train_df["Num"].median()

# Calculate mode (the value that appears the most often) for categorical/boolean columns
home_planet_mode = train_df["HomePlanet"].mode()[0]
destination_mode = train_df["Destination"].mode()[0]
vip_mode = train_df["VIP"].mode()[0]
cryosleep_mode = train_df["CryoSleep"].mode()[0]
deck_mode = train_df["Deck"].mode()[0]
side_mode = train_df["Side"].mode()[0]

# Fill NA values in train_df
train_df["Age"].fillna(age_median, inplace=True)
train_df["RoomService"].fillna(room_service_median, inplace=True)
train_df["FoodCourt"].fillna(food_court_median, inplace=True)
train_df["ShoppingMall"].fillna(shopping_mall_median, inplace=True)
train_df["Spa"].fillna(spa_median, inplace=True)
train_df["VRDeck"].fillna(vr_deck_median, inplace=True)
train_df["HomePlanet"].fillna(home_planet_mode, inplace=True)
train_df["Destination"].fillna(destination_mode, inplace=True)
train_df["VIP"].fillna(vip_mode, inplace=True)
train_df["CryoSleep"].fillna(cryosleep_mode, inplace=True)
train_df["Deck"].fillna(deck_mode, inplace=True)
train_df["Side"].fillna(side_mode, inplace=True)
train_df["Num"].fillna(num_median, inplace=True)


# Fill NA values in test_df
test_df["Age"].fillna(age_median, inplace=True)
test_df["RoomService"].fillna(room_service_median, inplace=True)
test_df["FoodCourt"].fillna(food_court_median, inplace=True)
test_df["ShoppingMall"].fillna(shopping_mall_median, inplace=True)
test_df["Spa"].fillna(spa_median, inplace=True)
test_df["VRDeck"].fillna(vr_deck_median, inplace=True)
test_df["HomePlanet"].fillna(home_planet_mode, inplace=True)
test_df["Destination"].fillna(destination_mode, inplace=True)
test_df["VIP"].fillna(vip_mode, inplace=True)
test_df["CryoSleep"].fillna(cryosleep_mode, inplace=True)
test_df["Deck"].fillna(deck_mode, inplace=True)
test_df["Side"].fillna(side_mode, inplace=True)
test_df["Num"].fillna(num_median, inplace=True)


In [None]:
train_df.isnull().sum()


In [None]:
test_df.isnull().sum()

In [None]:
# Now we can drop the Cabin column
train_df = train_df.drop('Cabin', axis=1)
test_df = test_df.drop('Cabin', axis=1)

After all these steps we need to use One-Hot Encoding for our columns. Because, remember, ML models only understand numbers and we have categorical columns that are strings. We use **.get_dummies()** to turn those single categorical columns into multiple binary columns.

In [None]:
# Hot encode test and train dataframes
train_df = pd.get_dummies(train_df)
test_df = pd.get_dummies(test_df)

Now we need to syncronized the test dataframe with the train dataframe so both dataframes' columns are in the same order to be able to use then.

In [None]:
# Synce columns between the train and test sets
test_df = test_df.reindex(columns=train_df.columns, fill_value=0)

## Split data

In [None]:
# Create X
X = train_df.drop("Transported", axis = 1)
# Create y
y = train_df["Transported"]

In [None]:
# Split the data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
len(X_train), len(X_test), len(y_train), len(y_test)

We need to scale numerical columns to ensure model fairness and stability because we have columns with different scales. Doing this ensures all features contribute equally to the model.

In [None]:
# Scale numerical columns

from sklearn.preprocessing import StandardScaler
numerical_cols = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Num', 'Number_Group']
scaler = StandardScaler()
X_train[numerical_cols] = scaler.fit_transform(X_train[numerical_cols])
X_test[numerical_cols] = scaler.transform(X_test[numerical_cols])

## Start Modelling

In [None]:
# Import necessary libraries
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier

# Put models in a dicctionary
models = {
    "KNN": KNeighborsClassifier(),
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(),
    "Gradient Boosting": GradientBoostingClassifier()
}

In [None]:
# Create a function to fit and score models
def fit_and_score(models, X_train, X_test, y_train, y_test):
    """
    Fits and evaluates given machine learning models.
    models : a dict of different Scikit-Learn machine learning models
    X_train : training data (no labels)
    X_test : testing data (no labels)
    y_train : training labels
    y_test : test labels
    """
    # Set random seed
    np.random.seed(42)

    # Make dictinoary to keep model scores
    model_scores = {}

    # Loop through models
    for name, model in models.items():
        # Fit the model to the data
        model.fit(X_train, y_train)
        # Evaluate the model and append its score to model_scores
        model_scores[name] = model.score(X_test, y_test)
    return model_scores

In [None]:
model_results = fit_and_score(models, X_train, X_test, y_train, y_test)

In [None]:
model_results

In [None]:
from sklearn.model_selection import GridSearchCV
# Define parameter grid
param_grid = {
    'n_estimators': [200],
    'max_depth': [3],
    "learning_rate": [0.025, 0.05, 0.075, 0.1],
    "min_samples_leaf": [20],
    "subsample": [0.8]
    }

# Instantiate the Grid search object
gscv = GridSearchCV(
    estimator=GradientBoostingClassifier(),
    param_grid=param_grid,
    cv=5,
    verbose=2,
    scoring = "accuracy"
)

gscv.fit(X_train, y_train)

In [None]:
gscv.best_score_

In [None]:
gscv.best_params_

In [None]:
# Train model with all information
final_model = GradientBoostingClassifier(random_state=42, learning_rate=0.05, max_depth=3, n_estimators=200, min_samples_leaf=20, subsample=0.8)
final_model.fit(X,y)

## Make predictions

In [None]:
# Drop the Transported column from test_df
test_df = test_df.drop('Transported', axis=1)

In [None]:
final_predictions = final_model.predict(test_df)
final_predictions

## Create submission file

In [None]:
!unzip /content/spaceship-titanic.zip

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import GradientBoostingClassifier # Assuming this was your best model

# --- 1. Load the original test data to retrieve the PassengerId ---
# We'll assume the file path is the same as before.
original_test_df = pd.read_csv('test.csv')

# Extract the PassengerId column before we use the prepared test_df
passenger_ids = original_test_df['PassengerId']

# --- 2. Final Model Training (Recap) ---
# Assuming X and y were your final, prepared training features/target
# And final_model was trained on X and y

# --- 3. Final Prediction (Recap) ---
# Assuming you already dropped the 'Transported' column from the prepared test_df
# final_predictions = final_model.predict(test_df) # Run this line again if needed

# --- 4. Create the Submission DataFrame ---
submission_df = pd.DataFrame({
    'PassengerId': passenger_ids,
    'Transported': final_predictions
})

# Convert the predictions (which are 0/1 integers) to True/False booleans as required by Kaggle
submission_df['Transported'] = submission_df['Transported'].astype(bool)

# --- 5. Save the Submission File ---
submission_df.to_csv('submission.csv', index=False)
print("Submission file 'submission.csv' successfully created!")