# Spaceship Titanic


**Dataset Description**
In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.

**File and Data Field Descriptions**

train.csv - Personal records for about two-thirds (8700) of the passengers, to be used as training data.

PassengerId - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.

HomePlanet - The planet the passenger departed from, typically their planet of permanent residence.

CryoSleep - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.

Cabin - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.

Destination - The planet the passenger will be debarking to.

Age - The age of the passenger.

VIP - Whether the passenger has paid for special VIP service during the voyage.
RoomService, FoodCourt, ShoppingMall, Spa, VRDeck - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.

Name - The first and last names of the passenger.

Transported - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.

test.csv - Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.

PassengerId - Id for each passenger in the test set.

Transported - The target. For each passenger, predict either True or False.

 # **The lab consist in a competition to see which students gets a better result in the test set. You will need to explain with code comments or text which steps are you following.**


In [66]:
#from google.colab import drive
#drive.mount('/content/drive')

### Import Libraries

In [67]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set pandas to display wider tables
pd.set_option('display.width', 1000)
pd.set_option('display.max_columns', 100)

### Load Data

In [68]:
#SpaceTrain = pd.read_csv("/content/drive/MyDrive/DataIntesive/Data/train_Lab3.csv")
SpaceTrain = pd.read_csv("train_Lab3.csv")

#SpaceTest = pd.read_csv("/content/drive/MyDrive/DataIntesive/Data/test_Lab3.csv")
SpaceTest = pd.read_csv("test_Lab3.csv")

In [None]:
SpaceTrain.head()

In [None]:
SpaceTest.head()

### Exploring the Data

In [None]:
# Check the first few rows of both datasets
print("Training Data:")
print(SpaceTrain.head())

print("\n") # Add a line break

print("Test Data:")
print(SpaceTest.head())

In [None]:
# Check dimensions of the datasets
print(f"Training data shape: {SpaceTrain.shape}")

print(f"Test data shape: {SpaceTest.shape}")

In [None]:
# Check for missing values in both datasets
print("Missing values in training data:")
print(SpaceTrain.isnull().sum())

print("\n") # Add a line break

print("Missing values in test data:")
print(SpaceTest.isnull().sum())

In [None]:
# Check data types of each column in both datasets
print("Training data types:")
print(SpaceTrain.info())

print("\n") # Add a line break

print("Test data types:")
print(SpaceTest.info())

In [None]:
# Basic statistics for numerical features in both datasets
print("Training data statistics:")
print(SpaceTrain.describe())

print("Test data statistics:")
print(SpaceTest.describe())

### Visualizing the Data

In [None]:
# Data Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Set up the matplotlib figure
plt.figure(figsize=(12, 6))

# Plot the distribution of the target variable
sns.countplot(x='Transported', data=SpaceTrain)
plt.title('Distribution of Transported')
plt.show()

# Plot the distribution of HomePlanet
plt.figure(figsize=(10, 5))
sns.countplot(x='HomePlanet', data=SpaceTrain)
plt.title('Distribution of HomePlanet')
plt.show()

# Plot the distribution of CryoSleep
plt.figure(figsize=(10, 5))
sns.countplot(x='CryoSleep', data=SpaceTrain)
plt.title('Distribution of CryoSleep')
plt.show()

# Plot the distribution of VIP
plt.figure(figsize=(10, 5))
sns.countplot(x='VIP', data=SpaceTrain)
plt.title('Distribution of VIP')
plt.show()

# Plot the distribution of Age
plt.figure(figsize=(10, 5))
sns.histplot(SpaceTrain['Age'].dropna(), kde=True)
plt.title('Distribution of Age')
plt.show()

### Handling Missing Values/Data

In [None]:
# Fill missing values for categorical columns with the mode
SpaceTrain['HomePlanet'].fillna(SpaceTrain['HomePlanet'].mode()[0], inplace=True)
SpaceTrain['Destination'].fillna(SpaceTrain['Destination'].mode()[0], inplace=True)
SpaceTrain['CryoSleep'].fillna(SpaceTrain['CryoSleep'].mode()[0], inplace=True)
SpaceTrain['VIP'].fillna(SpaceTrain['VIP'].mode()[0], inplace=True)

# Fill missing values for numerical columns with the median
SpaceTrain['Age'].fillna(SpaceTrain['Age'].median(), inplace=True)
SpaceTrain['RoomService'].fillna(SpaceTrain['RoomService'].median(), inplace=True)
SpaceTrain['FoodCourt'].fillna(SpaceTrain['FoodCourt'].median(), inplace=True)
SpaceTrain['ShoppingMall'].fillna(SpaceTrain['ShoppingMall'].median(), inplace=True)
SpaceTrain['Spa'].fillna(SpaceTrain['Spa'].median(), inplace=True)
SpaceTrain['VRDeck'].fillna(SpaceTrain['VRDeck'].median(), inplace=True)

# Do the same for the test data
SpaceTest['HomePlanet'].fillna(SpaceTest['HomePlanet'].mode()[0], inplace=True)
SpaceTest['Destination'].fillna(SpaceTest['Destination'].mode()[0], inplace=True)
SpaceTest['CryoSleep'].fillna(SpaceTest['CryoSleep'].mode()[0], inplace=True)
SpaceTest['VIP'].fillna(SpaceTest['VIP'].mode()[0], inplace=True)

SpaceTest['Age'].fillna(SpaceTest['Age'].median(), inplace=True)
SpaceTest['RoomService'].fillna(SpaceTest['RoomService'].median(), inplace=True)
SpaceTest['FoodCourt'].fillna(SpaceTest['FoodCourt'].median(), inplace=True)
SpaceTest['ShoppingMall'].fillna(SpaceTest['ShoppingMall'].median(), inplace=True)
SpaceTest['Spa'].fillna(SpaceTest['Spa'].median(), inplace=True)
SpaceTest['VRDeck'].fillna(SpaceTest['VRDeck'].median(), inplace=True)

# Handle missing values in the Cabin column
# Split the Cabin column into Deck, Num, and Side
SpaceTrain[['Deck', 'Num', 'Side']] = SpaceTrain['Cabin'].str.split('/', expand=True)
SpaceTest[['Deck', 'Num', 'Side']] = SpaceTest['Cabin'].str.split('/', expand=True)

# Fill missing values for 'Deck' and 'Side' with mode, and 'Num' with median
SpaceTrain['Deck'].fillna(SpaceTrain['Deck'].mode()[0], inplace=True)
SpaceTrain['Num'].fillna(SpaceTrain['Num'].median(), inplace=True)
SpaceTrain['Side'].fillna(SpaceTrain['Side'].mode()[0], inplace=True)

SpaceTest['Deck'].fillna(SpaceTest['Deck'].mode()[0], inplace=True)
SpaceTest['Num'].fillna(SpaceTest['Num'].median(), inplace=True)
SpaceTest['Side'].fillna(SpaceTest['Side'].mode()[0], inplace=True)

# Optionally, drop the Name column if not needed
SpaceTrain.drop('Name', axis=1, inplace=True)
SpaceTest.drop('Name', axis=1, inplace=True)

# Optionally, drop the original Cabin column since it's split into Deck, Num, and Side
SpaceTrain.drop('Cabin', axis=1, inplace=True)
SpaceTest.drop('Cabin', axis=1, inplace=True)

# Final check for missing values
print("Remaining missing values in SpaceTrain after handling:")
print(SpaceTrain.isnull().sum())

print("\nRemaining missing values in SpaceTest after handling:")
print(SpaceTest.isnull().sum())


### Feature Engineering

In [None]:
# Split the Cabin feature into Deck, Num, and Side
SpaceTrain[['Deck', 'Num', 'Side']] = SpaceTrain['Cabin'].str.split('/', expand=True)
SpaceTrain['Num'] = SpaceTrain['Num'].astype(float)  # Convert Num to numeric

SpaceTest[['Deck', 'Num', 'Side']] = SpaceTest['Cabin'].str.split('/', expand=True)
SpaceTest['Num'] = SpaceTest['Num'].astype(float)

# Create group size from PassengerId
SpaceTrain['Group'] = SpaceTrain['PassengerId'].apply(lambda x: x.split('_')[0])
SpaceTrain['GroupSize'] = SpaceTrain.groupby('Group')['PassengerId'].transform('count')

SpaceTest['Group'] = SpaceTest['PassengerId'].apply(lambda x: x.split('_')[0])
SpaceTest['GroupSize'] = SpaceTest.groupby('Group')['PassengerId'].transform('count')

# Create TotalSpending feature
SpaceTrain['TotalSpending'] = SpaceTrain[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
SpaceTest['TotalSpending'] = SpaceTest[['RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)


### Label Encoding Categorical Variables

In [58]:
# from sklearn.preprocessing import LabelEncoder

# # Initialize the label encoder
# le = LabelEncoder()

# # Combine the training and test sets for fitting the encoder
# combined_cryo_sleep = pd.concat([SpaceTrain['CryoSleep'], SpaceTest['CryoSleep']])
# combined_vip = pd.concat([SpaceTrain['VIP'], SpaceTest['VIP']])
# combined_home_planet = pd.concat([SpaceTrain['HomePlanet'], SpaceTest['HomePlanet']])
# combined_destination = pd.concat([SpaceTrain['Destination'], SpaceTest['Destination']])
# combined_deck = pd.concat([SpaceTrain['Deck'], SpaceTest['Deck']])
# combined_side = pd.concat([SpaceTrain['Side'], SpaceTest['Side']])

# # Fit on combined data, then transform training and test sets separately

# # CryoSleep
# le.fit(combined_cryo_sleep.astype(str))
# SpaceTrain['CryoSleep'] = le.transform(SpaceTrain['CryoSleep'].astype(str))
# SpaceTest['CryoSleep'] = le.transform(SpaceTest['CryoSleep'].astype(str))

# # VIP
# le.fit(combined_vip.astype(str))
# SpaceTrain['VIP'] = le.transform(SpaceTrain['VIP'].astype(str))
# SpaceTest['VIP'] = le.transform(SpaceTest['VIP'].astype(str))

# # HomePlanet
# le.fit(combined_home_planet.astype(str))
# SpaceTrain['HomePlanet'] = le.transform(SpaceTrain['HomePlanet'].astype(str))
# SpaceTest['HomePlanet'] = le.transform(SpaceTest['HomePlanet'].astype(str))

# # Destination
# le.fit(combined_destination.astype(str))
# SpaceTrain['Destination'] = le.transform(SpaceTrain['Destination'].astype(str))
# SpaceTest['Destination'] = le.transform(SpaceTest['Destination'].astype(str))

# # Deck
# le.fit(combined_deck.astype(str))
# SpaceTrain['Deck'] = le.transform(SpaceTrain['Deck'].astype(str))
# SpaceTest['Deck'] = le.transform(SpaceTest['Deck'].astype(str))

# # Side
# le.fit(combined_side.astype(str))
# SpaceTrain['Side'] = le.transform(SpaceTrain['Side'].astype(str))
# SpaceTest['Side'] = le.transform(SpaceTest['Side'].astype(str))


### Data Preprocessing

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Split the data into features and target
X = SpaceTrain.drop(['Transported', 'PassengerId', 'Name', 'Cabin', 'Group'], axis=1)  # Drop irrelevant columns
y = SpaceTrain['Transported'].astype(int)  # Ensure the target is in binary format (0,1)

# Split the dataset into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.3, random_state=42)

# Show dataset dimensions
print(f"Training samples: {X_train.shape[0]}, Test samples: {X_val.shape[0]}")

# Standardize the numerical features
scaler = StandardScaler()
numerical_columns = ['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'TotalSpending', 'Num']
X_train[numerical_columns] = scaler.fit_transform(X_train[numerical_columns])
X_val[numerical_columns] = scaler.transform(X_val[numerical_columns])


In [60]:
from sklearn.impute import SimpleImputer

# Create an imputer for numerical columns
numerical_imputer = SimpleImputer(strategy='median')

# Impute the missing values in numerical columns
X_train[numerical_columns] = numerical_imputer.fit_transform(X_train[numerical_columns])
X_val[numerical_columns] = numerical_imputer.transform(X_val[numerical_columns])

# Impute missing values in categorical columns (using the most frequent value)
categorical_columns = ['HomePlanet', 'CryoSleep', 'Destination', 'VIP', 'Deck', 'Side']
categorical_imputer = SimpleImputer(strategy='most_frequent')

X_train[categorical_columns] = categorical_imputer.fit_transform(X_train[categorical_columns])
X_val[categorical_columns] = categorical_imputer.transform(X_val[categorical_columns])

In [None]:
from sklearn.ensemble import HistGradientBoostingClassifier

# HistGradientBoosting Classifier (this can handle missing values natively)
hist_gb_clf = HistGradientBoostingClassifier(random_state=42)
hist_gb_clf.fit(X_train, y_train)

# Make predictions
y_pred_hist_gb = hist_gb_clf.predict(X_val)

# Evaluate HistGradientBoosting
print("HistGradientBoosting Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_hist_gb):.4f}")
print(classification_report(y_val, y_pred_hist_gb))


### Boosting Model/Algorithm

In [None]:
# Boosting algorithms for the Spaceship Titanic dataset
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier
from sklearn.metrics import accuracy_score, classification_report

# Reusing X_train, X_val, y_train, y_val from earlier

# AdaBoost Classifier
ada_clf = AdaBoostClassifier(n_estimators=50, random_state=42)
ada_clf.fit(X_train, y_train)
y_pred_ada = ada_clf.predict(X_val)
print("AdaBoost Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_ada):.4f}")
print(classification_report(y_val, y_pred_ada))

# Gradient Boosting Classifier
gb_clf = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
gb_clf.fit(X_train, y_train)
y_pred_gb = gb_clf.predict(X_val)
print("Gradient Boosting Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_gb):.4f}")
print(classification_report(y_val, y_pred_gb))

# XGBoost Classifier
xgb_clf = XGBClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42, eval_metric='logloss')
xgb_clf.fit(X_train, y_train)
y_pred_xgb = xgb_clf.predict(X_val)
print("XGBoost Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_xgb):.4f}")
print(classification_report(y_val, y_pred_xgb))

# CatBoost Classifier
catboost_clf = CatBoostClassifier(iterations=100, learning_rate=0.1, depth=3, verbose=0, random_state=42)
catboost_clf.fit(X_train, y_train)
y_pred_catboost = catboost_clf.predict(X_val)
print("CatBoost Results:")
print(f"Accuracy: {accuracy_score(y_val, y_pred_catboost):.4f}")
print(classification_report(y_val, y_pred_catboost))


### Model Building

In [64]:
# from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
# from sklearn.linear_model import LogisticRegression
# from sklearn.neighbors import KNeighborsClassifier
# from sklearn.svm import SVC
# import xgboost as xgb
# from sklearn.model_selection import cross_val_score
# from sklearn.metrics import accuracy_score, classification_report

# # List of classifiers to evaluate
# classifiers = {
#     'Random Forest': RandomForestClassifier(random_state=42),
#     'Gradient Boosting': GradientBoostingClassifier(random_state=42),
#     'XGBoost': xgb.XGBClassifier(random_state=42, eval_metric='logloss'),
#     'Logistic Regression': LogisticRegression(random_state=42),
#     'KNN': KNeighborsClassifier(),
#     'SVM': SVC(random_state=42)
# }

# # Function to train and evaluate the models
# def evaluate_classifiers(classifiers, X_train, y_train, X_val, y_val):
#     for name, clf in classifiers.items():
#         # Train the model
#         clf.fit(X_train, y_train)
        
#         # Predict on the validation set
#         y_pred = clf.predict(X_val)
        
#         # Evaluate the performance
#         accuracy = accuracy_score(y_val, y_pred)
#         print(f"Classifier: {name}")
#         print(f"Accuracy: {accuracy:.4f}")
#         print(classification_report(y_val, y_pred))
#         print("\n")


# # Call the evaluation function
# evaluate_classifiers(classifiers, X_train, y_train, X_val, y_val)


#### Random Forest

### Cross Validation

In [65]:
# from sklearn.model_selection import cross_val_score

# # Function to evaluate classifiers using cross-validation
# def cross_validate_classifiers(classifiers, X_train, y_train, cv=5):
#     for name, clf in classifiers.items():
#         # Perform cross-validation
#         cv_scores = cross_val_score(clf, X_train, y_train, cv=cv, scoring='accuracy')
        
#         # Print the cross-validation results
#         print(f"Classifier: {name}")
#         print(f"Mean CV Accuracy: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})\n")

# # Perform cross-validation
# cross_validate_classifiers(classifiers, X_train, y_train)
