# Spaceship Titanic
### Predict which passengers are transported to an alternate dimension

### License: Attribution 4.0 International (CC BY 4.0)
#### Competition link: https://www.kaggle.com/competitions/spaceship-titanic/data

## Dataset Description
### File and Data Field Descriptions

---

#### **train.csv**
Personal records for about two-thirds (~8700) of the passengers, to be used as training data.

- **PassengerId** – A unique Id for each passenger. Each Id takes the form `gggg_pp` where `gggg` indicates a group the passenger is travelling with and `pp` is their number within the group. People in a group are often family members, but not always.  
- **HomePlanet** – The planet the passenger departed from, typically their planet of permanent residence.  
- **CryoSleep** – Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.  
- **Cabin** – The cabin number where the passenger is staying. Takes the form `deck/num/side`, where side can be either **P** for Port or **S** for Starboard.  
- **Destination** – The planet the passenger will be debarking to.  
- **Age** – The age of the passenger.  
- **VIP** – Whether the passenger has paid for special VIP service during the voyage.  
- **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** – Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.  
- **Name** – The first and last names of the passenger.  
- **Transported** – Whether the passenger was transported to another dimension. This is the **target**, the column you are trying to predict.

---

#### **test.csv**
Personal records for the remaining one-third (~4300) of the passengers, to be used as test data.  
Your task is to predict the value of **Transported** for the passengers in this set.

---

#### **sample_submission.csv**
A submission file in the correct format.

- **PassengerId** – Id for each passenger in the test set.  
- **Transported** – The target. For each passenger, predict either **True** or **False**.



# Project Data Overview

In [99]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [100]:
data = pd.read_csv('../source_data/train.csv')
data

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


### Basic Understanding

---



In [101]:
# Shape of the train dataset
data.shape

(8693, 14)

- The shape of the train set contains `8693` rows and `14` features

In [102]:
test_set = pd.read_csv('../source_data/test.csv')
test_set

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,Nelly Carsoning
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,Lerome Peckers
2,0019_01,Europa,True,C/0/S,55 Cancri e,31.0,False,0.0,0.0,0.0,0.0,0.0,Sabih Unhearfus
3,0021_01,Europa,False,C/1/S,TRAPPIST-1e,38.0,False,0.0,6652.0,0.0,181.0,585.0,Meratz Caltilter
4,0023_01,Earth,False,F/5/S,TRAPPIST-1e,20.0,False,10.0,0.0,635.0,0.0,0.0,Brence Harperez
...,...,...,...,...,...,...,...,...,...,...,...,...,...
4272,9266_02,Earth,True,G/1496/S,TRAPPIST-1e,34.0,False,0.0,0.0,0.0,0.0,0.0,Jeron Peter
4273,9269_01,Earth,False,,TRAPPIST-1e,42.0,False,0.0,847.0,17.0,10.0,144.0,Matty Scheron
4274,9271_01,Mars,True,D/296/P,55 Cancri e,,False,0.0,0.0,0.0,0.0,0.0,Jayrin Pore
4275,9273_01,Europa,False,D/297/P,,,False,0.0,2680.0,0.0,0.0,523.0,Kitakan Conale


In [103]:
# Shape of the test dataset
test_set.shape

(4277, 13)

- The shape of the test dataset contains `4277` rows and `13` features, as the target column `Transported` has been removed

### Data Types and some insights from the columns

---



In [104]:
data.dtypes

PassengerId      object
HomePlanet       object
CryoSleep        object
Cabin            object
Destination      object
Age             float64
VIP              object
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Name             object
Transported        bool
dtype: object

- **PassengerId**: shows whether the passenger traveled in a group or not. This can be important for the final classification, which is why I should analyze this feature carefully.
- **HomePlanet**: when the people from different planets are separated, we may notice a correlation indicating that a person from a specific planet has a higher chance of being transported
- **CryoSleep**: It can show us whether people who are in cryogenic sleep have spent any money, whether passengers are divided by social status (richer or poorer), and whether this play important role in transportation.
- **Cabine**: `deck/num/side` This is how each cabin on the ship is encoded. In this case, one-hot encoding can be applied to provide the model with as much information as possible. This way, the model may capture hidden relationships during classification that we cannot detect ourselves, allowing it to separate the data more effectively.
- **Destination**: where the passenger was going. Some destinations may have had more transported passengers, which could mean those were priority routes or had more people in cryosleep.
- **Age**: Age is a number. We can look at how old the passengers were and see if younger or older people were more likely to be transported.
- **VIP**: Tells us if someone had special status. VIPs might act differently—they may spend more or have a different chance of being transported.
- **RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, and **VRDeck** show how much people spent on different things. These values might be zero or missing, especially for those in cryosleep. Adding them up could show total spending, which might affect the chance of being transported.
- **Name** probably isn’t useful by itself, but it might help spot duplicates or we might be able to pull out titles, which could tell us more about a person’s age or status.
- **Transported** is what we’re trying to predict. So, all the above points are about trying to understand what makes a person more or less likely to be transported—whether it’s their age, origin, destination, spending, or sleep status.


## Descriptive Statistic
#### This shows descriptive statistics like mean, std, min, and percentiles for numerical features

---

In [105]:
data

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


In [112]:
# Current dataframe shows NaN values for all features
columns_dict = {}
for feature in data.columns:
    nan_count = data[feature].isna().sum()
    nan_percent = round((nan_count / len(data[feature])) * 100, 2)
    columns_dict[feature] = [nan_count, nan_percent]

new_data = pd.DataFrame(columns_dict, index=['NaN Values', 'NaN %'])
new_data.T

Unnamed: 0,NaN Values,NaN %
PassengerId,0.0,0.0
HomePlanet,201.0,2.31
CryoSleep,217.0,2.5
Cabin,199.0,2.29
Destination,182.0,2.09
Age,179.0,2.06
VIP,203.0,2.34
RoomService,181.0,2.08
FoodCourt,183.0,2.11
ShoppingMall,208.0,2.39


#### The missing values we observe are approximately equal, with small deviations, which suggests that this is not random and may be a key factor in the classification — how we handle these missing values will be crucial.

In [None]:
# All numeric features
numeric_data = data[['Age', 'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck']]

In [113]:
# Descriptive statistic for all numeric features
numeric_data.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0
