# Space Ship Titanic   
## Kaggle Competition 
[Link](https://www.kaggle.com/competitions/spaceship-titanic/data)

### Dataset Description

In this competition your task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the spacetime anomaly. To help you make these predictions, you're given a set of personal records recovered from the ship's damaged computer system.
File and Data Field Descriptions

   #### train.csv - Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
**PassengerId** - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.  

**HomePlanet** - The planet the passenger departed from, typically their planet of permanent residence. **CryoSleep** - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.  
**Cabin** - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.  
**Destination** - The planet the passenger will be debarking to.  
**Age** - The age of the passenger.  
**VIP** - Whether the passenger has paid for special VIP service during the voyage.  
**RoomService**, **FoodCourt**, **ShoppingMall**, **Spa**, **VRDeck** - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.  
**Name** - The first and last names of the passenger.  
**Transported** - Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.  

#### Initial Thoughts  
+ Simple categorical data - A Tree model is a good place to start  
+ The **Cabin** variable needs to be split into multiple features with port and starboad converted to binary  
+ **PasengerID** may need to be split into group and ID, possibly even an engineered feature for group size  
+ Luxury amentities could be transformed to weather or not a passenger used a facility and total amount spent
+ Name as it exists is useless for categorizing, however breaking it into first and last name and determing if travling in a family unit is more likely to result in transportation is an experiment to run  


## Import Data and Begin Cleaning

In [1]:
import pandas as pd
df = pd.read_csv('data/train.csv')
df

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,Gravior Noxnuther,False
8689,9278_01,Earth,True,G/1499/S,PSO J318.5-22,18.0,False,0.0,0.0,0.0,0.0,0.0,Kurta Mondalley,False
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,Fayey Connon,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,Celeon Hontichre,False


### Look for blank cells

In [9]:
[print(f'The number of nulls in column {i} is {sum(pd.isna(df[i]))}') for i in df.columns]
    

The number of nulls in column PassengerId is 0
The number of nulls in column HomePlanet is 201
The number of nulls in column CryoSleep is 217
The number of nulls in column Cabin is 199
The number of nulls in column Destination is 182
The number of nulls in column Age is 179
The number of nulls in column VIP is 203
The number of nulls in column RoomService is 181
The number of nulls in column FoodCourt is 183
The number of nulls in column ShoppingMall is 208
The number of nulls in column Spa is 183
The number of nulls in column VRDeck is 188
The number of nulls in column Name is 200
The number of nulls in column Transported is 0


[None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None,
 None]

That is a fair number of blanks, maybe we can infer some values