# Spaceship Titanic - Predicting which passengers are transported to an alternate dimension.

## Introduction

In this project, we are presented with an imaginary scenario where a spaceship *Titanic* is involved in a disaster, leading to half its passengers being transported to an alternate dimension. The goal is to predict which passengers are transported, based on damaged data from the spaceship's computers.

The following explains the features in the dataset that may need further explanation:
- PassengerID: A unique ID for each passengers. The first 4 digits represents the group they are travelling with, and the last 2 digits is their number within the group.
- CryoSleep: Passengers put into cryosleep for their voyage. If True, they are confined to their cabins.
- Cabin: Where the passenger is staying, in the form deck/num/side. Side can either be P for Port, or S for Starboard.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Amount the passenger has billed at each of the amenities.

## Exploratory Data Analysis

### General

The 'train' dataset is first loaded to be analysed. The dataset seems to have a lot of missing data, which will need to be filled before the model is able to process it.

In [1]:
import numpy as np 
import seaborn as sns
import pandas as pd

In [2]:
df = pd.read_csv('./train.csv')

In [3]:
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True


In [92]:
df.isnull().sum()

PassengerId                  0
HomePlanet                 201
CryoSleep                   98
Cabin                      199
Destination                182
Age                        179
VIP                        203
RoomService                113
FoodCourt                  113
ShoppingMall               112
Spa                        118
VRDeck                     126
Name                       200
Transported                  0
TotalAmenities_ignoreNa      0
GroupID                      0
Num In Group                 0
Deck                       199
Num                        199
Side                       199
TotalAmenities             561
dtype: int64

In [5]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [63]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PassengerId              8693 non-null   object 
 1   HomePlanet               8492 non-null   object 
 2   CryoSleep                8595 non-null   object 
 3   Cabin                    8494 non-null   object 
 4   Destination              8511 non-null   object 
 5   Age                      8514 non-null   float64
 6   VIP                      8490 non-null   object 
 7   RoomService              8580 non-null   float64
 8   FoodCourt                8580 non-null   float64
 9   ShoppingMall             8581 non-null   float64
 10  Spa                      8575 non-null   float64
 11  VRDeck                   8567 non-null   float64
 12  Name                     8493 non-null   object 
 13  Transported              8693 non-null   bool   
 14  TotalAmenities_ignoreNa 

In [7]:
df.corr()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported
Age,1.0,0.068723,0.130421,0.033133,0.12397,0.101007,-0.075026
RoomService,0.068723,1.0,-0.015889,0.05448,0.01008,-0.019581,-0.244611
FoodCourt,0.130421,-0.015889,1.0,-0.014228,0.221891,0.227995,0.046566
ShoppingMall,0.033133,0.05448,-0.014228,1.0,0.013879,-0.007322,0.010141
Spa,0.12397,0.01008,0.221891,0.013879,1.0,0.153821,-0.221131
VRDeck,0.101007,-0.019581,0.227995,-0.007322,0.153821,1.0,-0.207075
Transported,-0.075026,-0.244611,0.046566,0.010141,-0.221131,-0.207075,1.0


### Cabin

From the background information provided, we know that the cabin number is formatted as 'Deck/Num/Side'. The data is obtained and split into different columns.

In [None]:
df[['Deck', 'Num', 'Side']] = df['Cabin'].str.split(r'/', expand=True)

### CryoSleep and Spending at Amenities

The data is provided with the amenities column as string data type, which will first need to be converted to float in order to perform mathematical operations on them. A new column is also created to sum the amenities columns, ignoring null values.

In [62]:
df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']] = df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].astype(float)
df['TotalAmenities_ignoreNa'] = df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].sum(axis=1, skipna=True)
df['TotalAmenities'] = df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].sum(axis=1, skipna=False)

First, the missing data for the amenities (room service, food court, shopping mall, spa, and VR deck) can be filled in. If passengers opt for cryosleep, it should be safe to assume that they will not be spending any money on those amenities. This is confirmed with .describe(), where the mean spending on all amenities for all passengers where there is data available, is 0.

In [9]:
df[df['CryoSleep']==True].describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,TotalAmenities_ignoreNa
count,2955.0,2969.0,2967.0,2941.0,2972.0,2975.0,3037.0
mean,27.405415,0.0,0.0,0.0,0.0,0.0,0.0
std,15.080469,0.0,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,26.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,37.0,0.0,0.0,0.0,0.0,0.0,0.0
max,78.0,0.0,0.0,0.0,0.0,0.0,0.0


In [10]:
df.loc[df['CryoSleep'] == True, ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]] = df.loc[df['CryoSleep'] == True, ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].fillna(0.0)

It is reasonable to assume that passengers who have missing values in the CryoSleep column who has spent money on amenities are not in Cryosleep.

In [11]:
df.loc[(df['TotalAmenities_ignoreNa']>0.0) & (df['CryoSleep'].isnull()), ['CryoSleep']] = df.loc[(df['TotalAmenities_ignoreNa']!=0.0) & (df['CryoSleep'].isnull()), ['CryoSleep']].fillna(False)

While we are able to derive that the passengers who have spent on amenities are not in CryoSleep, the reverse cannot be assumed, as there are also passengers who did not go into CryoSleep, and spent 0 on the amenities.

In [26]:
df[(df['TotalAmenities_ignoreNa']==0.0) & (df['CryoSleep'].isnull())].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,TotalAmenities_ignoreNa,GroupID,Num In Group,Deck,Num,Side
92,0099_02,Earth,,G/12/P,TRAPPIST-1e,2.0,False,0.0,0.0,0.0,0.0,0.0,Thewis Connelson,True,0.0,99,2,G,12,P
111,0115_01,Mars,,F/24/P,TRAPPIST-1e,26.0,False,0.0,0.0,0.0,0.0,,Rohs Pead,True,0.0,115,1,F,24,P
175,0198_01,Earth,,G/30/P,PSO J318.5-22,52.0,False,0.0,0.0,0.0,0.0,0.0,Jeroy Cookson,True,0.0,198,1,G,30,P
266,0290_03,Europa,,B/7/S,TRAPPIST-1e,43.0,False,0.0,0.0,0.0,0.0,0.0,Dhenar Excialing,True,0.0,290,3,B,7,S
392,0433_01,Europa,,B/20/P,55 Cancri e,27.0,False,0.0,0.0,0.0,0.0,0.0,Hekark Mormonized,True,0.0,433,1,B,20,P


### PassengerID

From the information provided, the passengerID can be split into 2 parts - the group number, and an individual number. Those two numbers are split into two columns.

In [19]:
df[['GroupID', 'Num In Group']] = df['PassengerId'].str.split('_',expand=True)
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,TotalAmenities_ignoreNa,GroupID,Num In Group
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0,1,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0,2,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0,3,1
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0,3,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0,4,1


In [20]:
more_than_1_person_in_group = df[df['GroupID'].duplicated()]['GroupID'].to_list()
df[df['GroupID'].isin(more_than_1_person_in_group)].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,TotalAmenities_ignoreNa,GroupID,Num In Group
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0,3,1
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0,3,2
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True,1584.0,6,1
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,0.0,Candra Jacostaffey,True,0.0,6,2
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True,0.0,8,1


### HomePlanet and Spending Habits

The passengers aboard the ship come from only 3 planets: Earth, Europa, and Mars.

In [None]:
pd.Series(df['HomePlanet'].value_counts())

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64

In [None]:
df['HomePlanet'].isnull().sum()

201

The passenger's spending habits also differ based on their home planet. For passengers with no missing data in any of the amenities columns and have spent more than 0.0 on those amenities, passengers from Europa spent the highest, with a large amount spent on the food court, spa, and VRDeck. 

Considering the mean, passengers from Mars spent more on room service and the shopping mall, and passengers from Earth had a balanced distribution of spending among all the amenities, and also spent the lowest.

The median indicates that it is highly likely that Earth just has passengers who spend

In [132]:
df[(df['TotalAmenities'] != 0.0) & (df['TotalAmenities'].notna())][['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck', 'HomePlanet','TotalAmenities_ignoreNa']].groupby(["HomePlanet"]).describe().transpose()

Unnamed: 0,HomePlanet,Earth,Europa,Mars
RoomService,count,2561.0,1013.0,858.0
RoomService,mean,219.34752,278.260612,1018.631702
RoomService,std,468.448855,1100.984277,1075.534331
RoomService,min,0.0,0.0,0.0
RoomService,25%,0.0,0.0,230.75
RoomService,50%,4.0,0.0,797.5
RoomService,75%,240.0,9.0,1332.5
RoomService,max,6256.0,14327.0,9920.0
FoodCourt,count,2561.0,1013.0,858.0
FoodCourt,mean,218.340492,2929.447187,94.931235


In [102]:
df[df['HomePlanet'].isnull()][['HomePlanet','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','TotalAmenities_ignoreNa']].head()

Unnamed: 0,HomePlanet,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,TotalAmenities_ignoreNa
59,,0.0,0.0,0.0,0.0,0.0,0.0
113,,0.0,2344.0,0.0,65.0,6898.0,9307.0
186,,0.0,0.0,0.0,0.0,0.0,0.0
225,,313.0,1.0,691.0,283.0,0.0,1288.0
234,,0.0,0.0,0.0,0.0,0.0,0.0


In [133]:
df[df['HomePlanet'] =='Earth'][['HomePlanet','RoomService','FoodCourt','ShoppingMall','Spa','VRDeck','TotalAmenities_ignoreNa']]

Unnamed: 0,HomePlanet,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,TotalAmenities_ignoreNa
1,Earth,109.0,9.0,25.0,549.0,44.0,736.0
4,Earth,303.0,70.0,151.0,565.0,2.0,1091.0
5,Earth,0.0,483.0,0.0,291.0,0.0,774.0
6,Earth,42.0,1539.0,3.0,0.0,0.0,1584.0
7,Earth,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...
8681,Earth,0.0,0.0,0.0,0.0,0.0,0.0
8682,Earth,240.0,242.0,510.0,0.0,0.0,992.0
8683,Earth,86.0,3.0,149.0,208.0,329.0,775.0
8689,Earth,0.0,0.0,0.0,0.0,0.0,0.0


The following assumptions were made:
- Passengers going into CryoSleep do not spend money on amenities.
- Passengers who have spent money on amenities are not in CryoSleep.