# Spaceship Titanic - Predicting which passengers are transported to an alternate dimension.

## Introduction

In this project, we are presented with an imaginary scenario where a spaceship *Titanic* is involved in a disaster, leading to half its passengers being transported to an alternate dimension. The goal is to predict which passengers are transported, based on damaged data from the spaceship's computers.

The following explains the features in the dataset that may need further explanation:
- PassengerID: A unique ID for each passengers. The first 4 digits represents the group they are travelling with, and the last 2 digits is their number within the group.
- CryoSleep: Passengers put into cryosleep for their voyage. If True, they are confined to their cabins.
- Cabin: Where the passenger is staying, in the form deck/num/side. Side can either be P for Port, or S for Starboard.
- RoomService, FoodCourt, ShoppingMall, Spa, VRDeck: Amount the passenger has billed at each of the amenities.

## Exploratory Data Analysis

The 'train' dataset is first loaded to be analysed. The dataset seems to have a lot of missing data, which will need to be filled before the model is able to process it.

In [2]:
import numpy as np 
import seaborn as sns
import pandas as pd

In [71]:
df = pd.read_csv('./train.csv')

In [4]:
df.head(10)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,Sandie Hinetthews,True
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,,Candra Jacostaffey,True
8,0007_01,Earth,False,F/3/S,TRAPPIST-1e,35.0,False,0.0,785.0,17.0,216.0,0.0,Andona Beston,True
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True


In [5]:
df.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

In [6]:
df.describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,8514.0,8512.0,8510.0,8485.0,8510.0,8505.0
mean,28.82793,224.687617,458.077203,173.729169,311.138778,304.854791
std,14.489021,666.717663,1611.48924,604.696458,1136.705535,1145.717189
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,19.0,0.0,0.0,0.0,0.0,0.0
50%,27.0,0.0,0.0,0.0,0.0,0.0
75%,38.0,47.0,76.0,27.0,59.0,46.0
max,79.0,14327.0,29813.0,23492.0,22408.0,24133.0


In [50]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8693 entries, 0 to 8692
Data columns (total 21 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   PassengerId              8693 non-null   object 
 1   HomePlanet               8492 non-null   object 
 2   CryoSleep                8476 non-null   object 
 3   Cabin                    8494 non-null   object 
 4   Destination              8511 non-null   object 
 5   Age                      8514 non-null   float64
 6   VIP                      8490 non-null   object 
 7   RoomService              8580 non-null   float64
 8   FoodCourt                8580 non-null   float64
 9   ShoppingMall             8581 non-null   float64
 10  Spa                      8575 non-null   float64
 11  VRDeck                   8567 non-null   float64
 12  Name                     8493 non-null   object 
 13  Transported              8693 non-null   bool   
 14  GroupID                 

### CryoSleep and Spending at Amenities

First, the missing data for the amenities (room service, food court, shopping mall, spa, and VR deck) can be filled in. If passengers opt for cryosleep, it should be safe to assume that they will not be spending any money on those amenities. This is confirmed with .describe(), where the mean spending on all amenities for all passengers where there is data available, is 0.

In [8]:
df[df['CryoSleep']==True].describe()

Unnamed: 0,Age,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck
count,2955.0,2969.0,2967.0,2941.0,2972.0,2975.0
mean,27.405415,0.0,0.0,0.0,0.0,0.0
std,15.080469,0.0,0.0,0.0,0.0,0.0
min,0.0,0.0,0.0,0.0,0.0,0.0
25%,18.0,0.0,0.0,0.0,0.0,0.0
50%,26.0,0.0,0.0,0.0,0.0,0.0
75%,37.0,0.0,0.0,0.0,0.0,0.0
max,78.0,0.0,0.0,0.0,0.0,0.0


In [72]:
df.loc[df['CryoSleep'] == True, ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]] = df.loc[df['CryoSleep'] == True, ["RoomService", "FoodCourt", "ShoppingMall", "Spa", "VRDeck"]].fillna(0.0)

It could be possible to assume that passengers who have missing values in the CryoSleep column who has spent money on amenities are not in Cryosleep.

In [75]:
df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']] = df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].astype(float)
df['TotalAmenities'] = df['RoomService'] + df['FoodCourt'] + df['ShoppingMall'] + df['VRDeck'] + df['Spa']
df['TotalAmenities_ignoreNa'] = df[['RoomService','FoodCourt','ShoppingMall','Spa','VRDeck']].sum(axis=1, skipna=True)

### HomePlanet

The passengers aboard the ship come from only 3 planets: Earth, Europa, and Mars.

In [11]:
df['HomePlanet'].value_counts()

Earth     4602
Europa    2131
Mars      1759
Name: HomePlanet, dtype: int64

In [12]:
df['HomePlanet'].isnull().sum()

201

In [13]:
df[df['HomePlanet'].isnull()].head(1)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
59,0064_02,,True,E/3/S,TRAPPIST-1e,33.0,False,0.0,0.0,0.0,0.0,0.0,Colatz Keen,True


### Destination

In [42]:
df['Destination'].value_counts()

TRAPPIST-1e      5915
55 Cancri e      1800
PSO J318.5-22     796
Name: Destination, dtype: int64

### PassengerID

From the information provided, the passengerID can be split into 2 parts - the group number, and an individual number. Those two numbers are split into two columns.

In [14]:
df[['GroupID', 'Num In Group']] = df['PassengerId'].str.split('_',expand=True)
df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupID,Num In Group
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,1,1
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,2,1
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3,1
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3,2
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,4,1


In [15]:
more_than_1_person_in_group = df[df['GroupID'].duplicated()]['GroupID'].to_list()
df[df['GroupID'].isin(more_than_1_person_in_group)].head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,GroupID,Num In Group
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,3,1
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,3,2
6,0006_01,Earth,False,F/2/S,TRAPPIST-1e,26.0,False,42.0,1539.0,3.0,0.0,0.0,Billex Jacostaffey,True,6,1
7,0006_02,Earth,True,G/0/S,TRAPPIST-1e,28.0,False,0.0,0.0,0.0,0.0,0.0,Candra Jacostaffey,True,6,2
9,0008_01,Europa,True,B/1/P,55 Cancri e,14.0,False,0.0,0.0,0.0,0.0,0.0,Erraiam Flatic,True,8,1


### Cabin

In [16]:
df[['Deck', 'Num', 'Side']] = df['Cabin'].str.split(r'/', expand=True)

The following assumptions were made:
- Passengers going into CryoSleep do not spend money on amenities.