<h1 style="; font-weight:bold"> Spaceship Titanic Analysis & ML </h1>

<p style="margin-left:30%">~  Nikhil Sharma</p>

# Table of Contents

<ol style= 'list-style-type: decimal'>
    <li><a href="#Introduction">Introduction</a></li>
    <li><a href="#data_understanding">Data Understanding</a></li>
    <li>
        <a href="#data_preparation">Data Preparation</a>
        <ol style= 'list-style-type: decimal' >
            <li><a href="#libraries">Import Required Libraries</a></li>
            <li><a href="#data_loading">Data Loading</a></li>
            <li><a href="#data_info">Brief Data Information</a></li>
        </ol>
    </li>
    <li>
        <a href="#data_preprocessing">Data Preprocessing</a>
        <ol style= 'list-style-type: decimal'>
            <li><a href="#data_cleaning">Imputing Missing Values</a></li>
            <li><a href="#feature_engineering">Feature Engineering</a></li>
        </ol>
    </li>
    <li>
        <a href="model">Model Building</a>
    </li>
    <li><a href="#conclusion">Conclusion</a></li>
 </ol>    


# 1. Introduction: <a class="anchor" id="introduction"></a>
Welcome to the year 2912, where your data science skills are needed to solve a cosmic mystery. We've received a transmission from four lightyears away and things aren't looking good.

The Spaceship Titanic was an interstellar passenger liner launched a month ago. With almost 13,000 passengers on board, the vessel set out on its maiden voyage transporting emigrants from our solar system to three newly habitable exoplanets orbiting nearby stars.

While rounding Alpha Centauri en route to its first destination—the torrid 55 Cancri E—the unwary Spaceship Titanic collided with a spacetime anomaly hidden within a dust cloud. Sadly, it met a similar fate as its namesake from 1000 years before. Though the ship stayed intact, almost half of the passengers were transported to an alternate dimension!

To help rescue crews and retrieve the lost passengers, you are challenged to predict which passengers were transported by the anomaly using records recovered from the spaceship’s damaged computer system.

Help save them and change history!


# 2. Data Understanding:<a class="anchor" id="data_understanding"></a>


Our task is to predict whether a passenger was transported to an alternate dimension during the Spaceship Titanic's collision with the space-time anomaly. To help us make these predictions, we're given a set of personal records recovered from the ship's damaged computer system.
<ol>
    <li><b>Train.csv </b>- Personal records for about two-thirds (~8700) of the passengers, to be used as training data.
        <ol>
            <li>
                <b>PassengerId</b> - A unique Id for each passenger. Each Id takes the form gggg_pp where gggg indicates a group the passenger is travelling with and pp is their number within the group. People in a group are often family members, but not always.
            </li>
            <li>
                <b>HomePlanet </b>- The planet the passenger departed from, typically their planet of permanent residence.
            </li>
            <li>
           <b> CryoSleep</b> - Indicates whether the passenger elected to be put into suspended animation for the duration of the voyage. Passengers in cryosleep are confined to their cabins.
            </li>
            <li>
           <b> Cabin</b> - The cabin number where the passenger is staying. Takes the form deck/num/side, where side can be either P for Port or S for Starboard.
            </li>
            <li>
            <b>Destination</b> - The planet the passenger will be debarking to.
            </li>
            <li>
          <b> Age </b>- The age of the passenger.
            </li>
             <li>
          <b>VIP </b>- Whether the passenger has paid for special VIP service during the voyage.
            </li>
             <li>
          <b> RoomService, FoodCourt, ShoppingMall, Spa, VRDeck</b> - Amount the passenger has billed at each of the Spaceship Titanic's many luxury amenities.
            </li>
             <li>
   <b> Name </b>- The first and last names of the passenger.
            </li>
            <li>
   <b> Transported </b>- Whether the passenger was transported to another dimension. This is the target, the column you are trying to predict.
            </li>
        </ol>
    </li>
    <li>

   <b> test.csv </b>- Personal records for the remaining one-third (~4300) of the passengers, to be used as test data. Your task is to predict the value of Transported for the passengers in this set.
    </li>
<ol/>


# 3. Data Preparation:<a class="anchor" id="data_preparation"></a>
In this Section:
* We will Install and Import the required Libraries.
* We will load the data.
* We will see few basic information about our data.

## 3.1 Import Required Libraries:<a class="anchor" id="libraries"></a>

In [1]:
#Import Required Librarires
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

#For Model training
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier


#Libraries for splitting the data and tuning the parameters of models.
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

#Supress Warning
def warn(*args, **kwargs):
    pass
import warnings
warnings.warn = warn
warnings.simplefilter(action='ignore', category=FutureWarning)

### 3.2 Data Loading:<a class="anchor" id="data_loading"></a>

In [2]:
test_data = pd.read_csv('data/test.csv')
train_data = pd.read_csv('data/train.csv')

In [3]:
train_data.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False


### 3.3 Brief Data Information:<a class="anchor" id="data_info"></a>
  

In [4]:
train_data.isnull().sum()

PassengerId       0
HomePlanet      201
CryoSleep       217
Cabin           199
Destination     182
Age             179
VIP             203
RoomService     181
FoodCourt       183
ShoppingMall    208
Spa             183
VRDeck          188
Name            200
Transported       0
dtype: int64

*Few Findings :*
* Every features has null value except for `PassengerId` and `transported`.
* We will have to fill these null/missing values.
* We also need to fill `CryoSleep` feature first.

# 4. Data Processing: <a class="anchor" id="data_preprocessing"></a>
In this Section:
* We will clean our data.
* We will fill missing values
* We will do feature Engineering.

## 4.1 Imputing Missing Values: <a class="anchor" id="data_cleaning"></a>

Let's work on a copy of our dataframe!

In [5]:
train_df =  train_data.copy()
train_df.head(2)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True


Now we have to make an `Expenses` feature temporarily because if someone is in `Cryosleep`, they are not spending any money. So, knowing someone's expenses can help us impute values for `CryoSleep`.

In [6]:
train_df['Expenses'] = train_df[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)
train_df[["PassengerId","HomePlanet", "CryoSleep", "Expenses"]].head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Expenses
0,0001_01,Europa,False,0.0
1,0002_01,Earth,False,736.0
2,0003_01,Europa,False,10383.0


In [7]:
train_df['Expenses'].groupby(train_df['Age'] < 13).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,7887.0,1588.113478,2902.797444,0.0,0.0,786.0,1624.0,35987.0
True,806.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [8]:
train_df['Expenses'].groupby(train_df['Age'] >= 13).describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Age,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
False,985.0,235.716751,1316.031595,0.0,0.0,0.0,0.0,22261.0
True,7708.0,1594.871562,2903.579335,0.0,0.0,787.0,1669.0,35987.0


* Since we can't really guess anyone's age, we'll impute these null values with the median. 
* An interesting thing I observed is that only people who are 13+ have expenses. I guess the little kids don't get pocket money. Quite unfair.

In [9]:
train_df.Age = train_df.Age.fillna(train_df.Age.median())

In [10]:
train_df[["PassengerId",'Name',"HomePlanet", "CryoSleep","Age", "Expenses"]].head(6)

Unnamed: 0,PassengerId,Name,HomePlanet,CryoSleep,Age,Expenses
0,0001_01,Maham Ofracculy,Europa,False,39.0,0.0
1,0002_01,Juanna Vines,Earth,False,24.0,736.0
2,0003_01,Altark Susent,Europa,False,58.0,10383.0
3,0003_02,Solam Susent,Europa,False,33.0,5176.0
4,0004_01,Willy Santantines,Earth,False,16.0,1091.0
5,0005_01,Sandie Hinetthews,Earth,False,44.0,774.0


We can see our newly created feature `Expenses` right after the feature `Age` in a demo presentation of our dataframe above. 


*Funnily enough, the first person on the dataset, Mr. Maham Ofracculy, didn't spend any money, and they weren't even in `CryoSleep`. Maybe they are broke? In that case I can relate with them.*

we're going to make a new column for cryosleep, with all values equal to False (or 0):

In [11]:
train_df['Cryosleep'] = 0

Now, for every row where Expenses is `0`, we're going to put `1` as the value. Because if someone has not spent any money, they are probably in `CryoSleep`. But don't worry, we'll deal with the exceptions, like *Mr. Maham Ofracculy*, later.

In [12]:
train_df.loc[train_df['Expenses'] == 0, 'Cryosleep'] = 1

Now, we are going to set this feature's value to `1` wherever the original `CryoSleep` is equal to `True`.

In [13]:
train_df.loc[train_df.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1

Conversely, we will put it to `0 `wherever `CryoSleep` is equal to `False`.

In [14]:
train_df.loc[train_df.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0

Let's take a look at this new feature:

In [15]:
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Name,Transported,Expenses,Cryosleep
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,Maham Ofracculy,False,0.0,0
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,Juanna Vines,True,736.0,0
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,Altark Susent,False,10383.0,0
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,Solam Susent,False,5176.0,0
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,Willy Santantines,True,1091.0,0


*What we have just completed:*

* First, we created a new feature `Cryosleep` and  set all its value to `False` = `0`.
* Next, we set `Cryosleep` as `True` = `1` for everyone whose expenses is 0.
* Finally, we used the original `CryoSleep` feature, to correct `CryoSleep` status for the people who haven't spent any money, but aren't in `cryosleep`. Just in case our last step incorrectly classified them as being in `cryosleep`.

Logical, right?

Now, let's just replace the original column with this one. There's probably a better way of doing this than how I did it here here:

In [16]:
train_df['Cryosleep'] = train_df['Cryosleep'].astype('bool')
train_df['CryoSleep'] = train_df['Cryosleep']
train_df.drop('Cryosleep',axis=1,inplace=True)

In [17]:
train_df[["PassengerId",'Name',"HomePlanet", "CryoSleep","Age", "Expenses"]].sample(6)

Unnamed: 0,PassengerId,Name,HomePlanet,CryoSleep,Age,Expenses
4147,4430_01,Ireen Robins,Earth,False,18.0,1515.0
2610,2795_01,Genons Brantcable,Europa,False,34.0,5687.0
4034,4309_01,Coracy Whitledges,Earth,False,37.0,788.0
2613,2797_01,Violan Burchanez,Earth,False,22.0,1345.0
3305,3550_03,Celina Carvis,Earth,True,21.0,0.0
6072,6419_01,Estina Holcompson,Earth,False,31.0,873.0


*What we have just completed*

* We have now replaced the values of our original `CryoSleep` column, that had missing values, with the values of our newly created `Cryosleep` column which doesn't have any null values. Then we dropped our new column.

* Our new column also states accurately that *Mr. Maham Ofracculy* is not in `cryosleep`, and he still hasn't spent any money on amenities, i.e., `RoomService`,`FoodCourt`,`ShoppingMall`,`Spa` and `VRDeck`.

The new feature shouldn't have any null values now. Let's check just in case

In [18]:
train_df.CryoSleep.isnull().any()

False

Since the only important person in this dataset is *Mr. Maham Ofracculy*, we don't need the names column.


In [19]:
train_df.drop('Name',axis=1,inplace=True)

Now for the amenities, we can easily impute null values for `Cryosleep` == `True`, since we know they are going to be zero as the person is in `CryoSleep`.

In [20]:
train_df.loc[train_df.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0
train_df.loc[train_df.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Before dealing with the rest of the amenities' values, let's make some more new features to aid us.

In [21]:
train_df['Adults'] = train_df['Age'] >= 13

*We know that 13 year olds aren't adults, But in this case, we are dividing it in terms of the `Expenses` because they are able to spend money at this age.*

Now, let's make a column now that tells us if someone is 13+ and is spending money.

In [22]:
train_df['Adult_and_spending'] = (train_df['Expenses'] > 0) & (train_df['Age'] >=13)

Let's take a look at the rows that are `True` for our new Adult_and_spending feature:

In [23]:
train_df.loc[train_df.Adult_and_spending == True]

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Expenses,Adults,Adult_and_spending
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,736.0,True,True
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,10383.0,True,True
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,5176.0,True,True
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,1091.0,True,True
5,0005_01,Earth,False,F/0/P,PSO J318.5-22,44.0,False,0.0,483.0,0.0,291.0,0.0,True,774.0,True,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
8687,9275_03,Europa,False,A/97/P,TRAPPIST-1e,30.0,False,0.0,3208.0,0.0,2.0,330.0,True,3540.0,True,True
8688,9276_01,Europa,False,A/98/P,55 Cancri e,41.0,True,0.0,6819.0,0.0,1643.0,74.0,False,8536.0,True,True
8690,9279_01,Earth,False,G/1500/S,TRAPPIST-1e,26.0,False,0.0,0.0,1872.0,1.0,0.0,True,1873.0,True,True
8691,9280_01,Europa,False,E/608/S,55 Cancri e,32.0,False,0.0,1049.0,0.0,353.0,3235.0,False,4637.0,True,True


So there are 5040 people who are 13+ and are spending money.

*What will we do next:*

* Now we are going to impute the values for our amenities.

* We know if someone is not an adult and has `zero` expenses, they are either below `13`, which means they definitely haven't spent on any amenities, or they are in `CryoSleep`, which again means they definitely haven't spent on amenities.

* So, wherever we have `Adult_and_spending` == `False`, we'll impute them with `0`

In [24]:
#Imputing missing values for RoomService feature where Adult_and_spending == False
train_df.RoomService = train_df.RoomService.fillna(train_df.RoomService.mean())
train_df.loc[train_df.Adult_and_spending ==False, 'RoomService'] = 0

#Imputing missing values for FoodCourt feature where Adult_and_spending == False
train_df.FoodCourt = train_df.FoodCourt.fillna(train_df.FoodCourt.mean())
train_df.loc[train_df.Adult_and_spending ==False, 'FoodCourt'] = 0

#Imputing missing values for ShoppingMall feature where Adult_and_spending == False
train_df.ShoppingMall = train_df.ShoppingMall.fillna(train_df.ShoppingMall.mean())
train_df.loc[train_df.Adult_and_spending ==False, 'ShoppingMall'] = 0

#Imputing missing values for Spa feature where Adult_and_spending == False
train_df.Spa = train_df.Spa.fillna(train_df.Spa.mean())
train_df.loc[train_df.Adult_and_spending ==False, 'Spa'] = 0

#Imputing missing values for VRDeck feature where Adult_and_spending == False
train_df.VRDeck = train_df.VRDeck.fillna(train_df.VRDeck.mean())
train_df.loc[train_df.Adult_and_spending ==False, 'VRDeck'] = 0

Neat! Now we are done with imputing these columns as well.

Let's take a look:

In [25]:
train_df[['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']].isna().sum()

RoomService     0
FoodCourt       0
ShoppingMall    0
Spa             0
VRDeck          0
dtype: int64

Great!

For the remaining columns, we can't figure out what values to fill in this manner. So we are just going to fill them with the values that the majority of people have in the dataset, i.e., the mode.

In [26]:
train_df.HomePlanet.mode()

0    Earth
dtype: object

In [27]:
train_df.Destination.mode()

0    TRAPPIST-1e
dtype: object

In [28]:
train_df.VIP.mode()

0    False
dtype: object

So, these are the values we will be imputing with.

In [29]:
train_df.HomePlanet = train_df.HomePlanet.fillna('Earth')
train_df.Destination = train_df.Destination.fillna('TRAPPIST-1e')
train_df.VIP = train_df.VIP.fillna('False')
train_df.VIP = train_df.VIP.astype('bool')

Let's see how much we are completed imputing null values:

In [30]:
train_df.isnull().sum()

PassengerId             0
HomePlanet              0
CryoSleep               0
Cabin                 199
Destination             0
Age                     0
VIP                     0
RoomService             0
FoodCourt               0
ShoppingMall            0
Spa                     0
VRDeck                  0
Transported             0
Expenses                0
Adults                  0
Adult_and_spending      0
dtype: int64

The cabin is the only column that remains with null values!

Filling this is not easy due to my limited skill. I am just going to use `ffill` to fill these null values. What that does is basically use the previous value to impute the missing one.

So, for example, if we have a dataset like:

    [5, 8, 15, null, 19]

If we use fill on this, it'll become:

    [5, 8, 15, 15, 19]

In [31]:
train_df['Cabin'] = train_df.Cabin.fillna(method='ffill')

In [32]:
train_df.isnull().sum()

PassengerId           0
HomePlanet            0
CryoSleep             0
Cabin                 0
Destination           0
Age                   0
VIP                   0
RoomService           0
FoodCourt             0
ShoppingMall          0
Spa                   0
VRDeck                0
Transported           0
Expenses              0
Adults                0
Adult_and_spending    0
dtype: int64

And so, we are done with imputing. Time to move on to feature engineering.

## 4.2 Feature Engineering <a class="anchor" id="feature_engineering"></a>

These are the features that I am going to add to this dataset.

In [33]:
train_df['Group_nums'] = train_df.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
train_df['Grouped'] = ((train_df['Group_nums'].value_counts() > 1).reindex(train_df['Group_nums'])).tolist()
train_df['Deck'] = train_df.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
train_df['Side'] = train_df.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
train_df['Has_expenses'] = train_df['Expenses'] > 0
train_df['Is_Embryo'] = train_df['Age'] == 0

*These specifies:*

* If someone was alone or in a group.
* Which deck someone was in.
* Which side (Starboard or Port).
* If the passenger was 0 years old (i.e, an embryo).

Let's get rid of our temporary columns:

In [34]:
train_df.drop(['Adult_and_spending','Group_nums','Expenses'],axis=1,\
                   inplace=True)

Our Final Dataset

In [35]:
train_df.head()

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Adults,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,B,P,False,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,False,F,S,True,False
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,True,A,S,True,False
3,0003_02,Europa,False,A/0/S,TRAPPIST-1e,33.0,False,0.0,1283.0,371.0,3329.0,193.0,False,True,True,A,S,True,False
4,0004_01,Earth,False,F/1/S,TRAPPIST-1e,16.0,False,303.0,70.0,151.0,565.0,2.0,True,True,False,F,S,True,False


In [36]:
train_df.to_csv('Cleaned and imputed data.csv',index=False)

Pre-proccesing of Test Data

In [37]:
test_df_copy = test_data.copy()

test_df_copy['Expenses'] = test_df_copy[['RoomService', 'FoodCourt',
                                           'ShoppingMall', 'Spa', 'VRDeck']].sum(axis=1)

test_df_copy.Age = test_df_copy.Age.fillna(test_df_copy.Age.median())

test_df_copy['Adult_spending_awake'] = (test_df_copy['Expenses'] > 0)\
                                     & (test_df_copy['Age'] >= 13)\
                                     & (test_df_copy['CryoSleep'] == False)

test_df_copy['Cryosleep'] = 0
test_df_copy.loc[test_df_copy['Expenses'] == 0, 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'True', 'Cryosleep'] = 1
test_df_copy.loc[test_df_copy.CryoSleep.astype('str') == 'False', 'Cryosleep'] = 0
test_df_copy['Cryosleep'] = test_df_copy['Cryosleep'].astype('bool')
test_df_copy['CryoSleep'] = test_df_copy['Cryosleep']
test_df_copy.drop('Cryosleep',axis=1,inplace=True)
test_df_copy.drop('Name',axis=1,inplace=True)

test_df_copy.loc[test_df_copy.CryoSleep == True,['RoomService', 'FoodCourt','ShoppingMall', 'Spa', 'VRDeck']] = 0

test_df_copy['Adults'] = test_df_copy['Age'] >= 13

test_df_copy['Adult_and_spending'] = (test_df_copy['Expenses'] > 0) & (test_df_copy['Age'] >=13)
test_df_copy.loc[test_df_copy.Adult_and_spending == True]

test_df_copy.RoomService = test_df_copy.RoomService.fillna(test_df_copy.RoomService.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'RoomService'] = 0

test_df_copy.FoodCourt = test_df_copy.FoodCourt.fillna(test_df_copy.FoodCourt.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'FoodCourt'] = 0

test_df_copy.ShoppingMall = test_df_copy.ShoppingMall.fillna(test_df_copy.ShoppingMall.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'ShoppingMall'] = 0

test_df_copy.Spa = test_df_copy.Spa.fillna(test_df_copy.Spa.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'Spa'] = 0

test_df_copy.VRDeck = test_df_copy.VRDeck.fillna(test_df_copy.VRDeck.mean())
test_df_copy.loc[test_df_copy.Adult_and_spending ==False, 'VRDeck'] = 0

test_df_copy.HomePlanet = test_df_copy.HomePlanet.fillna('Earth')
test_df_copy.Destination = test_df_copy.Destination.fillna('TRAPPIST-1e')
test_df_copy.VIP = test_df_copy.VIP.fillna('False')
test_df_copy.VIP = test_df_copy.VIP.astype('bool')

test_df_copy['Cabin'] = test_df_copy.Cabin.fillna(method='ffill')

test_df_copy['Group_nums'] = test_df_copy.PassengerId.apply(lambda x: x.split('_')).apply(lambda x: x[0])
test_df_copy['Grouped'] = ((test_df_copy['Group_nums'].value_counts() > 1).reindex(test_df_copy['Group_nums'])).tolist()
test_df_copy['Deck'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[0])
test_df_copy['Side'] = test_df_copy.Cabin.apply(lambda x: str(x).split('/')).apply(lambda x: x[2])
test_df_copy['Has_expenses'] = test_df_copy['Expenses'] > 0
test_df_copy['Is_Embryo'] = test_df_copy['Age'] == 0

test_df_copy.columns
test_df_copy.drop(['Expenses', 'Adult_spending_awake', 'Adult_and_spending','Adults'],axis=1, inplace=True)

test_df_copy.to_csv('Cleaned and imputed test data.csv',index=False)

# 5. Model Building <a class="anchor" id="model"></a>

In [38]:
df_train = pd.read_csv('Cleaned and imputed data.csv')
df_test = pd.read_csv('Cleaned and imputed test data.csv')

In [39]:
df_train.head(3)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Transported,Adults,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0001_01,Europa,False,B/0/P,TRAPPIST-1e,39.0,False,0.0,0.0,0.0,0.0,0.0,False,True,False,B,P,False,False
1,0002_01,Earth,False,F/0/S,TRAPPIST-1e,24.0,False,109.0,9.0,25.0,549.0,44.0,True,True,False,F,S,True,False
2,0003_01,Europa,False,A/0/S,TRAPPIST-1e,58.0,True,43.0,3576.0,0.0,6715.0,49.0,False,True,True,A,S,True,False


In [40]:
df_test.head(2)

Unnamed: 0,PassengerId,HomePlanet,CryoSleep,Cabin,Destination,Age,VIP,RoomService,FoodCourt,ShoppingMall,Spa,VRDeck,Group_nums,Grouped,Deck,Side,Has_expenses,Is_Embryo
0,0013_01,Earth,True,G/3/S,TRAPPIST-1e,27.0,False,0.0,0.0,0.0,0.0,0.0,13,False,G,S,False,False
1,0018_01,Earth,False,F/4/S,TRAPPIST-1e,19.0,False,0.0,9.0,0.0,2823.0,0.0,18,False,F,S,True,False


Now we are going to do some feature selection.

In [41]:
df_train.dtypes

PassengerId      object
HomePlanet       object
CryoSleep          bool
Cabin            object
Destination      object
Age             float64
VIP                bool
RoomService     float64
FoodCourt       float64
ShoppingMall    float64
Spa             float64
VRDeck          float64
Transported        bool
Adults             bool
Grouped            bool
Deck             object
Side             object
Has_expenses       bool
Is_Embryo          bool
dtype: object

In [42]:
features = ['HomePlanet', 'CryoSleep', 'Destination', 'Age', 'VIP',
            'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 
            'Grouped', 'Deck', 'Has_expenses', 'Side', 'Is_Embryo']

These are the features that I decided to use for model training and testing. I don't know if these are the best ones. So you can try different ones, and could even get a better result than mine!

Now we will assign the data in the training set to feature and target variables, and do a train-test-split split for evaluation

In [43]:
X = pd.get_dummies(df_train[features])
y = df_train['Transported']

In [44]:
X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)

In [45]:
model = LogisticRegression(max_iter=10000)
model.fit(X_train,y_train)
model.score(X_test,y_test)

0.8008279668813247

Score of `0.800` Not bad!

Since we actually have to predict the test set that Kaggle has provided, we want to use all of the train data to train the model. The more data the model gets to learn from, the better the prediction.

In [46]:
model2 = LogisticRegression(max_iter=10000)
model2.fit(X,y)
model2.score(X,y)

0.7929368457379501

Let's predict our test set now and save it:

In [47]:
y_pred_log2 = model2.predict(pd.get_dummies(df_test[features]))

Now I'll use the only other classification model I knew at the time,` K-Neighbors Classifier`.

In [48]:
#knn = KNeighborsClassifier()
#param_grid = {'n_neighbors':np.arange(2,15)}
#knn_gscv = GridSearchCV(knn, param_grid, cv=5)
#knn_gscv.fit(X,y)
#knn_gscv.best_params_

In [49]:
knn2 = KNeighborsClassifier(n_neighbors=14)
knn2.fit(X,y)
knn2.score(X,y)

0.8129529506499482

In [50]:
#Saving the output
y_pred_knn = knn2.predict(pd.get_dummies(df_test[features]))

Now, I did see that the model that seemed to perform great on this data is Gradient Boosting Classifier. So I looked it up and just used it with default hyperparameters:

In [51]:
#Model
gbc = GradientBoostingClassifier(random_state = 1)
  
# Fit to training set
gbc.fit(X, y)
gbc.score(X,y)

0.8130679857356494

Seems slightly worse than our K-Neighbors Classifier. But still, we'll keep its predictions as well.

In [52]:
#save the prediction
y_pred_gbc = gbc.predict(pd.get_dummies((df_test[features])))

Since Gradient Boosting was performing well, and I had also stumbled upon Extreme Gradient Boosting, it only seems logical to try that out as well (maybe we'll get extremely good results):

In [53]:
from xgboost import XGBClassifier
xgb = XGBClassifier()
xgb.fit(X,y)
xgb.score(X,y)

0.8887610721269987

In [54]:
#save the prediction
y_pred_xgb = xgb.predict(pd.get_dummies((df_test[features])))

The last thing I want to do is tune the Gradient Boost further using GridSearchCV

In [55]:
# = GradientBoostingClassifier()
#parameters = {
#    "n_estimators":[5,50,100],
#    "max_depth":[1,3,5],
#   "learning_rate":[0.01,0.1,1]
#}

#gbcv = RandomizedSearchCV(gbc, parameters, n_iter=27, scoring='accuracy', n_jobs=-1, cv=5, random_state=1)
#gbcv.fit(X,y)
#gbcv.best_params_

In [56]:
gbcv = GradientBoostingClassifier(n_estimators=50,max_depth=5,learning_rate=0.1) #best params from gscv

gbcv.fit(X,y)
gbcv.score(X,y)

0.831013459105027

In [57]:
#save the prediction
pred_y_gbc2 = gbcv.predict(pd.get_dummies((df_test[features])))

And so, we are done!

Time for submission.

In [58]:
gbc_out = pd.DataFrame({'PassengerId':df_test.PassengerId, 'Transported':pred_y_gbc2})
gbc_out.to_csv('submission.csv',index=False)

# 6. Conclusion <a class="anchor" id="conclusion"></a>
Hope you enjoyed my notebook and got to learn something. I recognize that I could probably increase my accuracy further, but I am probably not going to pursue that for now.

Still, I'd appreciate any feedback on this notebook.

I wish you the best on your data science journey. Farewell

<p style="text-color:gray; text-align:center">Copyright © Nikhil Sharma</p>