# Applying XGBooster to Spaceship Titanic
I just studied XGBoosters from a podium solution of another challenge I was interested in. I found it very cool and wanted to apply it to this classification problems as a test to play around with.

In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/spaceship-titanic/sample_submission.csv
/kaggle/input/spaceship-titanic/train.csv
/kaggle/input/spaceship-titanic/test.csv


In [2]:
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

Let's start by importing data and saving useful variables:

In [3]:
train_data_raw = pd.read_csv("../input/spaceship-titanic/train.csv")
test_data_raw = pd.read_csv("../input/spaceship-titanic/test.csv")

column_names = train_data_raw.columns
raw_features = column_names.drop(['PassengerId', 'Transported'])

train_data = train_data_raw[raw_features]
test_data = test_data_raw[raw_features]

response = train_data_raw['Transported']

print(raw_features)

Index(['HomePlanet', 'CryoSleep', 'Cabin', 'Destination', 'Age', 'VIP',
       'RoomService', 'FoodCourt', 'ShoppingMall', 'Spa', 'VRDeck', 'Name'],
      dtype='object')


For simplicity, I will drop `Cabin` and `Name`:

In [4]:
train_data = train_data.drop(['Cabin', 'Name'], axis=1)
test_data = test_data.drop(['Cabin', 'Name'], axis=1)

I also code `Destination` and `HomePlanet` as categoricals, with one-hot encoding.

In [5]:
train_data = pd.get_dummies(train_data, columns=['Destination', 'HomePlanet'])
test_data = pd.get_dummies(test_data, columns=['Destination', 'HomePlanet'])

In my previous attempt I classified the luxury categories based on whether the passenger had paid anything for it or not. In this case I prefer to leave these features as is. Next, I am going to fill out NaN values, and then change booleans (features `Cryosleep` and `VIP`) with 1 and 0:

In [6]:
for column in train_data.columns:
    if column == 'Age':
        train_data[column] = train_data[column].fillna(np.nanmean(train_data['Age']))
    else: 
        train_data[column] = train_data[column].fillna(0)

for column in test_data.columns:
    if column == 'Age':
        test_data[column] = test_data[column].fillna(np.nanmean(test_data['Age']))
    else: 
        test_data[column] = test_data[column].fillna(0)

In [7]:
train_data['CryoSleep'] = train_data['CryoSleep'].apply(lambda x: int(x))
train_data['VIP'] = train_data['VIP'].apply(lambda x: int(x))
test_data['CryoSleep'] = test_data['CryoSleep'].apply(lambda x: int(x))
test_data['VIP'] = test_data['VIP'].apply(lambda x: int(x))

## Model
It's now time to train the XGB model:

In [8]:
features = train_data.columns

We define a k-fold stratified classifier to test different parameters for the model (found in another notebook, ask for details):

In [9]:
def five_fold_cv(model, X_train, Y_train, verbose = True):
    skf = StratifiedKFold(n_splits = 5)
    fold = 1
    scores = []

    for train_index, test_index in skf.split(X_train, Y_train):
        X_train_fold, X_test_fold = X_train.iloc[train_index], X_train.iloc[test_index]
        Y_train_fold, Y_test_fold = Y_train.iloc[train_index], Y_train.iloc[test_index]

        model.fit(X_train_fold, Y_train_fold)
        
        preds = model.predict(X_test_fold)
        # preds = [x[1] for x in preds]

        score = accuracy_score(Y_test_fold, preds)
        scores.append(score)
        if verbose:
            print('Fold', fold, '    ', score)
        fold += 1

    avg = np.mean(scores)
    if verbose:
        print()
        print('Average', avg)
    return avg

Through some tuning of the parameters, we choose `learning_rate = 0.2`, `max_depth = 5` and `reg_lambda = 0.8`. None of these really changes much in the final result.

In [10]:
model = XGBClassifier(eval_metric=accuracy_score, objective = 'binary:logistic',
                     learning_rate=0.2, max_depth=5, subsample=1, reg_lambda=0.8)
score = five_fold_cv(model, train_data, response, verbose=True)

Fold 1      0.7918343875790684
Fold 2      0.7855089131684876
Fold 3      0.7906843013225991
Fold 4      0.8032220943613348
Fold 5      0.8037974683544303

Average 0.795009432957184


## Predictions
Time to try some predictions.

In [11]:
model.fit(train_data, response)
predictions = (model.predict(test_data) > 0.5).astype("bool")

In [12]:
output = pd.DataFrame({'PassengerId': test_data_raw.PassengerId, 'Transported': predictions})
output.to_csv('spacetitanic_xgb.csv', index=False)