## XGBoost with Titanic Dataset

Titanic Dataset contains data for 891 of the real Titanic passengers. Each row represents a single person. The columns describe different attributes about the person including whether they survived, their age, their passenger-class, their sex, the fare they paid etc.


### About the alogrithm

In [None]:
import os
path = '/rapids/notebooks/ml_tutorial/ML_intro/data/'
if not os.path.exists(str(path + 'titanic')):
    print("error, data is missing!")

## 1. Loading the Data

In [None]:
import cudf
titanic_gdf=  cudf.read_csv(str(path + 'titanic/train.csv')) #891x12

## 2. Cleaning the Data 

In [None]:
#encoding
survived = titanic_gdf['Survived'] == 1
die = titanic_gdf['Survived'] == 0


In [None]:
import numpy as np
#encoding
sex = np.zeros(len(titanic_gdf))
sex[titanic_gdf['Sex']== 'male'] = 1
sex[titanic_gdf['Sex']== 'female'] = 0
titanic_gdf['Sex'] = sex

In [None]:
#Fill the NA values with the most common one 
titanic_gdf['Embarked'] = titanic_gdf['Embarked'].fillna('S')
#encoding
embarked = np.zeros(len(titanic_gdf))
embarked[titanic_gdf['Embarked']== 'S'] = 1
embarked[titanic_gdf['Embarked']== 'C'] = 2
embarked[titanic_gdf['Embarked']== 'Q'] = 3
titanic_gdf['Embarked'] = embarked

In [None]:
# Dealing with the missing values in the Age feature.

titanic_gdf['Age'].fillna(titanic_gdf['Age'].mean(),inplace=True)
age = np.zeros(len(titanic_gdf))
#encoding
age[titanic_gdf['Age']<20] = 1
age[(titanic_gdf['Age']>=20)&(titanic_gdf['Age']<60)] = 2
age[(titanic_gdf['Age']>=60)] = 3
titanic_gdf['Age'] = age

In [None]:
# Dropping the columns .ipynb_checkpoints/e are not useing for the analysis 
titanic_gdf = titanic_gdf.drop(columns= ['PassengerId','Name','Ticket','Cabin'])

In [None]:
titanic_gdf.head()

## 3. Splitting the Data into Training and Testing

In [None]:
import cudf
from cuml.preprocessing.model_selection import train_test_split

target = titanic_gdf['Survived']
titanic_gdf = titanic_gdf.drop(['Survived'])

In [None]:
X_train, X_test, y_train, y_test = train_test_split(titanic_gdf, target,
                                                    train_size=0.8, shuffle = False)

In [None]:
import xgboost as xgb; print('XGBoost Version:', xgb.__version__)

In [None]:
X_train = xgb.DMatrix(X_train, label= y_train)
X_test = xgb.DMatrix(X_test, label = y_test)

## 4. Modelling

### 4.1. Define Parameters

In [None]:
params = {
    'n_estimators': 750,
    'max_depth': 3,
    'learning_rate': 0.02,
    'tree_method':'gpu_hist',
    'objective': 'binary:logistic',
    'gamma':  0.0,
    'subsample': 0.8
}


### 4.2. Fit the model

In [None]:
model = xgb.train(params, X_train)

### 4.3. Prediction

In [None]:
cuml_pred = model.predict(X_test)

### 4.4. Scoring

In [None]:
from sklearn.metrics import accuracy_score