# Overview
    Task: To implement model for binary classification
    Database: Titanic
    Algorithm: XGboost ( Adaptive | Gradient )
    Metrics: accuracy score - Used in classification 
    
    Life-cycle:
    
    1) Importing

    2) Preprocessing
        1) Reading from .csv | Train & Test datasets
        2) Splitting into dependent & independent
        3) Analyzing on consistence of anomalies and dealing with them:
                - Existence missing value
                - Existence of categorical feature
                - Existence dummy-trap variable
        
    3) Model processing
        1) Data structure preparation
        2) Building & training
        3) Predicting
    
    4) Benchmarking
        1) Preparing benchmark
        2) Calculating accuracy score

    5) Printing results

# 1) Importing

In [1]:
import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder, OneHotEncoder, Imputer


# 2) Preprocessing

+ 2.1) Reading from .csv | Train & Test datasets

In [2]:
dataset_train = pd.read_csv('train.csv')
dataset_test = pd.read_csv('test.csv')

In [3]:
dataset_train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [4]:
dataset_test.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


+ 2.2) Splitting into dependent and independent

In [5]:
def extract_independent_for_train(_dataset):            
    X = _dataset.iloc[:,4:8].values    
    return np.concatenate((X, _dataset.iloc[:,2:3].values), axis = 1)

def extract_dependent_for_train(_dataset):
    return _dataset.iloc[:,1].values

def extract_independent_for_test(_dataset):
    X = _dataset.iloc[:,3:7].values    
    return np.concatenate((X, _dataset.iloc[:,1:2].values), axis = 1)

X_train = extract_independent_for_train(dataset_train)
y_train = extract_dependent_for_train(dataset_train)

X_test = extract_independent_for_test(dataset_test)

# Make confident in right splitting dataset
print(f"Are length of train datasets are equal? : {len(X_train)==len(y_train)}")
print(f"Does test dataset contain any data? : {len(X_test)>0}")

Are length of train datasets are equal? : True
Does test dataset contain any data? : True


+ 2.3) Analyzing on consistence of anomalies:

In [6]:
# - Existence of missing values
print(f"Missing data contains that columns:  {dataset_train.columns[dataset_train.isnull().any()]}")

Missing data contains that columns:  Index(['Age', 'Cabin', 'Embarked'], dtype='object')


* Now i can conclude that we need only tackle the missing values in column Age. 

In [7]:
# Getting rid of missing values in column age
imputer = Imputer(missing_values='NaN', strategy= 'mean', axis = 0) # axis = 0
X_train[:,1:2] = imputer.fit_transform(X_train[:,1:2])
X_test[:,1:2] = imputer.fit_transform(X_test[:,1:2])



In [8]:
# - Existence of categorical feature
def label_encode(X):
    label_encoder=LabelEncoder()
    X[:,0] = label_encoder.fit_transform(X[:,0])


def one_hot_encode(X):
    encoder = OneHotEncoder(categorical_features = [0])
    encoded = encoder.fit_transform(X[:,4:5])
    X = X[:,:-1] # Deleting last columns containing categorical values.    
    encoded = encoded.toarray() # Converting crsf matrix to 2d-numpy array
    return np.concatenate((X, encoded),axis = 1)
    
label_encode(X_train)
X_train = one_hot_encode(X_train)

label_encode(X_test)
X_test = one_hot_encode(X_test)

In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.
In case you used a LabelEncoder before this OneHotEncoder to convert the categories to integers, then you can now use the OneHotEncoder directly.


In [9]:
print(f"length of train dataset: {len(X_train)}")
print(f"\nlength of test dataset: {len(X_test)}")

length of train dataset: 891

length of test dataset: 418


In [10]:
# - Existence of dummy trap variable
X_train = X_train[:,:-1]
X_test = X_test[:,:-1]

In [11]:
print(f" Are dimensionalities of test and train remained equal after categorical features transformation ? : {len(X_test[0])==len(X_train[0])}")

 Are dimensionalities of test and train remained equal after categorical features transformation ? : True


# 2) Model processing

+ 2.1) Data structure preparation

In [12]:
dtrain = xgb.DMatrix(X_train, label = y_train )
dtest = xgb.DMatrix(X_test)

+ 2.2) Building & training

In [13]:
# XGboost requires own data type 'DMatrix' thats why, we need firstly form train data according requirement

param = {
        'max_depth':3, # Defines max_depth
         'eta':0.5, # Identifies amount to which shrink data
         'silent':0, # 1 - not representer inner trees
         'tree_method':'exact',
         'objective':'count:poisson', 
#         'predictor' : 'gpu_predictor',
        }

num_round = 2
bst = xgb.train(param, dtrain, num_round)

[16:42:14] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3
[16:42:14] src/tree/updater_prune.cc:74: tree pruning end, 1 roots, 14 extra nodes, 0 pruned nodes, max_depth=3


+ 2.3) Predicting

In [14]:
predicted = bst.predict(dtest)
predicted = predicted.round()
predicted = predicted.astype(int)

# 3) Benchmarking

+ 3.1) Preparing benchmark

In [15]:
y_test = pd.read_csv('gender_submission.csv')
y = y_test.iloc[:,1:2].values 
print(f"Length of our test set: {len(y)}")
y_test.head()

Length of our test set: 418


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


+ 3.2) Calculating accuracy score

In [16]:
print(f"{accuracy_score(y, predicted)}")

0.9784688995215312


# 4) Printing results

In [17]:
del y_test['Survived'] # deleting default valuse
predicted_dataframe = pd.DataFrame({'Survived' : predicted}) # initilizing new dataframe with predicted value
y_test = y_test.join(predicted_dataframe) # combining dataframes (id + target_class)
y_test.to_csv('result.csv', index=False) 