In [1]:
import os
import pandas as pd
import sklearn as sk
import numpy as np
from sklearn import preprocessing
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

In [2]:
train_data=pd.read_csv("./Data/train.csv")
test_data=pd.read_csv("./Data/test.csv")

In [3]:
def CleanData(data):
    data["Age"]=data["Age"].fillna(data["Age"].mean())
    data["Embarked"]=data["Embarked"].fillna("X")
    data["SibSp"]=data["SibSp"].fillna(data["SibSp"].median())
    data["Parch"]=data["Parch"].fillna(data["Parch"].median())
    data["Fare"]=data["Fare"].fillna(data["Fare"].mean())
    data=data.drop(["PassengerId","Name","Ticket","Cabin"],axis=1)
    return data

Function that takes the data frame and make some cleaning to it: <br/>
1) Fill any null value in age with the mean age .<br/>
2) Fill any  null in Embarked column with X.<br/>
3) Fill the Parch and SibSp null rows with the median ( as it is not logic to be decimals) <br/>
4) Fill any missing Fare (null) with the mean. <br/>
5) Drop unneeded features: <br/>
     a) Name and passenger ID doe not matter as they are different from a passenger to other passenger so no pattern could be got <br/>
     b) Cabin has alot of null values so keeping it will result in inaccurate results. <br/>
     c) Ticket values are different for all passengers, no pattern can be recognized. <br/>

In [4]:
PassengerId=test_data["PassengerId"]
train_data=CleanData(train_data)
train_data_X=train_data.drop(["Survived"],axis=1)
train_data_Y=train_data["Survived"]
test_data_X=CleanData(test_data)


Line 1: Extract passenger ID column of test data to be used in submission <br/>
Line 2: Clean the Training data <br/>
Line 3: Get the training data without the label <br/>
Line 4: Get the label of training data <br/>
Line 5: Clean the test data <br/>

In [5]:
encoder=preprocessing.LabelEncoder()
Coded_Features = ['Sex','Embarked']
for col in Coded_Features:
    train_data_X[col] = encoder.fit_transform(train_data_X[col])
    test_data_X[col] = encoder.transform(test_data_X[col])

Loop on all features of letters to code them to a number using the fit transform for training data and transform for test.

In [6]:
 train_data_X, validate_data_X, train_data_Y, validate_data_Y = train_test_split(train_data_X, train_data_Y, test_size=0.25, random_state=42)

Split the training data 75% for training and 25% for validation of data

In [7]:
GBM = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=2, random_state=42).fit(train_data_X, train_data_Y)

The gradient boost classifier used as it combines many week models and the combination of results is made along running not at the end.
It is a strong model but must be used carefully to prevent overfitting

## Hyper tuning of Parameters:
The learning rate: I tried many learning rates 1,0.01,0.05,0.1. <br/>
   The 0.1 learning rate was the best as (1) has large oscillations so doesn't reach good result. <br/>
   0.01 and 0.05 move slowly towards the optimal solution and does not reach it as it got stuck. <br/>
The  n_estimators has to be large to prevent the overfitting but 1000 got lower results than 100 so 100 and used by trial and error <br/>
The fit function was given the train data X and the label train data Y to build the model with these data

In [8]:
score=GBM.score(validate_data_X,validate_data_Y)
print(score)

0.8251121076233184


Check the score when the validation data X is given and checking the predicted output with real output and get score percentage

In [9]:
test_data_Y=GBM.predict(test_data_X)

Predict the test data using the model made with input test data X

In [10]:
submit=pd.DataFrame({'PassengerId': PassengerId, 'Survived': test_data_Y})
submit.to_csv('./Data/submission_prediction.csv',index=False)

Write the test results along with the passenger ID column extracted from test_data from above