# <center>  Titanic, ML from Disaster (II) </center>

 **Brull Borràs, Pere Miquel, 28/01/2018. **

- **4 Modelling**
    - Hyperparameter tuning

In [65]:
# Modules for Data Analysis
import numpy as np
import pandas as pd
import regex as re
import lightgbm as lgb

# And for visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

import warnings
warnings.filterwarnings("ignore")

from xgboost import plot_importance
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import model_selection
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV

from collections import Counter

plt.style.use('fivethirtyeight')
sns.set_palette("GnBu_d")

# Load processed datasets from previous notebook
train = pd.read_csv("../input/processed_train.csv", sep = ",")
test = pd.read_csv("../input/processed_test.csv", sep = ",")
test_id = test['PassengerId']

## Modelling

Ok, what happened so far?

- Loaded and visualized data trying to get an idea of relation between explanatory variables and the response variable or target.
- Created new features trying to exploit hidden data and increase the effect on the target.
- Encoded categorical variables.
- Dropped useless information.

We are now left with a numerical matrix of data. Training dataset will get splitted into a train set which will correspond to the data used to fit the model and a validation set to assess performance. Then, the best parameters for the model will be looked for using the GridSearch technique, which iterates over all possible combinations and outputs the best one.


In [76]:
y_train = train['Survived']
y_test = test['PassengerId']
X_train = train.drop(['Survived'], axis=1)
X_test = test.drop(['PassengerId'], axis=1)

In [77]:
# Create training and validation sets
x, x_val, y, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42, stratify=y_train)

In [78]:
# create dataset for lightgbm containers
lgb_train = lgb.Dataset(x, y)
lgb_eval = lgb.Dataset(x_val, y_val, reference=lgb_train)

In [95]:
# specify your configurations as a dict
params = {'boosting_type': 'gbdt', 
          'colsample_bytree': 0.64, 
          'learning_rate': 0.1, 
          'n_estimators': 64, 
          'num_leaves': 8, 
          'objective': 'binary', 
          'random_state': 501, 
          'feature_fraction': 0.5,
          'bagging_fraction': 0.5,
          'bagging_freq': 20,
          'reg_alpha': 1, 
          'reg_lambda': 1.2, 
          'subsample': 0.75,
          'max_depth': -1,
          'subsample_for_bin': 200,}

In [103]:
gridParams = {
    'learning_rate': [0.01,0.05,0.1],
    'n_estimators': [100,200,300],
    'num_leaves': [8,10,12],
    'max_depth': [3,4],
    'boosting_type' : ['gbdt'],
    'objective' : ['binary'],
    'seed' : [777],
    'colsample_bytree' : [0.7,0.85,1],
    'subsample' : [0.7,0.85,1],
    'reg_alpha' : [0,0.5,1],
    'reg_lambda' : [0,2,6,7,10],
    'max_depth': [4,6]
}

In [104]:
mdl = lgb.LGBMClassifier(boosting_type= 'gbdt', 
          objective = 'binary', 
          n_jobs = -1, # Updated from 'nthread' 
          silent = True)

# Create the grid
grid = GridSearchCV(mdl, gridParams, verbose=1, cv=4, n_jobs=-1)
grid.fit(X_train, y_train)

# Print the best parameters found
print(grid.best_params_)
print(grid.best_score_)

Fitting 4 folds for each of 7290 candidates, totalling 29160 fits


[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 184 tasks      | elapsed:    6.2s
[Parallel(n_jobs=-1)]: Done 434 tasks      | elapsed:   10.8s
[Parallel(n_jobs=-1)]: Done 784 tasks      | elapsed:   20.9s
[Parallel(n_jobs=-1)]: Done 1234 tasks      | elapsed:   37.7s
[Parallel(n_jobs=-1)]: Done 1784 tasks      | elapsed:  1.0min
[Parallel(n_jobs=-1)]: Done 2434 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 3184 tasks      | elapsed:  1.9min
[Parallel(n_jobs=-1)]: Done 4034 tasks      | elapsed:  2.3min
[Parallel(n_jobs=-1)]: Done 4984 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 6034 tasks      | elapsed:  3.6min
[Parallel(n_jobs=-1)]: Done 7184 tasks      | elapsed:  4.3min
[Parallel(n_jobs=-1)]: Done 8434 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 9784 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 11234 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 12784 tasks      | elapsed:  7

{'boosting_type': 'gbdt', 'colsample_bytree': 0.85, 'learning_rate': 0.1, 'max_depth': 4, 'n_estimators': 300, 'num_leaves': 8, 'objective': 'binary', 'reg_alpha': 1, 'reg_lambda': 10, 'seed': 777, 'subsample': 1}
0.8361391694725028


In [106]:
best_params = {'boosting_type': 'gbdt', 'colsample_bytree': 0.85, 'learning_rate': 0.1, 'max_depth': 4, 
               'n_estimators': 300, 'num_leaves': 8, 'objective': 'binary', 'reg_alpha': 1, 'reg_lambda': 10,
               'seed': 777, 'subsample': 1}

In [107]:
mdl.set_params(**best_params)
mdl.fit(X_train,y_train)
kfold = KFold(n_splits=10, random_state=7)
results = cross_val_score(mdl, np.array(X_val), np.array(y_val), cv=kfold)
print("Accuracy: %.2f%% (%.2f%%)" % (results.mean()*100, results.std()*100))
lgb.plot_importance(mdl)
plt.show()

NameError: name 'X_val' is not defined

In [87]:
result = pd.DataFrame({'PassengerId': test_id,'Survived': y_pred}, columns=['PassengerId','Survived'])
result['Survived'] = result['Survived'].map(lambda x: 1 if x>0.5 else 0)
result.head()

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1


In [88]:
result.to_csv('../output/sub4_lgb.csv', sep=",", index=False)