# Model Selection

In this notebook, I'll work with **K-Fold Cross Validation**, **Grid Search** and **XGBoost**.

## K-Fold Cross Validation

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
%matplotlib inline

from sklearn.svm import SVC
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import confusion_matrix

import warnings
warnings.filterwarnings('ignore')

In [0]:
dataset = pd.read_csv('Social_Network_Ads.csv')
X = dataset.iloc[:, 2:-1]
y = dataset.iloc[:, -1]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)

In [0]:
standardScaler = StandardScaler()
X_train = standardScaler.fit_transform(X_train)
X_test = standardScaler.transform(X_test)

We will use the **Support Vector Classifier** for understanding K-Fold and Grid Search.

In [0]:
classifier = SVC(kernel = 'rbf', random_state = 0)

I'll now apply K-Fold cross validation to get an average accuracy. I set the value of K as 10.

In [5]:
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(classifier, X_train, y_train, cv = 10, n_jobs = -1)
print("Mean Accuracy: {:.2f}%".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}%".format(accuracies.std()*100))

Mean Accuracy: 90.05%
Standard Deviation: 6.39%


Standard Deviation being 6% means that the model can deviate upto 6% from the mean performance.

## Grid Search

I'll try two different set of parameters to identify the best parameter set.

In [6]:
from sklearn.model_selection import GridSearchCV

parameters = [
    {
        'C': [1, 10, 100, 1000],
        'kernel': ['linear']
    },
    {
        'C': [1, 10, 100, 1000],
        'kernel': ['rbf'],
        'gamma': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]
    }
]

grid_search = GridSearchCV(classifier, 
                           param_grid = parameters, 
                           scoring = 'accuracy',
                           cv= 10,
                           n_jobs = -1,
                           verbose = 5)
grid_search.fit(X_train, y_train)
print("Best Accuracy: {}%".format(grid_search.best_score_*100))

Fitting 10 folds for each of 40 candidates, totalling 400 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Best Accuracy: 90.33333333333333%


[Parallel(n_jobs=-1)]: Done 400 out of 400 | elapsed:    3.2s finished


In [7]:
best_parameters = grid_search.best_params_
print("Best parameters: {}".format(best_parameters))

Best parameters: {'C': 1, 'gamma': 0.7, 'kernel': 'rbf'}


## XGBoost

For XGBoost, I'm using the **Churn Modelling** dataset.

In [0]:
dataset = pd.read_csv('Churn_Modelling.csv')
X = dataset.iloc[:, 3:-1]
y = dataset.iloc[:, -1]

In [0]:
from sklearn.preprocessing import LabelEncoder, OneHotEncoder

labelEncoder_1 = LabelEncoder()
X.iloc[:, 1] = labelEncoder_1.fit_transform(X.iloc[:, 1])
labelEncoder_2 = LabelEncoder()
X.iloc[:, 2] = labelEncoder_2.fit_transform(X.iloc[:, 2])
oneHotEncoder = OneHotEncoder(categorical_features = [1])
X = oneHotEncoder.fit_transform(X).toarray()

# Avoid dummy variable trap
X = X[:, 1:]

In [0]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

In [11]:
from xgboost import XGBClassifier

classifier = XGBClassifier()
classifier.fit(X_train, y_train)

XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=100,
       n_jobs=1, nthread=None, objective='binary:logistic', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)

In [12]:
accuracies = cross_val_score(classifier, X_train, y_train, cv = 10, n_jobs = -1)
print("Mean Accuracy: {:.2f}%".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f}%".format(accuracies.std()*100))

Mean Accuracy: 86.30%
Standard Deviation: 1.07%


We achieved an **accuracy of 86.3%** on the dataset with only very slight variation of +1.07% or -1.07%.