________
<a id="top"></a>
# DS 7331 Data Mining: Lab 2 iPython Notebook
Created On: February 11, 2019
### Authors:  
- Arora, Tanvi                
- Chandna, Rajat
- Henderson Kuns, Nicol
- Ramasundaram, Kumar
- Vasquez, James
LRInterpertFeat

# Logisitic Regression and Support Vector Machines

## Contents
* <a href="#DataPrep">Data Prepping</a>
    * <a href="#onehotencode">One Hot Encoding</a>
    * <a href="#Perform8020split">Perform 80/20 split</a>  
    * <a href="#PrepTestData">Prep Test Data</a>    
* <a href="#CreateLRModel">Create Models</a>
    * <a href="#CreateLRModel">Simple Logistic Regression Model</a>  
    * <a href="#LRGridSearch">Grid Search</a>   
    * <a href="#LRInterpertFeat">Feature Interpertation</a>   
* <a href="#SVMModel">Simple SVM Model</a>
    * <a href="#SVMRBF">RBF Grid Search</a>   
    * <a href="#SVMPOLY">Poly Grid Search</a>   
    * <a href="#SVMFINAL">Final SVM Model on Validation Dataset</a>
    * <a href="#SVMFINAL_Test">Final SVM Model on Additional Test Dataset</a> 
* <a href="#MODELADV">Model Advantages</a>
* <a href="#INTVECT">Interpret Support Vector</a>
* <a href="#ECPWORK">Exceptionnal Work</a>

<a id="DataPrep"></a>
### Getting Dataset Ready for Model Building

In [1]:
# Importing the needed modules
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.simplefilter('ignore', DeprecationWarning)
warnings.simplefilter('ignore', FutureWarning)


# To display plots inside the iPython Notebook itself
%matplotlib inline

In [2]:
# To verify how data is orgainzed in file(to find the delimiter) and then
# use corresponding function to open the file. eg
# data could be in .csv. .tsv, excel format etc.
pathOfDataFile = "data/bank-full.csv"
firstFewLines = list()
noOfLinesToView = 5

with open(pathOfDataFile) as dataFile:
    firstFewLines = [next(dataFile) for i in range(noOfLinesToView)]
    for line in firstFewLines:
        print(line)

"age";"job";"marital";"education";"default";"balance";"housing";"loan";"contact";"day";"month";"duration";"campaign";"pdays";"previous";"poutcome";"y"

58;"management";"married";"tertiary";"no";2143;"yes";"no";"unknown";5;"may";261;1;-1;0;"unknown";"no"

44;"technician";"single";"secondary";"no";29;"yes";"no";"unknown";5;"may";151;1;-1;0;"unknown";"no"

33;"entrepreneur";"married";"secondary";"no";2;"yes";"yes";"unknown";5;"may";76;1;-1;0;"unknown";"no"

47;"blue-collar";"married";"unknown";"no";1506;"yes";"no";"unknown";5;"may";92;1;-1;0;"unknown";"no"



In [3]:
# Import the semi-colon delimited data file into pandas dataFrame
bankPromo_df = pd.read_csv(pathOfDataFile, sep = ";")

# Rename the Target/Final Outcome column from "y" to "Subscribed" as based on data description.
bankPromo_df = bankPromo_df.rename(columns={"y":"Subscribed"})

bankPromo_df.head(7)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,Subscribed
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
5,35,management,married,tertiary,no,231,yes,no,unknown,5,may,139,1,-1,0,unknown,no
6,28,management,single,tertiary,no,447,yes,yes,unknown,5,may,217,1,-1,0,unknown,no


In [4]:
bankPromo_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
age           45211 non-null int64
job           45211 non-null object
marital       45211 non-null object
education     45211 non-null object
default       45211 non-null object
balance       45211 non-null int64
housing       45211 non-null object
loan          45211 non-null object
contact       45211 non-null object
day           45211 non-null int64
month         45211 non-null object
duration      45211 non-null int64
campaign      45211 non-null int64
pdays         45211 non-null int64
previous      45211 non-null int64
poutcome      45211 non-null object
Subscribed    45211 non-null object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


In [5]:
# Get the unique values(Levels) for categorical variables.
# List to hold names of categorical variables
categoricalVars = list()
# List to hold names of numerical variables
numericalVars = list()

for colName in bankPromo_df.columns:
    if bankPromo_df[colName].dtype == np.int64:
        numericalVars.append(colName)
    elif bankPromo_df[colName].dtype == np.object:
        categoricalVars.append(colName)
    else:
        pass
    
# Remove Target column from final categorical Var list
categoricalVars.remove('Subscribed')

print(numericalVars)
print(categoricalVars)

['age', 'balance', 'day', 'duration', 'campaign', 'pdays', 'previous']
['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact', 'month', 'poutcome']


________________________________________________________________________________________________________
<a id="onehotencode"></a>
<a href="#top">Back to Top</a>
### Perform One Hot Encoding for categorical variables in dataset

In [6]:
# Make a copy of original data frame
bankPromoModel_Df = bankPromo_df.copy()
bankPromoModel_Df['Target'] = bankPromoModel_Df['Subscribed'].apply(lambda resp : 1 if resp == "yes" else 0)
bankPromoModel_Df['Target'] = bankPromoModel_Df['Target'].astype(np.int)
# Delete the original 'Subscribed' column
del bankPromoModel_Df['Subscribed']





In [7]:
# Drop the pDays feature as it had high correlation with "previous" feature
del bankPromoModel_Df['pdays']

In [8]:
# Covert all categorical variables to corresponding indicator variables
for categoricalVar in categoricalVars:
    tmpDf = pd.DataFrame()
    # Remove 1st class level to avoid multicollinearity
    tmpDf = pd.get_dummies(bankPromoModel_Df[categoricalVar], prefix=categoricalVar, drop_first=True)
    bankPromoModel_Df = pd.concat((bankPromoModel_Df, tmpDf), axis=1)

# Now remove the original categorical vars since indicator variables are created from them.
bankPromoModel_Df.drop(categoricalVars, inplace=True, axis=1)
bankPromoModel_Df.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 42 columns):
age                    45211 non-null int64
balance                45211 non-null int64
day                    45211 non-null int64
duration               45211 non-null int64
campaign               45211 non-null int64
previous               45211 non-null int64
Target                 45211 non-null int64
job_blue-collar        45211 non-null uint8
job_entrepreneur       45211 non-null uint8
job_housemaid          45211 non-null uint8
job_management         45211 non-null uint8
job_retired            45211 non-null uint8
job_self-employed      45211 non-null uint8
job_services           45211 non-null uint8
job_student            45211 non-null uint8
job_technician         45211 non-null uint8
job_unemployed         45211 non-null uint8
job_unknown            45211 non-null uint8
marital_married        45211 non-null uint8
marital_single         45211 non-null uint8
education_s

________________________________________________________________________________________________________
________________________________________________________________________________________________________
<a id="Perform8020split"></a>
<a href="#top">Back to Top</a>
### Create 10 Splits Stratified Cross Validation Object

In [9]:
# Training and Test Split
from sklearn.model_selection import StratifiedShuffleSplit
from sklearn.exceptions import DataConversionWarning
warnings.filterwarnings(action='ignore', category=DataConversionWarning)

if 'Target' in bankPromoModel_Df:
    y = bankPromoModel_Df['Target'].values # get the labels we want
    del bankPromoModel_Df['Target']        # get rid of the class label
    X = bankPromoModel_Df.values           # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# To use the cross validation object in scikit learn, we need to grab an instance
# of the object and set it up. This object will be able to split our data into 
# training and testing splits
num_cv_iterations = 10
stratified_cv_object = StratifiedShuffleSplit(n_splits=num_cv_iterations,
                         test_size  = 0.2, random_state=999)
                         
print(stratified_cv_object)


StratifiedShuffleSplit(n_splits=10, random_state=999, test_size=0.2,
            train_size=None)


In [10]:
# Training and Test Split
from sklearn.model_selection import StratifiedKFold

if 'Target' in bankPromoModel_Df:
    y = bankPromoModel_Df['Target'].values # get the labels we want
    del bankPromoModel_Df['Target']        # get rid of the class label
    X = bankPromoModel_Df.values           # use everything else to predict!

    ## X and y are now numpy matrices, by calling 'values' on the pandas data frames we
    #    have converted them into simple matrices to use with scikit learn
    
    
# To use the cross validation object in scikit learn, we need to grab an instance
# of the object and set it up. This object will be able to split our data into 
# training and testing splits
num_cv_iterations = 10
stratifiedKfold_cv_object = StratifiedKFold(n_splits=num_cv_iterations, random_state=999)
                         
print(stratifiedKfold_cv_object)


StratifiedKFold(n_splits=10, random_state=999, shuffle=False)


________________________________________________________________________________________________________
<a id="PrepTestData"></a>
<a href="#top">Back to Top</a>
### Getting ready Additional Test Dataset(with 10% instances) for final model fitting and evaluations 

In [11]:
pathOfAdditionalDataFile = "data/bank.csv"

# Import the semi-colon delimited data file into pandas dataFrame
bankPromoAdditional_df = pd.read_csv(pathOfAdditionalDataFile, sep = ";")

# Rename the Target/Final Outcome column from "y" to "Subscribed" as based on data description.
bankPromoAdditional_df = bankPromoAdditional_df.rename(columns={"y":"Subscribed"})

bankPromoAdditional_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 17 columns):
age           4521 non-null int64
job           4521 non-null object
marital       4521 non-null object
education     4521 non-null object
default       4521 non-null object
balance       4521 non-null int64
housing       4521 non-null object
loan          4521 non-null object
contact       4521 non-null object
day           4521 non-null int64
month         4521 non-null object
duration      4521 non-null int64
campaign      4521 non-null int64
pdays         4521 non-null int64
previous      4521 non-null int64
poutcome      4521 non-null object
Subscribed    4521 non-null object
dtypes: int64(7), object(10)
memory usage: 600.5+ KB


In [12]:
bankPromoAdditional_df['Target'] = bankPromoAdditional_df['Subscribed'].apply(lambda resp : 1 if resp == "yes" else 0)
bankPromoAdditional_df['Target'] = bankPromoAdditional_df['Target'].astype(np.int)
# Delete the original 'Subscribed' column
del bankPromoAdditional_df['Subscribed']

In [13]:
# Remove pDays
del bankPromoAdditional_df['pdays']

In [14]:
# Covert all categorical variables to corresponding indicator variables
for categoricalVar in categoricalVars:
    tmpDf = pd.DataFrame()
    # Remove 1st class level to avoid multicollinearity
    tmpDf = pd.get_dummies(bankPromoAdditional_df[categoricalVar], prefix=categoricalVar, drop_first=True)
    bankPromoAdditional_df = pd.concat((bankPromoAdditional_df, tmpDf), axis=1)

# Now remove the original categorical vars since indicator variables are created from them.
bankPromoAdditional_df.drop(categoricalVars, inplace=True, axis=1)

if 'Target' in bankPromoAdditional_df:
    y_Final = bankPromoAdditional_df['Target'].values # get the labels we want
    del bankPromoAdditional_df['Target']        # get rid of the class label
    X_Final = bankPromoAdditional_df.values

bankPromoAdditional_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4521 entries, 0 to 4520
Data columns (total 41 columns):
age                    4521 non-null int64
balance                4521 non-null int64
day                    4521 non-null int64
duration               4521 non-null int64
campaign               4521 non-null int64
previous               4521 non-null int64
job_blue-collar        4521 non-null uint8
job_entrepreneur       4521 non-null uint8
job_housemaid          4521 non-null uint8
job_management         4521 non-null uint8
job_retired            4521 non-null uint8
job_self-employed      4521 non-null uint8
job_services           4521 non-null uint8
job_student            4521 non-null uint8
job_technician         4521 non-null uint8
job_unemployed         4521 non-null uint8
job_unknown            4521 non-null uint8
marital_married        4521 non-null uint8
marital_single         4521 non-null uint8
education_secondary    4521 non-null uint8
education_tertiary     4521 non-n

________________________________________________________________________________________________________
<a id="CreateLRModel"></a>
<a href="#top">Back to Top</a>
# Create Model


________________________________________________________________________________________________________
<a id="SVMModel"></a>
<a href="#top">Back to Top</a>
### Simple SVM Model Fit

In [15]:
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline


scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

# Standardize the features first, since standardizing the features could lead to
# gradient desent algo to converge faster and then run SVM model

svmModel = make_pipeline(StandardScaler(), SVC(C=1.0, kernel='rbf', degree=3 , gamma='auto', random_state=999))
scores = cross_validate(svmModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)

print()
display(pd.DataFrame(scores))

scores = cross_validate(svmModel, X, y=y, cv=stratifiedKfold_cv_object, n_jobs=-1, scoring=scoring)
display(pd.DataFrame(scores))




Unnamed: 0,fit_time,score_time,test_F1_Score,train_F1_Score,test_AUC,train_AUC,test_Accuracy,train_Accuracy,test_Precision,train_Precision,test_Recall,train_Recall
0,55.997428,29.277847,0.442489,0.536374,0.905848,0.943094,0.901913,0.917358,0.660413,0.780235,0.332703,0.40865
1,55.573726,29.79394,0.409762,0.533934,0.908474,0.942348,0.898374,0.917026,0.639279,0.778533,0.301512,0.406287
2,53.945217,29.854525,0.460468,0.522341,0.91084,0.942594,0.905673,0.916058,0.695985,0.781176,0.344045,0.392342
3,53.975717,29.032582,0.442467,0.534931,0.903748,0.942832,0.903019,0.917358,0.675728,0.782787,0.328922,0.406287
4,53.551564,29.336625,0.447059,0.526694,0.905101,0.943542,0.90125,0.916169,0.648115,0.775632,0.34121,0.398724
5,56.740093,29.887593,0.44345,0.525689,0.911227,0.940267,0.903682,0.91628,0.684418,0.779378,0.327977,0.396597
6,56.034785,30.636996,0.434286,0.535808,0.900787,0.944734,0.901471,0.917026,0.661509,0.775291,0.323251,0.409359
7,55.871675,29.466311,0.431423,0.527187,0.909024,0.941691,0.902355,0.916335,0.676768,0.777778,0.316635,0.398724
8,52.515114,30.386943,0.458831,0.529183,0.901194,0.943467,0.904788,0.916363,0.684803,0.77484,0.344991,0.401796
9,56.857944,30.009609,0.434069,0.529936,0.901098,0.942322,0.900807,0.91686,0.652751,0.782548,0.325142,0.400615


Unnamed: 0,fit_time,score_time,test_F1_Score,train_F1_Score,test_AUC,train_AUC,test_Accuracy,train_Accuracy,test_Precision,train_Precision,test_Recall,train_Recall
0,68.547299,16.551178,0.026119,0.540348,0.88924,0.944899,0.884564,0.917545,1.0,0.776684,0.013233,0.414286
1,82.267283,16.542563,0.007308,0.546727,0.370369,0.942571,0.81977,0.918823,0.010274,0.788287,0.005671,0.418487
2,71.679508,16.518146,0.05169,0.559631,0.455206,0.945343,0.788985,0.920324,0.054507,0.791699,0.049149,0.432773
3,70.234187,16.177203,0.050304,0.56984,0.393581,0.947252,0.757797,0.922045,0.046474,0.803749,0.05482,0.441387
4,76.29343,16.650535,0.144304,0.555252,0.580249,0.942971,0.850476,0.920079,0.218391,0.795455,0.10775,0.426471
5,74.162877,16.490273,0.197415,0.559253,0.503552,0.94469,0.848927,0.919931,0.26087,0.785334,0.15879,0.434244
6,70.751371,16.391119,0.166983,0.553717,0.575662,0.946234,0.805795,0.919145,0.167619,0.781394,0.166352,0.428782
7,63.744324,14.995524,0.032051,0.618125,0.208879,0.955003,0.398806,0.92782,0.019746,0.810986,0.085066,0.49937
8,73.711308,17.453439,0.260456,0.543772,0.609869,0.942091,0.827914,0.918801,0.26195,0.793312,0.258979,0.413655
9,64.364488,14.17105,0.292779,0.660222,0.723169,0.954828,0.525442,0.935219,0.177246,0.854521,0.840909,0.537912


In [None]:
# For class balance

svmModel = make_pipeline(StandardScaler(), SVC(C=1.0, kernel='rbf', degree=3 , gamma='auto',class_weight="balanced", random_state=999))

scores = cross_validate(svmModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)

display(pd.DataFrame(scores))

scores = cross_validate(svmModel, X, y=y, cv=stratifiedKfold_cv_object, n_jobs=-1, scoring=scoring)
display(pd.DataFrame(scores))

________________________________________________________________________________________________________
<a id="SVMRBF"></a>
<a href="#top">Back to Top</a>
### Tuning The Model Hyper Parameters for SVM Using Grid Search


In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import f1_score

param_grid = {
     'svc__kernel' : ['poly', 'rbf'],
    'svc__C' : np.logspace(-10, 2, 5),
    'svc__degree' : [1,2,3],
    'svc__gamma': np.logspace(-9, 3, 5)}


# Create grid search object

grid = GridSearchCV(make_pipeline(StandardScaler(), SVC(class_weight='balanced', random_state=999)), \
                   param_grid = param_grid, cv = stratified_cv_object, \
                   verbose=False, n_jobs=-1, scoring=scoring, refit='F1_Score', \
                   return_train_score=True)

grid.fit(X, y=y)


print("The best parameters are %s with a score of %0.2f"
      % (grid.best_params_, grid.best_score_))

In [16]:
########## Random Forest ############################

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

baseRfModel = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1))
scores = cross_validate(baseRfModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)

display(pd.DataFrame(scores))


Unnamed: 0,fit_time,score_time,test_F1_Score,train_F1_Score,test_AUC,train_AUC,test_Accuracy,train_Accuracy,test_Precision,train_Precision,test_Recall,train_Recall
0,1.455259,0.670275,0.441088,0.963994,0.887794,0.999714,0.897711,0.991844,0.61139,0.996719,0.344991,0.933349
1,0.319,0.543636,0.43203,0.968424,0.888017,0.999747,0.898817,0.992811,0.629295,0.996003,0.328922,0.94233
2,0.320733,0.543752,0.454159,0.962628,0.892616,0.999738,0.900586,0.991539,0.634975,0.995957,0.353497,0.931458
3,0.244978,0.540238,0.42698,0.964416,0.892466,0.999737,0.8976,0.991927,0.61828,0.995472,0.326087,0.93524
4,0.22735,0.539349,0.446301,0.96786,0.889007,0.999801,0.897379,0.992701,0.605178,0.997991,0.353497,0.939494
5,0.226471,0.538469,0.42409,0.965534,0.890089,0.999742,0.898485,0.992175,0.630597,0.99598,0.319471,0.936894
6,0.231513,0.540741,0.436881,0.970923,0.889342,0.999798,0.89937,0.993364,0.632616,0.996023,0.333648,0.947057
7,0.205409,0.540837,0.44417,0.965526,0.893627,0.999672,0.900365,0.992175,0.639432,0.996229,0.340265,0.936658
8,0.264353,0.538117,0.4546,0.969167,0.887864,0.999763,0.899701,0.992977,0.624793,0.996257,0.357278,0.943512
9,0.249046,0.541407,0.435474,0.965635,0.895221,0.999768,0.897932,0.992203,0.616984,0.99673,0.336484,0.936422


In [None]:
from sklearn.ensemble import RandomForestClassifier

baseRfModel = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1, class_weight='balanced'))
scores = cross_validate(baseRfModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)

display(pd.DataFrame(scores))

In [None]:
from sklearn.ensemble import RandomForestClassifier

baseRfModel = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1, class_weight='balanced_subsample'))
scores = cross_validate(baseRfModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)

display(pd.DataFrame(scores))

In [17]:
#################################
# Create randomized grid
#################################

# Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]

# Number of features to consider at every split
max_features = ['auto', 'log2', 8, 9, 10]

# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)

# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10, 15]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]

#Class weights
class_weight = ['balanced', 'balanced_subsample']

# Method of selecting samples for training each tree
bootstrap = [True, False]

# Create the random grid
random_grid = {'randomforestclassifier__n_estimators': n_estimators,
               'randomforestclassifier__max_features': max_features,
               'randomforestclassifier__max_depth': max_depth,
               'randomforestclassifier__min_samples_split': min_samples_split,
               'randomforestclassifier__min_samples_leaf': min_samples_leaf,
               'randomforestclassifier__class_weight': class_weight,
               'randomforestclassifier__bootstrap': bootstrap}

print(random_grid)

{'randomforestclassifier__n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000], 'randomforestclassifier__max_features': ['auto', 'log2', 8, 9, 10], 'randomforestclassifier__max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None], 'randomforestclassifier__min_samples_split': [2, 5, 10, 15], 'randomforestclassifier__min_samples_leaf': [1, 2, 4], 'randomforestclassifier__class_weight': ['balanced', 'balanced_subsample'], 'randomforestclassifier__bootstrap': [True, False]}


In [18]:
from sklearn.model_selection import RandomizedSearchCV
#################################
# Random Search Training
#################################

# Use the random grid to search for best hyperparameters
# First create the base model to tune
#rf = RandomForestClassifier() #Originally was this
rf = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

# Random search of parameters, using 3 fold cross validation, 
# search across 100 different combinations, and use all available cores
rf_randomgrid = RandomizedSearchCV(estimator = rf, param_distributions = random_grid, 
                                   n_iter = 100, 
                                   cv = stratified_cv_object,
                                   verbose=2, 
                                   random_state=999, 
                                   n_jobs = -1,
                                   scoring=scoring,
                                   refit='F1_Score', \
                                   return_train_score=True)


# Fit the random search model
rf_randomgrid.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rf_randomgrid.best_params_, rf_randomgrid.best_score_))
#rf_random.best_params_

Fitting 10 folds for each of 100 candidates, totalling 1000 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.7min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed: 12.1min
[Parallel(n_jobs=-1)]: Done 520 tasks      | elapsed: 26.0min
[Parallel(n_jobs=-1)]: Done 1000 out of 1000 | elapsed: 45.7min finished


The best parameters are {'randomforestclassifier__n_estimators': 1200, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__bootstrap': False} with a score of 0.62


In [15]:
########################################################
# Create Smaller grid 1 based upon Random Grid CV results
#######################################################

# Number of trees in random forest
n_estimators = [1209, 1211, 1213, 1207]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [49,51]

# Minimum number of samples required to split a node
min_samples_split = [3,4,17]

# Minimum number of samples required at each leaf node
min_samples_leaf = [2,4,6]

#Class weights
class_weight = ['balanced']

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
subGrid = {'randomforestclassifier__n_estimators': n_estimators,
        'randomforestclassifier__max_features': max_features,
        'randomforestclassifier__max_depth': max_depth,
        'randomforestclassifier__min_samples_split': min_samples_split,
        'randomforestclassifier__min_samples_leaf': min_samples_leaf,
        'randomforestclassifier__class_weight': class_weight,
        'randomforestclassifier__bootstrap': bootstrap}

print(subGrid)


{'randomforestclassifier__n_estimators': [1209, 1211, 1213, 1207], 'randomforestclassifier__max_features': ['auto'], 'randomforestclassifier__max_depth': [49, 51], 'randomforestclassifier__min_samples_split': [3, 4, 17], 'randomforestclassifier__min_samples_leaf': [2, 4, 6], 'randomforestclassifier__class_weight': ['balanced'], 'randomforestclassifier__bootstrap': [False]}


In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#################################
# Sub Grid Search
#################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

rfSubGridEstimator = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

rfSubGridModel = GridSearchCV(estimator = rfSubGridEstimator, 
                              param_grid= subGrid,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
rfSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rfSubGridModel.best_params_, rfSubGridModel.best_score_))


Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  4.8min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed: 18.2min
[Parallel(n_jobs=-1)]: Done 520 tasks      | elapsed: 37.2min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 47.4min finished


The best parameters are {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__max_depth': 49, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 17, 'randomforestclassifier__n_estimators': 1209} with a score of 0.61


In [17]:
########################################################
# Create Smaller grid 2 based upon Random Grid CV results
#######################################################

# Number of trees in random forest
n_estimators = [1190, 1175, 1150, 1125]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [45,55]

# Minimum number of samples required to split a node
min_samples_split = [2,25,30]

# Minimum number of samples required at each leaf node
min_samples_leaf = [11,12,14]

#Class weights
class_weight = ['balanced']

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
subGrid = {'randomforestclassifier__n_estimators': n_estimators,
        'randomforestclassifier__max_features': max_features,
        'randomforestclassifier__max_depth': max_depth,
        'randomforestclassifier__min_samples_split': min_samples_split,
        'randomforestclassifier__min_samples_leaf': min_samples_leaf,
        'randomforestclassifier__class_weight': class_weight,
        'randomforestclassifier__bootstrap': bootstrap}

print(subGrid)

{'randomforestclassifier__n_estimators': [1190, 1175, 1150, 1125], 'randomforestclassifier__max_features': ['auto'], 'randomforestclassifier__max_depth': [45, 55], 'randomforestclassifier__min_samples_split': [2, 25, 30], 'randomforestclassifier__min_samples_leaf': [11, 12, 14], 'randomforestclassifier__class_weight': ['balanced'], 'randomforestclassifier__bootstrap': [False]}


In [18]:
#################################
# Sub Grid Search
#################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

rfSubGridEstimator = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

rfSubGridModel = GridSearchCV(estimator = rfSubGridEstimator, 
                              param_grid= subGrid,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
rfSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rfSubGridModel.best_params_, rfSubGridModel.best_score_))

Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  3.9min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed: 15.5min
[Parallel(n_jobs=-1)]: Done 520 tasks      | elapsed: 32.6min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 41.4min finished


The best parameters are {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__max_depth': 45, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 11, 'randomforestclassifier__min_samples_split': 25, 'randomforestclassifier__n_estimators': 1190} with a score of 0.58


In [15]:
########################################################
# Create Smaller grid 3 based upon Random Grid CV results
#######################################################

# Number of trees in random forest
n_estimators = [1192, 1194, 1196, 1198]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [48,50]

# Minimum number of samples required to split a node
min_samples_split =  [9,11,13]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1,4,7]

#Class weights
class_weight = ['balanced']

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
subGrid = {'randomforestclassifier__n_estimators': n_estimators,
        'randomforestclassifier__max_features': max_features,
        'randomforestclassifier__max_depth': max_depth,
        'randomforestclassifier__min_samples_split': min_samples_split,
        'randomforestclassifier__min_samples_leaf': min_samples_leaf,
        'randomforestclassifier__class_weight': class_weight,
        'randomforestclassifier__bootstrap': bootstrap}

print(subGrid)



{'randomforestclassifier__n_estimators': [1192, 1194, 1196, 1198], 'randomforestclassifier__max_features': ['auto'], 'randomforestclassifier__max_depth': [48, 50], 'randomforestclassifier__min_samples_split': [9, 11, 13], 'randomforestclassifier__min_samples_leaf': [1, 4, 7], 'randomforestclassifier__class_weight': ['balanced'], 'randomforestclassifier__bootstrap': [False]}


In [16]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#################################
# Sub Grid Search
#################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

rfSubGridEstimator = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

rfSubGridModel = GridSearchCV(estimator = rfSubGridEstimator, 
                              param_grid= subGrid,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
rfSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rfSubGridModel.best_params_, rfSubGridModel.best_score_))



Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed: 29.0min
[Parallel(n_jobs=-1)]: Done 154 tasks      | elapsed: 122.3min
[Parallel(n_jobs=-1)]: Done 357 tasks      | elapsed: 273.3min
[Parallel(n_jobs=-1)]: Done 640 tasks      | elapsed: 491.4min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 548.5min finished


The best parameters are {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__max_depth': 48, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 1, 'randomforestclassifier__min_samples_split': 13, 'randomforestclassifier__n_estimators': 1196} with a score of 0.61


In [17]:
########################################################
# Create Smaller grid 4 based upon Random Grid CV results
#######################################################

# Number of trees in random forest
n_estimators = [1210, 1225, 1250, 1275]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [49,51]

# Minimum number of samples required to split a node
min_samples_split = [5,15,20]

# Minimum number of samples required at each leaf node
min_samples_leaf = [8,9,10]

#Class weights
class_weight = ['balanced']

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
subGrid = {'randomforestclassifier__n_estimators': n_estimators,
        'randomforestclassifier__max_features': max_features,
        'randomforestclassifier__max_depth': max_depth,
        'randomforestclassifier__min_samples_split': min_samples_split,
        'randomforestclassifier__min_samples_leaf': min_samples_leaf,
        'randomforestclassifier__class_weight': class_weight,
        'randomforestclassifier__bootstrap': bootstrap}

print(subGrid)


{'randomforestclassifier__n_estimators': [1210, 1225, 1250, 1275], 'randomforestclassifier__max_features': ['auto'], 'randomforestclassifier__max_depth': [49, 51], 'randomforestclassifier__min_samples_split': [5, 15, 20], 'randomforestclassifier__min_samples_leaf': [8, 9, 10], 'randomforestclassifier__class_weight': ['balanced'], 'randomforestclassifier__bootstrap': [False]}


In [18]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#################################
# Sub Grid Search
#################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

rfSubGridEstimator = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

rfSubGridModel = GridSearchCV(estimator = rfSubGridEstimator, 
                              param_grid= subGrid,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
rfSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rfSubGridModel.best_params_, rfSubGridModel.best_score_))


Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 8 concurrent workers.
[Parallel(n_jobs=-1)]: Done  25 tasks      | elapsed: 14.6min
[Parallel(n_jobs=-1)]: Done 146 tasks      | elapsed: 71.3min
[Parallel(n_jobs=-1)]: Done 349 tasks      | elapsed: 165.5min
[Parallel(n_jobs=-1)]: Done 632 tasks      | elapsed: 296.8min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 336.1min finished


The best parameters are {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__max_depth': 49, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 8, 'randomforestclassifier__min_samples_split': 5, 'randomforestclassifier__n_estimators': 1250} with a score of 0.59


In [20]:
########################################################
# Create Smaller grid 5 based upon Random Grid CV results
#######################################################

# Number of trees in random forest
n_estimators = [1202, 1204, 1206, 1208]

# Number of features to consider at every split
max_features = ['auto']

# Maximum number of levels in tree
max_depth = [50,52]

# Minimum number of samples required to split a node
min_samples_split = [8,10,12]

# Minimum number of samples required at each leaf node
min_samples_leaf = [1,2,3]

#Class weights
class_weight = ['balanced']

# Method of selecting samples for training each tree
bootstrap = [False]

# Create the random grid
subGrid = {'randomforestclassifier__n_estimators': n_estimators,
        'randomforestclassifier__max_features': max_features,
        'randomforestclassifier__max_depth': max_depth,
        'randomforestclassifier__min_samples_split': min_samples_split,
        'randomforestclassifier__min_samples_leaf': min_samples_leaf,
        'randomforestclassifier__class_weight': class_weight,
        'randomforestclassifier__bootstrap': bootstrap}

print(subGrid)


{'randomforestclassifier__n_estimators': [1202, 1204, 1206, 1208], 'randomforestclassifier__max_features': ['auto'], 'randomforestclassifier__max_depth': [50, 52], 'randomforestclassifier__min_samples_split': [8, 10, 12], 'randomforestclassifier__min_samples_leaf': [1, 2, 3], 'randomforestclassifier__class_weight': ['balanced'], 'randomforestclassifier__bootstrap': [False]}


In [21]:
from sklearn.model_selection import GridSearchCV
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

#################################
# Sub Grid Search
#################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

rfSubGridEstimator = make_pipeline(StandardScaler(), RandomForestClassifier(random_state=999, n_jobs=-1)) 

rfSubGridModel = GridSearchCV(estimator = rfSubGridEstimator, 
                              param_grid= subGrid,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
rfSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (rfSubGridModel.best_params_, rfSubGridModel.best_score_))



Fitting 10 folds for each of 72 candidates, totalling 720 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 24 concurrent workers.
[Parallel(n_jobs=-1)]: Done 114 tasks      | elapsed: 17.7min
[Parallel(n_jobs=-1)]: Done 317 tasks      | elapsed: 46.1min
[Parallel(n_jobs=-1)]: Done 600 tasks      | elapsed: 85.8min
[Parallel(n_jobs=-1)]: Done 720 out of 720 | elapsed: 100.5min finished


The best parameters are {'randomforestclassifier__bootstrap': False, 'randomforestclassifier__class_weight': 'balanced', 'randomforestclassifier__max_depth': 50, 'randomforestclassifier__max_features': 'auto', 'randomforestclassifier__min_samples_leaf': 2, 'randomforestclassifier__min_samples_split': 10, 'randomforestclassifier__n_estimators': 1206} with a score of 0.62


In [15]:
#### Start XGBoost ####
import xgboost as xgb
from xgboost.sklearn import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

#class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, 
#                            silent=True, objective='binary:logistic', booster='gbtree', n_jobs=1,
#                            nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1,
#                            colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
#                            scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)

xgb_baseModel = XGBClassifier(n_jobs=-1, random_state=999)
scores = cross_validate(xgb_baseModel, X, y=y, cv=stratified_cv_object, n_jobs=-1, scoring=scoring)
display(pd.DataFrame(scores))

Unnamed: 0,fit_time,score_time,test_F1_Score,train_F1_Score,test_AUC,train_AUC,test_Accuracy,train_Accuracy,test_Precision,train_Precision,test_Recall,train_Recall
0,4.16582,0.147896,0.477972,0.497419,0.922039,0.930914,0.904346,0.908483,0.661102,0.695541,0.374291,0.387143
1,4.179397,0.144644,0.460695,0.512528,0.921989,0.93125,0.902134,0.910169,0.64837,0.701726,0.357278,0.403687
2,4.347942,0.164581,0.507126,0.489957,0.924644,0.930365,0.908216,0.907321,0.682109,0.68774,0.403592,0.380525
3,4.168341,0.14933,0.46966,0.500827,0.924256,0.931075,0.903351,0.908179,0.655932,0.687861,0.365784,0.39376
4,4.198545,0.158289,0.483568,0.504308,0.920983,0.930557,0.902687,0.90934,0.637771,0.699664,0.389414,0.394233
5,4.241407,0.16005,0.483557,0.500831,0.928431,0.929023,0.906226,0.908676,0.679795,0.694468,0.375236,0.391633
6,4.236522,0.178961,0.457846,0.509183,0.91994,0.930568,0.902577,0.909119,0.656085,0.691403,0.351607,0.402978
7,4.310461,0.494365,0.479369,0.498946,0.92059,0.930127,0.90512,0.907985,0.669492,0.687267,0.373346,0.391633
8,4.149833,0.16108,0.483262,0.5061,0.918979,0.930944,0.906115,0.90934,0.678632,0.697674,0.375236,0.397069
9,4.18522,0.164816,0.47619,0.502187,0.92009,0.930367,0.902687,0.908731,0.643087,0.69375,0.378072,0.393524


In [16]:
#####################################################################################
# Create randomized grid for Boosting Params only for now with default tree params
#####################################################################################

# Number of boosted trees to fit.
n_estimators = [int(x) for x in np.arange(10, 1000, 20)]

# Boosting learning rate (xgb’s “eta”)
learning_rate = [x for x in np.arange(0.05, 3, 0.05)]


# Create the random grid for boosting params
random_grid_boosting = {'n_estimators': n_estimators,
              'learning_rate': learning_rate}

print(random_grid_boosting)

{'n_estimators': [10, 30, 50, 70, 90, 110, 130, 150, 170, 190, 210, 230, 250, 270, 290, 310, 330, 350, 370, 390, 410, 430, 450, 470, 490, 510, 530, 550, 570, 590, 610, 630, 650, 670, 690, 710, 730, 750, 770, 790, 810, 830, 850, 870, 890, 910, 930, 950, 970, 990], 'learning_rate': [0.05, 0.1, 0.15000000000000002, 0.2, 0.25, 0.3, 0.35000000000000003, 0.4, 0.45, 0.5, 0.55, 0.6000000000000001, 0.6500000000000001, 0.7000000000000001, 0.7500000000000001, 0.8, 0.8500000000000001, 0.9000000000000001, 0.9500000000000001, 1.0, 1.05, 1.1, 1.1500000000000001, 1.2000000000000002, 1.2500000000000002, 1.3, 1.35, 1.4000000000000001, 1.4500000000000002, 1.5000000000000002, 1.55, 1.6, 1.6500000000000001, 1.7000000000000002, 1.7500000000000002, 1.8, 1.85, 1.9000000000000001, 1.9500000000000002, 2.0, 2.05, 2.1, 2.15, 2.1999999999999997, 2.25, 2.3, 2.35, 2.4, 2.45, 2.5, 2.55, 2.6, 2.65, 2.7, 2.75, 2.8, 2.85, 2.9, 2.95]}


In [18]:
# Finding optimal value of n_estimators param based upon leaning rate
from sklearn.model_selection import RandomizedSearchCV
#################################
# Random Search Training
#################################

# Use the random grid to search for best hyperparameters for Boosting while keeping tree 
# parameters to default values

xgbEstimator = XGBClassifier(n_jobs=-1, random_state=999)

# Perform Random Grid search using Stratified Shuffle Split CV Object.
xgb_randomgrid = RandomizedSearchCV(estimator = xgbEstimator, param_distributions = random_grid_boosting, 
                                   n_iter = 250, 
                                   cv = stratified_cv_object,
                                   verbose=2, 
                                   random_state=999, 
                                   n_jobs = -1,
                                   scoring=scoring,
                                   refit='F1_Score', \
                                   return_train_score=True)


# Fit the random search model
xgb_randomgrid.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (xgb_randomgrid.best_params_, xgb_randomgrid.best_score_))

Fitting 10 folds for each of 250 candidates, totalling 2500 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:   32.5s
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed:  3.0min
[Parallel(n_jobs=-1)]: Done 520 tasks      | elapsed:  6.8min
[Parallel(n_jobs=-1)]: Done 885 tasks      | elapsed: 11.6min
[Parallel(n_jobs=-1)]: Done 1330 tasks      | elapsed: 17.2min
[Parallel(n_jobs=-1)]: Done 1857 tasks      | elapsed: 23.4min
[Parallel(n_jobs=-1)]: Done 2500 out of 2500 | elapsed: 31.7min finished


The best parameters are {'n_estimators': 770, 'learning_rate': 0.35000000000000003} with a score of 0.55


In [16]:
## Running Grid Search for boosting parameters in vicinity of values found during Random Grid Search
# Number of boosted trees to fit.

n_estimators = [int(x) for x in np.arange(750, 780, 3)]

# Boosting learning rate (xgb’s “eta”)
learning_rate = [x for x in np.arange(0.28, 0.40, 0.02)]


# Create the random grid for boosting params
sub_grid_boosting = {'n_estimators': n_estimators,
              'learning_rate': learning_rate}

print(sub_grid_boosting)

{'n_estimators': [750, 753, 756, 759, 762, 765, 768, 771, 774, 777], 'learning_rate': [0.28, 0.30000000000000004, 0.32000000000000006, 0.3400000000000001, 0.3600000000000001, 0.3800000000000001]}


In [17]:
######################################
# Sub Grid Search for boosting params
######################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

xgbSubGridEstimator = XGBClassifier(n_jobs=-1, random_state=999)

xgbSubGridModel = GridSearchCV(estimator = xgbSubGridEstimator, 
                              param_grid = sub_grid_boosting,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
xgbSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (xgbSubGridModel.best_params_, xgbSubGridModel.best_score_))

Fitting 10 folds for each of 60 candidates, totalling 600 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed:  5.1min
[Parallel(n_jobs=-1)]: Done 600 out of 600 | elapsed: 12.0min finished


The best parameters are {'learning_rate': 0.30000000000000004, 'n_estimators': 771} with a score of 0.55


In [16]:
########################################################################################
# Create randomized grid for 2 Important Tree Params using Boosting Params obtained 
# via Grid Search
########################################################################################

# Maximum tree depth for base learners.
max_depth = [int(x) for x in np.arange(3, 11)]

# Minimum sum of instance weight(hessian) needed in a child.
min_child_weight = [int(x) for x in np.arange(1, 7)]


# Create the random grid for boosting params
sub_grid_tree = {'max_depth': max_depth,
              'min_child_weight': min_child_weight}

print(sub_grid_tree)

{'max_depth': [3, 4, 5, 6, 7, 8, 9, 10], 'min_child_weight': [1, 2, 3, 4, 5, 6]}


In [17]:
############################################################
# Sub Grid Search for 2max_depth and min_child_weight params
############################################################

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

xgbSubGridEstimator = XGBClassifier(learning_rate = 0.30, n_estimators = 771, n_jobs=-1, random_state=999)

xgbSubGridModel = GridSearchCV(estimator = xgbSubGridEstimator, 
                              param_grid = sub_grid_tree,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
xgbSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (xgbSubGridModel.best_params_, xgbSubGridModel.best_score_))

Fitting 10 folds for each of 48 candidates, totalling 480 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed:  6.9min
[Parallel(n_jobs=-1)]: Done 480 out of 480 | elapsed: 17.9min finished


The best parameters are {'max_depth': 5, 'min_child_weight': 1} with a score of 0.55


In [21]:
## Tune gamma parameter value
gamma_vals = {
 'gamma':[i/10.0 for i in range(0,7)]
}

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

xgbSubGridEstimator = XGBClassifier(learning_rate = 0.30, n_estimators = 771, max_depth = 5, \
                                    min_child_weight = 1, n_jobs=-1, random_state=999)

xgbSubGridModel = GridSearchCV(estimator = xgbSubGridEstimator, 
                              param_grid = gamma_vals,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
xgbSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (xgbSubGridModel.best_params_, xgbSubGridModel.best_score_))


Fitting 10 folds for each of 7 candidates, totalling 70 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  15 out of  70 | elapsed:  1.6min remaining:  5.8min
[Parallel(n_jobs=-1)]: Done  51 out of  70 | elapsed:  1.8min remaining:   39.5s
[Parallel(n_jobs=-1)]: Done  70 out of  70 | elapsed:  2.3min finished


The best parameters are {'gamma': 0.0} with a score of 0.55


In [22]:
## Tune Subsample and colsample_bytree tree params
param_test_2vars = {
 'subsample':[i/10.0 for i in range(3,10)],
 'colsample_bytree':[i/10.0 for i in range(3,10)]
}

scoring = {'F1_Score': 'f1', 'AUC': 'roc_auc', 'Accuracy': make_scorer(accuracy_score), 'Precision': 'precision', \
          'Recall': 'recall'}

xgbSubGridEstimator = XGBClassifier(learning_rate = 0.30, n_estimators = 771, max_depth = 5, \
                                    min_child_weight = 1, gamma = 0, n_jobs=-1, random_state=999)

xgbSubGridModel = GridSearchCV(estimator = xgbSubGridEstimator, 
                              param_grid = param_test_2vars,  
                              cv = stratified_cv_object,
                              verbose=2, 
                              n_jobs = -1,
                              scoring=scoring,
                              refit='F1_Score', 
                              return_train_score=True)


# Fit the random search model
xgbSubGridModel.fit(X, y=y)

print("The best parameters are %s with a score of %0.2f"
      % (xgbSubGridModel.best_params_, xgbSubGridModel.best_score_))

Fitting 10 folds for each of 49 candidates, totalling 490 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 64 concurrent workers.
[Parallel(n_jobs=-1)]: Done  34 tasks      | elapsed:  1.3min
[Parallel(n_jobs=-1)]: Done 237 tasks      | elapsed:  6.0min
[Parallel(n_jobs=-1)]: Done 490 out of 490 | elapsed: 12.5min finished


The best parameters are {'colsample_bytree': 0.9, 'subsample': 0.8} with a score of 0.54
