# Exercise - Ensemble

In this exercise, we will focus on underage drinking. The data set contains data about high school students. Each row represents a single student. The columns include the characteristics of deidentified students. This is a binary classification task: predict whether a student drinks alcohol or not (this is the **alc** column: 1=Yes, 0=No). This is an important prediction task to detect underage drinking and deploy intervention techniques. 

## Description of Variables

The description of variables are provided in "Alcohol - Data Dictionary.docx"

## Goal

Use the **alcohol.csv** data set and build a model to predict **alc**. 

# Read and Prepare the Data

In [1]:
# Common imports

import pandas as pd
import numpy as np

import time

np.random.seed(1)

pd.set_option('display.max_colwidth', None)


# Get the data

In [2]:
#We will predict the "price" value in the data set:

alcohol = pd.read_csv("alcohol.csv")
alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,gender,alc
0,18,2,1,4,2,0,5,4,2,5,2,M,1
1,18,4,3,1,0,0,4,4,2,3,9,M,1
2,15,4,3,2,3,0,5,3,4,5,0,F,0
3,15,3,3,1,4,0,4,3,3,3,10,F,0
4,17,3,2,1,2,0,5,3,5,5,2,M,1


In [3]:
## Identify any issues with data imbalance

alcohol['alc'].value_counts() # we can see that these are a bit imbalanced, but nothing to be too concerned about. If the imbalance was greater, use one of the techniques to balance the data that we discussed in data mining.


# If you had not seen how to address data imbalance, you would do this only on the test set (so later on in this code). See the section later in this document. 


alc
0    17757
1    16243
Name: count, dtype: int64

## Feature Engineering: Derive a new column

Examples:
- Ratio of study time to travel time
- Student is younger than 18 or not
- Average of father's and mother's level of education
- (etc.)

In [4]:
alcohol['study_2_travel'] = (alcohol['studytime'] / alcohol['traveltime']).replace([np.inf, -np.inf], np.nan)
alcohol['younger_than_18'] = (alcohol['age'] < 18).astype(int)
alcohol['avg_edu'] = (alcohol['Medu'] + alcohol['Fedu']) / 2

alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,gender,alc,study_2_travel,younger_than_18,avg_edu
0,18,2,1,4,2,0,5,4,2,5,2,M,1,0.5,0,1.5
1,18,4,3,1,0,0,4,4,2,3,9,M,1,0.0,0,3.5
2,15,4,3,2,3,0,5,3,4,5,0,F,0,1.5,1,3.5
3,15,3,3,1,4,0,4,3,3,3,10,F,0,4.0,1,3.0
4,17,3,2,1,2,0,5,3,5,5,2,M,1,2.0,1,2.5


In [5]:
# encode gender M and F to 1 and 0 respectively

alcohol = pd.get_dummies(alcohol, columns=['gender', 'alc'], drop_first=True, dtype='int')

alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,study_2_travel,younger_than_18,avg_edu,gender_M,alc_1
0,18,2,1,4,2,0,5,4,2,5,2,0.5,0,1.5,1,1
1,18,4,3,1,0,0,4,4,2,3,9,0.0,0,3.5,1,1
2,15,4,3,2,3,0,5,3,4,5,0,1.5,1,3.5,0,0
3,15,3,3,1,4,0,4,3,3,3,10,4.0,1,3.0,0,0
4,17,3,2,1,2,0,5,3,5,5,2,2.0,1,2.5,1,1


In [6]:
alcohol = alcohol.rename(columns={'alc_1': 'alc_use'})

alcohol.head()

Unnamed: 0,age,Medu,Fedu,traveltime,studytime,failures,famrel,freetime,goout,health,absences,study_2_travel,younger_than_18,avg_edu,gender_M,alc_use
0,18,2,1,4,2,0,5,4,2,5,2,0.5,0,1.5,1,1
1,18,4,3,1,0,0,4,4,2,3,9,0.0,0,3.5,1,1
2,15,4,3,2,3,0,5,3,4,5,0,1.5,1,3.5,0,0
3,15,3,3,1,4,0,4,3,3,3,10,4.0,1,3.0,0,0
4,17,3,2,1,2,0,5,3,5,5,2,2.0,1,2.5,1,1


# Data Prep

In [7]:
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer

In [8]:
# Split into X and y

y = alcohol['alc_use']
X = alcohol.drop('alc_use', axis=1)


##  Identify the numeric, binary, and categorical columns

In [9]:
# Identify the numerical columns
numeric_columns = X.select_dtypes('number').columns.to_list()

# Identify the categorical columns
categorical_columns = X.select_dtypes('object').columns.to_list()

In [10]:
numeric_columns

['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'health',
 'absences',
 'study_2_travel',
 'younger_than_18',
 'avg_edu',
 'gender_M']

In [11]:
categorical_columns

[]

In [12]:
binary_columns = [col for col in X.columns if X[col].nunique() == 2]
binary_columns

['younger_than_18', 'gender_M']

In [13]:
for binary_col in binary_columns:
    numeric_columns.remove(binary_col)
    
numeric_columns


['age',
 'Medu',
 'Fedu',
 'traveltime',
 'studytime',
 'failures',
 'famrel',
 'freetime',
 'goout',
 'health',
 'absences',
 'study_2_travel',
 'avg_edu']

# Split data (train/test)

In [14]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

## Address any data imbalance issues

> [!NOTE]
> See presentation for more details on the pros and cons of each technique. It's up to you to decide which one to use, and justify why you chose it. Another approach would be to train the models on resampled data using each of the 4 techniques and report which approach resulted in the best outcome. Also, it's important to note that you should not resample the test data, only the training data.


In [15]:

# There are three main techniques to balance the data (see powerpoint presentation for more details on these techniques):
# 1. Random Over Sampling
# 2. Random Under Sampling
# 3. SMOTE (Synthetic Minority Over-sampling Technique)
# 4. ADASYN (Adaptive Synthetic Sampling)

# from imblearn.over_sampling import RandomOverSampler
# from imblearn.under_sampling import RandomUnderSampler
# from imblearn.over_sampling import SMOTE
# from imblearn.over_sampling import ADASYN

# ros = RandomOverSampler(random_state=0)
# rus = RandomUnderSampler(random_state=0)
# smote = SMOTE(random_state=0)
# adasyn = ADASYN(random_state=0)

# X_train_resampled, y_train_resampled = ros.fit_resample(X_train, y_train)
# X_train_resampled, y_train_resampled = rus.fit_resample(X_train, y_train)
# X_train_resampled, y_train_resampled = smote.fit_resample(X_train, y_train)
# X_train_resampled, y_train_resampled = adasyn.fit_resample(X_train, y_train)

# Pipeline

In [16]:
numeric_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='median')),
        ('scaler', StandardScaler())
    ]
)

In [17]:
categorical_transformer = Pipeline(steps=[
        ('imputer', SimpleImputer(strategy='constant', fill_value=-1)),
        ('onehot', OneHotEncoder(handle_unknown='ignore'))
    ]
)

In [18]:
binary_transformer = Pipeline( steps=[
        ('imputer', SimpleImputer(strategy='most_frequent'))
    ]
)

In [19]:
preprocessor = ColumnTransformer(
    [
        ('num', numeric_transformer, numeric_columns),
        ('cat', categorical_transformer, categorical_columns), # we don't have any categorical columns in this data set, so we don't need to include this line
        ('binary', binary_transformer, binary_columns)        
    ],
    remainder='passthrough'
)

# Transform: fit_transform() for TRAIN

In [20]:
#Fit and transform the train data
X_train = preprocessor.fit_transform(X_train)

In [21]:
X_train.shape

(23800, 15)

# Tranform: transform() for TEST

In [22]:
# Transform the test data
X_test = preprocessor.transform(X_test)

X_test

array([[-1.23984621,  0.33104402,  1.76705606, ...,  1.01168573,
         1.        ,  0.        ],
       [-1.23984621, -0.30388608,  0.04019664, ..., -0.1702172 ,
         1.        ,  1.        ],
       [-0.28670367,  0.33104402,  0.04019664, ...,  0.22375045,
         1.        ,  1.        ],
       ...,
       [ 0.66643886, -0.30388608,  0.04019664, ..., -0.1702172 ,
         1.        ,  0.        ],
       [-1.23984621, -0.93881619,  0.04019664, ..., -0.56418484,
         1.        ,  1.        ],
       [-1.23984621,  0.96597412,  0.04019664, ...,  0.61771809,
         1.        ,  0.        ]])

In [23]:
X_test.shape

(10200, 15)

# Develop Models

## Create dataframe to store results


In [24]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score
from sklearn.model_selection import RandomizedSearchCV
from sklearn.linear_model import LogisticRegression

results = pd.DataFrame(columns=['Model', 'Duration', 'Accuracy', 'Precision', 'Recall', 'F1', 'AUC', 'Best Parameters'])

iters = 5
folds = 2

## Train a Logistic Regress Classifier (use random search hyperparameter tuning)

In [25]:
start = time.time()

# setup parameters for RandomizedSearchCV for Logistic Regression
param_distributions = {
    'C': np.logspace(-4, 4, 100),
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

log_reg = LogisticRegression()
log_reg_cv = RandomizedSearchCV(log_reg, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
log_reg_cv.fit(X_train, y_train)
model01 = log_reg_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model01.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

# Pandas used to have a append method, but it is now deprecated. The recommended way is to use the concat method or the following method:
results.loc[len(results.index)] = ['Logistic Regression', end-start, accuracy, precision, recall, f1, auc, str(log_reg_cv.best_params_)]
results


Fitting 2 folds for each of 5 candidates, totalling 10 fits


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"


## Train a random forest classifier (use random search hyperparameter tuning)

In [26]:
start = time.time()

# set up parameters for RandomizedSearchCV for Random Forest

from sklearn.ensemble import RandomForestClassifier

param_distributions = {
    'n_estimators': [int(x) for x in np.linspace(start=200, stop=2000, num=10)],
    'max_features': ['auto', 'sqrt'],
    'max_depth': [int(x) for x in np.linspace(2, 100, num=2)] + [None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
    'bootstrap': [True, False]
}

rf = RandomForestClassifier()
rf_cv = RandomizedSearchCV(rf, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
rf_cv.fit(X_train, y_train)
model02 = rf_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model02.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['Random Forest', end-start, accuracy, precision, recall, f1, auc, str(rf_cv.best_params_)]

results
    

Fitting 2 folds for each of 5 candidates, totalling 10 fits


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"


## Train an adaboost classifier (use random search hyperparameter tuning)

In [27]:
start = time.time()

# set up parameters for RandomizedSearchCV for adabooost

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

param_distributions = {
    'estimator': [DecisionTreeClassifier(max_depth=1), DecisionTreeClassifier(max_depth=2), DecisionTreeClassifier(max_depth=3), DecisionTreeClassifier(max_depth=4)],
    'n_estimators': [50, 100, 200, 500],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1.0]
}

ada = AdaBoostClassifier()
ada_cv = RandomizedSearchCV(ada, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
ada_cv.fit(X_train, y_train)
model03 = ada_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model03.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['AdaBoost', end-start, accuracy, precision, recall, f1, auc, str(ada_cv.best_params_)]

results

Fitting 2 folds for each of 5 candidates, totalling 10 fits


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"


## KNN Classifier

In [28]:
# set up parameters for RandomizedSearchCV for KNN  (this is a slow process)

start = time.time()

from sklearn.neighbors import KNeighborsClassifier

param_distributions = {
    'n_neighbors': [3, 5, 7, 9, 11],
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'p': [1, 2]
}

knn = KNeighborsClassifier()

knn_cv = RandomizedSearchCV(knn, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
knn_cv.fit(X_train, y_train)
model04 = knn_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model04.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()
results.loc[len(results.index)] = ['KNN', end-start, accuracy, precision, recall, f1, auc, str(knn_cv.best_params_)]
results

Fitting 2 folds for each of 5 candidates, totalling 10 fits


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"


## XGBoost Classifier

In [29]:
# set up parameters for RandomizedSearchCV for XGBClassifier

start = time.time()

from xgboost import XGBClassifier

param_distributions = {
    'n_estimators': [int(x) for x in np.linspace(start=200, stop=2000, num=10)],
    'max_depth': [int(x) for x in np.linspace(2, 100, num=2)] + [None],
    'learning_rate': [0.01, 0.05, 0.1, 0.5, 1.0],
    'subsample': [0.5, 0.75, 1.0],
    'colsample_bytree': [0.5, 0.75, 1.0],
    'gamma': [0, 1, 5],
    'reg_alpha': [0, 1, 5],
    'reg_lambda': [0, 1, 5]
}

xgb = XGBClassifier()
xgb_cv = RandomizedSearchCV(xgb, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)
xgb_cv.fit(X_train, y_train)
model05 = xgb_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model05.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['XGBClassifier', end-start, accuracy, precision, recall, f1, auc, str(xgb_cv.best_params_)]
results

Fitting 2 folds for each of 5 candidates, totalling 10 fits


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"
4,XGBClassifier,3.602068,0.827059,0.82471,0.812933,0.81878,0.826531,"{'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0, 'n_estimators': 1000, 'max_depth': None, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.75}"


## SVC

Uncomment the code below to train a SVC model. This model is computationally expensive and may take a long time to train. 


In [30]:
"""

# set up parameters for RandomizedSearchCV for SVM  (this is a slow process)

start = time.time()

from sklearn.svm import SVC

param_distributions = {
    'C': [0.1, 1, 2, 5, 10, 15, 20, 40, 80, 100],
    'gamma': [1, 0.5, 0.1, 0.05, 0.01, 0.001],
    'kernel': ['rbf', 'poly', 'sigmoid']
}

svc = SVC()

svc_cv = RandomizedSearchCV(svc, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)

svc_cv.fit(X_train, y_train)
model06 = svc_cv.best_estimator_

# calculate accuracy, precision, recall, f1, auc
y_pred = model06.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['SVM', end-start, accuracy, precision, recall, f1, auc, str(svc_cv.best_params_)]
results

"""

"\n\n# set up parameters for RandomizedSearchCV for SVM  (this is a slow process)\n\nstart = time.time()\n\nfrom sklearn.svm import SVC\n\nparam_distributions = {\n    'C': [0.1, 1, 2, 5, 10, 15, 20, 40, 80, 100],\n    'gamma': [1, 0.5, 0.1, 0.05, 0.01, 0.001],\n    'kernel': ['rbf', 'poly', 'sigmoid']\n}\n\nsvc = SVC()\n\nsvc_cv = RandomizedSearchCV(svc, param_distributions, n_iter=iters, cv=folds, scoring='f1', verbose=1, n_jobs=-1, random_state=42)\n\nsvc_cv.fit(X_train, y_train)\nmodel06 = svc_cv.best_estimator_\n\n# calculate accuracy, precision, recall, f1, auc\ny_pred = model06.predict(X_test)\n\naccuracy = accuracy_score(y_test, y_pred)\nprecision = precision_score(y_test, y_pred, zero_division=0)\nrecall = recall_score(y_test, y_pred, zero_division=0)\nf1 = f1_score(y_test, y_pred, zero_division=0)\nauc = roc_auc_score(y_test, y_pred)\n\nend = time.time()\n\nresults.loc[len(results.index)] = ['SVM', end-start, accuracy, precision, recall, f1, auc, str(svc_cv.best_params_)]\nre

## Train a Voting Classifier using previous models (test both soft and hard voting)

### Hard Voting

This is the default behavior of the VotingClassifier. In hard voting, the predicted output class is a class with the highest majority of votes i.e the class which had the highest probability of being predicted by each of the classifiers. Suppose three classifiers predicted the output class(A, A, B), so here by majority class A has been predicted.


In [31]:
start = time.time()

# train a voting classifier using the three models (model01, model02, model03)

from sklearn.ensemble import VotingClassifier

voting_clf = VotingClassifier(
#    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05), ('svc', model06)],
    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

# calculate accuracy, precision, recall, f1, auc
y_pred = voting_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['Voting Classifier-Hard', end-start, accuracy, precision, recall, f1, auc, '']

results

Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"
4,XGBClassifier,3.602068,0.827059,0.82471,0.812933,0.81878,0.826531,"{'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0, 'n_estimators': 1000, 'max_depth': None, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.75}"
5,Voting Classifier-Hard,6.679722,0.826765,0.824065,0.813137,0.818565,0.826255,


### Soft Voting

This voting classifier predicts the class label based on the argmax of the sums of the predicted probabilities. Soft voting takes into account the probability of each label. It predicts the class label based on the argmax of the sum of the predicted probabilities. 

In [32]:
start = time.time()

# train a voting classifier using the three models (model01, model02, model03, model04, model05, model06)

voting_clf = VotingClassifier(
#    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05), ('svc', model06)],
    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05)],
    voting='hard'
)

voting_clf.fit(X_train, y_train)

# calculate accuracy, precision, recall, f1, auc
y_pred = voting_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['Voting Classifier-Soft', end-start, accuracy, precision, recall, f1, auc, '']

results


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"
4,XGBClassifier,3.602068,0.827059,0.82471,0.812933,0.81878,0.826531,"{'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0, 'n_estimators': 1000, 'max_depth': None, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.75}"
5,Voting Classifier-Hard,6.679722,0.826765,0.824065,0.813137,0.818565,0.826255,
6,Voting Classifier-Soft,6.676889,0.826765,0.824199,0.812933,0.818527,0.826248,


## Train a StackedClassifier with the above models (minus the VotingClassifier)

In [33]:
start = time.time()

# train a stacking classifier using the three models (model01, model02, model03)

from sklearn.ensemble import StackingClassifier

stacking_clf = StackingClassifier(
#    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05), ('svc', model06)],
    estimators=[('lr', model01), ('rf', model02), ('ada', model03), ('knn', model04), ('xgb', model05)],
    final_estimator=LogisticRegression()
)

stacking_clf.fit(X_train, y_train)

# calculate accuracy, precision, recall, f1, auc
y_pred = stacking_clf.predict(X_test)

accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, zero_division=0)
recall = recall_score(y_test, y_pred, zero_division=0)
f1 = f1_score(y_test, y_pred, zero_division=0)
auc = roc_auc_score(y_test, y_pred)

end = time.time()

results.loc[len(results.index)] = ['Stacking Classifier', end-start, accuracy, precision, recall, f1, auc, '']

results


Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"
4,XGBClassifier,3.602068,0.827059,0.82471,0.812933,0.81878,0.826531,"{'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0, 'n_estimators': 1000, 'max_depth': None, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.75}"
5,Voting Classifier-Hard,6.679722,0.826765,0.824065,0.813137,0.818565,0.826255,
6,Voting Classifier-Soft,6.676889,0.826765,0.824199,0.812933,0.818527,0.826248,
7,Stacking Classifier,32.927035,0.827647,0.825736,0.812933,0.819285,0.827097,


## Discuss the results of the models and the best model based on F1 score results.

In [34]:
results.to_csv('results.csv', index=False)     
results

Unnamed: 0,Model,Duration,Accuracy,Precision,Recall,F1,AUC,Best Parameters
0,Logistic Regression,0.702046,0.821373,0.822446,0.801306,0.811738,0.820623,"{'solver': 'liblinear', 'penalty': 'l2', 'C': 0.6280291441834259}"
1,Random Forest,8.487113,0.816667,0.812706,0.803754,0.808205,0.816184,"{'n_estimators': 200, 'min_samples_split': 2, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 100, 'bootstrap': True}"
2,AdaBoost,6.462828,0.826078,0.82381,0.81171,0.817715,0.825541,"{'n_estimators': 500, 'learning_rate': 0.1, 'estimator': DecisionTreeClassifier(max_depth=2)}"
3,KNN,2.696353,0.809118,0.808133,0.790494,0.799216,0.808422,"{'weights': 'uniform', 'p': 2, 'n_neighbors': 11, 'algorithm': 'auto'}"
4,XGBClassifier,3.602068,0.827059,0.82471,0.812933,0.81878,0.826531,"{'subsample': 1.0, 'reg_lambda': 0, 'reg_alpha': 0, 'n_estimators': 1000, 'max_depth': None, 'learning_rate': 0.01, 'gamma': 1, 'colsample_bytree': 0.75}"
5,Voting Classifier-Hard,6.679722,0.826765,0.824065,0.813137,0.818565,0.826255,
6,Voting Classifier-Soft,6.676889,0.826765,0.824199,0.812933,0.818527,0.826248,
7,Stacking Classifier,32.927035,0.827647,0.825736,0.812933,0.819285,0.827097,
