<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Initialization" data-toc-modified-id="Initialization-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Initialization</a></span></li><li><span><a href="#Read-Data" data-toc-modified-id="Read-Data-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Read Data</a></span></li><li><span><a href="#Preprocessing" data-toc-modified-id="Preprocessing-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>Preprocessing</a></span><ul class="toc-item"><li><span><a href="#Dummy-Encoding-of-Metric-Features" data-toc-modified-id="Dummy-Encoding-of-Metric-Features-3.1"><span class="toc-item-num">3.1&nbsp;&nbsp;</span>Dummy Encoding of Metric Features</a></span></li><li><span><a href="#Some-simple-helper-functions" data-toc-modified-id="Some-simple-helper-functions-3.2"><span class="toc-item-num">3.2&nbsp;&nbsp;</span>Some simple helper functions</a></span></li></ul></li><li><span><a href="#Classifiers" data-toc-modified-id="Classifiers-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Classifiers</a></span><ul class="toc-item"><li><span><a href="#Zero-Rule" data-toc-modified-id="Zero-Rule-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Zero Rule</a></span></li><li><span><a href="#Logistic-Regression" data-toc-modified-id="Logistic-Regression-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Logistic Regression</a></span></li><li><span><a href="#Linear-Discriminant-Analysis" data-toc-modified-id="Linear-Discriminant-Analysis-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Linear Discriminant Analysis</a></span></li><li><span><a href="#Gaussian-Naive-Bayes" data-toc-modified-id="Gaussian-Naive-Bayes-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Gaussian Naive Bayes</a></span></li><li><span><a href="#k-Nearest-Neighbor" data-toc-modified-id="k-Nearest-Neighbor-4.5"><span class="toc-item-num">4.5&nbsp;&nbsp;</span>k-Nearest Neighbor</a></span></li><li><span><a href="#Decision-Tree-with-Gini-Criterion-for-splitting" data-toc-modified-id="Decision-Tree-with-Gini-Criterion-for-splitting-4.6"><span class="toc-item-num">4.6&nbsp;&nbsp;</span>Decision Tree with Gini Criterion for splitting</a></span></li><li><span><a href="#Decision-Tree-with-Entropy-Criterion-for-splitting" data-toc-modified-id="Decision-Tree-with-Entropy-Criterion-for-splitting-4.7"><span class="toc-item-num">4.7&nbsp;&nbsp;</span>Decision Tree with Entropy Criterion for splitting</a></span></li><li><span><a href="#Random-Forest-with-Gini-Criterion-for-splitting" data-toc-modified-id="Random-Forest-with-Gini-Criterion-for-splitting-4.8"><span class="toc-item-num">4.8&nbsp;&nbsp;</span>Random Forest with Gini Criterion for splitting</a></span></li><li><span><a href="#Random-Forest-with-Entropy-Criterion-for-splitting" data-toc-modified-id="Random-Forest-with-Entropy-Criterion-for-splitting-4.9"><span class="toc-item-num">4.9&nbsp;&nbsp;</span>Random Forest with Entropy Criterion for splitting</a></span></li><li><span><a href="#Gradient-Boosting-Classifier" data-toc-modified-id="Gradient-Boosting-Classifier-4.10"><span class="toc-item-num">4.10&nbsp;&nbsp;</span>Gradient Boosting Classifier</a></span></li><li><span><a href="#Support-Vector-Machine-with-Linear-Kernel" data-toc-modified-id="Support-Vector-Machine-with-Linear-Kernel-4.11"><span class="toc-item-num">4.11&nbsp;&nbsp;</span>Support Vector Machine with Linear Kernel</a></span></li><li><span><a href="#Support-Vector-Machine-with-Radial-Basis-Function-Kernel" data-toc-modified-id="Support-Vector-Machine-with-Radial-Basis-Function-Kernel-4.12"><span class="toc-item-num">4.12&nbsp;&nbsp;</span>Support Vector Machine with Radial Basis Function Kernel</a></span></li></ul></li></ul></div>

# Classification Algorithms with Scikit-learn
This notebook uses and compares different classifiers from [scikit-learn](https://scikit-learn.org).  
We're using the dataset from the
[Bank Marketing Dataset from the UCI Data Repository](https://archive.ics.uci.edu/ml/datasets/Bank+Marketing)

## Initialization

In [None]:
# a few imports for preprocessing
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# the classifiers we want to evaluate
from sklearn.dummy import DummyClassifier # zero rule
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.svm import SVC

# few quality measures
from sklearn.metrics import confusion_matrix, accuracy_score, roc_auc_score

import warnings
warnings.filterwarnings('ignore')

## Read Data
We're running the examples on the balanced data set. If you want to evaluate the algorithms on the original unbalanced set, simply change the index below to `data_sets[1]` and re-run this notebook. Note, however, that some classifiers, in particular SVMs, will take a while for the larger dataset.

In [None]:
data_sets = ('bank-10percent', 'bank-full', 'bank-balanced')
bank = pd.read_csv('../data/' + data_sets[2] + '.csv')

In [None]:
label_col = 'y'
label = bank[label_col]
features = bank.drop(columns=['y'])

## Preprocessing

### Dummy Encoding of Metric Features

In [None]:
label_encoded = pd.get_dummies(label, drop_first = True)
features_encoded = pd.get_dummies(features, drop_first = True)

In [None]:
### Split in training and test set

In [None]:
X_train, X_test, y_train, y_test = train_test_split(features_encoded, label_encoded, test_size = 0.2, random_state = 167)

In [None]:
### Normalize Features

In [None]:
sc_X = StandardScaler()
X_train = sc_X.fit_transform(X_train)
X_test = sc_X.transform(X_test)

### Some simple helper functions

This function displays a few quality measure for the classifier
* confusion matrix
* classification accuracy
* area under curve (AUC)

In [None]:
def quality(model, features, labels):
    predictions = model.predict(features)
    probabilities = model.predict_proba(features)
    conf_matrix = confusion_matrix(labels, predictions)   
    accuracy = accuracy_score(labels, predictions)  
    auc = roc_auc_score(labels, probabilities[:,1])
    return conf_matrix, accuracy, auc

Fit a model on the training data and evaluate on test data

In [None]:
def fit_and_evaluate(model, X_train,y_train, X_test, y_test, confusion_matrix = True):
    model.fit(X_train,y_train)
    conf_matrix, accuracy, auc = quality(model, X_test, y_test)
    if confusion_matrix:
        print("Confusion Matrix:\n{0}".format(conf_matrix))
    print("Accuracy: {0:.2f} %".format(accuracy*100), "AUC: {0:.3f} %".format(auc))

## Classifiers

### Zero Rule

In [None]:
dummy = DummyClassifier(strategy='most_frequent') 
fit_and_evaluate(dummy, X_train, y_train, X_test, y_test)

### Logistic Regression

In [None]:
logistic = LogisticRegression() 
fit_and_evaluate(logistic, X_train, y_train, X_test, y_test)

### Linear Discriminant Analysis

In [None]:
lda = LinearDiscriminantAnalysis() 
fit_and_evaluate(lda, X_train, y_train, X_test, y_test)

### Gaussian Naive Bayes

In [None]:
gaussianNB = GaussianNB()
fit_and_evaluate(gaussianNB, X_train, y_train, X_test, y_test)

### k-Nearest Neighbor

In [None]:
knn = KNeighborsClassifier(n_neighbors=13)
fit_and_evaluate(knn, X_train, y_train, X_test, y_test)

### Decision Tree with Gini Criterion for splitting

In [None]:
dtree = DecisionTreeClassifier(criterion='gini')
fit_and_evaluate(dtree, X_train, y_train, X_test, y_test)

### Decision Tree with Entropy Criterion for splitting

In [None]:
dtree = DecisionTreeClassifier(criterion='entropy')
fit_and_evaluate(dtree, X_train, y_train, X_test, y_test)

### Random Forest with Gini Criterion for splitting

In [None]:
rfc = RandomForestClassifier(criterion='gini')
fit_and_evaluate(rfc, X_train, y_train, X_test, y_test)

### Random Forest with Entropy Criterion for splitting

In [None]:
rfc = RandomForestClassifier(criterion='entropy')
fit_and_evaluate(rfc, X_train, y_train, X_test, y_test)

### Gradient Boosting Classifier

In [None]:
boosting = GradientBoostingClassifier()
fit_and_evaluate(boosting, X_train, y_train, X_test, y_test)

### Support Vector Machine with Linear Kernel

In [None]:
svc_linear = SVC(kernel = 'linear', probability=True)
fit_and_evaluate(svc_linear, X_train, y_train, X_test, y_test)

### Support Vector Machine with Radial Basis Function Kernel

In [None]:
svc_rbf = SVC(kernel = 'rbf', probability=True)
fit_and_evaluate(svc_rbf, X_train, y_train, X_test, y_test)