# Lab 5 - Classification Algorithms
In this lab, you will learn how to use train and evaluate classification algorithms
on a dataset. The reason we do this is because it is usually difficult to know 
which algorithm will perform best for a given dataset. It is important to
understand how to compare multiple models in order to select the best model.

We will need to import a variety of models from `sklearn`. Descriptions
of these models and links to documentation will be provided when 
we fit each model.

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split 
from sklearn import tree
from sklearn import ensemble
from sklearn.neural_network import MLPClassifier
from sklearn import svm
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn import metrics

# Data
The data comes from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).
The data dimensions are anonymized to protect the identity of the individuals.
The process to create the dimensions is 
[Principal Components Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis).

In [None]:
ccfraud = pd.read_csv('data/creditcard.csv')

## Preprocessing
This dataset is already cleaned and ready for use. Let's view some
descripteve statistics to verify this.

In [None]:
# the display() function works better for displaying dataframes in Jupyter Notebooks than print().
display(ccfraud.describe())

## Train/test split
Next, we will use a 75/25 split for this data. We will view
some summarization by outcome `Class` and a few variables.

In [None]:
np.random.seed(516)

# create train and test
train, test = train_test_split(ccfraud, test_size=0.25)
print("Rows in train:", len(train))
print("Rows in test:", len(test))

# view some stats by different variables
train_stats = train.groupby('Class')[['Time', 'Amount', 'V1']].agg(['mean', 'count'])
print("Training data:\n", train_stats)
test_stats = test.groupby('Class')[['Time', 'Amount', 'V1']].agg(['mean', 'count'])
print("Testing data:\n", test_stats)

Next, we will select the variables we want to use for prediction. 

In [None]:
pred_vars = ['Time', 'Amount', 'V1', 'V2', 'V3', 'V4'] 

# Train models
In this section, we will train a several different types of classifiers. They
will be presented one-by-one in this lab, but it is possible to set up the 
training in a `for` loop to make your code simpler and more flexible.

## Decision tree
The first model we fit is a decision tree, which we have been working with for some 
time. 

In [None]:
dtree = tree.DecisionTreeClassifier(criterion='entropy', max_depth=10)
dtree.fit(train[pred_vars], train['Class'])

## Random Forest
A random forest classifier consists of many decision trees
with randomized parameters (hence the name). One of the parameters
is the number of trees in the classifier. The default is 100, but 
this can be reduced for large datasets if training is slow.
Full documentation on this implementation is 
[here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html).

In [None]:
rf = ensemble.RandomForestClassifier()
rf.fit(train[pred_vars], train['Class'])

## Neural Networks
There are many types of neural networks in use today. We will 
use perhaps the most common type, the Multi-Layer Perceptron (MLP). 
The documentation for this method is 
[here](https://scikit-learn.org/stable/modules/neural_networks_supervised.html).

In [None]:
mlp = MLPClassifier(hidden_layer_sizes=(20,20))
mlp.fit(train[pred_vars], train['Class'])

## Support Vector Machines
Next, we will train a support vector machine. The basic 
model works well for linearly separable classes. You may also
want to try an RBF kernel, which can perform better in some situations.

The documentation for SVMs is 
[here](https://scikit-learn.org/stable/modules/svm.html). By default, 
the model only produces label outputs. We will adjust the probability
parameter so that the model can also report class probabilities.

In [None]:
svc = svm.SVC(probability=True)
svc.fit(train[pred_vars], train['Class'])

## Naive Bayes
The Naive Bayes classification algorithm is based on conditional probabilities
from Bayes' Theorem. It is one of the simplest algorithms, yet tends to 
perform quite well in many scenarios.

Documentation for this is 
[here](https://scikit-learn.org/stable/modules/naive_bayes.html).

In [None]:
nb = GaussianNB()
nb.fit(train[pred_vars], train['Class'])

## Logistic Regression
The last method we will use in this lab is the logistic regression.
This a method that has existed in statistics for a long time and,
like decision trees, results in models that are human-interpretable.

The documentation for this method is 
[here](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html).

In [None]:
lr = LogisticRegression()
lr.fit(train[pred_vars], train['Class'])

# Evaluation
Now that we have trained these models, the next step 
is to evaluate their performance on out-of-sample data (our test dataset).
We will use multiple evaluation statistics.

The code for this section is adapted from 
[this tutorial](https://abdalimran.github.io/2019-06-01/Drawing-multiple-ROC-Curves-in-a-single-plot),
which seems to be an adaptation of 
[this StackOverflow answer](https://stackoverflow.com/questions/42894871/how-to-plot-multiple-roc-curves-in-one-plot-with-legend-and-auc-scores-in-python).

It works by iterating through the models using a for loop,
and then storing the statistics in a dataframe. These models 
perform poorly, and may result in warning message from `sklearn`. 

In [None]:
# list of our models
fitted = [dtree, rf, mlp, svc, nb, lr]

# empty dataframe to store the results
result_table = pd.DataFrame(columns=['classifier_name', 'fpr','tpr','auc', 
                                     'log_loss', 'clf_report'])

for clf in fitted:
    # print the name of the classifier
    print(clf.__class__.__name__)
    
    # get predictions
    yproba = clf.predict_proba(test[pred_vars])
    yclass = clf.predict(test[pred_vars])
    
    # auc information
    fpr, tpr, _ = metrics.roc_curve(test['Class'],  yproba[:,1])
    auc = metrics.roc_auc_score(test['Class'], yproba[:,1])
    
    # log loss
    log_loss = metrics.log_loss(test['Class'], yproba[:,1])
    
    # add some other stats based on confusion matrix
    clf_report = metrics.classification_report(test['Class'], yclass)
    
    # add the results to the dataframe
    result_table = result_table.append({'classifier_name':clf.__class__.__name__,
                                        'fpr':fpr, 
                                        'tpr':tpr, 
                                        'auc':auc,
                                        'log_loss': log_loss,
                                        'clf_report': clf_report}, ignore_index=True)

## View the results
For easy formatting later, reset the dataframe index to be the classifier names
rather than a numeric index.

In [None]:
result_table.set_index('classifier_name', inplace=True)
display(result_table)

The raw dataframe is not very pleasing to look at, so let's work on that next.
First, run a for loop to show the classification report and log loss for 
each model.

In [None]:
for i in result_table.index:
    print('\n---- statistics for', i, "----\n")
    print(result_table.loc[i, 'clf_report'])
    print("Model log loss:", result_table.loc[i, 'log_loss'])

### Plot the ROC curve
A common technique in comparing models is to 
plot the ROC curve for each model and compare
the shape and the AUC for each. We will use another
for loop to generate the plot, and then
format the axes and titles

In [None]:
fig = plt.figure(figsize=(14,12))

for i in result_table.index:
    plt.plot(result_table.loc[i]['fpr'], 
             result_table.loc[i]['tpr'], 
             label="{}, AUC={:.3f}".format(i, result_table.loc[i]['auc']))
    
plt.plot([0,1], [0,1], color='orange', linestyle='--')

plt.xticks(np.arange(0.0, 1.1, step=0.1))
plt.xlabel("False Positive Rate", fontsize=15)

plt.yticks(np.arange(0.0, 1.1, step=0.1))
plt.ylabel("True Positive Rate", fontsize=15)

plt.title('ROC Curve Analysis', fontweight='bold', fontsize=15)
plt.legend(prop={'size':13}, loc='lower right')

plt.show()

If you recall from the last lab, a better model will have a line that is closer to the 
upper-left corner. The AUC is calculated by measuring the area underneath each curve.
The baseline AUC is 0.5 (see the dotted line). The SVC model performed impressively poorly bad, 
and none performed particularly well. The
best model by AUC is Naive Bayes, but the decision tree had highest recall for fraud cases. This is interesting,
because the classification report tells a much different story, with the Naive Bayes
having lower recall. This is evidence that relying 
on just one evaluation statistic or algorithm is usually not a good idea.

# Exercises

1. Run the model with at least 10 predictors. Does it improve the overall performance? 
2. Adjust some parameters of the random forest model. Note how similar to decision trees thes parameters are.
3. Overall, Which model would you choose? Why?


## Optional
1. Construct a voting classifier (class or prob). Voting classifiers
    take the predictions from several models and "vote" on the most
    likely class. They can also use the average probability over
    all models. There is a built-in function in `sklearn` to make
    this fairly simple.