# Lab 4 - Additional Evaluation Techniques

# Overview
In this lab, we will try to predict fraudulent credit card
transactions. This is a difficult task, with most transactions
being legitimate. 

## Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split 
from sklearn import tree
from sklearn import metrics

# Data
The data comes from [Kaggle](https://www.kaggle.com/mlg-ulb/creditcardfraud).
The data dimensions are anonymized to protect the identity of the individuals.
The process to create the dimensions is 
[Principal Components Analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis).

In [None]:
ccfraud = pd.read_csv('data/creditcard.csv')

The data is highly unbalanced, with a fraction of a percent of transactions
actually being fraudulent. Fraudulent transactions are about \$34 larger than legitimate transactions.

In [None]:
ccfraud.describe()
# get the average and count for each type
ccstats = ccfraud.groupby('Class')['Amount'].agg(['mean', 'count'])
# stats for fraud by count and average transaction amount
print(ccstats)

In [None]:
# percent of fraudulent transactions
print("Fraudulent transaction ratio:", ccstats.loc[1, 'count']/ccstats['count'].sum())

## Pre-processing
This dataset is already very clean, and does not need any preprocessing.
This is a somewhat large dataset, consisting of 284k rows. You may wish
to use a subset of about 10,000 rows from the data while writing your code to speed up
model training. Once your code is proper, you can remove the
subset and instead use the entire dataset.

# Training and testing sets
One recommendation is to use `np.random.seed()` to make reproducible results.
The random number generator will always produce the same "random" numbers
when a specific value is set for `np.random.seed()`.

In [None]:
np.random.seed(516)

Next, we will create train and test sets of the data. We will
fit the model with the training set, and use the test set to evaluate the model.
We will do a 75/25 split (75% will be training data).
Like the prior lab, use the built in methods for sampling in the `train_test_split()` function.

Now, we can use these same functions and principles to generate a random
sample from our dataset.

In [None]:
train, test = train_test_split(ccfraud, test_size=0.25)
print("Rows in train:", len(train))
print("Rows in test:", len(test))
train_stats = train.groupby('Class')['Amount'].agg(['mean', 'count'])
print("Training data:\n", train_stats)
test_stats = test.groupby('Class')['Amount'].agg(['mean', 'count'])
print("Testing data:\n", test_stats)

# Train the Model
We will again use decision trees to create the model. 

First, let's look at using column names, rather than column indices,
to select variables in the model. You can specify the values
in a list, and then  `.loc` locate the columns by name in the dataframe.

In [None]:
# view all columns
print(list(ccfraud.columns))

In [None]:
# use column names 
pred_vars = ['Time', 'Amount', 'V8', 'V1']
print(ccfraud.loc[:, pred_vars])

Next, let's train the actual model. We will use the columns specified in `pred_vars`. An advantage 
of using column names in this way is that we only need to update `pred_vars` in one location, and 
then reuse that list in all other instances where we train or test the model. In this case, I have
selected `"entropy"` as the measure to compute the splits.

In [None]:
dtree = tree.DecisionTreeClassifier(criterion="entropy")
dtree.fit(train.loc[:, pred_vars], train['Class'])

There are some statistics about the decision tree that we can view. 
First, let's see how many leaves, or splits, there are in this tree.

In [None]:
print(dtree.get_n_leaves())

This is a large tree! It may have over-fit, but we won't know until evaluating with our holdout data.
We can also look at tree depth.

In [None]:
print(dtree.get_depth())

## Evaluate model performance

Now, we will evaluate our model with previously unseen data 
(the test set). This will give us a vector of predicted labels.
There are two prediction functions for decision trees. First,
we will use the `predict()` function to see the class lable (0 or 1).
you to get predicted class probabilities (not shown in this lab). 

In [None]:
pred_labels = dtree.predict(test.loc[:, pred_vars])
pred_labels[0:4]

We can see how well the model performed on the new data.
First, let's view the confusion matrix.

In [None]:
metrics.plot_confusion_matrix(dtree, test.loc[:, pred_vars], test['Class'])

Next, view the classification report for the various
measures of performance.

In [None]:
print(metrics.classification_report(test['Class'], pred_labels, digits=5))

# Probabilistic Evaluation
The probabilistic predictions are useful for applications and advanced
evaluation metrics. First, we need to get probabilities for 
each class (rather than a fixed label). The documentation
for this is 
[here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier.predict_proba).
The result is a matrix with two columns, one for each class label.

In [None]:
pred_probs = dtree.predict_proba(test.loc[:, pred_vars])
pred_probs[0:5, :]

The first five rows show predictions with absolute confidence (100% not fraud).
This is due to the way that decision trees compute probabilities. If a branch
of the tree is pure, i.e. every observation on the branch is of the same class,
then any observation that reaches that endpoint will have a probability of 100%.

This looks like a classic case of overfitting. The tree is extremely deep,
having 39 layers and over 500 nodes with just four predictor variables.

These extreme probability values are likely to be punished by model 
evaluation statistics that rely on probability rather than class label.

## Area Under the Curve (AUC)
AUC is a measure that compares the tradeoff between the true positive rate and the true negative rate.
This measure is not affected by imbalanced classes, and it can be displayed on a graph.
The baseline for this measure is 0.5 (random guessing). 

First, let's compute the AUC

In [None]:
metrics.roc_auc_score(test['Class'], pred_probs[:,1])

Next, plot the curve. Since there is very little variation in probabilities 
for the predictions, this curve will not look like much of a curve. The 
exercises later will show different results. The plot function
also displays the AUC on the plot.

In [None]:
metrics.plot_roc_curve(dtree, test.loc[:, pred_vars], test['Class'])

We can run similar stats to AUC, but instead of TPR and FPR,
we can use precision and recall.

In [None]:
metrics.average_precision_score(test['Class'], pred_probs[:,1])

Plot the precision/recall curve

In [None]:
metrics.plot_precision_recall_curve(dtree, test.loc[:, pred_vars], test['Class'])

## Log Loss
[Log loss](http://wiki.fast.ai/index.php/Log_Loss) is a measure that considers
the confidence of a prediction, rather than just the correctness. The lower 
this number, the better the model is. It is not necessarily useful on its own, like
precision, recall, or $F_1$, 
but it is a very straightforward way to compare models. Values of log loss can be as
low as 0 and infinitely high. 

In [None]:
print(metrics.log_loss(test['Class'], pred_probs[:,1]))

We will use the value here as a baseline for additional iterations of this model.

# Exercises

1. What are the strengths and weaknesses of each evaluation criteria (precision/recall/F1/accuracy; model cost; log loss)?
2. This model is severly over-fit. Try creating a new model and 
    restricting the maximum depth of the tree to 5 levels (using the `max_depth` parameter).
    Run the various evaluation statistics on this new model.
    1. How does the tree compare to the original model? 
    2. On which measures is it better/worse?
2. Does adding additional variables to the model improve performance?
3. This data is anonymized, which means the column names and their values have been
    obscured. What data columns do you think would be useful for detecting fraudulent
    credit card transactions?
    
## Optional Exercises
1. What happens when you build the model using a different value for `np.random.seed()`? 
2. In this context, which evaluation statistic do you think is most relevant? Why?

# Additional Resources
There are a few other elements of decision trees that I would like to mention here.
First, we can see the tree by plotting. This works well on small trees, but for
larger trees like those in this lab, the value is minimal.

In [None]:
import matplotlib.pyplot as plt
tree.plot_tree(dtree)
plt.show()

Another interesting part of decision trees is viewing the path that a row
takes to reach the decision. Again, this is typically more useful for smaller trees.

To interpret this, look at the first value in parentheses. It corresponds to the row
in the dataset (in this case, we are looking at the subset of fraudulent observations only).
The second value in parentheses is the leaf, and the final value is the brach that the 
row took through the tree. 

In [None]:
paths_for_fraud = dtree.decision_path(train.loc[train['Class'] == 1, pred_vars])
print(paths_for_fraud)