# Progress Report Summary

## Outline of work done

* ### Data preparation 
* ### Metrics we care about
* ### Initial runs using logistic regression
* ### Look at sampling techniques
* ### Naive validation using test_train_split
* ### Custom cross validation
* ### Running on all classifiers, for all sampling techniques
* ### Baseline models evaluation
* ### CNN work - data and model prep
* ### CNN model 1 running and cross-val

### Data Preparation

I started baseline work by looking at the data and preparing it. Firstly I explored the structure of the data and viewed the imbalance of the two classes. Then, I prepared data by normalising the 'Amount' column of the data to range between -1 and 1 like the rest of the data and then I dropped the 'Time' column completely. I did this in order to not work with time series data and to focus solely on classification using the other 29 features. 




### Metrics we care about

#### F1-score:
F1-score is the harmonic average of precision and recall. We can define precision as intuitively the ability of the classifier not to label as positive, a sample that is negative. Similarly we define recall as intuitively the ability of the classifier to find all the positive samples.

So why do we care about F1 score? 

#### Context of the bank
In the context of banks and how these metrics add or lose value to them, we indeed care about F1 score. As you can see from the results table for SMOTE, LinearSVC has the highest Recall score, which means it is great at finding true fraudulent cases. However, it's precision is very low at 0.06. This essentially means that the classifier performs badly when it comes to predicting some non-fraudulent data and falsly labelling them as fraud. Why is this bad? This is bad because the loses the bank money and gives customers a bad experience. If we have bad precision then we falsly classify as fraud a lot and we freeze customer cards and accounts and send them a text to say we believe there is fraud etc. Only to ultimately verify that everything is benign and reverse the situation. This is very bad and gives a bad impression for the customer, who may indeed change bank or lose faith in the bank's intelligence systems.

#### Hence, we care about F1 score, which is the balance between these two metrics. Recall: being able to catch true frauds and Precision: being able to correctly classify and reduce the number of false positives.

### Baseline Models Evaluation
---

### Original vs Under-Sampled

The results certainly show that undersampling performs extremely well. However when we intuitively think about why this is so, it is perhaps not the wise approach. When under-sampling we have reduced the amount of information we have from over 200,000 real life examples (albeit benign transactions) and brought this down to merely <500. 
By doing this, as mentioned, we lose a *lot* of information that the classifier could learn from and hence become more generalisable. It is likely the case that we just overfit to the small dataset.

To this end, it would appear that perhaps no sampling is better here. The original data performed fairly well across the board of classifiers, with precision being stronger as the majority. 

### Oversample vs SMOTE

As seen, the results for Oversampling are marginally higher than that of SMOTE but due to the understanding of how the algorithms work and how we resample inside the cross validation loop, it is easy to understand why these results are likely to be biased.

Oversampling simply duplicates datapoints randomly, so there is a lot of redundant data floating around. This means it is likely that during the CV process, the testing fold will likely contain duplicate data as in some of the training folds and therefore we have a 'leakage' of test data. This would explain why Oversampling appears to achieve better results.

For this reason, taking the results of SMOTE is preferred, as it is more 'true'. This is because SMOTE uses K Nearest Neigbours to pick a nearby datapoint at random and then it randomly extrapolates this point in either the positive of negative direction by an amount in the range [0,1]. Effectively, this creates new data points which is a lot better to train on than duplicate information.

### Original vs SMOTE

The question then begs, is SMOTE worth it at all? Given that the original dataset achieves quite well in comparison. Well, it depends what we care about most. If we care about Recall and hence catching fraudluent transactions, then SMOTE has an advantage here. Taking the best achieving classifier, that is RandomForest, we can see that by using SMOTE we have a considerable margin on Recall, at the expense of some precision, but maintaining a marginally hgiher F1 score overall. So, by considering the classifier that appears to be suited for the problem at hand, SMOTE allows us to achieve higher than the original dataset.

### Test_Train_Split vs Custom cross_val_score using KFold
#### How a difference in splitting can influence results

To represent the importance of ensuring all of the data is used to validate the model (using KFold), we look at the results of using test_train_split to split the data and then resampling the training data only (to preserve test data) and then averaging this n times. 



# Results Appendix

In [None]:
 ''' Accounting properly for sampling in CV
 
 ORIGINAL
 ==============================
Cross validation training results: 
                        F1 Score  Precision    Recall   Training Time
Classifier                                                           
KNeighborsClassifier    0.773953   0.834210  0.733740 00:00:00.672486
LinearSVC               0.702664   0.911093  0.609756 00:00:44.615623
DecisionTreeClassifier  0.648748   0.584061  0.747967 00:00:12.705187
RandomForestClassifier  0.789572   0.867006  0.737805 00:00:11.946860
MLPClassifier           0.740725   0.789986  0.701220 00:00:10.611695
GaussianNB              0.114077   0.061255  0.833333 00:00:00.110761
==============================

UNDER
==============================
Cross validation training results: 
                        F1 Score  Precision    Recall   Training Time
Classifier                                                           
KNeighborsClassifier    0.929551   0.978015  0.886179 00:00:00.000986
LinearSVC               0.908181   0.955016  0.865854 00:00:00.016134
DecisionTreeClassifier  0.900284   0.910581  0.890244 00:00:00.009997
RandomForestClassifier  0.915486   0.952169  0.882114 00:00:00.040110
MLPClassifier           0.916529   0.956296  0.880081 00:00:00.545562
GaussianNB              0.900877   0.965324  0.845528 00:00:00.001272
==============================

OVER
==============================
Cross validation training results: 
                        F1 Score  Precision    Recall   Training Time
Classifier                                                           
KNeighborsClassifier    0.634392   0.563856  0.798780 00:00:01.675460
LinearSVC               0.117779   0.063179  0.871951 00:01:19.514793
DecisionTreeClassifier  0.649036   0.661330  0.664634 00:00:07.959666
RandomForestClassifier  0.801992   0.884864  0.745935 00:00:10.799792
MLPClassifier           0.658083   0.595006  0.774390 00:00:24.078549
GaussianNB              0.100591   0.053461  0.855691 00:00:00.198887
==============================

SMOTE
==============================
Cross validation training results: 
                        F1 Score  Precision    Recall   Training Time
Classifier                                                           
KNeighborsClassifier    0.500628   0.375985  0.831301 00:00:01.722399
LinearSVC               0.116943   0.062724  0.867886 00:01:21.129306
DecisionTreeClassifier  0.441011   0.325126  0.711382 00:00:23.395320
RandomForestClassifier  0.806464   0.833590  0.794715 00:00:21.880070
MLPClassifier           0.709039   0.708030  0.725610 00:00:24.371739
GaussianNB              0.107311   0.057277  0.855691 00:00:00.193990
==============================
 
 '''

In [None]:
''' Naive averaging, using test_train_split only. 
    Oversampling training split and preserving test set, averaging over 3 runs
    
SUMMARY OF RESULTS (AVG over three random iterations)
=====================================================================
                        F1 Score  Precision    Recall   Training Time
Classifier                                                           
KNeighborsClassifier    0.611444   0.476876  0.853809 00:00:01.758558
LinearSVC               0.119385   0.063951  0.896587 00:01:24.125692
DecisionTreeClassifier  0.536858   0.411079  0.774736 00:00:23.579666
RandomForestClassifier  0.846437   0.879453  0.816384 00:00:23.179558
MLPClassifier           0.750672   0.704481  0.807383 00:00:27.263501
GaussianNB              0.109543   0.058463  0.870450 00:00:00.213850
=====================================================================
'''

In [None]:
''' Tuned Random Forest Classifier, the best performer:

['RandomForestClassifier', 0.827025, 0.853782, 0.813008, datetime.timedelta(0, 419, 267422)] 

F1 SCORE = 827025

'''