<a href="https://colab.research.google.com/github/polo-music/credit-card-fraud-detection/blob/main/credit_card_fraud_prediction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Idea & project
The idea of this project is, by fitting different binary classification algorithms from the sklearn library, try to find one that can predict with a solid amount of security if a credit card transaction is fraudulent or not.

After doing this project I found that work with this amount of unbalanced data is difficult and challenging. I've ended up cleaning the dataset to a less big and more reliable version.

In [71]:
# Importing libreries
import pandas as pd
import numpy as np
import plotly.express as ex
import plotly.graph_objects as go
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import roc_auc_score

# The dataset is too large to load into GitHub (150MB), I uploaded it into the Disk space of Google Colab
df = pd.read_csv('/content/creditcard.csv')

print(df.describe())

                Time            V1            V2            V3            V4  \
count  284807.000000  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean    94813.859575  1.168375e-15  3.416908e-16 -1.379537e-15  2.074095e-15   
std     47488.145955  1.958696e+00  1.651309e+00  1.516255e+00  1.415869e+00   
min         0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00   
25%     54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01   
50%     84692.000000  1.810880e-02  6.548556e-02  1.798463e-01 -1.984653e-02   
75%    139320.500000  1.315642e+00  8.037239e-01  1.027196e+00  7.433413e-01   
max    172792.000000  2.454930e+00  2.205773e+01  9.382558e+00  1.687534e+01   

                 V5            V6            V7            V8            V9  \
count  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05  2.848070e+05   
mean   9.604066e-16  1.487313e-15 -5.556467e-16  1.213481e-16 -2.406331e-15   
std    1.380247e+00  1.332271e+00  1.23709

Okay. The dataset is loaded. There is a lot of featured values 
(from V1 to V28) which contain different numbers. The dataset 
documentation tells us that there are only numerical values informing of the PCA transformation but due to confidentiality issues they cannot provide the original feature and more background. This should not be a problem since its the machine learning algorithm that will take care of "understening" this values. The only problem is that us, as humans, could not know if its getting the job well done or not. The two columns we're interested in are the "amount" coulumn, the "time" coulmn and the "class" column (binary being 1 fraud an 0 non fraud).

In [40]:
# Since the dataset is too large and there are too variables involved I think it will not be useful to try to visualize
# some sort of grah. Lets get to the real deal.
# Now lets split the data to try to test some algorithms
x = df.drop(labels = ['Class'], axis = 1)
y = df['Class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 7)

## Decision Tree Algorithm

The decision tree is the simplest and most popular classification algorithm. For building the model the decision tree algorithm considers all the provided features of the data and comes up with the important features.

Because of this advantage, the decision tree algorithms also used in identifying the importance of the feature metrics. Which used in handpicking the features. 

Once the important features identified then the model trains with the training data to come up with a set of rules. These rules used in predicting future cases or for the test dataset. 

We will try to divide the model from class and test the output.

In [41]:
# Declare instance of DTC and train the model
decision_tree = DecisionTreeClassifier()
decision_tree.fit(x_train, y_train)

DecisionTreeClassifier()

In [46]:
# We insert another code cell to avoid reloading the model
y_predict = decision_tree.predict(x_test)
acc = decision_tree.score(x_test, y_test)
print(acc)

0.9991924440855307


Wow! The accuracy of our model is 99.9% but does this mean that it is a good model?
Sometimes accuracy is not the real metric to help us say if a model is good or bad, and in this case it sure isn't. Let's dive in what really accuracy is:
>We know that accuracy is the number of true results among all the cases. 

Keeping in mind our dataset, we can clarely see why this accuracy is so high. We stated at the beginning of the project that our dataset had class 1 and class 0 transactions, but the relationship to one another was an extremily low 0.17%. So, lets imagine that the model (for some reason) always says that the transaction is valid. Having this relationship even if some transactions are fraud, the difference is so big that it always will be a super high accuracy. That's why this particular metric does not work in this case

> Before starting any coding or designing any algorithm, I always recommend to pause a minute and invest the time in DEEPLY understend what kind of dataset are we working with and what type of question we want to ask. With this in mind, the above assumption is done almost immidiately.

After looking at this particular case, we can affirm that accuracy is a good metric when we have a well balanced bunch of data (not this case).

In [47]:
# For deep diving in the real metrics of the algorithm, lets print some classification report
# This should give us something more to work with and give us a deeper understanding of the model itself
print(classification_report(y_test, y_predict))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00     56862
           1       0.76      0.79      0.77       100

    accuracy                           1.00     56962
   macro avg       0.88      0.89      0.89     56962
weighted avg       1.00      1.00      1.00     56962



For this, we will use the method **classification_report** from the sklearn library.

This method give us some more classification reports to take into account when validating models. It's time to deep dive into some of thses :)

---

First of all, we have to take a step back and understend what valeues we're working with when we do a ML prediction (or some prediction of any kind). The 'values' we have are these:

|  | Actual data | Actual data |  |
| --- | --- | --- | --- |
|  |Positive (actual data) | Negative (actual data) |
| Positive (predicted data) | TP | FP |
| Negative (predicted data)| FN | TN |

Being the values of the matrix the following:


*   TP: True Positive
*   FP: False Positive
*   FN: False Negative
*   TN: True Negative

This is good to keep in mind because we can extract some formulas and some concepts from it to try to understand the real performance of the actual model we're working with.

## Accuracy
As we stated before, the accuracy is the amount of "trues" the model got from the total amount of data.
> Accuracy = (TP + TN) / (TP + FP + FN + TN)

In this case, when we have a heavely unbalanced data, this is not the best classification information we can extract.

## Precision
When we talk about precision we mean: "what proportion of predicted positive is actually positive?"
> Precision = TP / (TP + FP)

The precision metric works well when we want to be sure of what we're doing (i.e. if we're talking about administer some medicine to a patient).

## Recall
The recall measure a different point of view from the model: "what proportion of actual positive is correct in the classification?"
> Recall = TP / (TP + FN)

The recall metric is useful when (of course, depending on the problem and project we're working on) we're looking to achieve as many positives as possible. Useful if we are predicting some possible development of a disease.

---

This are the three main metrics we can extract when we're working with a classification model. But we can take it an step forward and combine them to get different information.

## F1 score
The F1 score is a metric between 0 and 1 and it's the harmonic mean between the recall and the precision.
> F1 = 2 * (R * P / R + P)

When looking at the F1 score we can see that if we're working with an umbalanced dataset, the F1 value will tend to 0 (either because the recall or the precision is close to 0). With this we can clearly see if the accuracy of the model is representative or not -> not in this case.

## Curve ROC and AUC
This is a very useful classification metric when we're facing problem like this one. The ROC (Receiver operating characteristic) curve is the graphic representation of amount of true (or correct) classifications that a binary classification model will do depending on the sampling. The AUC is the Area Under the Curve. 

We can extract this curve from the indirect relation between the recall and the precision. If we want to increase the precision, we will decrease the recall and the opposite. This variation depends on what we really want to achieve with the model we're developing: we want it to predict as many true positives as possible or we want to have as many correct predictions as possible?
> Therefore, we can make a graph that would look something like this:
![picture](https://www.statology.org/wp-content/uploads/2021/08/read_roc2.png)

So, deppending on the amount of area that is under this curve, the better the model will be. We can find a method for calculating this in the sklearn library.

---

After this super fast class about classification metrics we can finally conclude that working with this amount of unbalanced data is neither suitable nor reliable for this case.

What we can do now is split our data (or segment it) to repeat the model and look at some metrics again.

In [67]:
reduced_df_class0 = df[df.Class == 0].sample(n = 150)
reduced_df_class1 = df[df.Class == 1].sample(n = 50)
reduced_df = pd.concat([reduced_df_class0, reduced_df_class1])

print(reduced_df)

            Time        V1        V2         V3        V4        V5        V6  \
269221  163594.0  1.920637 -1.251337  -0.661176 -0.530568 -1.127583 -0.364154   
146417   87690.0 -0.722795  0.993829   1.387846 -0.332280 -0.310226 -0.306936   
125533   77706.0  0.628874 -1.013755  -1.384760  0.653579  1.676106  3.934178   
157320  109835.0  2.083117  0.123442  -1.337087  0.350009  0.413914 -0.718075   
55214    46837.0 -0.681522  0.453120   1.760494  0.926386 -0.967963  1.097663   
...          ...       ...       ...        ...       ...       ...       ...   
42590    41164.0 -5.932778  4.571743  -9.427247  6.577056 -6.115218 -3.661798   
150662   93853.0 -5.839192  7.151532 -12.816760  7.031115 -9.651272 -2.938427   
203700  134928.0  1.204934  3.238070  -6.010324  5.720847  1.548400 -2.321064   
222133  142840.0 -3.613850 -0.922136  -4.749887  3.373001 -0.545207 -1.171301   
93424    64412.0 -1.348042  2.522821  -0.782432  4.083047 -0.662280 -0.598776   

               V7        V8

Nice, we have a proportion of 3/1, a much greater percentage than in the initial case. We can now re-do the spliting and training of the decission tree model.

In [68]:
x = reduced_df.drop(labels = ['Class'], axis = 1)
y = reduced_df['Class']

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, random_state = 7)

In [69]:
reduced_decision_tree = DecisionTreeClassifier()
reduced_decision_tree.fit(x_train, y_train)

DecisionTreeClassifier()

In [72]:
y_predict = reduced_decision_tree.predict(x_test)
print(classification_report(y_test, y_predict))
print(roc_auc_score(y_test, y_predict))

              precision    recall  f1-score   support

           0       0.97      0.97      0.97        29
           1       0.91      0.91      0.91        11

    accuracy                           0.95        40
   macro avg       0.94      0.94      0.94        40
weighted avg       0.95      0.95      0.95        40

0.9373040752351097


Now this are some useful conclusions. As we can see, the F1-score of both class 1 and class 0 is more realistic than before. We also have a AUC of 0.93 which is pretty good. 

We can assume that this model will work better than the one with the unbalanced data. My final touch would be to repeat the random sampling from the bigger dataset to see if the decision tree algorithm is good enough for our purpose or we would need to look for some other classification method. :)