# <strong>Final Report - Credit Card Fraud Detection Model</strong>
<strong>Team members</strong>: Alexander Shih, Diksha Holla, Jihye Oh, Robert Pigue, Srujal Gawali
<br>
<br>
## 1. Introduction


---
### 1.1 Background

Credit card fraud is a major issue for banks and financial institutions across the world. It is not possible for the bank to verify every single transaction that takes place and there needs to be a way to detect which transaction is fraudulent. According to the Federal Trade Commission, there were over 2.7 million reports of credit card fraud in the United States in 2021 which approximated to almost $5.8 billion dollars in losses (Federal Trade Commission, 2022). Credit card fraud detection algorithms can help reduce financial losses for both banks and individual customers.
<br>

### 1.2 Problem Definition

As mentioned in the background, credit card fraud is a critical concern for banks. Through our project, we want to try and solve this problem using Machine Learning models that banks can possibly use to detect fraudulent transactions. With its capability to dynamically adapt to evolving transaction patterns, our ML-based approach emerges as a powerful alternative to traditional methods such as rule-based systems or signature-based detection. Unlike these conventional approaches that rely on predefined rules, our approach can identify subtle anomalies in evolving fraudulent transactions, significantly mitigating false positives. Thus, we believe that our adoption of ML in credit card fraud detection will offer a proactive solution for banks seeking to secure their customers against emerging threats.
<br>

### 1.3 Dataset

[Our dataset](https://www.kaggle.com/datasets/nelgiriyewithana/credit-card-fraud-detection-dataset-2023) consists of card transactions data made by European cardholders in the year 2023. It contains more than 550,000 records and has been anonymized to conceal the cardholders' identities.
## 2. Methods


---
To solve this problem, we plan to use gaussian mixture model clustering for an unsupervised approach, and random forest classifiers for a supervised approach. At this point, only the Random Forest Classifier has been implemented and evaluated.
<br>
### 2.1 Pre-processing and Feature Selection
For data pre-processing, since the data in the dataset was already cleaned, the data was screened for skewness and collinearity. Features were screened with two methods. First, they were ranked in terms of colinearity, and features with higher levels of collinearity were used to train the random forest model.  

For feature selection, we used PCA as an initial approach for both methods. We examined all the principle components resulting from PCA based on their explained variance so we had a sense of how representative of the dataset the PCA components were. For the GMM, we thought that this approach was sufficient for feature selection, but for the random forest classifier we used additional methods to achieve higher accuracy.  

The random forest classifier in sklearn can quantify the importance of each feature used for training by comparing the decrease in impurity within each tree when using each feature. This importance metric was also used as a way of selecting features for the model. Finally, PCA was used as a form of feature selection. For the random forest classifier, more work was done with PCA to see if we could get better results. PCA was performed both on the entire dataset, and on the high collinearity subset, to produce principal components that could be used to train the random forest model model.  

The results were evaluated by examining the overall training and testing accuracy, precision, recall, f1-score, and confusion matrix resulting from the trained model. For determining model effectiveness, true positive accuracy (instances of fraud), and limitation of false negative were prioritized. That is, the goal was to make sure to catch as much fraud as possible, and avoid classifying fraud transactions as legitimate, even at a slight cost of flagging legitimate transactions as fraudulent.  

### 2.2 Gaussian Mixture Model
A Gaussian Mixture Model is a useful unsupervised approach to this problem because it provides a good estimate of the probability that a datapoint belongs to a certain cluster. A GMM can find and group clusters better than some other clustering algorithms can and capture heterogeneity. One advantage it has over other more robust approaches like random forest classifiers is that the training time is much shorter, taking as low as 2 seconds to train a model on 80% of our dataset. On the other hand, the random forest classifier took over 10 minutes to train the same dataset.

### 2.3 Random Forest Classifier  
Random Forest Classifiers are a good supervised approach to this problem because it is good at classification with large amounts of variables. It provides good information on which variables are the most important to consider. The random forest classifier was trained on several subsets of the data, where the performance of the model was compared with each other. As a baseline, the model was trained on the whole dataset, with no preprocessing done to the data, and on a set of features with higher colinearity (V1-V22). Next, PCA was performed on the data, and 9 principal components were selected to train the model based. 9 components were selected because they had over 85% explained variance.  

The importance of each feature was plotted to identify important features, and the model was trained on subsets of the 3, 5, 6, and 9, most important features as determined by the initial random forest model.


## 3. Results and Discussion


---
### 3.1 Data Preprocessing
a. Correlation Heatmap

Here is a Correlation Heatmap of our features. It measures correlation of all features vs. each other. If our data was not anonymized, then it would be interesting to see which features are correlated. Despite the lack of named features, we can still notice interesting patterns. For example, seeing that V16-V18 are all closely correlated or that V4 and V11 are correlated with class gives us insights into patterns in the data.

<p style="text-align:center;"><img src="images/heatmap.png" style="width:70%;"/></p>

b. Feature Skew Graphs  
Here is a collection of graphs representing the skew of various features. Skewness measures asymmetry in a dataset.

<p style="text-align:center;"><img src="images/feature_skew.png" style="width:70%;"/></p>

Features V1, V10, and V23 have a negative skew.
Distribution of negative skew:

<p style="text-align:center;"><img src="images/feature_skew_v1.png" style="width:70%;"/></p>
<p style="text-align:center;"><img src="images/feature_skew_v10.png" style="width:70%;"/></p>
<p style="text-align:center;"><img src="images/feature_skew_v23.png" style="width:70%;"/></p>

c. Class Distribution  
This graph shows us the proportion of the dataset that falls into the classifications of Legitimate and Fraudulent.

<p style="text-align:center;"><img src="images/class_distribution.png" style="width:40%;"/></p>

### 3.2 Employed Metrics  
  
##### GMM:  

For our GMM implementation we also implemented a confusion matrix. We trained the model with 2 and 3 PCA components. The increase in components showed negligible changes in model accuracy. The results are not as true as for Random Forest, however we still get high true positives and true negatives. The GMM implementation did predict slightly more false negatives, but false fraudulent predictions are better than false legitimate predictions for our overall goal for this dataset. Overall our predictions on whether there is fraud still has a high accuracy with an unsupervised approach as well.  
  
<p style="text-align:center;"><img src="images/2d.png" style="width:50%;"/></p>

<p style="text-align:center;"><img src="images/3d.png" style="width:50%;"/></p>

We also found the Adjusted Rand Score which is given from -1 to 1, we achieved a score of 0.69219 from our GMM implementation which shows that the data points in each cluster are closely related, and proves that the clusters formed were not random.  
  
In addition to the numerical analysis to validate our GMM model results, we plotted the PCA components and colored them by cluster, with 1 as fraudulent, and 0 as legitimate. Below are the clusterings for both the 2 component and 3 component GMM models, and it is clear that there are distinct clusters.  
  
<p style="text-align:center;"><img src="images/cluster1.png" style="width:70%;"/></p>

<p style="text-align:center;"><img src="images/cluster2.png" style="width:70%;"/></p>

The GMM true fraud accuracy (86%) was lower than the true legitimate accuracy (97%), which was not an issue we had with the random forest. This is because the cluster for fraud cases was largely overlapping with the legitimate cluster, but the legitimate cluster has wider deviation, which means it may consume some of the fraud cases.  

Based on this we found that GMM performance depends on the separation and deviation of the means, where more overlap in clusters often leads to a sacrifice in overall accuracy and more specifically, categorical accuracy. This application had two classes that had significant overlap, which is why we see a significant decrease in accuracy with the GMM when compared to random forest.


##### Random Forest:  

We Implemented Random Forest and to study the results we employed the ROC metric to see a representation of the true positive rates and false positive rates. We also were able to see the overall performance of our implementation by finding the AUC (area under the curve) of our ROC curve. Our curve is shown below and the calculated AUC value is  0.9651895257875807.

<p style="text-align:center;"><img src="images/roc.png" style="width:50%;"/></p>

Our curve and AUC value show that our implementation of Random Forest is better at differentiating between false positives and true positives. We wanted our AUC to at least be above a 0.5, since 0.5 is similar to guessing randomly if there is fraud or not. A value close to 1.0, which is what we have, shows that we are able to more accurately predict if there is fraud or not. However, when we had 100% accuracy, it was a red flag that we had overfitting, which we wanted to avoid.

To assess our random forests model, we also looked at individual decision trees to get a feel for how the forest was operating. Random forests is an ensemble learning method, because it uses numerous decision trees to make its decision by selecting the classification that a plurality of trees selects. Here is one such tree:

<p style="text-align:center;"><img src="images/decision_tree.png" style="width:60%;"/></p>

There are 8 levels to this tree, one for each feature used in our model + 1. If we zoom in, we can see the condition checked at every node. Looking at the color lets us see what proportion of samples that fit that condition are fraudulent or legitimate on the tree at every level. For example, the first split, which checks if feature v10 <= .02, leads to two drastically different sides of the tree. The left side, for which the condition is true, has a blue color. The one on the right has an orange color. This means that just based on that one check, the tree has already managed to separate the samples into a mostly fraudulent and a mostly legitimate side. Deeper in the tree, we can see a few white colored nodes. The observations present in these nodes are balanced between legitimate and fraudulent, and require more conditions to be checked in order to determine legitimacy.  Analysis of these decision trees gives us insight into which features are most important for determining legitimacy. 

We further illustrated the training and validation accuracy of a model across different maximum depths of decision trees.

<p style="text-align:center;"><img src="images/accuracy.png" style="width:50%;"/></p>

As the maximum depth increases from 4 to 7, both accuracies show a consistent upward trend in the range of (0.95, 0.97). This implies that the model is successfully capturing subtle variations in the training data with high accuracy. Yet, it is noteworthy that as the maximum depth approaches its upper limit, we observe a less steep increase in the validation accuracy, which suggests a possibility of overfitting for high max depth values.

We also implemented a Confusion Matrix as a metric to help visualize the predictions vs. actual outcomes. It contains not just true/false positives like and ROC, but contains the false/true negatives as well. 

<p style="text-align:center;"><img src="images/confusion_matrix.png" style="width:50%;"/></p>

We can see that the true positives and true negatives have a very high rate, whereas false positives and false negatives have a very low rate. The matrix shows that our implementation, for the most part, has accurate predictions on whether there is fraud or not.

### 3.3 Overall Results

We achieved an accuracy rate of 96% in fraud detection for our Random Forest implementation, demonstrating the effectiveness of Machine Learning models in future research on credit card fraud detection. Our model is highly accurate even at low depths, indicating that only a few key features are most important for accurate classification of fraud.  

We achieved an accuracy rate of 91% in fraud detection for our GMM implementation, our model is able to cluster data points efficiently instead of randomly which leads to a higher accuracy in detecting fraud. 

### 3.4 Model Comparison: GMM vs Random Forest

Gaussian Mixture Model (GMM) and Random Forest are both commonly used in machine learning, each with its own advantages and drawbacks, as compared in the following table.

<table>
  <thead>
    <tr>
      <th>Methods</th>
      <th>GMM</th>
      <th>Random Forest</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Advantages</td>
      <td>
        <ul>
          <li>Faster Training Time</li>
          <li>Better performance with PCA components</li>
        </ul>
      </td>
      <td>
        <ul>
          <li>High accuracy in general</li>
          <li>High recall of true fraudulent cases</li>
          <li>Can identify individual important variables</li>
        </ul>
      </td>
    </tr>
    <tr>
      <td>Drawbacks</td>
      <td>
        <ul>
          <li>Poorer performance in general</li>
          <li>Lower recall of true fraudulent cases</li>
          <li>Reliant of more distinct class means</li>
        </ul>
      </td>
      <td>
        <ul>
          <li>Significantly longer training time</li>
          <li>Prone to overfitting without proper tuning</li>
          <li>Overfitting with PCA components</li>
        </ul>
      </td>
    </tr>
  </tbody>
</table>


## 4. Conclusion

---
Our project effectively addressed the pressing issue of credit card fraud detection by implementing both the Gaussian Mixture Model (GMM) and Random Forest. With accuracy rates reaching 91% for GMM and 96% for Random Forest, our findings underscore the efficacy and potential of employing these models. These results further advocate our approaches as adaptable solutions for financial institutions seeking robust meastures to protect their customers from fraudulent transactions.

### Future Iterations
Moving forward, we aim to enhance the implementation for GMM by adjusting its sensitivity to class means and explore alternative models such as Naive bayes to further optimize our credit card fraud detection models. Additionally, we intend to assess our model performance with unbalanced datasets for a more accurate representation of real life scenarios and understanding how these variations impact the overall performace.

<br>

## References

---
Aburbeian, A. M., & Ashqar, H. I. (2023). Credit Card Fraud Detection Using Enhanced Random Forest Classifier for Imbalanced Data. https://doi.org/10.48550/arxiv.2303.06514  
Federal Trade Commission (2022). Consumer Sentinel Network Data Book 2021.https://www.ftc.gov/reports/consumer-sentinel-network-data-book-2021  
Melcher, K., & Silipo, R. (2019). Fraud detection using random forest, neural autoencoder, and isolation forest techniques. InfoQ. https://www.infoq.com/articles/fraud-detection-random-forest/   
Xuan, S. Liu, G., Li, Z., Zheng, L., Wang, S., & Jiang, C. (2018). Random forest for credit card fraud detection. 2018 IEEE 15th International Conference on Networking, Sensing and Control (ICNSC), pp. 1-6, doi: 10.1109/ICNSC.2018.8361343.
<br>
## Proposed Timeline

---
https://docs.google.com/spreadsheets/d/1TIT890cfIWgXPvwbE-o2cGhjh4PTi0kxe592Ma09y-8/edit?usp=sharing
<br>

## Technical Contribution Table

---

A technical contribution table below presents the names of all team members, explicitly detailing each member's technical contribution to the project preparation and implementation.

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Data Preprocessing</th>
      <th>GMM & Random Forest Implementation</th>
      <th>Result Visualization</th>
      <th>Discussion</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alexander Shih</td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
    <tr>
      <td>Diksha Holla</td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
    <tr>
      <td>Jihye Oh</td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
    <tr>
      <td>Robert Pigue</td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
    <tr>
      <td>Srujal Gawali</td>
      <td>✔️</td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
  </tbody>
</table>

<br>

## Documentation Contribution Table

---

A documentation contribution table below presents the names of all team members, explicitly detailing each member's contribution to the project report.

<table>
  <thead>
    <tr>
      <th>Name</th>
      <th>Introduction</th>
      <th>Problem Definition</th>
      <th>Methods</th>
      <th>Results & Discussion</th>
      <th>Github Pages</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Alexander Shih</td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
      <td></td>
    </tr>
    <tr>
      <td>Diksha Holla</td>
      <td></td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td></td>
    </tr>
    <tr>
      <td>Jihye Oh</td>
      <td></td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td>✔️</td>
    </tr>
    <tr>
      <td>Robert Pigue</td>
      <td></td>
      <td></td>
      <td></td>
      <td>✔️</td>
      <td></td>
    </tr>
    <tr>
      <td>Srujal Gawali</td>
      <td>✔️</td>
      <td>✔️</td>
      <td></td>
      <td>✔️</td>
      <td></td>
    </tr>
  </tbody>
</table>


