Credit card fraud detection is a critical task in the financial industry to prevent unauthorized transactions and protect consumers from financial losses. Both machine learning and deep learning techniques have been extensively applied to address this problem. Here's an overview of how each approach can be used:

**Machine Learning for Credit Card Fraud Detection**:

- Supervised Learning: In supervised learning, historical transaction data labeled as either fraudulent or non-fraudulent is used to train classification models. Common algorithms include logistic regression, decision trees, random forests, support vector machines (SVM), and ensemble methods like gradient boosting classifiers. These models learn patterns and features from the data to classify new transactions as fraudulent or legitimate.

- Anomaly Detection: Anomaly detection techniques are also widely used for fraud detection, especially in cases where fraudulent transactions are rare and hard to distinguish from normal behavior. Algorithms like Isolation Forest, Local Outlier Factor (LOF), and One-Class SVM are commonly used for detecting anomalies in credit card transactions based on deviations from normal behavior patterns.

**Deep Learning for Credit Card Fraud Detection**:

- Neural Networks: Deep learning models, particularly neural networks, have shown promise in detecting complex patterns and nonlinear relationships in high-dimensional data like credit card transactions. Architectures like feedforward neural networks, convolutional neural networks (CNNs), and recurrent neural networks (RNNs) can be employed for fraud detection tasks.
Autoencoders: Autoencoder architectures, which consist of an encoder and a decoder, can be used for unsupervised learning-based anomaly detection. The model learns to reconstruct input data and anomalies are identified by large reconstruction errors.


- Graph Neural Networks (GNNs): GNNs can capture the relational structure between entities (e.g., accounts, merchants) in transaction networks. By modeling transaction graphs, GNNs can learn to detect fraudulent patterns that involve complex relationships between entities.
Challenges and Considerations:

Imbalanced Data: Credit card fraud datasets are typically highly imbalanced, with fraudulent transactions being a small fraction of the total. Dealing with imbalanced data requires careful selection of evaluation metrics, sampling techniques, and model optimization strategies.

Feature Engineering: Extracting relevant features from transaction data is crucial for building effective fraud detection models. Features may include transaction amount, time, location, user behavior patterns, and more. Feature engineering techniques play a vital role in improving model performance.

Real-Time Detection: In practice, credit card fraud detection systems need to operate in real-time to block suspicious transactions as they occur. This requires efficient model inference and integration with transaction processing systems.

In summary, both machine learning and deep learning techniques offer effective approaches for credit card fraud detection, each with its strengths and considerations. The choice of approach depends on factors such as the nature of the data, the complexity of fraud patterns, and the requirements of the deployment environment.

In [1]:
import pandas as pd

In [2]:
data= pd.read_csv("creditcard.csv")
data.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,0,-1.359807,-0.072781,2.536347,1.378155,-0.338321,0.462388,0.239599,0.098698,0.363787,...,-0.018307,0.277838,-0.110474,0.066928,0.128539,-0.189115,0.133558,-0.021053,149.62,0.0
1,0,1.191857,0.266151,0.16648,0.448154,0.060018,-0.082361,-0.078803,0.085102,-0.255425,...,-0.225775,-0.638672,0.101288,-0.339846,0.16717,0.125895,-0.008983,0.014724,2.69,0.0
2,1,-1.358354,-1.340163,1.773209,0.37978,-0.503198,1.800499,0.791461,0.247676,-1.514654,...,0.247998,0.771679,0.909412,-0.689281,-0.327642,-0.139097,-0.055353,-0.059752,378.66,0.0
3,1,-0.966272,-0.185226,1.792993,-0.863291,-0.010309,1.247203,0.237609,0.377436,-1.387024,...,-0.1083,0.005274,-0.190321,-1.175575,0.647376,-0.221929,0.062723,0.061458,123.5,0.0
4,2,-1.158233,0.877737,1.548718,0.403034,-0.407193,0.095921,0.592941,-0.270533,0.817739,...,-0.009431,0.798278,-0.137458,0.141267,-0.20601,0.502292,0.219422,0.215153,69.99,0.0


In [3]:
import numpy as np
np.unique(data["Class"] , return_counts=True)

# its mean 15862 good data
# 73 data with label 1 as Fraud
# contamination = dirty/clean= 73/15862   ----> its give us estimation of ration of dirty to clean data

(array([ 0.,  1., nan]), array([15862,    73,     1]))

In [7]:
# Drop rows with missing values
data.dropna(inplace=True)

In [8]:
from sklearn.neighbors import LocalOutlierFactor


In [9]:
# Create the feature matrix X
X = data.drop("Class", axis=1)

# Create the LOF model
LOF = LocalOutlierFactor(n_neighbors=10, contamination=0.004)

# Fit the model and predict outliers
y_predict = LOF.fit_predict(X)

In [11]:
print(y_predict)

[1 1 1 ... 1 1 1]


In [12]:
np.unique(y_predict , return_counts=True)
# -1 ---> Fraud

(array([-1,  1]), array([   64, 15871]))

In [13]:
y_predict[y_predict==1] = 0
y_predict[y_predict==-1] = 1

In [14]:
# make similar to our dataset , Fraud---> 1 , clean--->0
np.unique(y_predict , return_counts=True)

(array([0, 1]), array([15871,    64]))

In [16]:
from sklearn import metrics



In [17]:
y_true= data["Class"]


In [19]:
print(metrics.classification_report(y_true=y_true, y_pred=y_predict))

              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     15862
         1.0       0.06      0.05      0.06        73

    accuracy                           0.99     15935
   macro avg       0.53      0.53      0.53     15935
weighted avg       0.99      0.99      0.99     15935



The results indicate that **the model's performance is not satisfactory**, especially for class 1 (the minority class). Here are some observations:

Precision and Recall for Class 1 (Anomalies): **The precision and recall** **values for class 1 are very low**(0.06 and 0.05, respectively). This means that out of all the instances predicted as anomalies, only **a very small** fraction **are actually true anomalies**(precision), and the model is also missing a large number of true anomalies (recall).

F1-score: The F1-score, which is the harmonic mean of precision and recall, is also very low for class 1 (0.06). This indicates poor overall performance in identifying anomalies.

Imbalanced Dataset: The dataset seems to be heavily imbalanced, with a large number of instances belonging to class 0 (normal) and a very small number belonging to class 1 (anomalies). This imbalance can affect the model's ability to learn patterns in the minority class and may lead to biased results.

Accuracy: While the overall accuracy of 99% may seem high, it can be misleading in the presence of imbalanced data. In this case, the high accuracy is mainly due to the large number of correctly predicted instances in class 0, but it does not reflect the model's performance on the minority class.

In summary, the model's performance, especially for detecting anomalies (class 1), is not satisfactory, and further investigation and possibly model improvement are needed.

# Python Outlier Detection(PyOD)

In PyOD, the choice between angle-based and distance-based methods depends on the specific algorithm being used for outlier detection.

**Angle-Based Methods**  (Sklearn is not angle based):

Angle-based methods typically measure the angle between data points in a high-dimensional space. These methods are often used in outlier detection algorithms based on subspace analysis or nearest neighbors.
One example of an angle-based method in PyOD is the Angle-based Outlier Detector (**ABOD**), which computes the variance of angles between a data point and all other data points. Outliers are identified based on the variance of these angles.

**Angle-based methods can be effective when dealing with data that**       **exhibit complex geometric structures or non-linear relationships.**


Distance-Based Methods:

Distance-based methods measure the distance between data points in the feature space. Outliers are often identified as data points that are farthest from the majority of the data.
Many popular outlier detection algorithms in PyOD, such as k-Nearest Neighbors (kNN), Isolation Forest, and Local Outlier Factor (LOF), are distance-based methods.


Distance-based methods are suitable for detecting outliers in both low-dimensional and high-dimensional datasets. They are particularly effective when outliers are defined as instances that are significantly distant from the rest of the data points.



Both angle-based and distance-based methods have their advantages and limitations, and the choice between them depends on factors such as the nature of the data, the desired level of interpretability, and the computational requirements. PyOD provides a variety of algorithms that cover both types of methods, allowing users to select the most appropriate approach based on their specific needs and the characteristics of the dataset.

PyOD (Python Outlier Detection) is a comprehensive Python library for detecting outliers and anomalies in data. It provides a wide range of algorithms and tools for various anomaly detection tasks. Here's an overview of PyOD's application in anomaly detection:

Wide Range of Algorithms:

PyOD offers a rich collection of outlier detection algorithms, including both traditional statistical methods and modern machine learning techniques. These algorithms cover various approaches such as proximity-based, linear models, clustering-based, and ensemble methods.
Some popular algorithms included in PyOD are Isolation Forest, Local Outlier Factor (LOF), k-Nearest Neighbors (kNN), One-Class SVM, Principal Component Analysis (PCA), and more.

Flexibility and Customization:

PyOD provides a unified API interface for different algorithms, making it easy to compare and evaluate multiple methods on the same dataset.
Users can customize the parameters and settings of each algorithm to adapt to different data characteristics and application requirements.

Scalability and Performance:

Many algorithms in PyOD are designed to handle large-scale datasets efficiently. For example, Isolation Forest and kNN-based methods have linear time complexity with respect to the number of data points.
PyOD also supports parallel processing and multi-threading to further enhance performance on multicore systems.

Evaluation and Model Selection:

PyOD includes utilities for evaluating and benchmarking outlier detection algorithms. Users can assess the performance of different methods using standard metrics such as precision, recall, F1-score, and area under the ROC curve (AUC-ROC).
The library provides functions for cross-validation, model selection, hyperparameter tuning, and visualization of evaluation results.

Application Areas:

PyOD can be applied to various domains and use cases where anomaly detection is required, including fraud detection, network security, IoT (Internet of Things) monitoring, financial transactions, healthcare, and more.
It is suitable for both batch processing and real-time anomaly detection scenarios.

Overall, PyOD is a powerful and versatile library for anomaly detection tasks, offering a comprehensive set of algorithms, tools, and utilities to support a wide range of applications. It simplifies the process of implementing, evaluating, and deploying outlier detection solutions, making it accessible to both researchers and practitioners in the field.

In [21]:
!pip install pyod


Collecting pyod
  Downloading pyod-1.1.3.tar.gz (160 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m160.5/160.5 kB[0m [31m1.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: pyod
  Building wheel for pyod (setup.py) ... [?25l[?25hdone
  Created wheel for pyod: filename=pyod-1.1.3-py3-none-any.whl size=190251 sha256=e0747c081ca776bd4e430b9c0c7b519ffec5e32ed5226c3d95973daf4b978c97
  Stored in directory: /root/.cache/pip/wheels/05/f8/db/124d43bec122d6ec0ab3713fadfe25ebed8af52ec561682b4e
Successfully built pyod
Installing collected packages: pyod
Successfully installed pyod-1.1.3


In [25]:
!pip install --upgrade pyod




In [26]:
# with PyOD-ABOD----> angle
from pyod.models.abod import ABOD
from sklearn import metrics

# Your code to load the dataset and create the feature matrix X

# Create the ABOD model
abod_model = ABOD(contamination=0.004)

# Fit the model and predict outliers
abod_model.fit(X)
y_predict = abod_model.predict(X)

# Get true labels
y_true = data["Class"]

# Calculate and print classification report
print(metrics.classification_report(y_true=y_true, y_pred=y_predict))


  return _methods._var(a, axis=axis, dtype=dtype, out=out, ddof=ddof,
  arrmean = um.true_divide(arrmean, div, out=arrmean,
  ret = ret.dtype.type(ret / rcount)


              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     15862
         1.0       0.06      0.04      0.05        73

    accuracy                           0.99     15935
   macro avg       0.53      0.52      0.52     15935
weighted avg       0.99      0.99      0.99     15935



In [23]:
# with PyOD-LOF----> distance
from pyod.models.lof import LOF
from sklearn import metrics

# Create the feature matrix X
X = data.drop("Class", axis=1)

# Create the LOF model
lof_model = LOF(n_neighbors=10, contamination=0.004)

# Fit the model and predict outliers
lof_model.fit(X)
y_predict = lof_model.predict(X)

# Get true labels
y_true = data["Class"]

# Calculate and print classification report
print(metrics.classification_report(y_true=y_true, y_pred=y_predict))


              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     15862
         1.0       0.05      0.03      0.03        73

    accuracy                           0.99     15935
   macro avg       0.52      0.51      0.52     15935
weighted avg       0.99      0.99      0.99     15935



In [24]:
from pyod.models.knn import KNN
from sklearn import metrics

# Create the feature matrix X
X = data.drop("Class", axis=1)

# Create the KNN model
knn_model = KNN(contamination=0.004)

# Fit the model and predict outliers
knn_model.fit(X)
y_predict = knn_model.predict(X)

# Get true labels
y_true = data["Class"]

# Calculate and print classification report
print(metrics.classification_report(y_true=y_true, y_pred=y_predict))


              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     15862
         1.0       0.04      0.03      0.03        73

    accuracy                           0.99     15935
   macro avg       0.52      0.51      0.51     15935
weighted avg       0.99      0.99      0.99     15935



# IsolationForest

Isolation Forest is an algorithm used for anomaly detection, particularly in high-dimensional datasets. It works by isolating anomalies in the dataset by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. This process is repeated recursively until all instances are isolated.

The main idea behind Isolation Forest is that anomalies are likely to be isolated in fewer splits compared to normal instances, as they tend to have attribute values that are very different from those of normal instances. Therefore, the average path length to isolate an anomaly is expected to be shorter than that of a normal instance.

Isolation Forest has several advantages:

Efficiency: Isolation Forest can efficiently handle large datasets with high dimensionality because it only needs to randomly select a subset of features for splitting at each step.

Scalability: It has a computational complexity of O(n log n), making it scalable to large datasets.

Robustness: Isolation Forest is less affected by outliers and noise in the dataset compared to other algorithms.

No assumptions about the data: It does not make any assumptions about the distribution of the data, making it suitable for a wide range of applications.

Whether Isolation Forest performs better than other algorithms like PyOD (Python Outlier Detection) and LocalOutlierFactor (LOF) depends on various factors such as the characteristics of the dataset, the nature of the anomalies, and the specific requirements of the application.

In some cases, Isolation Forest may outperform other algorithms, especially when dealing with high-dimensional data or when anomalies are well-separated from normal instances. However, it's essential to evaluate the performance of different algorithms empirically on the specific dataset to determine which one works best for a particular use case. Additionally, ensemble methods like Isolation Forest can also be combined with other anomaly detection techniques for improved performance.

In [27]:
from sklearn.ensemble import IsolationForest
from sklearn import metrics

# Create the feature matrix X
X = data.drop("Class", axis=1)

# Create the Isolation Forest model
isoforest_model = IsolationForest(contamination=0.004, random_state=42)

# Fit the model and predict outliers
isoforest_model.fit(X)
y_predict = isoforest_model.predict(X)

# Convert predictions to binary labels (0: inliers, 1: outliers)
y_predict_binary = np.where(y_predict == -1, 1, 0)

# Get true labels
y_true = data["Class"]

# Calculate and print classification report
print(metrics.classification_report(y_true=y_true, y_pred=y_predict_binary))




              precision    recall  f1-score   support

         0.0       1.00      1.00      1.00     15862
         1.0       0.58      0.51      0.54        73

    accuracy                           1.00     15935
   macro avg       0.79      0.75      0.77     15935
weighted avg       1.00      1.00      1.00     15935



The result looks reasonable, but it depends on the specific requirements and context of your anomaly detection task.

Here's a breakdown of the metrics:

Precision (for class 1): Precision measures the ratio of true positive predictions to the total number of positive predictions made by the model. In this case, it **indicates that about 58% of the instances predicted** as anomalies **are actually true anomalies**.

Recall (for class 1): Recall measures the ratio of true positive predictions to the total number of actual positive instances in the dataset. It indicates that the model **correctly identifies around 51% of the actual anomalies**.

F1-score (for class 1): The F1-score is the harmonic mean of precision and recall. It provides a balance between precision and recall. In this case, the F1-score for class 1 is 0.54, which suggests a reasonable balance between precision and recall.

Accuracy: Accuracy measures the ratio of correctly predicted instances to the total number of instances in the dataset. An accuracy of 1.00 indicates that the model correctly classifies all instances, both normal and anomalous.

Macro avg and weighted avg: These metrics provide the average scores across all classes. They give a general overview of the model's performance across all classes.

Overall, the model achieves high accuracy and precision for the majority class (normal instances), but it has relatively lower recall and F1-score for the minority class (anomalies). Depending on your specific requirements, you may need to adjust the model or explore other techniques to improve its performance on detecting anomalies.