Fraud Detection Analysis in Financial Transactions

Fraud detection analysis in financial transactions involves identifying and preventing fraudulent activities in financial data. This is a critical task in the financial industry, as fraud can result in significant financial losses and damage to a company's reputation.

Machine learning techniques, such as isolation forest and local outlier factor, can be used to detect fraudulent transactions in financial datasets. These algorithms can identify patterns and anomalies in the data that may indicate fraudulent activity.

Importing the necessary libraries

In [1]:
import pandas as pd
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix

Reading the file

In [32]:
data = pd.read_csv('C:/Users/Dell/OneDrive/Desktop/codtech internship/financial dataset.csv')

Creating a Dataframe

In [33]:
df = pd.DataFrame(data)

Returning the first five rows of the data

In [34]:
df.head()

Unnamed: 0,Type,Amount,Customer ID,Recipient ID,Is Fraud
0,PAYMENT,9839.64,11231006815,21979787155,0
1,PAYMENT,1864.28,11666544295,22044282225,0
2,TRANSFER,181.0,11305486145,3553264065,1
3,CASH_OUT,181.0,1840083671,338997010,1
4,PAYMENT,11668.14,12048537720,21230701703,0


In [35]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1699 entries, 0 to 1698
Data columns (total 5 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Type          1699 non-null   object 
 1   Amount        1699 non-null   float64
 2   Customer ID   1699 non-null   int64  
 3   Recipient ID  1699 non-null   int64  
 4   Is Fraud      1699 non-null   int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 66.5+ KB


Checking for null values

In [36]:
df.isnull()

Unnamed: 0,Type,Amount,Customer ID,Recipient ID,Is Fraud
0,False,False,False,False,False
1,False,False,False,False,False
2,False,False,False,False,False
3,False,False,False,False,False
4,False,False,False,False,False
...,...,...,...,...,...
1694,False,False,False,False,False
1695,False,False,False,False,False
1696,False,False,False,False,False
1697,False,False,False,False,False


Converting the Categorical data into Numerical data

In [37]:
df['Type'] = df['Type'].map({'PAYMENT':0,
                                     'TRANSFER':1,
                                     'CASH_OUT':2,
                                     'CASH_IN':3,
                                     'DEBIT':4})

In [38]:
df.head()

Unnamed: 0,Type,Amount,Customer ID,Recipient ID,Is Fraud
0,0,9839.64,11231006815,21979787155,0
1,0,1864.28,11666544295,22044282225,0
2,1,181.0,11305486145,3553264065,1
3,2,181.0,1840083671,338997010,1
4,0,11668.14,12048537720,21230701703,0


Splitting the data into training and testing sets

In [39]:
X = df.drop('Is Fraud', axis=1)
y = df['Is Fraud']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Implementing the Isolation Algorithm for Anamoly Detection

In [41]:
if_model = IsolationForest(contamination=0.1) #trains an Isolation Forest model on the training data witha a contamination rate of 0.1(i.e., 10% of the data is expected to be anomalies.)
if_model.fit(X_train)

y_pred_if = if_model.predict(X_test)



Predicts anomalies on the testing data using the Isolation Forest model

In [43]:
y_pred_if = [1 if x == -1 else 0 for x in y_pred_if]

In [44]:
print("Isolation Forest:")
print("Accuracy:", accuracy_score(y_test, y_pred_if))
print("Classification Report:")
print(classification_report(y_test, y_pred_if))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_if))

Isolation Forest:
Accuracy: 0.9029411764705882
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.91      0.95       337
           1       0.00      0.00      0.00         3

    accuracy                           0.90       340
   macro avg       0.50      0.46      0.47       340
weighted avg       0.98      0.90      0.94       340

Confusion Matrix:
[[307  30]
 [  3   0]]


Also Implementing the Local Outlier Factor Algorithm

In [42]:
lof_model = LocalOutlierFactor(n_neighbors=20, contamination=0.1) #Trains a Local Outlier Factor (LOF) model on the testing data with a contamination rate of 0.1 and 20 nearest neighbors
y_pred_lof = lof_model.fit_predict(X_test)

Predicts anomalies on the testing data using the LOF model

In [45]:
y_pred_lof = [1 if x == -1 else 0 for x in y_pred_lof]

In [46]:
print("\nLocal Outlier Factor:")
print("Accuracy:", accuracy_score(y_test, y_pred_lof))
print("Classification Report:")
print(classification_report(y_test, y_pred_lof))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_pred_lof))


Local Outlier Factor:
Accuracy: 0.8911764705882353
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.90      0.94       337
           1       0.00      0.00      0.00         3

    accuracy                           0.89       340
   macro avg       0.50      0.45      0.47       340
weighted avg       0.98      0.89      0.93       340

Confusion Matrix:
[[303  34]
 [  3   0]]


Conclusion

Isolation Forest

The model is good at identifying normal transactions (class 0), with a high precision (0.99) and recall (0.91) for this class.
The accuracy of the model is 0.90, which means that it correctly classifies 90% of the transactions.
The confusion matrix shows that the model misclassifies 30 normal transactions as anomalous (false positives) and 3 anomalous transactions as normal (false negatives).

Local Outlier Factor (LOF)

The model is also good at identifying normal transactions (class 0), with a high precision (0.99) and recall (0.90) for this class.
The accuracy of the model is 0.89, which means that it correctly classifies 89% of the transactions.
The confusion matrix shows that the model misclassifies 34 normal transactions as anomalous (false positives) and 3 anomalous transactions as normal (false negatives).

Both models perform similarly, with the Isolation Forest model having a slightly higher accuracy (0.90 vs 0.89).
The LOF model has a slightly higher number of false positives (34 vs 30) compared to the Isolation Forest model.