<a href="https://colab.research.google.com/github/karan-2004/financeanomalydetection/blob/main/Untitled1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# *****Anomaly detection in financial transactions*****

Approach:

      1. To utilize both the labelled and unlabelled datasets available.

      2. Due to the scarcity of the labelled data our approach is more towards the unsupervised learning and semi-supervised learning.

      3. considered algorithms:

                * Isolation forest
                * DBscan
                * One-class
                * Auto-encoders(self supervised learning)

      4. Eventhough we opted with Isolation forest  because of its robustness. It out performs auto encoders on low dimension datasets.

      5. Labelled data are utilized for the purpose of the validation.

      6. Due to the uneven proportions in labels, GAN is used to increase the efficiency of the NN model which is then used for the validation. **bold text**

Importing the necessary modules

In [None]:
import scipy
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Loading of data

In [None]:
df = pd.read_csv("./financial_anomaly_data.csv")

Analysing the data

In [None]:
df.head()

Unnamed: 0,Timestamp,TransactionID,AccountID,Amount,Merchant,TransactionType,Location
0,01-01-2023 08:00,TXN1127,ACC4,95071.92,MerchantH,Purchase,Tokyo
1,01-01-2023 08:01,TXN1639,ACC10,15607.89,MerchantH,Purchase,London
2,01-01-2023 08:02,TXN872,ACC8,65092.34,MerchantE,Withdrawal,London
3,01-01-2023 08:03,TXN1438,ACC6,87.87,MerchantE,Purchase,London
4,01-01-2023 08:04,TXN1338,ACC6,716.56,MerchantI,Purchase,Los Angeles


In [None]:
df.shape

(76107, 7)

In [None]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76107 entries, 0 to 76106
Data columns (total 7 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Timestamp        76107 non-null  object 
 1   TransactionID    76107 non-null  object 
 2   AccountID        76107 non-null  object 
 3   Amount           76107 non-null  float64
 4   Merchant         76107 non-null  object 
 5   TransactionType  76107 non-null  object 
 6   Location         76107 non-null  object 
dtypes: float64(1), object(6)
memory usage: 4.1+ MB


In [None]:
df.describe()

Unnamed: 0,Amount
count,76107.0
mean,50141.400761
std,29403.04923
min,10.84
25%,25053.575
50%,50234.14
75%,75098.465
max,978942.26


In [None]:
df = df.dropna()

In [None]:
columns = df.columns

In [None]:
for column in columns:
  print(column, "\n")
  print(df[f'{column}'].value_counts(), "\n\n")

Timestamp 

01-01-2023 08:00    1
05-02-2023 13:44    1
05-02-2023 13:42    1
05-02-2023 13:41    1
05-02-2023 13:40    1
                   ..
18-01-2023 22:48    1
18-01-2023 22:47    1
18-01-2023 22:46    1
18-01-2023 22:45    1
23-02-2023 04:26    1
Name: Timestamp, Length: 76107, dtype: int64 


TransactionID 

TXN584     61
TXN1412    58
TXN1099    57
TXN840     56
TXN340     56
           ..
TXN60      22
TXN922     22
TXN1071    22
TXN411     20
TXN1437    19
Name: TransactionID, Length: 1999, dtype: int64 


AccountID 

ACC7     5145
ACC15    5132
ACC5     5127
ACC14    5114
ACC2     5103
ACC11    5077
ACC8     5071
ACC6     5057
ACC13    5055
ACC12    5053
ACC4     5051
ACC9     5050
ACC1     5037
ACC10    5035
ACC3     5000
Name: AccountID, dtype: int64 


Amount 

36475.58    3
42510.03    2
60777.86    2
55969.27    2
15104.71    2
           ..
6180.11     1
44835.23    1
79254.55    1
78577.52    1
57378.89    1
Name: Amount, Length: 75809, dtype: int64 


Merchant 

Mer

In [None]:
columns

Index(['Timestamp', 'TransactionID', 'AccountID', 'Amount', 'Merchant',
       'TransactionType', 'Location'],
      dtype='object')

# ***Isolation Forest***

In [None]:
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import LabelEncoder

In [None]:
dropped_df = df.drop(['Timestamp', 'TransactionID', 'AccountID'], axis=1)
dropped_df.head()

Unnamed: 0,Amount,Merchant,TransactionType,Location
0,95071.92,MerchantH,Purchase,Tokyo
1,15607.89,MerchantH,Purchase,London
2,65092.34,MerchantE,Withdrawal,London
3,87.87,MerchantE,Purchase,London
4,716.56,MerchantI,Purchase,Los Angeles


In [None]:
df_selected = dropped_df

In [None]:
label_encoder = LabelEncoder()
df_selected['Merchant'] = label_encoder.fit_transform(df_selected['Merchant'])
df_selected['TransactionType'] = label_encoder.fit_transform(df_selected['TransactionType'])
df_selected['Location'] = label_encoder.fit_transform(df_selected['Location'])

In [None]:
df_selected

Unnamed: 0,Amount,Merchant,TransactionType,Location,AnomalyScore
0,95071.92,7,0,4,1
1,15607.89,7,0,0,1
2,65092.34,4,2,0,1
3,87.87,4,0,0,1
4,716.56,8,0,1,1
...,...,...,...,...,...
216955,62536.88,0,2,3,1
216956,68629.69,6,1,0,1
216957,8203.57,5,0,0,1
216958,77800.36,5,0,2,1


In [None]:
model = IsolationForest(contamination=0.05, random_state=42)

In [None]:
model.fit(df_selected)



In [None]:
df_selected['AnomalyScore'] = model.predict(df_selected)
df_selected.rename(columns = {'AnomalyScore':'Anomalylabel'}, inplace=True)

In [None]:
anomalies = df_selected[df_selected['Anomalylabel'] == -1]

# Print or handle anomalies as needed
print("Anomalies:")
print(anomalies)

Anomalies:
          Amount  Merchant  TransactionType  Location  Anomalylabel
14      96525.88         8                0         4            -1
15      98688.82         7                0         0            -1
26      92970.47         8                0         0            -1
38      39888.46         0                0         0            -1
57      96665.82         2                0         0            -1
...          ...       ...              ...       ...           ...
216803  97458.45         0                0         0            -1
216840    138.05         6                0         0            -1
216904  98376.93         9                0         4            -1
216920  83299.21         0                0         0            -1
216932  97969.69         0                0         0            -1

[10834 rows x 5 columns]


In [69]:
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow.keras.layers import Dense, LeakyReLU, BatchNormalization, Input, Concatenate
from tensorflow.keras.models import Sequential, Model
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report, confusion_matrix

# Assuming df is your labeled dataset with 'AnomalyLabel' as the anomaly label (-1 or 1)
# ...

# Features and labels
X = df_selected.drop('Anomalylabel', axis=1)
y = df_selected['Anomalylabel']

# Map -1 to 0 for binary classification
y_binary = (y + 1) // 2

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y_binary, test_size=0.2, random_state=42)

# Normalize features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# GAN Generator Model
def build_generator(latent_dim, output_dim):
    model = Sequential([
        Dense(128, input_dim=latent_dim, activation='relu'),
        BatchNormalization(),
        Dense(256, activation='relu'),
        BatchNormalization(),
        Dense(output_dim, activation='tanh')
    ])
    return model

# GAN Discriminator Model
def build_discriminator(input_dim):
    model = Sequential([
        Dense(256, input_dim=input_dim, activation=LeakyReLU(alpha=0.2)),
        Dense(128, activation=LeakyReLU(alpha=0.2)),
        Dense(1, activation='sigmoid')
    ])
    return model

# GAN Combined Model (Generator + Discriminator)
def build_gan(generator, discriminator):
    discriminator.trainable = False
    model = Sequential([
        generator,
        discriminator
    ])
    return model

# Define GAN parameters
latent_dim = 100
output_dim = X_train_scaled.shape[1]

# Build and compile GAN components
generator = build_generator(latent_dim, output_dim)
discriminator = build_discriminator(output_dim)
discriminator.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

gan = build_gan(generator, discriminator)
gan.compile(optimizer='adam', loss='binary_crossentropy')

# Training the GAN
epochs_gan = 1
batch_size_gan = 64

for epoch in range(epochs_gan):
    noise = np.random.normal(0, 1, size=(batch_size_gan, latent_dim))

    # Generate synthetic anomalies using the GAN generator
    generated_anomalies = generator.predict(noise)

    # Create labels for synthetic anomalies (1 for normal)
    labels_gan = np.ones((batch_size_gan, 1))

    # Train the discriminator on real anomalies
    d_loss_real = discriminator.train_on_batch(X_train_scaled[y_train == 1], np.ones((sum(y_train == 1), 1)))

    # Train the discriminator on synthetic anomalies
    d_loss_fake = discriminator.train_on_batch(generated_anomalies, labels_gan)

    # Train the GAN
    noise = np.random.normal(0, 1, size=(batch_size_gan, latent_dim))
    labels_gan = np.zeros((batch_size_gan, 1))  # Labels for the GAN (0 for normal)
    g_loss = gan.train_on_batch(noise, labels_gan)

    # Print progress
    if epoch % 100 == 0:
        print(f"Epoch {epoch}, D Loss Real: {d_loss_real}, D Loss Fake: {d_loss_fake}, G Loss: {g_loss}")

# Generate synthetic anomalies using the trained GAN generator
synthetic_anomalies = generator.predict(np.random.normal(0, 1, size=(len(X_train_scaled[y_train == 1]), latent_dim)))

# Combine real anomalies and synthetic anomalies for training the neural network
X_train_gan = np.concatenate([X_train_scaled[y_train == 0], X_train_scaled[y_train == 1], synthetic_anomalies])
y_train_gan = np.concatenate([np.zeros(sum(y_train == 0)), np.ones(sum(y_train == 1) + len(synthetic_anomalies))])

# Build and train the neural network for anomaly detection
model_nn = Sequential([
    Dense(64, activation='relu', input_shape=(output_dim,)),
    Dense(32, activation='relu'),
    Dense(1, activation='sigmoid')
])

model_nn.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

# Train the model
model_nn.fit(X_train_gan, y_train_gan, epochs=10, batch_size=32, validation_split=0.2, verbose=2)

# Evaluate the model on the real test data
X_test_gan = X_test_scaled
y_test_gan = y_test
predictions_nn = model_nn.predict(X_test_gan)
predictions_nn_binary = (predictions_nn > 0.5).astype(int)

# Print classification report
print("Classification Report (Neural Network with GAN):")
print(classification_report(y_test_gan, predictions_nn_binary))

# Print confusion matrix
print("Confusion Matrix (Neural Network with GAN):")
print(confusion_matrix(y_test_gan, predictions_nn_binary))


Epoch 0, D Loss Real: [0.6669498682022095, 0.5650649070739746], D Loss Fake: [0.697708010673523, 0.359375], G Loss: 0.8524422645568848
Epoch 1/10
2969/2969 - 8s - loss: 0.0931 - accuracy: 0.9733 - val_loss: 1.4003e-05 - val_accuracy: 1.0000 - 8s/epoch - 3ms/step
Epoch 2/10
2969/2969 - 6s - loss: 0.0350 - accuracy: 0.9860 - val_loss: 2.6141e-08 - val_accuracy: 1.0000 - 6s/epoch - 2ms/step
Epoch 3/10
2969/2969 - 7s - loss: 0.0282 - accuracy: 0.9892 - val_loss: 9.6223e-10 - val_accuracy: 1.0000 - 7s/epoch - 2ms/step
Epoch 4/10
2969/2969 - 6s - loss: 0.0247 - accuracy: 0.9903 - val_loss: 1.1086e-10 - val_accuracy: 1.0000 - 6s/epoch - 2ms/step
Epoch 5/10
2969/2969 - 8s - loss: 0.0228 - accuracy: 0.9911 - val_loss: 1.1080e-11 - val_accuracy: 1.0000 - 8s/epoch - 3ms/step
Epoch 6/10
2969/2969 - 6s - loss: 0.0213 - accuracy: 0.9911 - val_loss: 6.2074e-13 - val_accuracy: 1.0000 - 6s/epoch - 2ms/step
Epoch 7/10
2969/2969 - 7s - loss: 0.0203 - accuracy: 0.9917 - val_loss: 5.6991e-14 - val_accuracy