<a href="https://colab.research.google.com/github/kkgh2024/Anomaly-Detection-in-Credit-Card-Transactions/blob/master/Anomaly_Github.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Anomaly Detection in Credit Card Transactions**
## **Project Summary:**
This project focuses on building an anomaly detection system to help a bank identify potentially fraudulent transactions and monitor customers' monthly spending. By leveraging an unsupervised learning approach, particularly Isolation Forest, we aim to detect unusual transactions without needing historical fraud labels, a key advantage in identifying emerging fraud patterns that a supervised model might miss.

## **Project Objectives:**

* 1- **Anomalous Transaction Detection:** Develop a model to flag transactions that deviate from typical spending patterns. We experimented with several algorithms, including K-Means, DBSCAN, and One-Class SVM. However, Isolation Forest was ultimately chosen for its efficiency with large datasets, robustness in detecting outliers, minimal parameter tuning, and ability to provide an interpretable anomaly score. This approach enables proactive fraud detection by identifying outliers in real-time.

* 2- **Monthly Limit Monitoring:** Create a function that checks if any user has exceeded their monthly spending limit. This function filters transactions by user and date, then compares the total spending to the credit limit. It can run daily, giving the bank a continuous view of customers’ spending to prevent overspending.
Data Used:

cc_info.csv: Contains credit card details for 843 users, including anonymized credit card numbers, spending limits, and user location information.
tx_data.csv: Holds 230,000 transaction records, with details on transaction amounts, dates, and geographical locations.

## **The Key Findings:**

The system allows the bank to identify and investigate potential fraud and over-limit spending proactively, protecting customers and enhancing financial oversight. By integrating anomaly detection with daily spending checks, the project aims to deliver a comprehensive fraud monitoring tool adaptable to new fraud trends and spending behaviors.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
#import missingno as msno
import datetime

from sklearn.ensemble import IsolationForest
from datetime import datetime
%matplotlib inline

In [2]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


In [3]:
df_1 =  pd.read_csv("/content/drive/MyDrive/Anomaly_detection/tx_data.csv")
df_2 =  pd.read_csv("/content/drive/MyDrive/Anomaly_detection/cc_info.csv")

In [20]:
df = df_2.merge(df_1, on=['cc_number'])
df.head()

Unnamed: 0,user_id,zipcode,state,cc_number,cc_limit,date,transaction,long,lat
0,29421272,3280,NH,4983684529421272,10000,2016-08-11 19:14:06,44.65,-72.174098,43.156873
1,29421272,3280,NH,4983684529421272,10000,2016-06-10 22:09:53,5.68,-72.131747,43.213248
2,29421272,3280,NH,4983684529421272,10000,2016-07-27 04:18:40,63.19,-72.022366,43.161782
3,29421272,3280,NH,4983684529421272,10000,2016-05-27 00:39:42,122.04,-72.159013,43.23553
4,29421272,3280,NH,4983684529421272,10000,2016-08-15 22:14:28,53.66,-72.169355,43.170859


In [5]:
!pip install folium



# Visualizing all transactions made by a specific user

In [21]:
import folium
import pandas as pd

# Assuming df is already defined with columns ['user_id', 'zipcode', 'state', 'cc_number', 'cc_limit', 'date', 'transaction', 'long', 'lat']

# Specify the user ID you want to visualize
specific_user_id = 29421272  # Replace with the desired user_id

# Filter the DataFrame for the specific user
user_df = df[df['user_id'] == specific_user_id].reset_index(drop=True)

# Initialize a map centered on the average latitude and longitude of all points for the specific user
center_lat = user_df['lat'].mean()
center_long = user_df['long'].mean()
user_map = folium.Map(location=[center_lat, center_long], zoom_start=10)

# Add markers for each transaction of the specific user
for _, row in user_df.iterrows():
    folium.Marker(
        location=[row['lat'], row['long']],
        popup=(f"User ID: {row['user_id']}<br>Transaction: ${row['transaction']}"
               f"<br>Date: {row['date']}<br>Zipcode: {row['zipcode']}<br>State: {row['state']}"),
        tooltip="Click for details"
    ).add_to(user_map)

# Save map as an HTML file and provide a link
user_map.save("/content/drive/MyDrive/Anomaly_detection/user_map.html")
print("Map saved as user_map.html. Open the file to view the map.")


Map saved as user_map.html. Open the file to view the map.


In [22]:
import folium
import pandas as pd
from IPython.display import display

# Assuming df is already defined with columns ['user_id', 'zipcode', 'state', 'cc_number', 'cc_limit', 'date', 'transaction', 'long', 'lat']

# Specify the user ID you want to visualize
specific_user_id = 29421272  # Replace with the desired user_id

# Filter the DataFrame for the specific user
user_df = df[df['user_id'] == specific_user_id].reset_index(drop=True)

# Initialize a map centered on the average latitude and longitude of all points for the specific user
center_lat = user_df['lat'].mean()
center_long = user_df['long'].mean()
user_map = folium.Map(location=[center_lat, center_long], zoom_start=10)

# Add markers for each transaction of the specific user
for _, row in user_df.iterrows():
    folium.Marker(
        location=[row['lat'], row['long']],
        popup=(f"User ID: {row['user_id']}<br>Transaction: ${row['transaction']}"
               f"<br>Date: {row['date']}<br>Zipcode: {row['zipcode']}<br>State: {row['state']}"),
        tooltip="Click for details"
    ).add_to(user_map)

# Display the map inline in the Jupyter Notebook
display(user_map)


## Preprocess Data


By visualizing a user's transactions on a map, you can gain insights into their usual spending locations and frequency over time. Anomalies often stand out when there are transactions in unexpected locations, unusually high or low amounts, or unusual frequencies (e.g., multiple transactions within a very short period or transactions occurring at uncommon hours). This spatial and temporal visualization provides an intuitive way to spot irregularities that may not be immediately obvious in raw data, aiding in anomaly detection and fraud prevention efforts.

Convert the date column in df to datetime for better handling:

In [None]:
df['date'] = pd.to_datetime(df['date'])


## Anomaly Detection using Isolation Forest

In [None]:
def detect_anomalies(df):
    # Select relevant features for anomaly detection
    features = df[['transaction', 'long', 'lat']]

    # Initialize and fit the Isolation Forest model
    model = IsolationForest(contamination=0.01, random_state=42)
    df['anomaly'] = model.fit_predict(features)

    # Filter out anomalies
    anomalies = df[df['anomaly'] == -1]
    return anomalies

# Apply the function to detect anomalies
anomalous_transactions = detect_anomalies(df)
#print("Anomalous Transactions:", anomalous_transactions)


In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
from sklearn.svm import OneClassSVM
from sklearn.neighbors import LocalOutlierFactor
from scipy.spatial.distance import cdist
# Function for Isolation Forest
def detect_anomalies_isolation_forest(data):
    model = IsolationForest(contamination=0.01, random_state=42)
    data['isolation_forest'] = model.fit_predict(data[['transaction', 'long', 'lat']])
    return data['isolation_forest'] == -1

# Function for DBSCAN
def detect_anomalies_dbscan(data, eps=0.5, min_samples=5):
    model = DBSCAN(eps=eps, min_samples=min_samples)
    labels = model.fit_predict(data[['transaction', 'long', 'lat']])
    return labels == -1  # Points labeled -1 are outliers

# Function for K-Means with Distance Threshold
#def detect_anomalies_kmeans(data, n_clusters=5, threshold=2.5):
#    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
#    data['cluster'] = kmeans.fit_predict(data[['transaction', 'long', 'lat']])
#    data['distance'] = cdist(data[['transaction', 'long', 'lat']], kmeans.cluster_centers_[data['cluster']], 'euclidean').diagonal()
#    return data['distance'] > threshold  # Points with distance above threshold are anomalies

# Function for One-Class SVM
def detect_anomalies_ocsvm(data, nu=0.01, kernel="rbf", gamma=0.1):
    model = OneClassSVM(nu=nu, kernel=kernel, gamma=gamma)
    return model.fit_predict(data[['transaction', 'long', 'lat']]) == -1

# Function for Local Outlier Factor
def detect_anomalies_lof(data, n_neighbors=20, contamination=0.01):
    model = LocalOutlierFactor(n_neighbors=n_neighbors, contamination=contamination)
    return model.fit_predict(data[['transaction', 'long', 'lat']]) == -1

# Main function to run all algorithms and compare results
def compare_anomaly_algorithms(data):
    # Initialize a DataFrame to hold results
    results = pd.DataFrame(index=data.index)

    # Run each algorithm and store results
    results['IsolationForest'] = detect_anomalies_isolation_forest(data)
    results['DBSCAN'] = detect_anomalies_dbscan(data)
#    results['KMeans'] = detect_anomalies_kmeans(data)
    results['OneClassSVM'] = detect_anomalies_ocsvm(data)
    results['LOF'] = detect_anomalies_lof(data)

    # Count anomalies detected by each algorithm
    anomaly_counts = results.sum()
    print("Anomaly Counts by Algorithm:\n", anomaly_counts)

    # Find overlap in anomalies detected by each pair of algorithms
    overlap_matrix = pd.DataFrame(index=results.columns, columns=results.columns)
    for col1 in results.columns:
        for col2 in results.columns:
            overlap_matrix.loc[col1, col2] = (results[col1] & results[col2]).sum()

    print("\nOverlap in Detected Anomalies:\n", overlap_matrix)

    # Return results for additional analysis if needed
    return results, anomaly_counts, overlap_matrix



# Convert date column to datetime format in the example DataFrame
df['date'] = pd.to_datetime(df['date'])

# Run the comparison
results, anomaly_counts, overlap_matrix = compare_anomaly_algorithms(df)


Anomaly Counts by Algorithm:
 IsolationForest     2331
DBSCAN             15416
OneClassSVM         6360
LOF                 2331
dtype: int64

Overlap in Detected Anomalies:
                 IsolationForest DBSCAN OneClassSVM   LOF
IsolationForest            2331   2309         889    19
DBSCAN                     2309  15416        3365   855
OneClassSVM                 889   3365        6360   175
LOF                          19    855         175  2331


In [None]:
from sklearn.metrics import silhouette_score
import numpy as np

# Function to calculate silhouette score for each algorithm
def calculate_silhouette_scores(data, results):
    scores = {}
    for algorithm in results.columns:
        # Create labels for silhouette calculation: 1 for anomalies, 0 for normal points
        labels = np.where(results[algorithm], 1, 0)

        # Calculate the Silhouette Score if there are enough anomalies detected
        if len(np.unique(labels)) > 1:  # Silhouette requires at least 2 different labels
            score = silhouette_score(data[['transaction', 'long', 'lat']], labels)
            scores[algorithm] = score
        else:
            scores[algorithm] = None  # Insufficient anomalies to calculate Silhouette Score

    return scores

# Run the silhouette score calculation after running the comparison
#results, anomaly_counts, overlap_matrix = compare_anomaly_algorithms(df)
silhouette_scores = calculate_silhouette_scores(df, results)

print("\nSilhouette Scores by Algorithm:\n", silhouette_scores)



Silhouette Scores by Algorithm:
 {'IsolationForest': 0.8487902563849317, 'DBSCAN': 0.6652386718138972, 'OneClassSVM': 0.5620215712065457, 'LOF': -0.06001518861289016}


Isolation Forest is the best-performing algorithm for anomaly detection in this dataset, as indicated by its high Silhouette Score of 0.85, suggesting well-defined and distinct clusters between normal points and anomalies. DBSCAN, with a score of 0.67, also performs reasonably well but is less effective than Isolation Forest. One-Class SVM, scoring 0.56, shows only moderate cluster definition, indicating potential challenges in accurately separating anomalies from regular points. Local Outlier Factor (LOF), with a negative score of -0.06, performs poorly, implying overlapping or unclear clusters, making it unsuitable for this dataset. Overall, Isolation Forest stands out as the most reliable choice for anomaly detection here.

# Monthly Limit Exceedance Detection

This function flags customers who exceed their monthly limit by summing monthly transactions and comparing them against the limit in cc_info.

In [None]:
df.tail()

Unnamed: 0,user_id,zipcode,state,cc_number,cc_limit,date,transaction,long,lat,anomaly,isolation_forest
233038,93357426,45201,OH,8500470993357426,10000,2016-08-06 01:27:05,13.23,-84.598038,39.180146,1,1
233039,93357426,45201,OH,8500470993357426,10000,2016-07-06 01:32:12,18.15,-84.500397,39.173975,1,1
233040,93357426,45201,OH,8500470993357426,10000,2016-08-12 20:18:30,14.43,97.190652,24.455904,-1,-1
233041,93357426,45201,OH,8500470993357426,10000,2016-08-06 18:21:36,30.1,-84.462941,39.198185,1,1
233042,93357426,45201,OH,8500470993357426,10000,2016-06-14 05:18:04,16.51,-84.572132,39.235629,1,1


In [None]:

date = datetime.today()

    # Ensure 'date' column is in datetime format
df['date'] = pd.to_datetime(df['date'])


# Extract month and year for grouping purposes
df['year_month'] = df['date'].dt.to_period('M')  # Creates a 'YYYY-MM' period column

# Aggregate transaction amounts by cc_number and each month
monthly_transactions = df.groupby(['user_id', 'cc_number', 'cc_limit', 'year_month'])['transaction'].sum().reset_index()

limit_exceeded = monthly_transactions[monthly_transactions['transaction'] > monthly_transactions['cc_limit']]
user_exceed = limit_exceeded['user_id'].tolist()
len(user_exceed)
#user_exceed[:5]


[154142, 594227, 1406445, 1542116, 1542116]

In [None]:


def detect_limit_exceedance(data, date=None):
    if date is None:
        date = datetime.today()

    # Ensure 'date' column is in datetime format
    data['date'] = pd.to_datetime(data['date'])

    # Filter data for the current month
    data['month'] = data['date'].dt.to_period('M')
    current_month = date.strftime('%Y-%m')

    # Aggregate transaction amounts by cc_number for the current month
    monthly_transactions = data[data['month'] == current_month].groupby(['user_id', 'cc_number', 'cc_limit'])['transaction'].sum().reset_index()

    # Identify users who exceeded their monthly limit
    limit_exceeded = monthly_transactions[monthly_transactions['transaction'] > monthly_transactions['cc_limit']]

    # Return a list of user_ids who exceeded their limit
    return limit_exceeded['user_id'].tolist()


# Run the function
users_exceeding_limit = detect_limit_exceedance(df)
print("Users exceeding their monthly limit:", users_exceeding_limit)


Users exceeding their monthly limit: []


In [None]:
df.shape

(233043, 11)