# Transaction Anomaly Detection Prototype

This notebook demonstrates a prototype for detecting anomalies in financial transaction data using an unsupervised learning approach (Isolation Forest).

## 1. Setup and Imports

In [None]:
import pandas as pd
import numpy as np
from sklearn.ensemble import IsolationForest
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix

# Set a seed for reproducibility
np.random.seed(42)

## 2. Data Simulation/Generation

We'll generate a synthetic dataset of transactions. Most transactions will be normal, but we'll inject a small percentage of anomalies.

In [None]:
def generate_transaction_data(num_transactions=1500, anomaly_percentage=0.03):
    data = []
    num_anomalies = int(num_transactions * anomaly_percentage)
    transaction_ids = list(range(1, num_transactions + 1))
    np.random.shuffle(transaction_ids) # Shuffle IDs to mix anomalies

    # Define some securities and their typical price ranges
    securities = {
        'SEC_A': {'mean_price': 100, 'std_price': 5},
        'SEC_B': {'mean_price': 50, 'std_price': 2},
        'SEC_C': {'mean_price': 200, 'std_price': 10}
    }
    security_ids = list(securities.keys())

    for i in range(num_transactions):
        tx_id = transaction_ids[i]
        client_id = np.random.randint(1001, 1101)
        sec_id = np.random.choice(security_ids)
        buy_sell = np.random.choice([0, 1]) # 0 for Sell, 1 for Buy
        timestamp = pd.Timestamp('2023-01-01 09:00:00') + pd.Timedelta(minutes=np.random.randint(0, 8*60*20)) # 20 trading days, 8 hours/day
        is_anomaly = 0

        # Base normal transaction parameters
        base_price = np.random.normal(securities[sec_id]['mean_price'], securities[sec_id]['std_price'])
        quantity = np.random.randint(10, 500)

        # Inject anomalies
        if i < num_anomalies:
            is_anomaly = 1
            anomaly_type = np.random.choice(['high_amount', 'high_quantity', 'price_dev'])
            
            if anomaly_type == 'high_amount':
                # Price can be normal, but quantity makes amount high
                quantity = np.random.randint(2000, 5000) 
            elif anomaly_type == 'high_quantity':
                quantity = np.random.randint(3000, 6000)
            elif anomaly_type == 'price_dev':
                # Price significantly different from typical for this security
                if np.random.rand() > 0.5:
                    base_price *= np.random.uniform(1.5, 2.5) # Significantly higher
                else:
                    base_price *= np.random.uniform(0.3, 0.5) # Significantly lower
                quantity = np.random.randint(50, 1000) # Quantity can be normal or slightly elevated
        
        price = round(max(1.0, base_price), 2) # Ensure price is positive
        amount = round(quantity * price, 2)

        data.append([
            tx_id, client_id, sec_id, buy_sell, quantity, price, timestamp, amount, is_anomaly
        ])
    
    df = pd.DataFrame(data, columns=[
        'TransactionID', 'ClientID', 'SecurityID', 'BuySell', 'Quantity', 
        'Price', 'Timestamp', 'Amount', 'IsAnomaly_GroundTruth'
    ])
    return df.sort_values(by='Timestamp').reset_index(drop=True)

df_transactions = generate_transaction_data(num_transactions=1500, anomaly_percentage=0.04)
df_transactions.head()

In [None]:
df_transactions['IsAnomaly_GroundTruth'].value_counts()

## 3. Exploratory Data Analysis (EDA) & Preprocessing (Brief)

In [None]:
print("Basic Data Statistics:")
print(df_transactions.describe())

plt.figure(figsize=(12, 6))
sns.histplot(df_transactions['Amount'], bins=100, kde=True)
plt.title('Distribution of Transaction Amount')
plt.xlabel('Amount')
plt.ylabel('Frequency')
plt.show()

plt.figure(figsize=(12, 6))
sns.scatterplot(data=df_transactions, x='Price', y='Quantity', hue='IsAnomaly_GroundTruth', style='IsAnomaly_GroundTruth', alpha=0.7)
plt.title('Price vs. Quantity (Colored by Ground Truth Anomaly)')
plt.xlabel('Price')
plt.ylabel('Quantity')
plt.show()

For this prototype, we will select numerical features that are likely to indicate anomalies. `SecurityID` and `BuySell` are categorical. `Timestamp` would require more complex feature engineering (e.g., extracting hour, day of week) which we will omit for this initial prototype.

In [None]:
features_for_model = ['Amount', 'Quantity', 'Price']
X = df_transactions[features_for_model].copy()

# Scaling features can sometimes help performance for distance-based or variance-based algorithms,
# though Isolation Forest is generally robust to feature scaling.
# For completeness, let's scale them.
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=features_for_model)

## 4. Model Implementation (Isolation Forest)

Isolation Forest is an unsupervised learning algorithm that isolates observations by randomly selecting a feature and then randomly selecting a split value between the maximum and minimum values of the selected feature. Since anomalies are "few and different," they are more susceptible to isolation and are typically found closer to the root of the tree.

In [None]:
# Estimate contamination based on our injected anomaly percentage
contamination_rate = df_transactions['IsAnomaly_GroundTruth'].value_counts(normalize=True)[1]
if contamination_rate == 0: # handle case where no anomalies were injected by chance, or only one class exists
    contamination_rate = 'auto'
else:
    print(f"Using estimated contamination rate: {contamination_rate:.4f}")

model = IsolationForest(n_estimators=100, 
                        contamination=contamination_rate, # or 'auto'
                        random_state=42,
                        n_jobs=-1)

model.fit(X_scaled_df)

# Predict: -1 for anomalies, 1 for inliers
df_transactions['AnomalyScore'] = model.decision_function(X_scaled_df)
df_transactions['IsAnomaly_Predicted'] = model.predict(X_scaled_df)

# Convert predictions to 0 (normal) / 1 (anomaly)
# Original output: 1 for inliers, -1 for outliers
df_transactions['IsAnomaly_Predicted'] = df_transactions['IsAnomaly_Predicted'].apply(lambda x: 1 if x == -1 else 0)

df_transactions.head()

In [None]:
print("Predicted Anomaly Counts:")
print(df_transactions['IsAnomaly_Predicted'].value_counts())

## 5. Results Visualization and Basic Evaluation

In [None]:
plt.figure(figsize=(10, 6))
sns.histplot(df_transactions['AnomalyScore'], bins=50, kde=True)
plt.title('Distribution of Anomaly Scores')
plt.xlabel('Anomaly Score')
plt.ylabel('Frequency')
plt.show()

anomalies_df = df_transactions[df_transactions['IsAnomaly_Predicted'] == 1]
normals_df = df_transactions[df_transactions['IsAnomaly_Predicted'] == 0]

plt.figure(figsize=(14, 8))
plt.scatter(normals_df['Amount'], normals_df['Quantity'], c='blue', label='Predicted Normal', alpha=0.5, s=20)
plt.scatter(anomalies_df['Amount'], anomalies_df['Quantity'], c='red', label='Predicted Anomaly', alpha=0.7, s=50, marker='x')

# Highlight ground truth anomalies for comparison
ground_truth_anomalies = df_transactions[df_transactions['IsAnomaly_GroundTruth'] == 1]
plt.scatter(ground_truth_anomalies['Amount'], ground_truth_anomalies['Quantity'], 
            facecolors='none', edgecolors='yellow', s=100, linewidth=2, label='Actual Injected Anomaly')

plt.title('Transaction Amount vs. Quantity (Colored by Model Prediction)')
plt.xlabel('Transaction Amount')
plt.ylabel('Transaction Quantity')
plt.legend()
plt.grid(True)
plt.show()

### Basic Evaluation against Ground Truth

Since we injected anomalies, we have a ground truth. We can use this to evaluate how well our unsupervised model performed. **Note:** In a real-world scenario, true labels are often unavailable or very costly to obtain for unsupervised anomaly detection problems. This step is for validating the prototype's basic capability.

In [None]:
y_true = df_transactions['IsAnomaly_GroundTruth']
y_pred = df_transactions['IsAnomaly_Predicted']

print("Confusion Matrix:")
cm = confusion_matrix(y_true, y_pred)
print(cm)

plt.figure(figsize=(6,4))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues', 
            xticklabels=['Predicted Normal', 'Predicted Anomaly'], 
            yticklabels=['Actual Normal', 'Actual Anomaly'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()

print("\nClassification Report:")
print(classification_report(y_true, y_pred, target_names=['Normal (0)', 'Anomaly (1)'] KČ))

### Discussion

**How well did the model identify the injected anomalies?**

Based on the confusion matrix and classification report:
*   The model identifies a certain number of the injected anomalies (True Positives).
*   It also correctly identifies a large number of normal transactions (True Negatives).
*   There might be some False Positives (normal transactions flagged as anomalies) and False Negatives (injected anomalies missed by the model).
*   The `contamination` parameter in Isolation Forest is crucial. If set too low, it might miss anomalies. If too high, it might flag too many normal points as anomalous. In this prototype, we used the known injection rate, which is an ideal scenario.

**Limitations of this simple prototype:**

*   **Simulated Data:** The data is synthetic and simplified. Real-world transaction data is far more complex, noisy, and may have more subtle anomalies.
*   **Feature Engineering:** We used only a few raw numerical features. More sophisticated feature engineering is typically required, such as:
    *   Handling `Timestamp`: Extracting features like hour of day, day of week, time since last transaction for a client, etc.
    *   Handling `SecurityID`: One-hot encoding or embedding if the number of securities is large. Anomalies might be specific to certain securities or client behaviors related to securities.
    *   Creating interaction features or ratios.
    *   Considering client-specific historical behavior (e.g., deviation from a client's own average transaction amount).
*   **Model Simplicity:** Isolation Forest is a good baseline, but other algorithms (e.g., Local Outlier Factor (LOF), One-Class SVM, Autoencoders) might perform better or capture different types of anomalies.
*   **Static Anomalies:** Our injected anomalies are relatively straightforward (very high amount/quantity, significant price deviation). Real anomalies can be more nuanced.
*   **Evaluation:** Evaluation relies on ground truth, which is rare. In practice, anomaly detection evaluation often involves manual review of flagged items by domain experts.
*   **Scalability:** For very large datasets, the `sklearn` implementation might need to be run on sampled data, or distributed computing solutions might be necessary.

**Next steps for a more robust solution:**

*   **Use Real-World Data:** Obtain and analyze actual (anonymized) transaction data.
*   **Advanced Feature Engineering:** Incorporate time-based features, categorical features (properly encoded), client-specific aggregates, and interaction terms.
*   **Explore Other Algorithms:** Experiment with LOF, One-Class SVM, Gaussian Mixture Models, or deep learning approaches like Autoencoders or LSTMs for sequential transaction data.
*   **Dynamic Thresholding:** Anomaly scores often require careful thresholding. This might involve statistical methods or domain expertise to set appropriate alert levels.
*   **Feedback Loop:** Implement a mechanism for domain experts to review and label predicted anomalies, which can be used to refine the model (semi-supervised learning) or evaluate its performance over time.
*   **Consider Seasonality and Trends:** Financial data often has trends and seasonal patterns that normal behavior might follow. Models should account for this to avoid flagging normal peaks as anomalies.
*   **Scalability and Deployment:** Plan for deploying the model in a production environment, including data pipelines, retraining schedules, and monitoring.

## 6. Conclusion for Prototype

This prototype successfully demonstrated the basic application of the Isolation Forest algorithm for detecting anomalies in simulated transaction data. It highlighted the process of data generation, model training, prediction, and a basic form of evaluation using injected ground truth.

The results indicate that even a simple unsupervised model can identify obvious anomalies. However, the discussion section underscores the significant simplifications made and outlines the many considerations and improvements needed to develop a production-ready, robust anomaly detection system for financial transactions. This prototype serves as a foundational step and a proof-of-concept for further development.