# Cognitive Shield: High-Fidelity Data Generation

This notebook generates a synthetic, time-series dataset simulating user behavior in a digital banking environment. The data includes both normal, persona-driven activity and sophisticated, injected fraud scenarios. This dataset will serve as the foundation for training and evaluating our three-engine fraud detection system.

**Key Features:**
- **Entity Simulation:** Creates users with distinct personas, devices, and merchants.
- **Behavioral Simulation:** Generates event sequences based on user personas (e.g., `tech_savvy_youth`, `cautious_pensioner`).
- **Fraud Injection:** Programmatically inserts complex fraud patterns like Account Takeover, Mule Rings, and Loan Application Fraud.

In [2]:
import pandas as pd
import numpy as np
from faker import Faker
import random
from datetime import datetime, timedelta
import uuid
import pytz
import os

In [3]:
# --- Configuration ---
# Setting a seed for reproducibility
random.seed(42)
np.random.seed(42)

# Initialize Faker to generate realistic fake data for India
fake = Faker('en_IN')
india_tz = pytz.timezone('Asia/Kolkata')

# Dataset parameters
NUM_USERS = 5000
NUM_MERCHANTS = 500
NUM_DEVICES = int(NUM_USERS * 1.5)
START_DATE = datetime(2025, 7, 1, tzinfo=india_tz)
END_DATE = datetime(2025, 9, 28, tzinfo=india_tz) # Today's date
TOTAL_DAYS = (END_DATE - START_DATE).days

# Output file configuration
DATA_DIR = '../data/'
OUTPUT_FILE = os.path.join(DATA_DIR, 'events.csv')

# Ensure the data directory exists
os.makedirs(DATA_DIR, exist_ok=True)


print(f"Configuration set. Data will be generated from {START_DATE.date()} to {END_DATE.date()}.")
print(f"Output will be saved to: {OUTPUT_FILE}")

Configuration set. Data will be generated from 2025-07-01 to 2025-09-28.
Output will be saved to: ../data/events.csv


## 2. Entity Generation

First, we create the core entities of our digital bank: users, devices, and merchants. Each user is assigned a `persona` which will dictate their typical behavior, providing a realistic baseline for our simulation.

In [4]:
def create_users():
    """Generates a DataFrame of users with different personas."""
    users = []
    user_personas = ['tech_savvy_youth', 'routine_commuter', 'cautious_pensioner', 'small_business_owner']
    for _ in range(NUM_USERS):
        customer_since = fake.date_time_between(start_date='-5y', end_date='-1y', tzinfo=india_tz)
        users.append({
            'user_id': f"usr_{uuid.uuid4().hex[:12]}",
            'customer_since': customer_since,
            'kyc_risk_level': random.choice(['low', 'medium', 'high']),
            'persona': random.choice(user_personas)
        })
    return pd.DataFrame(users)

def create_devices():
    """Generates a pool of unique device IDs."""
    return [f"dev_{uuid.uuid4().hex[:12]}" for _ in range(NUM_DEVICES)]

def create_merchants():
    """Generates a DataFrame of merchants."""
    merchants = []
    categories = ['groceries', 'electronics', 'travel', 'utilities', 'entertainment', 'clothing', 'restaurants']
    for _ in range(NUM_MERCHANTS):
        merchants.append({
            'merchant_id': f"mer_{uuid.uuid4().hex[:12]}",
            'merchant_category': random.choice(categories)
        })
    return pd.DataFrame(merchants)

# Generate the entities
users_df = create_users()
devices_pool = create_devices()
merchants_df = create_merchants()

print(f"Generated {len(users_df)} users, {len(devices_pool)} devices, and {len(merchants_df)} merchants.")
print("\nSample Users:")
display(users_df.head())

Generated 5000 users, 7500 devices, and 500 merchants.

Sample Users:


Unnamed: 0,user_id,customer_since,kyc_risk_level,persona
0,usr_2ba35ac017c5,2023-10-01 17:45:14+05:30,high,tech_savvy_youth
1,usr_d3065d2f8a7f,2022-12-20 21:39:08+05:30,low,cautious_pensioner
2,usr_dbff63d649d0,2022-12-13 03:17:37+05:30,low,routine_commuter
3,usr_481cd5455472,2024-03-01 14:54:43+05:30,low,tech_savvy_youth
4,usr_a5fbfacbb839,2023-02-21 19:26:20+05:30,high,tech_savvy_youth


## 3. Simulating Normal Behavior

This is the core of the simulation. We loop through each day, activating a subset of users. Each active user performs a series of actions within a "session" based on their assigned persona. This creates the rich, sequential data needed for Engine A.

In [5]:
def create_event(user_id, timestamp, event_type, channel, device_id, ip_address, session_id, **kwargs):
    """Helper function to create a single event dictionary."""
    return {
        'event_id': f"evt_{uuid.uuid4().hex[:12]}",
        'user_id': user_id,
        'timestamp': timestamp,
        'event_type': event_type,
        'channel': channel,
        'amount': kwargs.get('amount', 0.0),
        'source_account': f"acc_{user_id[4:]}",
        'destination_account': kwargs.get('destination_account', None),
        'device_id': device_id,
        'ip_address': ip_address,
        'session_id': session_id,
        'is_fraud': False
    }

def generate_persona_action(persona, merchants_df, users_df):
    """Generates a single action based on user persona."""
    event_type = 'transaction_inquiry' # Default action
    details = {}
    
    # Simplified logic for persona actions
    actions_map = {
        'tech_savvy_youth': [('merchant_payment', {'amount': (100, 2000)}), ('p2p_transfer', {'amount': (500, 5000)}), ('transaction_inquiry', {})],
        'routine_commuter': [('bill_pay', {'amount': (500, 5000)}), ('merchant_payment', {'amount': (100, 1500)}), ('transaction_inquiry', {})],
        'cautious_pensioner': [('atm_withdrawal', {'amount': (1000, 10000)}), ('transaction_inquiry', {})],
        'small_business_owner': [('bulk_transfer', {'amount': (50000, 500000)}), ('merchant_payment', {'amount': (1000, 10000)}), ('beneficiary_add', {})]
    }
    
    action, params = random.choice(actions_map[persona])
    event_type = action
    
    if 'amount' in params:
        details['amount'] = round(random.uniform(*params['amount']), 2)
    
    if action == 'merchant_payment':
        details['destination_account'] = random.choice(merchants_df['merchant_id'])
    elif action == 'p2p_transfer':
        details['destination_account'] = random.choice(users_df['user_id'])
    elif action == 'bill_pay':
        details['destination_account'] = 'util_company_'+str(random.randint(1,5))
    elif action == 'bulk_transfer':
        details['destination_account'] = 'supplier_'+str(random.randint(1,20))
        
    return event_type, details

def generate_normal_events(users_df, devices_pool, merchants_df):
    """Simulates normal user behavior based on personas."""
    all_events = []
    user_devices = {user_id: random.sample(devices_pool, k=random.randint(1, 3)) for user_id in users_df['user_id']}

    for day in range(TOTAL_DAYS):
        current_date = START_DATE + timedelta(days=day)
        active_users_df = users_df.sample(frac=random.uniform(0.1, 0.3))

        for _, user in active_users_df.iterrows():
            persona = user['persona']
            user_id = user['user_id']
            session_id = f"sess_{uuid.uuid4().hex[:12]}"
            
            # Simplified session start time
            start_hour = random.randint(0, 23)
            session_start_time = current_date.replace(hour=start_hour, minute=random.randint(0, 59))
            
            timestamp = session_start_time
            device = random.choice(user_devices[user_id])
            ip_address = fake.ipv4()
            channel = random.choice(['mobile_app', 'web_browser'])

            all_events.append(create_event(user_id, timestamp, 'login_success', channel, device, ip_address, session_id))

            for _ in range(random.randint(1, 5)): # Simplified number of actions
                timestamp += timedelta(seconds=random.randint(10, 300))
                event_type, event_details = generate_persona_action(persona, merchants_df, users_df)
                all_events.append(create_event(user_id, timestamp, event_type, channel, device, ip_address, session_id, **event_details))
    
    return all_events

# Generate the normal events
normal_events = generate_normal_events(users_df, devices_pool, merchants_df)
print(f"Generated {len(normal_events)} normal user events.")

Generated 354129 normal user events.


## 4. Injecting Sophisticated Fraud Scenarios

With a baseline of normal behavior established, we now inject specific, hard-to-detect fraud patterns into the dataset. Each function creates a narrative of a particular type of attack.

In [6]:
# Fraud Scenario 1: Account Takeover (ATO)
def inject_account_takeover(events, users_df, devices_pool):
    fraud_events = []
    # Ensure we don't select a user who might not have devices in the main pool, for simplicity
    victim = users_df.sample(1).iloc[0]
    
    # Get a list of devices NOT associated with the victim to ensure it's a new device
    # This is a conceptual approach; a real implementation would be more robust
    victim_devices = set(users_df[users_df['user_id'] == victim['user_id']].index) # Simplified way to get some devices
    non_victim_devices = list(set(devices_pool) - victim_devices)
    if not non_victim_devices: # In case the victim has all devices, which is unlikely
        non_victim_devices = [f"dev_{uuid.uuid4().hex[:12]}"] # Create a brand new one
    fraud_device = random.choice(non_victim_devices)
    
    fraud_ip = fake.ipv4()
    attack_time = fake.date_time_between(start_date=START_DATE, end_date=END_DATE, tzinfo=india_tz)
    session_id = f"sess_{uuid.uuid4().hex[:12]}"
    
    # CORRECTED STRUCTURE: (delay_seconds, event_type, other_details_dict)
    fraud_sequence = [
        (0, 'login_attempt_failed', {}),
        (15, 'password_reset', {}),
        (45, 'login_success', {}),
        (105, 'beneficiary_add', {'destination_account': f"fraud_ben_{uuid.uuid4().hex[:6]}"}),
        (180, 'high_value_transfer', {'amount': round(random.uniform(50000, 200000), 2), 'destination_account': f"fraud_ben_{uuid.uuid4().hex[:6]}"})
    ]
    
    for delay, event_type, details in fraud_sequence:
        event_time = attack_time + timedelta(seconds=delay)
        # CORRECTED FUNCTION CALL: We pass event_type directly and unpack only the 'details' dictionary
        event = create_event(victim['user_id'], event_time, event_type, 'web_browser', fraud_device, fraud_ip, session_id, **details)
        event['is_fraud'] = True
        fraud_events.append(event)
        
    return events + fraud_events

# Fraud Scenario 2: Mule Account Ring
def inject_mule_ring(events, users_df):
    fraud_events = []
    num_mules = 5
    mule_ids = [f"usr_{uuid.uuid4().hex[:12]}" for _ in range(num_mules)]
    victim = users_df.sample(1).iloc[0]['user_id']
    fraud_device = f"dev_{uuid.uuid4().hex[:12]}" # A single device used by the mule herder
    fraud_ip = fake.ipv4() # A single IP used by the mule herder
    attack_time = fake.date_time_between(start_date=START_DATE, end_date=END_DATE, tzinfo=india_tz)
    
    # 1. Victim to collector mule
    collector_mule = mule_ids[0]
    total_fraud_amount = round(random.uniform(200000, 1000000), 2)
    # The initial fraudulent transaction from a compromised device/ip
    event = create_event(
        victim, 
        attack_time, 
        'high_value_transfer', 
        'web_browser', 
        f"dev_{uuid.uuid4().hex[:12]}", # compromised device
        fake.ipv4(), # compromised ip
        f"sess_{uuid.uuid4().hex[:12]}", 
        amount=total_fraud_amount, 
        destination_account=f"acc_{collector_mule[4:]}"
    )
    event['is_fraud'] = True
    fraud_events.append(event)
    
    # 2. Collector disperses funds using a shared device/IP
    session_id_mule = f"sess_{uuid.uuid4().hex[:12]}" # One session for the mule herder
    current_time = attack_time + timedelta(minutes=random.randint(5,15))
    amount_per_mule = round(total_fraud_amount / (num_mules - 1), 2)

    for mule in mule_ids[1:]:
        current_time += timedelta(seconds=random.randint(20, 90)) # Rapid dispersal
        event = create_event(
            collector_mule, 
            current_time, 
            'p2p_transfer', 
            'web_browser', 
            fraud_device, # Shared device
            fraud_ip,     # Shared IP
            session_id_mule, 
            amount=amount_per_mule, 
            destination_account=f"acc_{mule[4:]}"
        )
        event['is_fraud'] = True
        fraud_events.append(event)
        
    return events + fraud_events
    
# Inject a few scenarios
events_with_fraud = list(normal_events) # Make a copy
num_fraud_scenarios = 50 # Let's make it configurable
print(f"Injecting {num_fraud_scenarios} of each fraud type...")

for i in range(num_fraud_scenarios):
    events_with_fraud = inject_account_takeover(events_with_fraud, users_df, devices_pool)
    events_with_fraud = inject_mule_ring(events_with_fraud, users_df)

print(f"Total events after injecting fraud: {len(events_with_fraud)}")

Injecting 50 of each fraud type...
Total events after injecting fraud: 354629


## 5. Final Assembly and Export

Finally, we combine all the generated events into a single pandas DataFrame. We sort the data chronologically by `timestamp` to create a coherent event log and then save it to a CSV file in the `../data/` directory. This file is now ready for feature engineering and model training.

In [7]:
# Final Assembly
final_df = pd.DataFrame(events_with_fraud)

# Convert timestamp to datetime objects and sort
final_df['timestamp'] = pd.to_datetime(final_df['timestamp'])
final_df = final_df.sort_values(by='timestamp').reset_index(drop=True)

# --- NEW LINE ADDED ---
# Save the user profiles DataFrame as well
users_output_file = os.path.join(DATA_DIR, 'users.csv')
users_df.to_csv(users_output_file, index=False)
# --- END OF NEW LINE ---

# Save the events CSV
final_df.to_csv(OUTPUT_FILE, index=False)

print("--- Data Generation Complete ---")
print(f"User profiles saved to: {users_output_file}")
print(f"Total events created: {len(final_df)}")
print(f"Total fraudulent events: {final_df['is_fraud'].sum()}")
print(f"Events dataset saved to: {OUTPUT_FILE}")

print("\nSample of the final generated data:")
display(final_df.head())

print("\nRandom sample of fraudulent events:")
display(final_df[final_df['is_fraud'] == True].sample(5))

--- Data Generation Complete ---
User profiles saved to: ../data/users.csv
Total events created: 354629
Total fraudulent events: 500
Events dataset saved to: ../data/events.csv

Sample of the final generated data:


Unnamed: 0,event_id,user_id,timestamp,event_type,channel,amount,source_account,destination_account,device_id,ip_address,session_id,is_fraud
0,evt_cb95b14fdf9b,usr_9c73435ecbaa,2025-06-30 23:37:00+05:30,login_success,web_browser,0.0,acc_9c73435ecbaa,,dev_6c4187fad28a,126.58.205.46,sess_78c0846b5447,False
1,evt_b70b72d6ffd9,usr_43ca4ced01d5,2025-06-30 23:37:00+05:30,login_success,web_browser,0.0,acc_43ca4ced01d5,,dev_3ca1b360a357,164.109.173.147,sess_f76c862eaf29,False
2,evt_b85458c561d1,usr_9c73435ecbaa,2025-06-30 23:38:20+05:30,merchant_payment,web_browser,185.52,acc_9c73435ecbaa,mer_0b3027feabe1,dev_6c4187fad28a,126.58.205.46,sess_78c0846b5447,False
3,evt_258143911fb7,usr_9c73435ecbaa,2025-06-30 23:39:24+05:30,bill_pay,web_browser,2092.25,acc_9c73435ecbaa,util_company_2,dev_6c4187fad28a,126.58.205.46,sess_78c0846b5447,False
4,evt_6fe1088f6c95,usr_9f216b9a3801,2025-06-30 23:41:00+05:30,login_success,mobile_app,0.0,acc_9f216b9a3801,,dev_eae0754b288c,206.187.228.206,sess_fb0bc76192d2,False



Random sample of fraudulent events:


Unnamed: 0,event_id,user_id,timestamp,event_type,channel,amount,source_account,destination_account,device_id,ip_address,session_id,is_fraud
36596,evt_5fd3505565c9,usr_965d32470fe3,2025-07-10 20:18:05+05:30,p2p_transfer,web_browser,231096.23,acc_965d32470fe3,acc_48aa84dc2a22,dev_8ec6b2a53041,209.197.35.69,sess_94e14c5bf7dc,True
57448,evt_a264c9425885,usr_0fe3a088e222,2025-07-16 09:49:08+05:30,p2p_transfer,web_browser,215438.42,acc_0fe3a088e222,acc_6c6328041297,dev_dea65ad7300a,68.94.178.27,sess_00d3bb1398d9,True
25448,evt_9cdc2d0a4d9c,usr_2b614dc63207,2025-07-08 00:52:00+05:30,p2p_transfer,web_browser,71256.62,acc_2b614dc63207,acc_66525311d453,dev_7191e4cea77d,126.1.247.161,sess_9439db1408fe,True
31893,evt_46f137f481fa,usr_1169ee54a13d,2025-07-09 19:36:26+05:30,p2p_transfer,web_browser,199529.85,acc_1169ee54a13d,acc_fd25ca3cbe6a,dev_b558ea116a1e,101.216.228.60,sess_fd9a4b61baa4,True
283097,evt_6494b4540bd5,usr_3ba32eeeb82b,2025-09-10 14:03:56+05:30,p2p_transfer,web_browser,142806.65,acc_3ba32eeeb82b,acc_fd17fb68001c,dev_ea318edcc410,189.177.211.208,sess_40519f5bf654,True
