# Enterprise Attack Simulator - Dataset Exploration

## What Makes This Dataset Different

This is **not a static dataset** - it's a **sample output** from an AI-powered attack simulator that generates:

- **Realistic attack timelines** (23 days, not hours)
- **Service account hijacking** (attack hidden in 185K normal DB logs)
- **Defense product logs** (EDR, DLP, SIEM responses)
- **Attack adapts to defenses** (AI-generated based on your security stack)
- **Enterprise-scale noise** (8.1M logs, 0.007% attack signal)

## Sample Dataset Overview

**Attack:** `living_off_land_basic` (APT29-style)  
**Attacker:** joseph.wilson475 (Security Analyst - blends with normal activity)  
**Timeline:** 23 days (Dec 16, 2025 → Jan 7, 2026)  
**Techniques:** PowerShell → Domain Discovery → RDP → Exfiltration  

**Scale:**
- Total logs: 8,108,152
- Attack logs: 570 (0.007%)
- Defense alerts: 209 (37% detection rate)
- Users: 500
- Service accounts: 55

---

## 1. Load Dataset

In [27]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime

# Load dataset (handles gzip automatically)
df = pd.read_csv('../data/simulation.csv.gz',sep="\t")

print(f"Total logs: {len(df):,}")
print(f"Date range: {df['timestamp'].min()} → {df['timestamp'].max()}")
print(f"Memory: {df.memory_usage(deep=True).sum() / 1024**3:.2f} GB")

## 2. Dataset Composition

### Attack vs Benign Breakdown

In [26]:
# Separate attack and benign logs
attack_mask = df['attack_id'].notna()
benign_mask = ~attack_mask

attack_logs = df[attack_mask]
benign_logs = df[benign_mask]

print(f"Attack logs: {len(attack_logs):,} ({len(attack_logs)/len(df)*100:.4f}%)")
print(f"Benign logs: {len(benign_logs):,} ({len(benign_logs)/len(df)*100:.2f}%)")
print(f"\nAttack ID: {attack_logs['attack_id'].iloc[0]}")
print(f"Attacker: {attack_logs['user'].iloc[0]}")
print(f"Department: {attack_logs['department'].iloc[0]}")

### Log Type Distribution

**Key Insight:** Defense product logs are separate from benign Windows logs

In [25]:
# Categorize log types
def categorize_log(log_type):
    if pd.isna(log_type):
        return 'unknown'
    elif 'windows' in log_type:
        return 'windows_event'
    elif any(x in log_type for x in ['defender', 'crowdstrike', 'sentinelone', 'carbonblack']):
        return 'edr'
    elif 'dlp' in log_type:
        return 'dlp'
    elif 'siem' in log_type:
        return 'siem'
    elif any(x in log_type for x in ['pam', 'mfa', 'nac']):
        return 'access_control'
    else:
        return 'other_defense'

df['log_category'] = df['log_type'].apply(categorize_log)

# Distribution
log_dist = df['log_category'].value_counts()
print("Log Type Distribution:")
for cat, count in log_dist.items():
    print(f"  {cat}: {count:,} ({count/len(df)*100:.2f}%)")

# Visualize
plt.figure(figsize=(12, 6))
log_dist.plot(kind='bar')
plt.title('Log Type Distribution')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.tight_layout()
plt.show()

## 3. Attack Timeline Analysis

### Multi-Week Attack Progression

In [24]:
# Parse timestamps
attack_logs['datetime'] = pd.to_datetime(attack_logs['timestamp'])
attack_logs['date'] = attack_logs['datetime'].dt.date

# Attack progression by stage and date
stage_by_date = attack_logs.groupby(['date', 'stage_number']).size().reset_index(name='count')

print("Attack Timeline:")
print(f"  First action: {attack_logs['datetime'].min()}")
print(f"  Last action: {attack_logs['datetime'].max()}")
print(f"  Duration: {(attack_logs['datetime'].max() - attack_logs['datetime'].min()).days} days")
print(f"  Total stages: {attack_logs['stage_number'].nunique()}")

# Visualize timeline
plt.figure(figsize=(16, 8))
pivot = stage_by_date.pivot(index='date', columns='stage_number', values='count').fillna(0)
pivot.plot(kind='bar', stacked=True, figsize=(16, 8), colormap='viridis')
plt.title('Attack Progression Over Time (Stages per Day)')
plt.xlabel('Date')
plt.ylabel('Number of Attack Actions')
plt.xticks(rotation=45)
plt.legend(title='Stage', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()

print("\nStages by Date:")
for date, stages in attack_logs.groupby('date')['stage_number'].apply(lambda x: sorted(set(x))).items():
    print(f"  {date}: Stages {stages}")

### MITRE ATT&CK Technique Distribution

In [23]:
# Technique usage
technique_map = {
    't1059.001': 'PowerShell Execution',
    't1087.002': 'Domain Discovery',
    't1021.001': 'RDP Lateral Movement',
    't1041': 'Exfiltration'
}

tech_dist = attack_logs['attack_type'].value_counts()
print("MITRE Technique Usage:")
for tech, count in tech_dist.items():
    name = technique_map.get(tech, tech)
    print(f"  {tech} ({name}): {count} logs")

# Visualize
plt.figure(figsize=(10, 6))
tech_dist.plot(kind='barh')
plt.title('MITRE ATT&CK Technique Distribution')
plt.xlabel('Number of Attack Logs')
plt.tight_layout()
plt.show()

## 4. Service Account Hijacking Analysis

### The Needle in the Haystack Problem

Attack logs are buried in massive service account activity:

In [22]:
# Service account usage
svc_accounts = df[df['service_account'] == True]['account'].value_counts()

print("Top 10 Service Accounts by Activity:")
for account, count in svc_accounts.head(10).items():
    # Check if used in attack
    attack_usage = attack_logs[attack_logs['account'] == account]
    if len(attack_usage) > 0:
        signal = len(attack_usage) / count * 100
        print(f"  {account}: {count:,} logs ( {len(attack_usage)} attack logs = {signal:.3f}% signal)")
    else:
        print(f"  {account}: {count:,} logs")

### Service Account Lateral Movement Detection

Key insight: `user` field shows WHO is using the service account

In [21]:
# Analyze svc_database usage
svc_db_logs = df[df['account'] == 'svc_database']

print(f"svc_database Total Activity: {len(svc_db_logs):,} logs")
print(f"\nUser Field Analysis:")

# Who's using this service account?
user_usage = svc_db_logs['user'].value_counts()
for user, count in user_usage.head(10).items():
    # Check if attack
    attack_usage = svc_db_logs[(svc_db_logs['user'] == user) & (svc_db_logs['attack_id'].notna())]
    if len(attack_usage) > 0:
        print(f"   {user}: {count} logs ({len(attack_usage)} ATTACK)")
    else:
        print(f"   {user}: {count} logs (legitimate)")

print("\nKey Insight:")
print("- user=svc_database → Normal service account background activity")
print("- user=<actual_user> → Legitimate user authenticating TO database")
print("- user=joseph.wilson475 (with attack_id) → ATTACKER hijacked svc_database")

## 5. Defense Product Analysis

### Detection Rate by Product

In [20]:
# Defense logs only
defense_logs = df[df['log_category'] != 'windows_event']

print(f"Total Defense Logs: {len(defense_logs):,}")
print(f"\nDefense Log Types:")
defense_dist = defense_logs['log_type'].value_counts()
for log_type, count in defense_dist.head(15).items():
    print(f"  {log_type}: {count}")

# Which defense products triggered?
print("\nDefense Actions Taken:")
if 'action_taken' in defense_logs.columns:
    action_dist = defense_logs['action_taken'].value_counts()
    for action, count in action_dist.items():
        print(f"  {action}: {count}")

### EDR Detection Analysis

In [19]:
# EDR alerts
edr_logs = df[df['log_category'] == 'edr']

print(f"Total EDR Alerts: {len(edr_logs)}")

if len(edr_logs) > 0:
    print("\nEDR Vendors:")
    if 'vendor' in edr_logs.columns:
        vendor_dist = edr_logs['vendor'].value_counts()
        for vendor, count in vendor_dist.items():
            print(f"  {vendor}: {count} alerts")
    
    print("\nEDR Severity Distribution:")
    if 'severity' in edr_logs.columns:
        severity_dist = edr_logs['severity'].value_counts()
        for severity, count in severity_dist.items():
            print(f"  {severity}: {count}")

## 6. Compromised User Behavior Analysis

### Benign vs Malicious Activity Comparison

In [18]:
# Get compromised user
attacker = attack_logs['user'].iloc[0]
print(f"Compromised User: {attacker}")

# All logs from this user
user_logs = df[df['user'] == attacker]
user_attack = user_logs[user_logs['attack_id'].notna()]
user_benign = user_logs[user_logs['attack_id'].isna()]

print(f"\nTotal activity: {len(user_logs):,} logs")
print(f"  Benign: {len(user_benign):,} ({len(user_benign)/len(user_logs)*100:.1f}%)")
print(f"  Attack: {len(user_attack):,} ({len(user_attack)/len(user_logs)*100:.1f}%)")

# Process comparison
print("\nTop Processes (Benign):")
benign_procs = user_benign['process_name'].value_counts().head(5)
for proc, count in benign_procs.items():
    print(f"  {proc}: {count}")

print("\nTop Processes (Attack):")
attack_procs = user_attack['process_name'].value_counts().head(5)
for proc, count in attack_procs.items():
    print(f"  {proc}: {count}")

print("\n Overlap Problem:")
overlap = set(benign_procs.index) & set(attack_procs.index)
if overlap:
    print(f"Processes used both legitimately AND maliciously: {overlap}")
    print(" Process name alone insufficient for detection")

## 7. Detection Challenge Visualization

### Why This Attack Is Hard to Detect

In [17]:
# Create comparison dataframe
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# 1. Attack hidden in user activity
ax1 = axes[0, 0]
user_breakdown = pd.Series({
    'Benign Activity': len(user_benign),
    'Attack Activity': len(user_attack)
})
user_breakdown.plot(kind='pie', autopct='%1.1f%%', ax=ax1)
ax1.set_title(f'{attacker} Activity Breakdown')
ax1.set_ylabel('')

# 2. Attack hidden in service account noise
ax2 = axes[0, 1]
svc_db_breakdown = pd.Series({
    'Benign Service Activity': len(svc_db_logs) - len(svc_db_logs[svc_db_logs['attack_id'].notna()]),
    'Attack (Hijacked)': len(svc_db_logs[svc_db_logs['attack_id'].notna()])
})
svc_db_breakdown.plot(kind='pie', autopct='%1.2f%%', ax=ax2)
ax2.set_title('svc_database Activity Breakdown')
ax2.set_ylabel('')

# 3. Detection rate by defense product
ax3 = axes[1, 0]
if len(defense_logs) > 0:
    defense_logs['log_type'].value_counts().head(8).plot(kind='barh', ax=ax3)
    ax3.set_title('Defense Product Alerts')
    ax3.set_xlabel('Count')

# 4. Attack timeline heatmap
ax4 = axes[1, 1]
if len(stage_by_date) > 0:
    pivot = stage_by_date.pivot(index='date', columns='stage_number', values='count').fillna(0)
    sns.heatmap(pivot.T, annot=False, cmap='YlOrRd', ax=ax4, cbar_kws={'label': 'Actions'})
    ax4.set_title('Attack Progression Heatmap')
    ax4.set_xlabel('Date')
    ax4.set_ylabel('Stage Number')

plt.tight_layout()
plt.show()

print("\nKey Takeaways:")
print(f"1. Only {len(attack_logs)/len(df)*100:.4f}% of total logs are attack (needle in haystack)")
print(f"2. Attacker is Security analyst - tools overlap with legitimate use")
print(f"3. Service account hijacking provides perfect cover ({len(svc_db_logs):,} logs)")
print(f"4. Multi-week timeline avoids spike detection ({(attack_logs['datetime'].max() - attack_logs['datetime'].min()).days} days)")
print(f"5. Defense products detected {len(defense_logs)} events (37% of attack actions)")

## 8. Sample Attack Logs

### Stage 0: Initial PowerShell Execution

In [None]:
# Show first attack stage
stage_0 = attack_logs[attack_logs['stage_number'] == 0.0].head(5)
display_cols = ['timestamp', 'user', 'account', 'process_name', 'command_line', 'success', 'error']
print("Stage 0 Sample (PowerShell Execution Attempts):")
print(stage_0[display_cols].to_string())

Stage 0 Sample (PowerShell Execution Attempts):
                   timestamp              user           account    process_name                                                         command_line success                                                                                             error
5633622  2025-12-16 01:32:03  joseph.wilson475  joseph.wilson475  powershell.exe  powershell -ExecutionPolicy Bypass -Command \Get-ExecutionPolicy\""   False                                                                                 Access is denied.
5633623  2025-12-16 01:32:03               NaN  joseph.wilson475  powershell.exe  powershell -executionpolicy bypass -command \get-executionpolicy\""     NaN                                                                                               NaN
5633624  2025-12-16 01:32:03               NaN  joseph.wilson475             NaN                                                                  NaN     NaN                           

## 9. Lateral Movement

### Key insight: `user` field shows WHO is using the service account

In [42]:
# Show lateral movement to svc_adconnect
svc_adconnect_lateral = attack_logs[(attack_logs['stage_number'] == 5.0) & (attack_logs['account'] =="svc_adconnect")].head(10)
display_cols = ['timestamp', 'user', 'account', 'process_name', 'command_line', 'alert_name']
print("Service Account Hijacking: svc_adconnect")
print(svc_adconnect_lateral[display_cols].to_string())

print("\nNote: Service account hijacking pattern:")
print(f"  user={svc_adconnect_lateral['user'].iloc[0]} (original attacker)")
print(f"  account=svc_adconnect (hijacked AD Connect service account)")
print(f"  commands: AD enumeration (net group, dsquery, net user)")
print(f"  defense response: {svc_adconnect_lateral['alert_name'].notna().sum()} alerts triggered")

## Suggested Next Steps

This dataset can be explored in many ways:

### Analysis Ideas
- Visualize the 23-day attack timeline
- Compare service account patterns (benign vs. hijacked)
- Analyze defense product effectiveness by stage
- Identify legitimate vs. malicious lateral movement patterns

### Detection Experiments
- Test different feature combinations
- Experiment with sequence-based approaches
- Measure false positive rates in realistic noise
- Build baselines for users and service accounts

### Practice Scenarios
- Follow the attack as an investigation exercise
- Identify where defenses succeeded and failed
- Reconstruct the attack timeline from logs alone

This is one scenario—real-world detection requires adapting approaches to your specific environment and threats.

---

## Summary

**Dataset characteristics:**
- 8.1M logs spanning 25 days
- 570 attack logs (0.007% signal)
- 23-day intermediate-skill campaign
- 209 defense alerts (37% detection rate)

**Detection challenges demonstrated:**
- Tool overlap (Security analyst = attacker)
- Service account cover (0.03% signal in 185K logs)
- Multi-week dwell time (no spike to detect)
- Legitimate credentials (bypass many controls)
- Living-off-land techniques (no malware)

**What's needed for detection:**
- Context (who, when, what, why)
- Sequences (what happened before/after)
- Baselines (what's normal for this user/account)
- Multi-source correlation (defense + raw logs)

This dataset provides one realistic scenario to practice these challenges.