# Dataset Exploration: Enterprise Security Logs with Embedded Attack Chain

## Overview

This notebook provides a comprehensive exploration of a synthetic enterprise log dataset containing:

- 500 users across multiple departments (Engineering, Finance, Security, Sales, etc.)
- 50+ service accounts with 24/7 activity patterns
- 1 embedded multi-stage attack chain (MITRE ATT&CK) with ground-truth labels
- Realistic benign activity spanning multiple days

### Key Features

Ground Truth Labels: Every attack event labeled with `attack_id`, `attack_type` (MITRE technique), and `stage_number`  
Role-Based Profiles: Security analysts, finance users, engineers with realistic tool usage  
Service Account Abuse: Compromised service accounts (svc_backup, svc_database) mixed with benign activity  
Temporal Realism: Multi-day attack campaigns with persistent threat behavior  
Behavioral Overlap: Admin tools used legitimately AND maliciously (detection challenge!)  

Use Cases

- ML Training: UEBA models, anomaly detection, sequence modeling
- Detection Testing: Validate EDR/SIEM rules against realistic scenarios
- SOC Training: Teach analysts behavioral analysis and incident investigation
- Threat Research: Study attack progression and persistence patterns

---

## 1. Setup and Data Loading

Let's import necessary libraries and load the dataset.

In [19]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import json
from collections import Counter, defaultdict
from datetime import datetime
import warnings

# Configuration
warnings.filterwarnings('ignore')
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 6)
plt.rcParams['font.size'] = 10

print("Libraries loaded successfully")

In [18]:
# Load the dataset

# Path to dataset
DATASET_PATH = 'data/enterprise_logs_sample.csv'
# Load using pandas
df = pd.read_csv(DATASET_PATH, sep="\t")

print(f"Dataset loaded: {len(df):,} events")
print(f"Memory usage: {df.memory_usage(deep=True).sum() / 1024**2:.2f} MB")

---

## 2. Schema Exploration

Understanding the data structure and available fields.

In [17]:
# Display first few events
print("Sample Events (first 3 rows):\n")
df.head(3)

In [16]:
# Schema information
print("Dataset Schema:\n")
print(f"Total columns: {len(df.columns)}")
print(f"Total rows: {len(df):,}\n")

# Column details
schema_info = pd.DataFrame({
    'Column': df.columns,
    'Data Type': df.dtypes.values,
    'Non-Null Count': df.count().values,
    'Null Count': df.isnull().sum().values,
    'Unique Values': [df[col].nunique() for col in df.columns]
})

schema_info

In [181]:
# Key field descriptions
field_descriptions = {
    'timestamp': 'ISO 8601 timestamp with timezone',
    'user': 'Human identity (e.g., linda.davis381) or service account entity',
    'account': 'Security principal used in event (user or service account)',
    'service_account': 'Flag indicating if account is a service account (true/false)',
    'hostname': 'Host where event originated',
    'device_type': 'Type of device (workstation, mobile, server, domain_controller)',
    'location': 'Logical site (NYC_HQ, SF_Office, Remote_VPN, etc.)',
    'process_name': 'Process that generated the event',
    'command_line': 'Command executed (if applicable)',
    'event_type': 'Type of event (process_start, network_connection, file_access, etc.)',
    'sha256_hash': 'SHA-256 file hash for binary/file events',
    'md5_hash': 'MD5 file hash for binary/file events',
    'prevalence_score': 'File rarity score (0=common, 1=rare/unique)',
    'signed': 'Digital signature status (true/false)',
    'attack_id': 'Attack identifier (null for benign, "ATK_XXXXX" for attack)',
    'attack_type': 'MITRE ATT&CK technique (null for benign, "t1xxx.xxx" for attack)',
    'stage_number': 'Attack stage in sequence (null for benign, integer for attack)',
}

print("\nKey Field Descriptions:\n")
for field, desc in field_descriptions.items():
    if field in df.columns:
        print(f"‚Ä¢ {field}: {desc}")

---

## 3. Dataset Statistics

High-level metrics about the dataset composition.

In [10]:
# Basic statistics
print("="*80)
print("DATASET STATISTICS")
print("="*80)

# Temporal range
df['timestamp_parsed'] = pd.to_datetime(df['timestamp'])
date_range = df['timestamp_parsed'].dt.date
print(f"\n Temporal Coverage:")
print(f"   First Event: {df['timestamp_parsed'].min()}")
print(f"   Last Event:  {df['timestamp_parsed'].max()}")
print(f"   Duration:    {(df['timestamp_parsed'].max() - df['timestamp_parsed'].min()).days} days")
print(f"   Active Days: {date_range.nunique()} unique days")

# User and account statistics
print(f"\nüë• Users & Accounts:")
print(f"   Unique Users:          {df['user'].nunique():,}")
print(f"   Unique Accounts:       {df['account'].nunique():,}")
service_accounts = df[df['service_account'] == True]['account'].nunique()
print(f"   Service Accounts:      {service_accounts}")
human_users = df['user'].nunique() - service_accounts
print(f"   Human Users:           {human_users:,}")

# Infrastructure
print(f"\n  Infrastructure:")
print(f"   Unique Hosts:          {df['hostname'].nunique():,}")
print(f"   Unique Locations:      {df['location'].nunique()}")
print(f"   Device Types:          {df['device_type'].nunique()}")

# Events
print(f"\n Events:")
print(f"   Total Events:          {len(df):,}")
print(f"   Unique Event Types:    {df['event_type'].nunique()}")
print(f"   Unique Processes:      {df['process_name'].nunique():,}")

# Attack vs Benign
attack_events = df[df['attack_id'].notnull()]
benign_events = df[df['attack_id'].isnull()]
print(f"\n Attack vs Benign:")
print(f"   Attack Events:         {len(attack_events):,} ({len(attack_events)/len(df)*100:.2f}%)")
print(f"   Benign Events:         {len(benign_events):,} ({len(benign_events)/len(df)*100:.2f}%)")
print(f"   Attack-to-Benign Ratio: 1:{len(benign_events)/len(attack_events):.2f}")

print("\n" + "="*80)

In [183]:
# Event type distribution
print("Top 15 Event Types:\n")
event_type_counts = df['event_type'].value_counts().head(15)

fig, ax = plt.subplots(figsize=(12, 6))
event_type_counts.plot(kind='barh', ax=ax, color='steelblue')
ax.set_xlabel('Number of Events')
ax.set_ylabel('Event Type')
ax.set_title('Distribution of Event Types', fontsize=14, fontweight='bold')
ax.invert_yaxis()

# Add count labels
for i, v in enumerate(event_type_counts.values):
    ax.text(v + 100, i, f'{v:,}', va='center')

plt.tight_layout()
plt.show()

event_type_counts

In [184]:
# Process distribution
print("Top 20 Processes:\n")
process_counts = df['process_name'].value_counts().head(20)

fig, ax = plt.subplots(figsize=(12, 8))
process_counts.plot(kind='barh', ax=ax, color='coral')
ax.set_xlabel('Number of Events')
ax.set_ylabel('Process Name')
ax.set_title('Top 20 Most Active Processes', fontsize=14, fontweight='bold')
ax.invert_yaxis()

# Add count labels
for i, v in enumerate(process_counts.values):
    ax.text(v + 100, i, f'{v:,}', va='center')

plt.tight_layout()
plt.show()

process_counts

---

## 4. Attack Analysis - The Embedded Threat

Deep dive into the attack chain embedded in the dataset.

In [11]:
# Attack overview
attack_df = df[df['attack_id'].notnull()].copy()

print("="*80)
print("ATTACK CHAIN ANALYSIS")
print("="*80)

# Basic attack info
unique_attacks = attack_df['attack_id'].unique()
print(f"\n Attack Overview:")
print(f"   Unique Attack IDs:     {len(unique_attacks)}")
print(f"   Attack IDs:            {', '.join(unique_attacks)}")
print(f"   Total Attack Events:   {len(attack_df):,}")

# Focus on primary attack
primary_attack_id = unique_attacks[0]
primary_attack = attack_df[attack_df['attack_id'] == primary_attack_id]

print(f"\n Primary Attack: {primary_attack_id}")
print(f"   Events:                {len(primary_attack):,}")
print(f"   Compromised User:      {primary_attack['user'].mode()[0]}")
print(f"   Primary Hostname:      {primary_attack['hostname'].mode()[0]}")
print(f"   Location:              {primary_attack['location'].mode()[0]}")

# Temporal span
attack_start = primary_attack['timestamp_parsed'].min()
attack_end = primary_attack['timestamp_parsed'].max()
attack_duration = (attack_end - attack_start).total_seconds() / 3600
print(f"\n Attack Timeline:")
print(f"   First Event:           {attack_start}")
print(f"   Last Event:            {attack_end}")
print(f"   Duration:              {attack_duration:.1f} hours ({attack_duration/24:.1f} days)")

# MITRE ATT&CK techniques
techniques = primary_attack['attack_type'].value_counts()
print(f"\n MITRE ATT&CK Techniques:")
for technique, count in techniques.items():
    print(f"   {technique}: {count:,} events ({count/len(primary_attack)*100:.1f}%)")

# Attack stages
stages = primary_attack['stage_number'].value_counts().sort_index()
print(f"\n Attack Stages:")
print(f"   Total Stages:          {len(stages)}")
print(f"   Stage Range:           {stages.index.min()} ‚Üí {stages.index.max()}")

# Account abuse
compromised_accounts = primary_attack['account'].value_counts()
print(f"\n Compromised/Abused Accounts:")
for account, count in compromised_accounts.head(10).items():
    is_svc = '(service)' if account.startswith('svc_') else '(user)'
    print(f"   {account:30s} {is_svc:12s} {count:4,} events")


# Lateral movement: Show hostname progression
print(f"\n Lateral Movement (Hostname Progression):")
hostname_progression = primary_attack.groupby('hostname').agg({
    'timestamp_parsed': ['min', 'max', 'count']
}).sort_values(('timestamp_parsed', 'min'))
print(hostname_progression)

print("\n" + "="*80)

In [12]:
# Process-based Attack Detection
print("="*80)
print("PROCESS-BASED ANOMALY ANALYSIS")
print("="*80)

print(f"\n Key Insight: Authentication != Attack")
print(f"   Users legitimately authenticate with service accounts for admin tasks.")
print(f"   What matters is WHAT PROCESSES RUN after authentication.\n")

# Analyze processes used in attack vs benign on same hostnames
attack_hostnames = primary_attack['hostname'].unique()

# Get attack processes on service servers
attack_procs = primary_attack['process_name'].value_counts()
print(f"Processes Used During Attack:")
for proc, count in attack_procs.head(15).items():
    print(f"   {proc:30s} {count:3d} events")

# Identify highly suspicious processes
suspicious_processes = [
    'reg.exe', 'procdump.exe', 'rundll32.exe', 'mimikatz', 'wmic.exe',
    'certutil.exe', 'bitsadmin.exe', 'ftp.exe'
]

attack_suspicious = primary_attack[primary_attack['process_name'].isin(suspicious_processes)]
print(f"\n Highly Suspicious Processes in Attack:")
print(f"   Total Events: {len(attack_suspicious)} / {len(primary_attack)} ({len(attack_suspicious)/len(primary_attack)*100:.1f}%)")
for proc in suspicious_processes:
    count = len(attack_suspicious[attack_suspicious['process_name'] == proc])
    if count > 0:
        print(f"   ‚Ä¢ {proc}: {count} events")

print("\n Detection Strategy:")
print("   ‚Ä¢ Focus on PROCESS ANOMALIES, not authentication patterns")
print("   ‚Ä¢ Build baselines: What processes are normal for each server?")
print("   ‚Ä¢ Alert on: reg.exe, procdump, rundll32 on non-admin servers")
print("   ‚Ä¢ Combine with: Off-hours activity, volume spikes, command-line patterns")

print("\n" + "="*80)

In [187]:
# MITRE ATT&CK technique visualization
technique_counts = primary_attack['attack_type'].value_counts()

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(range(len(technique_counts)), technique_counts.values, color=['#d62728', '#ff7f0e', '#2ca02c', '#9467bd'])
ax.set_xticks(range(len(technique_counts)))
ax.set_xticklabels(technique_counts.index, rotation=0)
ax.set_ylabel('Number of Events')
ax.set_title(f'Attack Chain: MITRE ATT&CK Technique Distribution ({primary_attack_id})', 
             fontsize=14, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

# Add count labels on bars
for bar in bars:
    height = bar.get_height()
    ax.text(bar.get_x() + bar.get_width()/2., height,
            f'{int(height)}',
            ha='center', va='bottom', fontweight='bold')

# Add technique descriptions
technique_names = {
    't1087.002': 'Account Discovery',
    't1003.001': 'Credential Dumping',
    't1021.001': 'Lateral Movement (RDP)',
    't1041': 'Exfiltration'
}

plt.figtext(0.5, -0.065, 
            '\n'.join([f'{k}: {v}' for k, v in technique_names.items()]),
            ha='center', fontsize=9, style='italic')

plt.tight_layout()
plt.show()

In [188]:
# Attack stage progression
stage_counts = primary_attack['stage_number'].astype(int).value_counts().sort_index()

fig, ax = plt.subplots(figsize=(14, 6))
ax.plot(stage_counts.index, stage_counts.values, marker='o', linewidth=2, markersize=8, color='crimson')
ax.fill_between(stage_counts.index, stage_counts.values, alpha=0.3, color='crimson')
ax.set_xlabel('Attack Stage', fontsize=12)
ax.set_ylabel('Number of Events', fontsize=12)
ax.set_title(f'Attack Progression: Events per Stage ({primary_attack_id})', 
             fontsize=14, fontweight='bold')
ax.grid(True, alpha=0.3)
ax.set_xticks(stage_counts.index)

# Annotate major stages
for idx, val in stage_counts.items():
    if val > 10:  # Annotate stages with >10 events
        ax.annotate(f'{val}', xy=(idx, val), xytext=(0, 10), 
                   textcoords='offset points', ha='center', fontweight='bold')

plt.tight_layout()
plt.show()

print(f"\nStage Distribution Summary:")
print(f"Total Stages: {len(stage_counts)}")
print(f"Average Events per Stage: {stage_counts.mean():.1f}")
print(f"Max Events in Single Stage: {stage_counts.max()} (Stage {stage_counts.idxmax()})")

### Attack Kill Chain Summary

The embedded attack follows a realistic MITRE ATT&CK kill chain:

Stage 0-3: Initial Access & Execution (t1059.001)
- PowerShell execution with `-ExecutionPolicy Bypass` and `-WindowStyle Hidden`
- Malicious script execution from compromised user context (linda.davis381)
- Initial system reconnaissance and environment profiling
- Service account compromise (escalation to svc_adconnect)

Stage 4-6: Discovery (t1087.002)  
- Domain admin enumeration: `net group "Domain Admins" /domain`
- Active Directory user discovery with PowerShell `Get-ADUser`
- Account privilege mapping across domain infrastructure

Stage 7-10: Lateral Movement (t1021.001)
- RDP connections targeting domain controllers (DC-01, DC-02)
- PowerShell remoting sessions with compromised credentials
- Remote Desktop Protocol (rdpclip.exe) sessions established
- Multiple lateral movement attempts 
- Targeting of critical infrastructure (domain controllers)

Stage 11-15: Exfiltration (t1041)
- Sensitive data collection (NTDS.dit, credentials, system files)
- Multiple exfiltration channels for redundancy:
    - PowerShell `Invoke-WebRequest` (port 8080)
    - `certutil` file transfer (port 443)
    - `curl` POST requests (port 9090)
- C2 communication to external attacker infrastructure (203.0.113.70)
- Password-protected archives to evade DLP detection

---

---

## 5. User Behavior Profiles - Role-Based Detection

Comparing compromised vs benign users to understand detection challenges.

In [189]:
# Identify the compromised user
compromised_user = primary_attack['user'].mode()[0]
compromised_user_df = df[df['user'] == compromised_user].copy()

# Get a benign user for comparison (one with significant activity)
benign_users = df[df['attack_id'].isnull()]['user'].value_counts()
benign_user = "thomas.anderson128" #benign_users.index[0]  # Most active benign user
benign_user_df = df[df['user'] == "thomas.anderson128"].copy()

print("="*80)
print("USER BEHAVIOR COMPARISON")
print("="*80)

print(f"\nüë§ Compromised User: {compromised_user}")
print(f"   Total Events:          {len(compromised_user_df):,}")
attack_count = len(compromised_user_df[compromised_user_df['attack_id'] != 'NA'])
benign_count = len(compromised_user_df[compromised_user_df['attack_id'] == 'NA'])
print(f"   Attack Events:         {attack_count:,} ({attack_count/len(compromised_user_df)*100:.1f}%)")
print(f"   Benign Events:         {benign_count:,} ({benign_count/len(compromised_user_df)*100:.1f}%)")
print(f"   Primary Hostname:      {compromised_user_df['hostname'].mode()[0] if len(compromised_user_df) > 0 else 'N/A'}")
print(f"   Location:              {compromised_user_df['location'].mode()[0] if len(compromised_user_df) > 0 else 'N/A'}")

print(f"\n Benign User: {benign_user}")
print(f"   Total Events:          {len(benign_user_df):,}")
print(f"   Attack Events:         0 (0.0%)")
print(f"   Benign Events:         {len(benign_user_df):,} (100.0%)")
print(f"   Primary Hostname:      {benign_user_df['hostname'].mode()[0] if len(benign_user_df) > 0 else 'N/A'}")
print(f"   Location:              {benign_user_df['location'].mode()[0] if len(benign_user_df) > 0 else 'N/A'}")

print("\n" + "="*80)

In [190]:
# Process comparison between users
comp_processes = compromised_user_df['process_name'].value_counts().head(15)
benign_processes = benign_user_df['process_name'].value_counts().head(15)

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Compromised user
comp_processes.plot(kind='barh', ax=ax1, color='crimson', alpha=0.7)
ax1.set_title(f'Top Processes: {compromised_user} (COMPROMISED)', fontsize=12, fontweight='bold')
ax1.set_xlabel('Number of Events')
ax1.invert_yaxis()

# Benign user  
benign_processes.plot(kind='barh', ax=ax2, color='green', alpha=0.7)
ax2.set_title(f'Top Processes: {benign_user} (BENIGN)', fontsize=12, fontweight='bold')
ax2.set_xlabel('Number of Events')
ax2.invert_yaxis()

plt.tight_layout()
plt.show()

# Process overlap analysis
comp_proc_set = set(compromised_user_df['process_name'].unique())
benign_proc_set = set(benign_user_df['process_name'].unique())

overlap = comp_proc_set & benign_proc_set
overlap_filtered = [x for x in overlap if isinstance(x, str)]

print(f"\n Process Overlap Analysis:")
print(f"   Compromised User Processes:    {len(comp_proc_set)}")
print(f"   Benign User Processes:         {len(benign_proc_set)}")
print(f"   Overlapping Processes:         {len(overlap)}")
print(f"\n   Shared processes: {sorted(list(overlap_filtered))[:10]}...")

In [191]:
# Hourly activity comparison
comp_user_df_copy = compromised_user_df.copy()
benign_user_df_copy = benign_user_df.copy()

comp_user_df_copy['hour'] = comp_user_df_copy['timestamp_parsed'].dt.hour
benign_user_df_copy['hour'] = benign_user_df_copy['timestamp_parsed'].dt.hour

comp_hourly = comp_user_df_copy['hour'].value_counts().sort_index()
benign_hourly = benign_user_df_copy['hour'].value_counts().sort_index()

fig, ax = plt.subplots(figsize=(14, 6))

x = np.arange(24)
width = 0.35

# Align both to 0-23 hours
comp_hourly_aligned = [comp_hourly.get(i, 0) for i in range(24)]
benign_hourly_aligned = [benign_hourly.get(i, 0) for i in range(24)]

ax.bar(x - width/2, comp_hourly_aligned, width, label=f'{compromised_user} (Compromised)', color='crimson', alpha=0.7)
ax.bar(x + width/2, benign_hourly_aligned, width, label=f'{benign_user} (Benign)', color='green', alpha=0.7)

ax.set_xlabel('Hour of Day')
ax.set_ylabel('Number of Events')
ax.set_title('Activity Patterns: Compromised vs Benign User', fontsize=14, fontweight='bold')
ax.set_xticks(x)
ax.set_xticklabels([f'{h:02d}:00' for h in range(24)], rotation=45)
ax.legend()
ax.grid(axis='y', alpha=0.3)

# Highlight off-hours (00:00 - 06:00)
ax.axvspan(-0.5, 5.5, alpha=0.1, color='red', label='Off-hours')

plt.tight_layout()
plt.show()

print("\n Off-Hours Activity (00:00 - 06:00):")
comp_offhours = sum([comp_hourly_aligned[i] for i in range(6)])
benign_offhours = sum([benign_hourly_aligned[i] for i in range(6)])
print(f"   Compromised User: {comp_offhours} events")
print(f"   Benign User:      {benign_offhours} events")

### Key Detection Insight: Role-Based Baselines

Why Simple Signature Detection Fails:

Many processes appear in BOTH compromised and benign Security analyst activity:
- `powershell.exe` - Used legitimately for security automation AND maliciously for credential dumping
- `net.exe` - Routine domain administration AND attacker reconnaissance
- `whoami.exe`, `systeminfo.exe` - Normal system validation AND reconnaissance
- `mstsc.exe`, `Enter-PSSession` - Legitimate remote administration AND lateral movement

The Challenge:

- linda.davis381 (compromised): 891 events, 11.8% attack rate (linda.davis381)
- Comparison baseline: Clean Security analysts with similar tool usage patterns
- Process overlap: ~60% of attack tools are legitimate administrative tools
- Role complexity: Security analysts EXPECT to use reconnaissance and admin tools

Detection Must Be Contextual:

1. Temporal patterns - Attack concentrated over 5-day period (Nov 18-22)
2. Command sequences - PowerShell execution ‚Üí Domain enumeration ‚Üí RDP lateral movement ‚Üí Exfiltration
3. Account progression - User account (linda.davis381) ‚Üí Service account (svc_adconnect)
4. Network behavior - Multiple C2 channels (ports 8080, 443, 9090) to external IP (203.0.113.70)
5. Target selection - Focused reconnaissance of domain controllers (DC-01, DC-02)
6. Multi-stage behavior - Clear kill chain: Initial Access ‚Üí Discovery (2 days) ‚Üí Lateral Movement (2 days) ‚Üí Exfiltration
7. Tool chaining** - Legitimate tools used in attack sequences (PowerShell ‚Üí net.exe ‚Üí mstsc ‚Üí curl)

Bottom Line: Models must learn INTENT, SEQUENCE, and CONTEXT, not just tool presence. A Security analyst running `net group "Domain Admins"` is routine; the same command followed by PowerShell remoting to multiple domain controllers and external data transfers is an attack.




---

## 6. Service Account Baseline Analysis

Service accounts generate 30-50% of enterprise logs but are often overlooked in detection.

In [192]:
# Service account analysis
service_accounts_df = df[df['service_account'] == True].copy()

print("="*80)
print("SERVICE ACCOUNT ANALYSIS")
print("="*80)

print(f"\n Service Account Activity:")
print(f"   Total Service Account Events:  {len(service_accounts_df):,} ({len(service_accounts_df)/len(df)*100:.1f}% of all logs)")
print(f"   Unique Service Accounts:       {service_accounts_df['account'].nunique()}")

# Top service accounts
top_service_accounts = service_accounts_df['account'].value_counts().head(10)
print(f"\nüîù Top 10 Service Accounts by Activity:")
for account, count in top_service_accounts.items():
    print(f"   {account:30s} {count:6,} events")

# Service account abuse in attack
service_in_attack = primary_attack[primary_attack['account'].str.startswith('svc_', na=False)]
if len(service_in_attack) > 0:
    abused_accounts = service_in_attack['account'].value_counts()
    print(f"\n Service Accounts Abused in Attack:")
    for account, count in abused_accounts.items():
        print(f"   {account:30s} {count:6,} attack events")

print("\n" + "="*80)

In [193]:
# Service account hourly distribution (should be relatively uniform - 24/7 operation)
service_accounts_df_copy = service_accounts_df.copy()
service_accounts_df_copy['hour'] = service_accounts_df_copy['timestamp_parsed'].dt.hour
service_hourly = service_accounts_df_copy['hour'].value_counts().sort_index()

# Compare with human users
human_users_df = df[df['service_account'] == False].copy()
human_users_df['hour'] = human_users_df['timestamp_parsed'].dt.hour
human_hourly = human_users_df['hour'].value_counts().sort_index()

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Service accounts
service_hourly.plot(kind='bar', ax=ax1, color='steelblue', alpha=0.7)
ax1.set_title('Service Account Activity (24/7 Pattern)', fontsize=12, fontweight='bold')
ax1.set_xlabel('Hour of Day')
ax1.set_ylabel('Number of Events')
ax1.set_xticklabels([f'{h:02d}:00' for h in range(24)], rotation=45)
ax1.axhline(service_hourly.mean(), color='red', linestyle='--', alpha=0.5, label=f'Mean: {service_hourly.mean():.0f}')
ax1.legend()
ax1.grid(axis='y', alpha=0.3)

# Human users
human_hourly.plot(kind='bar', ax=ax2, color='orange', alpha=0.7)
ax2.set_title('Human User Activity (Business Hours Pattern)', fontsize=12, fontweight='bold')
ax2.set_xlabel('Hour of Day')
ax2.set_ylabel('Number of Events')
ax2.set_xticklabels([f'{h:02d}:00' for h in range(24)], rotation=45)
ax2.axhline(human_hourly.mean(), color='red', linestyle='--', alpha=0.5, label=f'Mean: {human_hourly.mean():.0f}')
ax2.legend()
ax2.grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\n Activity Pattern Analysis:")
print(f"   Service Accounts - Std Dev: {service_hourly.std():.2f} (low = uniform)")
print(f"   Human Users - Std Dev:      {human_hourly.std():.2f} (high = business hours)")
print(f"\n    Service accounts show relatively uniform 24/7 activity")
print(f"    Human users show business hours peak (9 AM - 5 PM)")

In [194]:
df_copy = df.copy()
df_copy['date'] = df_copy['timestamp_parsed'].dt.date
df_copy['is_attack'] = df_copy['attack_id'].notnull()

df_copy.groupby(['date', 'is_attack']).size().unstack(fill_value=0)

### Service Account Detection Challenge

Normal vs Abusive Behavior:

Service accounts like `svc_adconnect` legitimately:
- Run 24/7 on domain controllers (off-hours activity is NORMAL)
- Execute domain synchronization and replication tasks
- Access multiple systems for Active Directory operations
- Use administrative tools (PowerShell, net.exe, WMI) routinely
- Authenticate across domain infrastructure

When compromised (as seen in this attack):
- Abused for domain reconnaissance (`net group "Domain Admins" /domain`)
- Used as pivot point for lateral movement (RDP to DC-01, DC-02)
- Leveraged for credential access (targeting NTDS.dit)
- Blend attack activity with normal AD synchronization traffic
- Exploit trusted service account privileges across domain

Detection requires:
- Behavioral baselines per service account type
- Anomaly detection on command sequences (normal sync vs reconnaissance)
- Correlation with user activity (linda.davis381 ‚Üí svc_adconnect transition)
- Network behavior analysis (normal DC-to-DC vs DC-to-external)
- Temporal pattern recognition (service account used for interactive sessions)

Key Indicator: Service accounts typically run automated tasks, not interactive sessions. `svc_adconnect` establishing RDP sessions and running manual reconnaissance commands is a strong attack signal.

In [195]:
# Hourly heatmap of attack activity
attack_df_copy = attack_df.copy()
attack_df_copy['date'] = attack_df_copy['timestamp_parsed'].dt.date
attack_df_copy['hour'] = attack_df_copy['timestamp_parsed'].dt.hour

# Create pivot table for heatmap - reindex to ensure all 24 hours are present
heatmap_data = attack_df_copy.groupby(['date', 'hour']).size().unstack(fill_value=0)
heatmap_data = heatmap_data.reindex(columns=range(24), fill_value=0)

fig, ax = plt.subplots(figsize=(16, 6))
sns.heatmap(heatmap_data, cmap='Reds', annot=True, fmt='d', cbar_kws={'label': 'Number of Attack Events'},
            linewidths=0.5, ax=ax)
ax.set_xlabel('Hour of Day')
ax.set_ylabel('Date')
ax.set_title('Attack Activity Heatmap: When Did the Attack Occur?', fontsize=14, fontweight='bold')
ax.set_xticklabels([f'{h:02d}:00' for h in range(24)])
plt.tight_layout()
plt.show()

print("\n Attack Hotspots (>10 events):")
for date in heatmap_data.index:
    for hour in heatmap_data.columns:
        count = heatmap_data.loc[date, hour]
        if count > 10:
            print(f"   {date} at {hour:02d}:00 - {count} attack events")

---

## 7. Temporal Visualizations - Attack Timeline

Visualizing when the attack occurred and how it progressed over time.

In [196]:
# Attack progression timeline - show stage evolution
attack_timeline = primary_attack.copy()
attack_timeline = attack_timeline.sort_values('timestamp_parsed')
attack_timeline['stage_int'] = attack_timeline['stage_number'].astype(int)
attack_timeline['hours_from_start'] = (attack_timeline['timestamp_parsed'] - attack_timeline['timestamp_parsed'].min()).dt.total_seconds() / 3600

fig, ax = plt.subplots(figsize=(16, 8))

# Color by technique
technique_colors = {
    't1087.002': 'blue',
    't1059.001': 'purple', 
    't1021.001': 'green',
    't1041': 'red'
}

for technique, color in technique_colors.items():
    mask = attack_timeline['attack_type'] == technique
    ax.scatter(attack_timeline[mask]['hours_from_start'], 
              attack_timeline[mask]['stage_int'],
              c=color, label=technique, alpha=0.6, s=100)

ax.set_xlabel('Hours from Attack Start', fontsize=12)
ax.set_ylabel('Attack Stage', fontsize=12)
ax.set_title('Attack Timeline: Stage Progression Over Time', fontsize=14, fontweight='bold')
ax.legend(title='MITRE Technique', bbox_to_anchor=(1.05, 1), loc='upper left')
ax.grid(True, alpha=0.3)

# Add vertical lines for day boundaries
for day in range(1, int(attack_timeline['hours_from_start'].max() / 24) + 1):
    ax.axvline(day * 24, color='gray', linestyle='--', alpha=0.3)
    ax.text(day * 24, ax.get_ylim()[1], f'Day {day+1}', ha='center', va='bottom', fontsize=9)

plt.tight_layout()
plt.show()

print(f"\n‚è±  Attack Duration Analysis:")
print(f"   Total Duration: {attack_timeline['hours_from_start'].max():.1f} hours ({attack_timeline['hours_from_start'].max()/24:.1f} days)")
print(f"   Stages Covered: {attack_timeline['stage_int'].min()} ‚Üí {attack_timeline['stage_int'].max()}")

---

## 8. Detection Challenges - Why This Dataset is Hard

Highlighting the realistic complexity that makes this dataset valuable for ML training.

In [197]:
# Process overlap challenge
attack_processes = set(attack_df['process_name'].unique())
benign_processes = set(benign_events['process_name'].unique())
overlapping_processes = attack_processes & benign_processes

print("="*80)
print("DETECTION CHALLENGES")
print("="*80)

print(f"\n Challenge 1: Process Overlap")
print(f"   Processes used in attacks:        {len(attack_processes)}")
print(f"   Processes used benignly:          {len(benign_processes)}")
print(f"   Overlapping (both contexts):      {len(overlapping_processes)}")
print(f"   Overlap percentage:               {len(overlapping_processes)/len(attack_processes)*100:.1f}%")
print(f"\n   Common dual-use processes: {sorted(list(overlapping_processes))[:15]}")

# Analyze PowerShell usage as an example
if 'powershell.exe' in overlapping_processes:
    ps_attack = len(attack_df[attack_df['process_name'] == 'powershell.exe'])
    ps_benign = len(benign_events[benign_events['process_name'] == 'powershell.exe'])
    ps_total = ps_attack + ps_benign
    print(f"\n   Example: PowerShell Usage")
    print(f"   ‚Ä¢ Attack events:   {ps_attack:,} ({ps_attack/ps_total*100:.1f}%)")
    print(f"   ‚Ä¢ Benign events:   {ps_benign:,} ({ps_benign/ps_total*100:.1f}%)")
    print(f"   ‚Ä¢ Simple signature detection would flag {ps_total:,} events with {ps_benign:,} false positives!")

print("\n" + "="*80)

In [198]:
# Attack-to-benign ratio analysis
print("\n Challenge 2: Class Imbalance (Needle in Haystack)\n")

attack_ratio = len(attack_events) / len(df)
print(f"   Attack events:        {len(attack_events):,}")
print(f"   Benign events:        {len(benign_events):,}")
print(f"   Attack ratio:         {attack_ratio*100:.2f}%")
print(f"   Ratio:                1 attack : {len(benign_events)/len(attack_events):.1f} benign")
print(f"\n   This is realistic! Enterprise logs typically have 1-5% attack traffic.")
print(f"   Perfect for training ML models on imbalanced data.")

# Visualize the imbalance
fig, ax = plt.subplots(figsize=(8, 8))
sizes = [len(benign_events), len(attack_events)]
labels = [f'Benign\n{len(benign_events):,} events\n({len(benign_events)/len(df)*100:.3f}%)',
          f'Attack\n{len(attack_events):,} events\n({len(attack_events)/len(df)*100:.3f}%)']
colors = ['green', 'red']
explode = (0, 0.1)  # Explode attack slice

ax.pie(sizes, explode=explode, labels=labels, colors=colors, autopct='',
       shadow=True, startangle=90, textprops={'fontsize': 12, 'weight': 'bold'})
ax.set_title('Dataset Composition: Attack vs Benign Events', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

In [199]:
# Temporal patterns challenge
print("\n Challenge 3: Temporal Context Matters\n")

# Get compromised user's off-hours activity
comp_user_attack = compromised_user_df[compromised_user_df['attack_id'].notnull()].copy()
comp_user_attack['hour'] = comp_user_attack['timestamp_parsed'].dt.hour

offhours_attack = len(comp_user_attack[comp_user_attack['hour'].isin(range(0, 6))])
total_attack = len(comp_user_attack)

print(f"   Off-hours attack activity (00:00-06:00): {offhours_attack} / {total_attack} events ({offhours_attack/total_attack*100:.1f}%)")
print(f"\n   Detection insight:")
print(f"   ‚Ä¢ Same process (e.g., mstsc.exe) is benign at 2 PM, suspicious at 2 AM")
print(f"   ‚Ä¢ Time-of-day is a critical feature for anomaly detection")
print(f"   ‚Ä¢ But service accounts operate 24/7 - can't use simple time-based rules!")

### Summary of Detection Challenges

This dataset intentionally includes realistic complexity:

1. **Process Overlap**: ~50% of attack processes are also used benignly
   - Forces models to learn behavioral context, not just signatures

2. **Class Imbalance**: 1%< attack traffic (realistic enterprise ratio)
   - Requires handling imbalanced data (SMOTE, class weights, etc.)

3. **Temporal Context**: Time-of-day matters differently for users vs services
   - Off-hours = suspicious for humans, normal for service accounts

4. **Role-Based Behavior**: Security analysts legitimately use "attacker tools"
   - PowerShell, net.exe, reg.exe are both benign and malicious

5. **Service Account Abuse**: High-privilege accounts used as pivot points
   - svc_backup doing credential dumps blends with normal backup activity

**These challenges make the dataset valuable for training robust ML models!**

---

## 9. Feature Engineering Ideas

Examples of features that could be extracted for ML models.

In [200]:
# Feature engineering examples
print("="*80)
print("FEATURE ENGINEERING IDEAS FOR ML MODELS")
print("="*80)

# Create sample features
df_features = df.copy()
df_features['hour'] = df_features['timestamp_parsed'].dt.hour
df_features['day_of_week'] = df_features['timestamp_parsed'].dt.dayofweek
df_features['is_weekend'] = df_features['day_of_week'].isin([5, 6])
df_features['is_offhours'] = df_features['hour'].isin(range(0, 6)) | df_features['hour'].isin(range(22, 24))
df_features['is_service_account'] = df_features['service_account'] == True

# Calculate user activity volume features
user_event_counts = df_features.groupby('user').size()
df_features['user_total_events'] = df_features['user'].map(user_event_counts)

# Process frequency features
process_counts = df_features['process_name'].value_counts()
df_features['process_frequency'] = df_features['process_name'].map(process_counts)
df_features['is_rare_process'] = df_features['process_frequency'] < 10

print("\n Sample Feature Categories:\n")

print("1. Temporal Features:")
print("   ‚Ä¢ hour (0-23)")
print("   ‚Ä¢ day_of_week (0-6)")
print("   ‚Ä¢ is_weekend (boolean)")
print("   ‚Ä¢ is_offhours (boolean)")

print("\n2. User/Account Features:")
print("   ‚Ä¢ is_service_account (boolean)")
print("   ‚Ä¢ user_total_events (volume)")
print("   ‚Ä¢ department (if available)")
print("   ‚Ä¢ location")

print("\n3. Process Features:")
print("   ‚Ä¢ process_name (categorical)")
print("   ‚Ä¢ process_frequency (how common)")
print("   ‚Ä¢ is_rare_process (boolean)")
print("   ‚Ä¢ parent_process (categorical)")

print("\n4. Network Features:")
print("   ‚Ä¢ destination_ip (external vs internal)")
print("   ‚Ä¢ destination_port")
print("   ‚Ä¢ protocol")

print("\n5. Event Features:")
print("   ‚Ä¢ event_type (categorical)")
print("   ‚Ä¢ success (boolean)")
print("   ‚Ä¢ command_line_length")
print("   ‚Ä¢ has_error (boolean)")

print("\n6. Sequence Features (for RNNs/Transformers):")
print("   ‚Ä¢ Previous N events for user")
print("   ‚Ä¢ Time since last event")
print("   ‚Ä¢ Event sequence patterns")

print("\n7. Aggregation Features (rolling windows):")
print("   ‚Ä¢ Events per user in last hour")
print("   ‚Ä¢ Unique processes in last hour")
print("   ‚Ä¢ Failed auth attempts in last hour")

print("\n" + "="*80)

In [201]:
# Show feature importance of simple features
print("\n Feature Analysis: Correlation with Attack Events\n")

# Create target variable
df_features['is_attack'] = (df_features['attack_id'].notnull()).astype(int)

# Calculate attack rates for key features
print("Off-hours activity:")
offhours_attack_rate = df_features[df_features['is_offhours']]['is_attack'].mean()
regular_attack_rate = df_features[~df_features['is_offhours']]['is_attack'].mean()
print(f"   Attack rate during off-hours:  {offhours_attack_rate*100:.2f}%")
print(f"   Attack rate during reg hours:  {regular_attack_rate*100:.2f}%")
print(f"   Lift:                          {offhours_attack_rate/regular_attack_rate:.2f}x")

print("\nService accounts:")
service_attack_rate = df_features[df_features['is_service_account']]['is_attack'].mean()
human_attack_rate = df_features[~df_features['is_service_account']]['is_attack'].mean()
print(f"   Attack rate for service accts: {service_attack_rate*100:.2f}%")
print(f"   Attack rate for human users:   {human_attack_rate*100:.2f}%")
print(f"   Lift:                          {human_attack_rate/service_attack_rate:.2f}x")

print("\nRare processes:")
rare_attack_rate = df_features[df_features['is_rare_process']]['is_attack'].mean()
common_attack_rate = df_features[~df_features['is_rare_process']]['is_attack'].mean()
print(f"   Attack rate for rare procs:    {rare_attack_rate*100:.2f}%")
print(f"   Attack rate for common procs:  {common_attack_rate*100:.2f}%")
if rare_attack_rate > 0:
    print(f"   Lift:                          {rare_attack_rate/common_attack_rate:.2f}x")

---

## 10. Ground Truth Label Quality

Validating that labels are consistent and suitable for supervised learning.

In [202]:
# Label quality checks
print("="*80)
print("GROUND TRUTH LABEL QUALITY VALIDATION")
print("="*80)

print("\n Label Completeness:")
print(f"   Events with attack_id:        {df['attack_id'].notna().sum():,} / {len(df):,} (100%)")
print(f"   Events with attack_type:      {df['attack_type'].notna().sum():,} / {len(df):,} (100%)")
print(f"   Events with stage_number:     {df['stage_number'].notna().sum():,} / {len(df):,} (100%)")

print("\n Label Consistency:")
# Check that attack_id != NA implies attack_type != NA
attack_with_type = df[(df['attack_id'] != 'NA') & (df['attack_type'] != 'NA')]
attack_without_type = df[(df['attack_id'] != 'NA') & (df['attack_type'] == 'NA')]
print(f"   Attack events with technique:  {len(attack_with_type):,} / {len(attack_events):,}")
print(f"   Attack events missing tech:    {len(attack_without_type):,}")

# Check that benign events have "NA" labels
benign_with_na = df[(df['attack_id'] == 'NA') & (df['attack_type'] == 'NA') & (df['stage_number'] == 'NA')]
print(f"   Benign events with NA labels:  {len(benign_with_na):,} / {len(benign_events):,}")

print("\n MITRE ATT&CK Technique Mapping:")
techniques = attack_df['attack_type'].unique()
print(f"   Unique techniques in dataset:  {len(techniques)}")
for tech in sorted(techniques):
    count = len(attack_df[attack_df['attack_type'] == tech])
    print(f"   ‚Ä¢ {tech}: {count:,} events")

print("\n Stage Number Validation:")
stages = attack_df['stage_number'].astype(int).unique()
print(f"   Unique stages:                 {len(stages)}")
print(f"   Stage range:                   {min(stages)} - {max(stages)}")
print(f"   Sequential stages:             {sorted(stages)}")

# Check for gaps in stage progression
expected_stages = set(range(min(stages), max(stages) + 1))
actual_stages = set(stages)
missing_stages = expected_stages - actual_stages
if missing_stages:
    print(f"   ‚ö†Ô∏è  Missing stages:              {missing_stages}")
else:
    print(f"    No gaps in stage sequence")

print("\n" + "="*80)
print("\n LABEL QUALITY: EXCELLENT")
print("   ‚Ä¢ All events have complete labels")
print("   ‚Ä¢ Attack/benign distinction is clear (attack_id == 'NA' vs != 'NA')")
print("   ‚Ä¢ MITRE techniques are properly mapped")
print("   ‚Ä¢ Stage progression is sequential")
print("   ‚Ä¢ Ready for supervised learning!")
print("="*80)

---

## 11. Key Insights Summary

### Dataset Strengths

Realistic Complexity: Attack events blend with benign activity (~60% process overlap)  
Role-Based Profiles: Security analysts vs Finance users have different behavioral baselines  
Service Account Coverage: 50+ service accounts with 24/7 activity patterns  
Multi-Day Campaigns: Attack spans 5 days (Nov 18-22) with realistic APT dwell time  
Ground Truth Labels: Every event labeled with attack_id, MITRE technique, stage_number  
Balanced Imbalance: <1% attack ratio (24 attack events in ~55,000 total) - realistic for enterprise environments  
Infrastructure Realism: Attacks target actual infrastructure (DC-01, DC-02, domain controllers)  

### Detection Challenges (Why This Dataset is Valuable)

Process Overlap:** PowerShell, net.exe, mstsc used in both benign and attack contexts  
Temporal Context:** 5-day attack window mimics slow APT reconnaissance (not instant breach)  
Role Deviations:** Security analysts legitimately use reconnaissance tools daily  
Service Abuse:** svc_adconnect doing domain enumeration blends with normal AD operations  
Account Progression:** User account ‚Üí Service account escalation (linda.davis381 ‚Üí svc_adconnect)  
Target Selection:** Domain controller targeting appears as normal administration  
Multi-Channel Exfiltration:** C2 communication uses multiple ports (8080, 443, 9090) to evade detection  

### Machine Learning Applications

Supervised Learning Tasks:
- Event-level classification (attack vs benign)
- Session-level detection (compromised sessions)
- User-level profiling (compromised user: linda.davis381)
- Technique prediction (MITRE ATT&CK classification: t1059.001, t1087.002, t1021.001, t1041)
- Attack stage identification (0-15 stages across kill chain)

Unsupervised Learning Tasks:
- Anomaly detection (outlier events in Security analyst behavior)
- Clustering (user/service account behavioral groups)
- Sequence modeling (kill chain progression: Initial Access ‚Üí Discovery ‚Üí Lateral Movement ‚Üí Exfiltration)
- Behavioral baseline creation (per-role, per-account)

Feature Engineering:
- Temporal features (5-day attack window, stage progression timing)
- User features (role: Security, account transitions, privilege escalation)
- Process features (command sequences, parent-child relationships)
- Network features (C2 infrastructure: 203.0.113.70, multi-port exfiltration)
- Sequence features (stage_number, technique progression, kill chain context)
- Target features (domain controller focus, infrastructure targeting)

---

## 12. Next Steps

### Recommended Analyses

1. Baseline Detection Models
   - Random Forest classifier for event-level detection
   - LSTM/Transformer models for kill chain sequence prediction
   - Isolation Forest for behavioral anomaly detection
   - Graph Neural Networks for lateral movement path analysis
   
2. Deep Dive Investigations:
   - Reconstruct the full 5-day attack timeline for ATK_63070 (linda.davis381)
   - Analyze service account privilege escalation (user ‚Üí svc_adconnect)
   - Study command-line sequences (benign admin vs attack reconnaissance)
   - Investigate multi-channel exfiltration patterns (ports 8080, 443, 9090)
   
3. Advanced Modeling:
   - Transformers for multi-stage attack prediction (16 stages)
   - Temporal graph networks for user-account-host relationships
   - Time series anomaly detection for C2 beaconing patterns
   - Role-based behavioral profiling (Security vs Finance vs IT)

### Dataset Usage

For ML Engineers/Data Scientists:
- Use ground truth labels (attack_type, stage_number) for supervised learning
- Experiment with role-based feature engineering (department, task_category)
- Test model robustness against realistic class imbalance (<1% attack rate)
- Build sequence models using kill chain progression (Initial Access ‚Üí Discovery ‚Üí Lateral Movement ‚Üí Exfiltration)
- Evaluate detection latency across 5-day attack window

For Threat Researchers:
- Study APT behavior patterns (5-day dwell time, slow reconnaissance)
- Analyze MITRE ATT&CK technique combinations in real attack sequences
- Investigate service account abuse (svc_adconnect compromise)
- Research C2 infrastructure patterns (multi-port exfiltration)
- Examine attacker operational security (blending with normal admin activity)

For SOC Analysts:
- Practice incident investigation on ATK_63070 (Security analyst compromise)
- Learn to identify subtle behavioral anomalies in privileged user activity
- Understand account progression indicators (linda.davis381 ‚Üí svc_adconnect)
- Recognize domain controller targeting patterns (DC-01, DC-02)
- Distinguish legitimate Security analyst activity from attack reconnaissance

For Academics:
- Research role-based behavioral modeling
- Study attack sequence prediction and kill chain reconstruction
- Investigate class imbalance handling in cybersecurity ML
- Explore explainable AI for security event classification
- Publish findings on real-world attack simulation quality

For SOC Automation Teams:
- Build SOAR playbooks for kill chain stages (0-15)
- Test automated response triggers for service account anomalies
- Develop correlation rules for multi-stage attacks
- Benchmark detection rule coverage against MITRE ATT&CK (t1059.001, t1087.002, t1021.001, t1041)
- Measure MTTD (Mean Time to Detect) across 5-day attack window

For EDR/SIEM Vendors:
- Test detection rules against realistic APT scenarios
- Measure false positive rates in high-privilege user contexts (Security analysts)
- Benchmark MITRE ATT&CK coverage and alert fidelity
- Validate behavioral analytics for service account abuse
- Assess detection latency for slow-burn attacks (multi-day campaigns)

---

**Project:** Phantom Armor (synthetic cybersecurity log simulator)  
**Author:** Greg Rothman  
**Contact:** gregralr@phantomarmor.com  
**Note:** All data and scenarios are fully synthetic. No real users, systems, or organizations are represented.
