# Advanced Threat Hunting - Microsoft Sentinel Data Lake

This notebook provides advanced threat hunting capabilities using Microsoft Sentinel Data Lake, combining multiple data sources for sophisticated threat detection.

## üéØ **SIMPLE SETUP: Just Update the Workspace Names!**
This notebook uses a simple manual configuration system. 
**Just update the workspace names in the configuration cell and run!**

## Advanced Use Cases Covered
1. **Command and Control (C2) Detection** - Identify beacon behavior and C2 communications
2. **Living off the Land** - Detect abuse of legitimate tools for malicious purposes
3. **Data Exfiltration Patterns** - Multi-stage data theft detection
4. **Advanced Persistent Threat (APT) Indicators** - Long-term compromise detection
5. **User Behavior Analytics** - Detect anomalous user activities
6. **Behavioral Analytics** - Advanced statistical analysis for threat detection

## Prerequisites ‚úÖ
- ‚úÖ **Update workspace names in the configuration cell below** (see the simple setup instructions)
- ‚úÖ **Multiple data sources available** (SignInLogs, DeviceEvents, NetworkEvents)
- ‚úÖ **Microsoft Sentinel Data Lake enabled** in your environment

## üöÄ **Publication-Ready Features:**
- ‚úÖ **Simple manual configuration** - just update workspace names
- ‚úÖ **Works in any environment** with any workspace names
- ‚úÖ **Adapts to available data** - uses whatever data sources you have
- ‚úÖ **No hardcoded values** - completely portable once configured
- ‚úÖ **Advanced analytics** - sophisticated threat detection algorithms
- ‚úÖ **Clear error handling** - helpful messages if data isn't available

---

In [4]:
# Import advanced libraries for threat hunting
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Import Sentinel Data Lake libraries
from sentinel_lake.providers import MicrosoftSentinelProvider
from pyspark.sql.functions import (
    col, count as spark_count, desc, asc, when, from_json, 
    countDistinct, sum as spark_sum, avg, stddev,
    date_trunc, hour, dayofweek, minute,
    regexp_extract, lower, upper, split, concat,
    to_timestamp, datediff, current_timestamp,
    substring, length,
    expr, lit, coalesce, isnan, isnull,
    collect_list, collect_set, array_contains,
    unix_timestamp, from_unixtime, lag, lead
)
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType
from pyspark.sql.window import Window
import matplotlib.pyplot as plt
import seaborn as sns

# Set visualization style
try:
    plt.style.use('seaborn-v0_8')
except:
    plt.style.use('default')

# Initialize data provider
data_provider = MicrosoftSentinelProvider(spark)

print("‚úÖ Advanced threat hunting libraries imported")

# üîÑ WORKSPACE CONFIGURATION
# ===========================================
# üéØ SIMPLE SETUP: Copy the workspace names from your setup notebook!

PRIMARY_WORKSPACE = "ak-SecOps"      # üè¢ Copy your primary workspace name here
ENTRA_WORKSPACE = "default"          # üîµ Copy your Entra workspace name here

# Analysis configuration
ANALYSIS_HOURS = 24                  # üìÖ Hours of data to analyze
SENTINEL_ENVIRONMENT = True          # üîç Enable advanced analysis features

# Advanced settings (optional - can leave as defaults)
WORKSPACE_MAPPING = {
    'SigninLogs': PRIMARY_WORKSPACE,
    'DeviceEvents': PRIMARY_WORKSPACE,
    'DeviceProcessEvents': PRIMARY_WORKSPACE,
    'DeviceNetworkEvents': PRIMARY_WORKSPACE,
    'DeviceFileEvents': PRIMARY_WORKSPACE,
    'DeviceInfo': PRIMARY_WORKSPACE,
    'SecurityEvent': PRIMARY_WORKSPACE,
    'CommonSecurityLog': PRIMARY_WORKSPACE,
    'AADNonInteractiveUserSignInLogs': PRIMARY_WORKSPACE,
    'AADServicePrincipalSignInLogs': PRIMARY_WORKSPACE,
    'AADManagedIdentitySignInLogs': PRIMARY_WORKSPACE,
    'AuditLogs': PRIMARY_WORKSPACE,
    'EntraUsers': ENTRA_WORKSPACE,
    'EntraGroups': ENTRA_WORKSPACE,
    'EntraApplications': ENTRA_WORKSPACE,
    'EntraServicePrincipals': ENTRA_WORKSPACE,
    'EntraGroupMemberships': ENTRA_WORKSPACE,
    'EntraMembers': ENTRA_WORKSPACE,
    'EntraOrganizations': ENTRA_WORKSPACE
}

print(f"\nüéØ ADVANCED THREAT HUNTING CONFIGURATION:")
print(f"üè¢ Primary workspace: '{PRIMARY_WORKSPACE}'")
print(f"üîµ Entra workspace: '{ENTRA_WORKSPACE}'")
print(f"? Analysis window: {ANALYSIS_HOURS} hours")

# Configuration validation
if PRIMARY_WORKSPACE == "YOUR_WORKSPACE_NAME_HERE":
    print(f"\n‚ö†Ô∏è  CONFIGURATION NEEDED!")
    print(f"üìù Please update the workspace names above:")
    print(f"   1. Run 01_Setup_and_Configuration.ipynb first")
    print(f"   2. Copy the discovered workspace names")
    print(f"   3. Update PRIMARY_WORKSPACE and ENTRA_WORKSPACE above")
    print(f"   4. Re-run this cell")
elif PRIMARY_WORKSPACE == "test-workspace":
    print(f"\n‚úÖ DEMO MODE: Using test configuration")
    print(f"üí° Configuration system is working correctly!")
    print(f"üìù For real analysis, replace with your actual workspace names")
else:
    print(f"\n‚úÖ Configuration looks good!")
    print(f"? Ready for advanced threat hunting in your environment")
    
    # Show workspace mapping
    print(f"\nüìä Table-to-workspace mapping:")
    for table, workspace in WORKSPACE_MAPPING.items():
        print(f"   ‚Ä¢ {table} ‚Üí {workspace}")

print(f"\nüéØ ADVANCED THREAT HUNTING READY!")
print(f"üîç This notebook will perform sophisticated multi-source threat detection")

# Helper function for safe table checking using discovered mapping
def safe_table_check(table_name, workspace_name=None):
    """Safely check table availability using workspace mapping"""
    try:
        # Use workspace mapping from configuration
        if workspace_name is None:
            workspace_name = WORKSPACE_MAPPING.get(table_name, PRIMARY_WORKSPACE)
        
        df = data_provider.read_table(table_name, workspace_name)
        
        # Get basic stats with small sample for performance
        sample_count = df.limit(100).count()
        columns = df.columns
        
        return {
            'available': True,
            'sample_rows': sample_count,
            'total_columns': len(columns),
            'columns': columns[:5],  # Show first 5 columns
            'workspace': workspace_name,
            'error': None
        }
    except Exception as e:
        return {
            'available': False,
            'sample_rows': 0,
            'total_columns': 0,
            'columns': [],
            'workspace': workspace_name,
            'error': str(e)
        }

# Quick verification of key threat hunting data sources
print(f"\nüîç VERIFYING THREAT HUNTING DATA SOURCES...")
print("=" * 50)

# Test key tables across different data source types
threat_hunting_tables = [
    "SigninLogs",           # Identity data
    "DeviceEvents",         # Endpoint data  
    "DeviceProcessEvents",  # Process data
    "DeviceNetworkEvents",  # Network data
    "SecurityEvent",        # Windows events
    "CommonSecurityLog"     # Network security devices
]

accessible_hunting_tables = []

for table in threat_hunting_tables:
    result = safe_table_check(table)
    if result['available']:
        accessible_hunting_tables.append(table)
        print(f"‚úÖ {table}: {result['sample_rows']} sample rows in '{result['workspace']}'")
    else:
        print(f"‚ùå {table}: Not accessible ({result['error'][:50]}...)")

print(f"\nüìä THREAT HUNTING DATA AVAILABILITY:")
print(f"   ‚úÖ Accessible data sources: {len(accessible_hunting_tables)}/{len(threat_hunting_tables)}")

if accessible_hunting_tables:
    print(f"   üìã Ready for hunting: {', '.join(accessible_hunting_tables)}")
    print(f"\nüöÄ Ready for advanced threat hunting!")
    
    # Categorize available data types
    identity_data = [t for t in accessible_hunting_tables if 'signin' in t.lower()]
    endpoint_data = [t for t in accessible_hunting_tables if 'device' in t.lower()]
    security_data = [t for t in accessible_hunting_tables if 'security' in t.lower()]
    
    print(f"\nüìà DATA SOURCE CATEGORIES:")
    if identity_data:
        print(f"   üîê Identity: {', '.join(identity_data)}")
    if endpoint_data:
        print(f"   üíª Endpoint: {', '.join(endpoint_data)}")
    if security_data:
        print(f"   üõ°Ô∏è  Security: {', '.join(security_data)}")
        
else:
    print(f"   ‚ö†Ô∏è Limited threat hunting data available")
    print(f"   üí° Advanced threat hunting works best with multiple data sources")
    print(f"   üìù The analysis sections will adapt to available data")

print("\n" + "="*60)

StatementMeta(MSGMedium, 39, 5, Finished, Available, Finished)

‚úÖ Advanced threat hunting libraries imported

üéØ ADVANCED THREAT HUNTING CONFIGURATION:
üè¢ Primary workspace: 'ak-SecOps'
üîµ Entra workspace: 'default'
ÔøΩ Analysis window: 24 hours

‚úÖ Configuration looks good!
ÔøΩ Ready for advanced threat hunting in your environment

üìä Table-to-workspace mapping:
   ‚Ä¢ SigninLogs ‚Üí ak-SecOps
   ‚Ä¢ DeviceEvents ‚Üí ak-SecOps
   ‚Ä¢ DeviceProcessEvents ‚Üí ak-SecOps
   ‚Ä¢ DeviceNetworkEvents ‚Üí ak-SecOps
   ‚Ä¢ DeviceFileEvents ‚Üí ak-SecOps
   ‚Ä¢ DeviceInfo ‚Üí ak-SecOps
   ‚Ä¢ SecurityEvent ‚Üí ak-SecOps
   ‚Ä¢ CommonSecurityLog ‚Üí ak-SecOps
   ‚Ä¢ AADNonInteractiveUserSignInLogs ‚Üí ak-SecOps
   ‚Ä¢ AADServicePrincipalSignInLogs ‚Üí ak-SecOps
   ‚Ä¢ AADManagedIdentitySignInLogs ‚Üí ak-SecOps
   ‚Ä¢ AuditLogs ‚Üí ak-SecOps
   ‚Ä¢ EntraUsers ‚Üí default
   ‚Ä¢ EntraGroups ‚Üí default
   ‚Ä¢ EntraApplications ‚Üí default
   ‚Ä¢ EntraServicePrincipals ‚Üí default
   ‚Ä¢ EntraGroupMemberships ‚Üí default
   ‚Ä¢ EntraMembers ‚Üí defau

## 1. Command and Control (C2) Beacon Detection

Advanced detection of C2 communications using statistical analysis of network patterns.

In [6]:
# Advanced C2 beacon detection
def detect_c2_beacons(network_df, min_connections=10, jitter_threshold=0.1):
    """
    Detect potential C2 beacons using advanced statistical analysis
    """
    print("üîç ANALYZING C2 BEACON PATTERNS...\n")
    
    # First, let's see what columns are available
    print("üìä Available columns in DeviceNetworkEvents:")
    print(network_df.columns)
    
    # Group connections by source and destination with time bucketing
    # Note: Using connection frequency instead of byte counts for beacon detection
    beacon_analysis = network_df.withColumn(
        "TimeWindow", date_trunc("hour", col("Timestamp"))
    ).groupBy(
        "DeviceName", "LocalIP", "RemoteIP", "RemotePort", "TimeWindow"
    ).agg(
        spark_count("*").alias("ConnectionCount")
    )
    
    # Calculate beacon characteristics per connection pair
    beacon_stats = beacon_analysis.groupBy(
        "DeviceName", "LocalIP", "RemoteIP", "RemotePort"
    ).agg(
        spark_count("*").alias("TimeWindows"),
        avg("ConnectionCount").alias("AvgConnectionsPerHour"),
        stddev("ConnectionCount").alias("StdDevConnections")
    ).filter(
        col("TimeWindows") >= min_connections
    )
    
    # Identify potential beacons (low jitter, consistent timing)
    potential_beacons = beacon_stats.withColumn(
        "ConnectionJitter", 
        coalesce(col("StdDevConnections") / col("AvgConnectionsPerHour"), lit(999))
    ).withColumn(
        "BeaconScore",
        when(col("ConnectionJitter") <= jitter_threshold, 100)
        .when(col("ConnectionJitter") <= 0.3, 75)
        .when(col("ConnectionJitter") <= 0.5, 50)
        .otherwise(25)
    ).filter(
        col("BeaconScore") >= 50
    ).orderBy(desc("BeaconScore"), desc("TimeWindows"))
    
    return potential_beacons

# Load network data for C2 analysis
try:
    if SENTINEL_ENVIRONMENT:
        # Use safe_table_check to get proper workspace
        table_info = safe_table_check("DeviceNetworkEvents")
        if table_info['available']:
            network_events = data_provider.read_table("DeviceNetworkEvents", table_info['workspace'])
            print(f"‚úÖ Loaded DeviceNetworkEvents from {table_info['workspace']}")
        else:
            print(f"‚ùå DeviceNetworkEvents not available: {table_info['error']}")
            network_events = None
    else:
        network_events = None
    
    # Filter to analysis window and external connections
    if network_events is not None:
        network_filtered = network_events.filter(
            col("Timestamp") >= (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
        ).filter(
            # Focus on external connections (not internal RFC1918)
            ~col("RemoteIP").rlike(r"^(10\.|192\.168\.|172\.(1[6-9]|2[0-9]|3[0-1])\.).*$")
        ).filter(
            # Filter out common legitimate traffic
            ~col("RemotePort").isin([80, 443, 53, 123])  # HTTP, HTTPS, DNS, NTP
        )
        
        # Detect C2 beacons
        c2_beacons = detect_c2_beacons(network_filtered)
        
        beacon_count = c2_beacons.count()
    else:
        print("‚ùå Network events not available - skipping C2 analysis")
        beacon_count = 0
        network_filtered = None
    
    if beacon_count > 0:
        print(f"üö® POTENTIAL C2 BEACONS DETECTED: {beacon_count}\n")
        
        c2_beacons.select(
            "DeviceName", "RemoteIP", "RemotePort", 
            "BeaconScore", "TimeWindows", "AvgConnectionsPerHour", "ConnectionJitter"
        ).show(20, truncate=False)
        
        # Analyze beacon timing patterns
        high_confidence_beacons = c2_beacons.filter(col("BeaconScore") >= 75)
        
        if high_confidence_beacons.count() > 0:
            print("\nüî• HIGH CONFIDENCE C2 BEACONS:")
            high_confidence_beacons.show(10, truncate=False)
            
            print("\nüö® IMMEDIATE ACTIONS REQUIRED:")
            print("   1. Block identified C2 IPs at firewall")
            print("   2. Isolate affected devices immediately")
            print("   3. Analyze malware samples on affected systems")
            print("   4. Search for related IOCs across environment")
            print("   5. Review user accounts on affected devices")
        
    else:
        print("‚úÖ No obvious C2 beacon patterns detected")
        
        # Show some network statistics (only if network_filtered is available)
        if 'network_filtered' in locals() and network_filtered is not None:
            external_connections = network_filtered.count()
            unique_destinations = network_filtered.select("RemoteIP").distinct().count()
            
            print(f"üìä External Network Analysis Summary:")
            print(f"   External Connections: {external_connections:,}")
            print(f"   Unique Destinations: {unique_destinations:,}")
        else:
            print("üìä Network statistics not available")
    
except Exception as e:
    print(f"‚ö†Ô∏è  Error analyzing network data: {str(e)}")
    print("   DeviceNetworkEvents may not be available")

StatementMeta(MSGMedium, 39, 7, Finished, Available, Finished)

{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceNetworkEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceNetworkEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceNetworkEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceNetworkEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceNetworkEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceNetworkEvents"}
‚úÖ Loaded DeviceNetworkEvents from ak-SecOps
üîç ANALYZING C2 BEACON PATTERNS...

üìä Available columns in DeviceNetworkEvents:
['TenantId', 'ActionType', 'AdditionalFields', 'AppGuardContainerId', 'DeviceId', 'DeviceName', 'InitiatingProcessAccountDomain', 'Ini

## 2. Living off the Land (LotL) Detection

Detect abuse of legitimate system tools for malicious purposes.

In [7]:
# Living off the Land detection
def detect_lotl_abuse(process_events):
    """
    Detect abuse of legitimate tools (Living off the Land techniques)
    """
    print("üé≠ DETECTING LIVING OFF THE LAND TECHNIQUES...\n")
    
    # Define suspicious usage patterns for legitimate tools
    lotl_patterns = {
        "PowerShell Abuse": {
            "process": "powershell|pwsh",
            "cmdline": "encodedcommand|bypass|unrestricted|hidden|downloadstring|iex|invoke-expression|reflection\.assembly"
        },
        "WMI Abuse": {
            "process": "wmic|wmiprvse",
            "cmdline": "process.*call.*create|shadowcopy.*delete|service.*create"
        },
        "Certificate Abuse": {
            "process": "certutil",
            "cmdline": "urlcache|decode|encode|-f"
        },
        "BitsAdmin Abuse": {
            "process": "bitsadmin",
            "cmdline": "transfer|addfile|setnotifyflags"
        },
        "RegSvr32 Abuse": {
            "process": "regsvr32",
            "cmdline": "/s.*http|/u.*http|scrobj\.dll"
        },
        "MSBuild Abuse": {
            "process": "msbuild",
            "cmdline": "\.xml|inline"
        },
        "WMIC Process Creation": {
            "process": "wmic",
            "cmdline": "process.*call.*create|/node:"
        },
        "Rundll32 Abuse": {
            "process": "rundll32",
            "cmdline": "javascript|vbscript|url\.dll|shell32.*control_rundll"
        }
    }
    
    lotl_results = {}
    
    for technique, pattern in lotl_patterns.items():
        lotl_detections = process_events.filter(
            lower(col("FileName")).rlike(pattern["process"]) &
            lower(col("ProcessCommandLine")).rlike(pattern["cmdline"])
        )
        
        detection_count = lotl_detections.count()
        
        if detection_count > 0:
            lotl_results[technique] = {
                'count': detection_count,
                'data': lotl_detections
            }
    
    return lotl_results

# Load and analyze process events for LotL
try:
    if SENTINEL_ENVIRONMENT:
        # Use safe_table_check to get proper workspace
        table_info = safe_table_check("DeviceProcessEvents")
        if table_info['available']:
            process_events = data_provider.read_table("DeviceProcessEvents", table_info['workspace'])
            print(f"‚úÖ Loaded DeviceProcessEvents from {table_info['workspace']}")
        else:
            print(f"‚ùå DeviceProcessEvents not available: {table_info['error']}")
            process_events = None
    else:
        process_events = None
    
    # Filter to analysis window
    if process_events is not None:
        process_filtered = process_events.filter(
            col("Timestamp") >= (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
        )
        
        # Detect LotL techniques
        lotl_results = detect_lotl_abuse(process_filtered)
    else:
        print("‚ùå Process events not available - skipping LotL analysis")
        lotl_results = {}
    
    if lotl_results:
        print(f"üé≠ LIVING OFF THE LAND TECHNIQUES DETECTED: {len(lotl_results)}\n")
        
        for technique, result in lotl_results.items():
            print(f"üîç {technique}: {result['count']} instances")
            
            # Show top examples
            technique_summary = result['data'].groupBy(
                "ProcessCommandLine", "AccountName"
            ).agg(
                spark_count("*").alias("Count"),
                countDistinct("DeviceName").alias("UniqueDevices")
            ).orderBy(desc("Count"))
            
            print(f"   Top command patterns:")
            technique_summary.show(5, truncate=False)
            print()
        
        # Overall LotL summary
        total_lotl_events = sum(result['count'] for result in lotl_results.values())
        
        print(f"üìä LIVING OFF THE LAND SUMMARY:")
        print(f"   Total Suspicious Events: {total_lotl_events:,}")
        print(f"   Techniques Detected: {len(lotl_results)}")
        
        # Get affected users and devices
        all_lotl_events = None
        for result in lotl_results.values():
            if all_lotl_events is None:
                all_lotl_events = result['data']
            else:
                all_lotl_events = all_lotl_events.union(result['data'])
        
        affected_users = all_lotl_events.select("AccountName").distinct().count()
        affected_devices = all_lotl_events.select("DeviceName").distinct().count()
        
        print(f"   Affected Users: {affected_users}")
        print(f"   Affected Devices: {affected_devices}")
        
        print("\nüõ°Ô∏è  LOTL MITIGATION RECOMMENDATIONS:")
        print("   1. Implement PowerShell logging and monitoring")
        print("   2. Restrict administrative tools usage")
        print("   3. Enable application whitelisting")
        print("   4. Monitor process creation events")
        print("   5. Train users on social engineering tactics")
        
    else:
        print("‚úÖ No Living off the Land techniques detected")
        
except Exception as e:
    print(f"‚ö†Ô∏è  Error analyzing process events: {str(e)}")

StatementMeta(MSGMedium, 39, 8, Finished, Available, Finished)

{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceProcessEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceProcessEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceProcessEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceProcessEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: DeviceProcessEvents"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceProcessEvents"}
‚úÖ Loaded DeviceProcessEvents from ak-SecOps
üé≠ DETECTING LIVING OFF THE LAND TECHNIQUES...

{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table DeviceProcessEvents"}
‚úÖ Loaded DeviceProcessEvents from ak-Se

## 3. Advanced Persistence Detection

Detect sophisticated persistence mechanisms used by advanced threats.

In [8]:
# Advanced persistence detection
def detect_persistence_mechanisms(process_events, registry_events=None):
    """
    Detect various persistence mechanisms
    """
    print("üîÑ DETECTING PERSISTENCE MECHANISMS...\n")
    
    persistence_indicators = []
    
    # 1. Scheduled task creation/modification
    schtasks_persistence = process_events.filter(
        lower(col("FileName")).rlike("schtasks|at\\.exe") &
        lower(col("ProcessCommandLine")).rlike("/create|/change|/run")
    )
    
    schtasks_count = schtasks_persistence.count()
    if schtasks_count > 0:
        persistence_indicators.append(("Scheduled Tasks", schtasks_count, schtasks_persistence))
    
    # 2. Service creation/modification
    service_persistence = process_events.filter(
        lower(col("FileName")).rlike("sc\\.exe|net\\.exe") &
        lower(col("ProcessCommandLine")).rlike("create.*binpath|config.*binpath|start.*auto")
    )
    
    service_count = service_persistence.count()
    if service_count > 0:
        persistence_indicators.append(("Service Persistence", service_count, service_persistence))
    
    # 3. Registry-based persistence (RunKey modifications)
    run_key_persistence = process_events.filter(
        lower(col("ProcessCommandLine")).rlike("reg.*add.*run|reg.*add.*runonce")
    )
    
    run_key_count = run_key_persistence.count()
    if run_key_count > 0:
        persistence_indicators.append(("Registry Run Keys", run_key_count, run_key_persistence))
    
    # 4. DLL hijacking indicators (unusual DLL loads)
    dll_hijacking = process_events.filter(
        lower(col("ProcessCommandLine")).rlike("rundll32|regsvr32") &
        lower(col("ProcessCommandLine")).rlike("appdata|temp|public")
    )
    
    dll_count = dll_hijacking.count()
    if dll_count > 0:
        persistence_indicators.append(("DLL Hijacking", dll_count, dll_hijacking))
    
    # 5. WMI event subscription persistence
    wmi_persistence = process_events.filter(
        lower(col("ProcessCommandLine")).rlike("wmic.*eventfilter|wmic.*consumer|wmic.*subscription")
    )
    
    wmi_count = wmi_persistence.count()
    if wmi_count > 0:
        persistence_indicators.append(("WMI Persistence", wmi_count, wmi_persistence))
    
    return persistence_indicators

# Analyze persistence mechanisms
persistence_results = detect_persistence_mechanisms(process_filtered) if 'process_filtered' in locals() else []

if persistence_results:
    print(f"üîÑ PERSISTENCE MECHANISMS DETECTED: {len(persistence_results)}\n")
    
    for mechanism, mechanism_count, data in persistence_results:
        print(f"üéØ {mechanism}: {mechanism_count} instances")
        
        # Show details for each mechanism
        mechanism_details = data.groupBy(
            "ProcessCommandLine", "AccountName", "DeviceName"
        ).agg(
            spark_count("*").alias("Count")
        ).orderBy(desc("Count"))
        
        print("   Command examples:")
        mechanism_details.show(3, truncate=False)
        print()
    
    # Risk assessment
    total_persistence = sum(mechanism_count for _, mechanism_count, _ in persistence_results)
    risk_level = "HIGH" if total_persistence >= 10 else "MEDIUM" if total_persistence >= 5 else "LOW"
    
    print(f"üõ°Ô∏è  PERSISTENCE RISK LEVEL: {risk_level}")
    print(f"   Total Indicators: {total_persistence}")
    
    if risk_level in ["HIGH", "MEDIUM"]:
        print("\nüö® IMMEDIATE ACTIONS REQUIRED:")
        print("   1. Review all scheduled tasks and services")
        print("   2. Check registry run keys for unauthorized entries")
        print("   3. Validate DLL authenticity and locations")
        print("   4. Audit WMI subscriptions and filters")
        print("   5. Implement application allowlisting")
        print("   6. Monitor for lateral movement patterns")

else:
    print("‚úÖ No obvious persistence mechanisms detected")
    print("   Continue monitoring for sophisticated techniques")

StatementMeta(MSGMedium, 39, 9, Finished, Available, Finished)

üîÑ DETECTING PERSISTENCE MECHANISMS...

‚úÖ No obvious persistence mechanisms detected
   Continue monitoring for sophisticated techniques
‚úÖ No obvious persistence mechanisms detected
   Continue monitoring for sophisticated techniques


## 4. Data Exfiltration Pattern Analysis

Detect sophisticated data exfiltration patterns combining multiple indicators.

In [9]:
# Advanced data exfiltration detection
def detect_data_exfiltration_patterns():
    """
    Detect sophisticated data exfiltration using multiple data sources
    """
    print("üì§ ANALYZING DATA EXFILTRATION PATTERNS...\n")
    
    exfiltration_indicators = []
    
    try:
        # 1. Large external data transfers - NOTE: Adapted for available columns
        if 'network_filtered' in locals() and network_filtered is not None:
            # Focus on connection frequency to external IPs as a proxy for data transfers
            external_connections = network_filtered.groupBy(
                "DeviceName", "RemoteIP"
            ).agg(
                spark_count("*").alias("ConnectionCount")
            ).filter(col("ConnectionCount") > 100)  # High connection frequency
            
            high_connection_count = external_connections.count()
            if high_connection_count > 0:
                exfiltration_indicators.append(("High External Connection Frequency", high_connection_count, external_connections))
        
        # 2. Archive creation before external transfers
        if 'process_filtered' in locals():
            archive_creation = process_filtered.filter(
                lower(col("ProcessCommandLine")).rlike(
                    "7z.*a |winrar.*a |zip.*-r|tar.*-czf|makecab"
                ) &
                lower(col("ProcessCommandLine")).rlike(
                    "documents|desktop|users|programdata|temp"
                )
            )
            
            archive_count = archive_creation.count()
            if archive_count > 0:
                exfiltration_indicators.append(("Suspicious Archive Creation", archive_count, archive_creation))
        
        # 3. Cloud storage tool usage
        if 'process_filtered' in locals():
            cloud_tools = process_filtered.filter(
                lower(col("FileName")).rlike(
                    "rclone|aws|gsutil|azcopy|dropbox|googledrive"
                ) |
                lower(col("ProcessCommandLine")).rlike(
                    "s3.*cp|blob.*upload|drive.*upload|dropbox.*upload"
                )
            )
            
            cloud_count = cloud_tools.count()
            if cloud_count > 0:
                exfiltration_indicators.append(("Cloud Storage Tools", cloud_count, cloud_tools))
        
        # 4. Encoded/encrypted data preparation
        if 'process_filtered' in locals():
            encoding_activities = process_filtered.filter(
                lower(col("ProcessCommandLine")).rlike(
                    "base64|certutil.*encode|openssl.*enc|gpg.*encrypt"
                )
            )
            
            encoding_count = encoding_activities.count()
            if encoding_count > 0:
                exfiltration_indicators.append(("Data Encoding/Encryption", encoding_count, encoding_activities))
    
    except Exception as e:
        print(f"‚ö†Ô∏è  Error in exfiltration analysis: {str(e)}")
    
    return exfiltration_indicators

# Detect data exfiltration patterns
exfiltration_results = detect_data_exfiltration_patterns()

if exfiltration_results:
    print(f"üì§ DATA EXFILTRATION INDICATORS DETECTED: {len(exfiltration_results)}\n")
    
    for indicator, count, data in exfiltration_results:
        print(f"üö® {indicator}: {count} instances")
        
        # Show relevant details based on indicator type
        if "Connection" in indicator:
            data.select(
                "DeviceName", "RemoteIP", "ConnectionCount"
            ).show(5, truncate=False)
        else:
            summary = data.groupBy("ProcessCommandLine", "AccountName").agg(
                spark_count("*").alias("Count"),
                countDistinct("DeviceName").alias("UniqueDevices")
            ).orderBy(desc("Count"))
            
            summary.show(3, truncate=False)
        print()
    
    # Risk assessment
    total_indicators = len(exfiltration_results)
    risk_level = "HIGH" if total_indicators >= 3 else "MEDIUM" if total_indicators >= 2 else "LOW"
    
    print(f"üéØ EXFILTRATION RISK LEVEL: {risk_level}")
    print(f"   Total Indicators: {total_indicators}")
    
    if risk_level in ["HIGH", "MEDIUM"]:
        print("\nüö® IMMEDIATE ACTIONS REQUIRED:")
        print("   1. Investigate all flagged activities immediately")
        print("   2. Review data access logs for affected users")
        print("   3. Check for unauthorized cloud storage usage")
        print("   4. Validate legitimate business purposes")
        print("   5. Consider temporary network restrictions")
        print("   6. Implement DLP policies if not already present")
    
else:
    print("‚úÖ No obvious data exfiltration patterns detected")
    print("   Continue monitoring for suspicious data movement")

StatementMeta(MSGMedium, 39, 10, Finished, Available, Finished)

üì§ ANALYZING DATA EXFILTRATION PATTERNS...

‚úÖ No obvious data exfiltration patterns detected
   Continue monitoring for suspicious data movement


## 5. User Behavior Analytics (UBA)

Advanced behavioral analysis to detect anomalous user activities.

In [10]:
# User Behavior Analytics
def perform_user_behavior_analysis():
    """
    Perform advanced user behavior analysis across multiple data sources
    """
    print("üë§ USER BEHAVIOR ANALYTICS...\n")
    
    try:
        # Load sign-in data for behavioral analysis using smart table checking
        if SENTINEL_ENVIRONMENT:
            table_info = safe_table_check("SigninLogs")
            if table_info['available']:
                signin_logs = data_provider.read_table("SigninLogs", table_info['workspace'])
                print(f"‚úÖ Loaded SigninLogs from {table_info['workspace']}")
            else:
                print(f"‚ùå SigninLogs not available: {table_info['error']}")
                signin_logs = None
        else:
            signin_logs = None
        
        if signin_logs is None:
            print("‚ö†Ô∏è SigninLogs not available - skipping user behavior analysis")
            return
        
        # Filter to analysis window
        signin_filtered = signin_logs.filter(
            col("CreatedDateTime") >= (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS*2} HOURS"))  # Longer window for baseline
        )
        
        # 1. Unusual time-based patterns
        user_time_patterns = signin_filtered.withColumn(
            "HourOfDay", hour(col("CreatedDateTime"))
        ).withColumn(
            "DayOfWeek", dayofweek(col("CreatedDateTime"))
        ).groupBy(
            "UserPrincipalName", "HourOfDay", "DayOfWeek"
        ).agg(
            spark_count("*").alias("SignInCount")
        )
        
        # Calculate user's normal hours (hours with >10% of their activity)
        user_total_signins = user_time_patterns.groupBy("UserPrincipalName").agg(
            spark_sum("SignInCount").alias("TotalSignIns")
        )
        
        user_normal_hours = user_time_patterns.join(
            user_total_signins, "UserPrincipalName"
        ).withColumn(
            "ActivityPercentage", col("SignInCount") / col("TotalSignIns")
        ).filter(
            col("ActivityPercentage") > 0.05  # Hours with >5% of activity
        )
        
        # Find recent sign-ins outside normal patterns
        recent_signins = signin_filtered.filter(
            col("CreatedDateTime") >= (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
        ).withColumn(
            "HourOfDay", hour(col("CreatedDateTime"))
        ).withColumn(
            "DayOfWeek", dayofweek(col("CreatedDateTime"))
        )
        
        # Anti-join to find sign-ins outside normal patterns
        anomalous_time_signins = recent_signins.join(
            user_normal_hours.select("UserPrincipalName", "HourOfDay", "DayOfWeek"),
            ["UserPrincipalName", "HourOfDay", "DayOfWeek"],
            "left_anti"
        )
        
        anomalous_time_count = anomalous_time_signins.count()
        
        if anomalous_time_count > 0:
            print(f"‚è∞ ANOMALOUS TIME-BASED SIGN-INS: {anomalous_time_count}")
            
            time_anomalies = anomalous_time_signins.groupBy(
                "UserPrincipalName", "UserDisplayName"
            ).agg(
                spark_count("*").alias("AnomalousSignIns"),
                countDistinct("IPAddress").alias("UniqueIPs")
            ).orderBy(desc("AnomalousSignIns"))
            
            time_anomalies.show(10, truncate=False)
        
        # 2. Geographic anomalies
        if signin_filtered.filter(col("LocationDetails").isNotNull()).count() > 0:
            location_schema = StructType([
                StructField("countryOrRegion", StringType(), True)
            ])
            
            user_locations = signin_filtered.filter(
                col("LocationDetails").isNotNull()
            ).withColumn(
                "Country", from_json(col("LocationDetails"), location_schema).getField("countryOrRegion")
            )
            
            # Baseline countries for each user
            baseline_countries = user_locations.filter(
                col("CreatedDateTime") < (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
            ).select("UserPrincipalName", "Country", "UserDisplayName").distinct()
            
            # Recent countries
            recent_countries = user_locations.filter(
                col("CreatedDateTime") >= (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
            ).select("UserPrincipalName", "Country", "UserDisplayName").distinct()
            
            new_country_signins = recent_countries.join(
                baseline_countries,
                ["UserPrincipalName", "Country"],
                "left_anti"
            )
            
            new_country_count = new_country_signins.count()
            
            if new_country_count > 0:
                print(f"\nüåç NEW COUNTRY SIGN-INS: {new_country_count}")
                new_country_signins.show(10, truncate=False)
        
        # 3. Application usage anomalies
        app_anomalies = recent_signins.filter(
            col("AppDisplayName").isNotNull()
        ).groupBy(
            "UserPrincipalName", "AppDisplayName"
        ).agg(
            spark_count("*").alias("RecentUsage")
        )
        
        # Find apps used in baseline period
        baseline_app_usage = signin_filtered.filter(
            col("CreatedDateTime") < (current_timestamp() - expr(f"INTERVAL {ANALYSIS_HOURS} HOURS"))
        ).filter(
            col("AppDisplayName").isNotNull()
        ).groupBy(
            "UserPrincipalName", "AppDisplayName"
        ).agg(
            spark_count("*").alias("BaselineUsage")
        )
        
        app_usage_anomalies = app_anomalies.join(
            baseline_app_usage,
            ["UserPrincipalName", "AppDisplayName"],
            "left"
        ).filter(
            col("RecentUsage") > (col("BaselineUsage") * 3)  # 3x normal usage
        ).filter(
            col("RecentUsage") > 5  # At least 5 sign-ins
        )
        
        app_anomaly_count = app_usage_anomalies.count()
        
        if app_anomaly_count > 0:
            print(f"\nüì± APPLICATION USAGE ANOMALIES: {app_anomaly_count}")
            app_usage_anomalies.orderBy(desc("RecentUsage")).show(10, truncate=False)
        
        # Summary
        total_anomalies = sum([
            anomalous_time_count if 'anomalous_time_count' in locals() else 0,
            new_country_count if 'new_country_count' in locals() else 0,
            app_anomaly_count if 'app_anomaly_count' in locals() else 0
        ])
        
        if total_anomalies > 0:
            print(f"\nüéØ USER BEHAVIOR ANALYSIS SUMMARY:")
            print(f"   Total Behavioral Anomalies: {total_anomalies}")
            print("\nüîç INVESTIGATION PRIORITIES:")
            print("   1. Review users with multiple anomaly types")
            print("   2. Correlate with recent security events")
            print("   3. Verify account compromise indicators")
            print("   4. Check for privilege escalation attempts")
            print("   5. Monitor for additional suspicious activities")
        else:
            print("‚úÖ No significant user behavior anomalies detected")
        
        # Return total anomalies for summary
        return total_anomalies
    
    except Exception as e:
        print(f"‚ö†Ô∏è  Error in user behavior analysis: {str(e)}")
        return 0

# Perform user behavior analysis
total_anomalies = perform_user_behavior_analysis()

StatementMeta(MSGMedium, 39, 11, Finished, Available, Finished)

üë§ USER BEHAVIOR ANALYTICS...

{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: SigninLogs"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table SigninLogs"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: SigninLogs"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table SigninLogs"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Loading table: SigninLogs"}
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table SigninLogs"}
‚úÖ Loaded SigninLogs from ak-SecOps
{"level": "INFO", "run_id": "c538f690-2d90-44c0-92e9-6903ff10ab25", "message": "Successfully loaded table SigninLogs"}
‚úÖ Loaded SigninLogs from ak-SecOps

üåç NEW COUNTRY SIGN-INS: 1

üåç NEW COUNTRY SIGN-INS: 1
+--------------------------------

## 6. Advanced Threat Hunting Summary

Comprehensive summary and threat assessment based on all analyses.

In [11]:
# Comprehensive threat hunting summary
print("üéØ ADVANCED THREAT HUNTING SUMMARY")
print("=" * 50)

# Collect all findings
threat_findings = {}

if 'beacon_count' in locals():
    threat_findings['C2 Beacons'] = beacon_count
if 'lotl_results' in locals():
    threat_findings['Living off the Land'] = len(lotl_results)
if 'persistence_results' in locals():
    threat_findings['Persistence Mechanisms'] = len(persistence_results)
if 'exfiltration_results' in locals():
    threat_findings['Data Exfiltration Indicators'] = len(exfiltration_results)
if 'total_anomalies' in locals():
    threat_findings['Behavioral Anomalies'] = total_anomalies

# Display findings
print(f"üìä THREAT HUNTING RESULTS ({ANALYSIS_HOURS}-hour analysis):")
for finding, count in threat_findings.items():
    status = "üö®" if count > 0 else "‚úÖ"
    print(f"   {status} {finding}: {count}")

# Calculate overall threat score
threat_score = 0
critical_findings = 0

# Weight different finding types
if threat_findings.get('C2 Beacons', 0) > 0:
    threat_score += 40
    critical_findings += 1
if threat_findings.get('Data Exfiltration Indicators', 0) >= 2:
    threat_score += 30
    critical_findings += 1
if threat_findings.get('Living off the Land', 0) >= 3:
    threat_score += 20
if threat_findings.get('Persistence Mechanisms', 0) >= 2:
    threat_score += 15
if threat_findings.get('Behavioral Anomalies', 0) >= 5:
    threat_score += 10

# Determine threat level
if threat_score >= 50:
    threat_level = "CRITICAL"
    color = "üî¥"
elif threat_score >= 30:
    threat_level = "HIGH"
    color = "üü†"
elif threat_score >= 15:
    threat_level = "MEDIUM"
    color = "üü°"
else:
    threat_level = "LOW"
    color = "üü¢"

print(f"\n{color} OVERALL THREAT LEVEL: {threat_level} (Score: {threat_score}/100)")

# Provide specific recommendations based on findings
print(f"\nüéØ THREAT-SPECIFIC RECOMMENDATIONS:")

if threat_level == "CRITICAL":
    print("   üö® IMMEDIATE INCIDENT RESPONSE REQUIRED")
    print("   1. Activate incident response team")
    print("   2. Isolate affected systems immediately")
    print("   3. Preserve evidence for forensic analysis")
    print("   4. Reset credentials for affected accounts")
    print("   5. Implement emergency containment measures")
    print("   6. Contact legal and compliance teams")
    print("   7. Prepare external communications if needed")

elif threat_level == "HIGH":
    print("   ‚ö†Ô∏è  ELEVATED THREAT - IMMEDIATE INVESTIGATION")
    print("   1. Begin detailed investigation of all findings")
    print("   2. Implement enhanced monitoring")
    print("   3. Consider network segmentation")
    print("   4. Review and update security policies")
    print("   5. Increase security team alertness")
    print("   6. Prepare for potential escalation")

elif threat_level == "MEDIUM":
    print("   üîç ACTIVE MONITORING AND INVESTIGATION")
    print("   1. Investigate flagged activities systematically")
    print("   2. Validate findings with additional context")
    print("   3. Implement targeted monitoring")
    print("   4. Review security controls effectiveness")
    print("   5. Update threat hunting playbooks")

else:
    print("   ‚úÖ BASELINE SECURITY POSTURE MAINTAINED")
    print("   1. Continue regular monitoring")
    print("   2. Maintain current security controls")
    print("   3. Schedule next threat hunting cycle")
    print("   4. Review and update hunting queries")
    print("   5. Train team on new techniques")

# Next steps and continuous improvement
print(f"\nüîÑ CONTINUOUS IMPROVEMENT:")
print("   üìä Update baselines with new data")
print("   üéì Train analysts on identified techniques")
print("   üîß Tune detection rules based on findings")
print("   üìù Document lessons learned")
print("   üïí Schedule regular threat hunting cycles")

print(f"\nüìÖ Analysis completed: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print(f"üéØ Next recommended analysis: {(datetime.now() + timedelta(days=7)).strftime('%Y-%m-%d')}")
print("‚ú® Advanced threat hunting cycle complete!")

# Generate hunting report summary
if threat_level in ["CRITICAL", "HIGH"]:
    print("\nüìã EXECUTIVE SUMMARY FOR STAKEHOLDERS:")
    print(f"   ‚Ä¢ Threat Level: {threat_level}")
    print(f"   ‚Ä¢ Critical Findings: {critical_findings}")
    print(f"   ‚Ä¢ Analysis Period: {ANALYSIS_HOURS} hours")
    print(f"   ‚Ä¢ Total Indicators: {sum(threat_findings.values())}")
    print("   ‚Ä¢ Immediate action required for security posture")

StatementMeta(MSGMedium, 39, 12, Finished, Available, Finished)

üéØ ADVANCED THREAT HUNTING SUMMARY
üìä THREAT HUNTING RESULTS (24-hour analysis):
   üö® C2 Beacons: 1
   üö® Living off the Land: 1
   ‚úÖ Persistence Mechanisms: 0
   ‚úÖ Data Exfiltration Indicators: 0
   üö® Behavioral Anomalies: 1

üü† OVERALL THREAT LEVEL: HIGH (Score: 40/100)

üéØ THREAT-SPECIFIC RECOMMENDATIONS:
   ‚ö†Ô∏è  ELEVATED THREAT - IMMEDIATE INVESTIGATION
   1. Begin detailed investigation of all findings
   2. Implement enhanced monitoring
   3. Consider network segmentation
   4. Review and update security policies
   5. Increase security team alertness
   6. Prepare for potential escalation

üîÑ CONTINUOUS IMPROVEMENT:
   üìä Update baselines with new data
   üéì Train analysts on identified techniques
   üîß Tune detection rules based on findings
   üìù Document lessons learned
   üïí Schedule regular threat hunting cycles

üìÖ Analysis completed: 2025-09-03 09:35:25
üéØ Next recommended analysis: 2025-09-10
‚ú® Advanced threat hunting cycle complet