# Anomalous Sign-In Detection and Enrichment

### Overview

This notebook analyzes sign-in logs to detect anomalous authentication patterns, enriches them with threat intelligence and geo-location data, and saves findings to a managed table for alerting and investigation.

### How to Run Notebook

Reference the general [Sentinel Notebook Readme](../README.md) for guidance on installing and running notebooks.

For this job specifically there is a job yaml file included. Action required by users on that job yaml:
- **StartTime**: What day and time the job should start running.
- **EndTime**: What day and time the job should stop running.
- **JobName(Optional)**: If you decide to change the jobname, prefix the name with 'Anomalous-SignIn-Detection'.

### Key Features:
- **Anomaly Detection**: Identifies suspicious sign-in patterns including:
  - Multiple failed attempts followed by success
  - Impossible travel scenarios (geographically distant logins)
  - Sign-ins from risky IPs or locations
  - Unusual sign-in times or locations for users
- **Threat Intelligence Enrichment**: Correlates suspicious IPs with known threat indicators
- **Geo-Location Analysis**: Detects travel velocity anomalies
- **Risk Scoring**: Assigns risk scores based on multiple factors
- **Incremental Updates**: Avoids duplicate alerts on subsequent runs

### Data Sources:
- **SigninLogs**: Standard user sign-in activities
- **AADNonInteractiveUserSignInLogs**: Non-interactive authentication
- **ThreatIntelIndicators**: IP threat intelligence data

### Required Customer Input:
- **WORKSPACE_NAME**: Customer Log Analytics workspace name. If 'None', auto-detects first available workspace.
- **LOOKBACK_DAYS**: 1-90. Analysis period for sign-in logs. Default 7.
- **MIN_FAILED_ATTEMPTS**: Minimum failed logins before success to flag as anomalous. Default 3.
- **IMPOSSIBLE_TRAVEL_KM_PER_HOUR**: Speed threshold for impossible travel detection. Default 800.

### Output Schema:
Results are saved to the `AnomalousSignInFindings_SPRK_CL` custom table with the following schema:

| Column Name | Type | Description |
|-------------|------|-------------|
| FindingId | string | Unique identifier for the finding |
| JobId | string | Identifier for the analysis job execution |
| JobStartTime | datetime | Timestamp when the job started |
| JobEndTime | datetime | Timestamp when the job completed |
| AnomalyType | string | Type of anomaly detected (e.g., "FailedThenSuccess", "ImpossibleTravel", "ThreatIP") |
| UserPrincipalName | string | User account involved in anomalous activity |
| IPAddress | string | Source IP address |
| Location | string | Sign-in location |
| RiskScore | int | Calculated risk score (0-100) |
| FailedAttempts | int | Number of failed attempts (for FailedThenSuccess type) |
| TravelDistanceKm | double | Distance traveled (for ImpossibleTravel type) |
| TravelVelocityKmh | double | Travel velocity (for ImpossibleTravel type) |
| ThreatIntelMatch | boolean | Whether IP matched threat intelligence |
| ThreatActors | dynamic | Threat actors associated with IP (if any) |
| EventReferences | dynamic | Array of related sign-in event IDs |
| FirstSeen | datetime | First occurrence of this anomaly pattern |
| LastSeen | datetime | Most recent occurrence |
| OccurrenceCount | int | Number of times this pattern occurred |
| TenantId | string | Azure tenant identifier |
| TimeGenerated | datetime | Timestamp of record creation |

### Version
1.0.0

In [None]:
# ===============================================================================
# PARAMETERS AND CONFIGURATION
# ===============================================================================

# Workspace Configuration
WORKSPACE_NAME = None  # Set to your workspace name or leave as None for auto-detection
LOOKBACK_DAYS = 7  # Days to look back for analysis (1-90)

# Anomaly Detection Thresholds
MIN_FAILED_ATTEMPTS = 3  # Minimum failed logins before success to flag
IMPOSSIBLE_TRAVEL_KM_PER_HOUR = 800  # Speed threshold for impossible travel (km/h)
MAX_TRAVEL_TIME_HOURS = 1  # Maximum time between logins to check for impossible travel

# Risk Scoring Weights (total should equal 100)
WEIGHT_FAILED_ATTEMPTS = 30
WEIGHT_IMPOSSIBLE_TRAVEL = 40
WEIGHT_THREAT_INTEL = 30

# Table Names
THREAT_INTEL_TABLE = "ThreatIntelIndicators"
RESULTS_TABLE = "AnomalousSignInFindings_SPRK_CL"

# Debug Settings
SHOW_DEBUG_LOGS = False
SHOW_STATS = True

# Version
VERSION = "1.0.0"

# ===============================================================================
# PARAMETER VALIDATION
# ===============================================================================
if not (1 <= LOOKBACK_DAYS <= 90):
    raise ValueError("LOOKBACK_DAYS must be between 1 and 90")

if MIN_FAILED_ATTEMPTS < 1:
    raise ValueError("MIN_FAILED_ATTEMPTS must be at least 1")

if IMPOSSIBLE_TRAVEL_KM_PER_HOUR <= 0:
    raise ValueError("IMPOSSIBLE_TRAVEL_KM_PER_HOUR must be positive")

if not RESULTS_TABLE.endswith('_SPRK_CL'):
    RESULTS_TABLE = f"{RESULTS_TABLE}_SPRK_CL"

print(f"Notebook version: {VERSION}")
print(f"Configuration loaded: Workspace={WORKSPACE_NAME}, Lookback={LOOKBACK_DAYS} days")
print(f"Thresholds: Failed attempts>={MIN_FAILED_ATTEMPTS}, Travel speed>{IMPOSSIBLE_TRAVEL_KM_PER_HOUR} km/h")

## Imports, Sentinel Provider, and Spark Configuration

In [None]:
# ===============================================================================
# IMPORTS AND SETUP
# ===============================================================================
import json
import uuid
import time
from datetime import datetime, timedelta
from math import radians, cos, sin, asin, sqrt

from pyspark.sql.functions import (
    col, lit, current_timestamp, array, struct, when, count, row_number,
    first, last, collect_list, collect_set, size, explode, concat_ws, concat,
    array_distinct, coalesce, sum as spark_sum, count_distinct, avg,
    min as spark_min, max as spark_max, lag, lead, unix_timestamp,
    from_json, to_json, get_json_object, broadcast, expr, udf, greatest, least, split, flatten
)
from pyspark.sql.types import (
    StringType, IntegerType, DoubleType, BooleanType, ArrayType,
    StructType, StructField, TimestampType, LongType
)
from pyspark.sql.window import Window
from pyspark.sql import functions as F

from sentinel_lake.providers import MicrosoftSentinelProvider

# Start timer
start_time = time.time()

# Spark configuration for performance
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(150 * 1024 * 1024))

# Initialize provider
data_provider = MicrosoftSentinelProvider(spark)
print("✓ Microsoft Sentinel data provider initialized")

# Job tracking
job_id = str(uuid.uuid4())
job_start_time = current_timestamp()
print(f"✓ Job ID: {job_id}")

# Auto-select workspace if not provided
if WORKSPACE_NAME is None or WORKSPACE_NAME.strip() == "":
    print("Auto-selecting workspace...")
    databases = data_provider.list_databases()
    for db in databases:
        if db.lower() not in ["default", "system tables"]:
            WORKSPACE_NAME = db
            print(f"✓ Auto-selected workspace: {WORKSPACE_NAME}")
            break

if WORKSPACE_NAME is None:
    raise ValueError("No workspace available. Please specify WORKSPACE_NAME.")

## Helper Functions

In [None]:
# ===============================================================================
# HELPER FUNCTIONS
# ===============================================================================

def haversine_distance(lat1, lon1, lat2, lon2):
    """
    Calculate the great circle distance between two points on Earth.
    Returns distance in kilometers.
    """
    if None in [lat1, lon1, lat2, lon2]:
        return None
    
    # Convert to radians
    lat1, lon1, lat2, lon2 = map(radians, [lat1, lon1, lat2, lon2])
    
    # Haversine formula
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a))
    
    # Radius of Earth in kilometers
    r = 6371
    return c * r

# Register UDF for distance calculation
haversine_udf = udf(haversine_distance, DoubleType())

def calculate_risk_score(failed_attempts, travel_velocity, has_threat_intel):
    """
    Calculate risk score (0-100) based on multiple factors.
    """
    score = 0
    
    # Failed attempts component (0-30 points)
    if failed_attempts:
        score += min(WEIGHT_FAILED_ATTEMPTS, (failed_attempts / 10) * WEIGHT_FAILED_ATTEMPTS)
    
    # Impossible travel component (0-40 points)
    if travel_velocity:
        if travel_velocity > IMPOSSIBLE_TRAVEL_KM_PER_HOUR:
            excess = (travel_velocity - IMPOSSIBLE_TRAVEL_KM_PER_HOUR) / IMPOSSIBLE_TRAVEL_KM_PER_HOUR
            score += min(WEIGHT_IMPOSSIBLE_TRAVEL, WEIGHT_IMPOSSIBLE_TRAVEL * (1 + excess * 0.5))
    
    # Threat intel component (0-30 points)
    if has_threat_intel:
        score += WEIGHT_THREAT_INTEL
    
    return min(100, int(score))

# Register UDF for risk scoring
calculate_risk_udf = udf(calculate_risk_score, IntegerType())

print("✓ Helper functions defined")

## Load Sign-In Logs

In [None]:
# ===============================================================================
# LOAD SIGN-IN LOGS
# ===============================================================================

# Calculate date range
end_date = datetime.now()
start_date = end_date - timedelta(days=LOOKBACK_DAYS)

print(f"Loading sign-in logs from {start_date.strftime('%Y-%m-%d')} to {end_date.strftime('%Y-%m-%d')}...")

# Load interactive sign-in logs
try:
    signin_logs = data_provider.read_table("SigninLogs", WORKSPACE_NAME)
    signin_logs = signin_logs.filter(
        (col("TimeGenerated") >= lit(start_date)) &
        (col("TimeGenerated") <= lit(end_date))
    ).select(
        col("Id").alias("EventId"),
        col("TimeGenerated"),
        col("UserPrincipalName"),
        col("IPAddress"),
        col("Location"),
        col("ResultType"),
        col("ResultDescription"),
        when(col("ResultType") == "0", lit(True)).otherwise(lit(False)).alias("IsSuccess"),
        col("TenantId")
    )
    print(f"✓ Loaded SigninLogs: {signin_logs.count():,} records")
except Exception as e:
    print(f"⚠ Could not load SigninLogs: {e}")
    signin_logs = None

# Load non-interactive sign-in logs
try:
    noninteractive_logs = data_provider.read_table("AADNonInteractiveUserSignInLogs", WORKSPACE_NAME)
    noninteractive_logs = noninteractive_logs.filter(
        (col("TimeGenerated") >= lit(start_date)) &
        (col("TimeGenerated") <= lit(end_date))
    ).select(
        col("Id").alias("EventId"),
        col("TimeGenerated"),
        col("UserPrincipalName"),
        col("IPAddress"),
        col("Location"),
        col("ResultType"),
        col("ResultDescription"),
        when(col("ResultType") == "0", lit(True)).otherwise(lit(False)).alias("IsSuccess"),
        col("TenantId")
    )
    print(f"✓ Loaded AADNonInteractiveUserSignInLogs: {noninteractive_logs.count():,} records")
except Exception as e:
    print(f"⚠ Could not load AADNonInteractiveUserSignInLogs: {e}")
    noninteractive_logs = None

# Combine logs
if signin_logs is not None and noninteractive_logs is not None:
    all_signin_logs = signin_logs.union(noninteractive_logs)
elif signin_logs is not None:
    all_signin_logs = signin_logs
elif noninteractive_logs is not None:
    all_signin_logs = noninteractive_logs
else:
    raise RuntimeError("No sign-in logs available")

all_signin_logs = all_signin_logs.cache()
total_signin_count = all_signin_logs.count()
print(f"✓ Total sign-in events: {total_signin_count:,}")

if SHOW_DEBUG_LOGS:
    print("\nSample sign-in logs:")
    all_signin_logs.show(5, truncate=True)

## Detect Anomaly: Failed Attempts Followed by Success

In [None]:
# ===============================================================================
# ANOMALY DETECTION: FAILED THEN SUCCESS
# ===============================================================================

print("\nDetecting failed-then-success patterns...")

# Window for ordering events by user and time
user_window = Window.partitionBy("UserPrincipalName", "IPAddress").orderBy("TimeGenerated")

# Identify sequences where failures are followed by success
signin_with_next = all_signin_logs.withColumn(
    "NextSuccess", 
    lead("IsSuccess", 1).over(user_window)
).withColumn(
    "NextEventId",
    lead("EventId", 1).over(user_window)
).withColumn(
    "NextTime",
    lead("TimeGenerated", 1).over(user_window)
)

# Find failed attempts that are followed by success within reasonable time (1 hour)
failed_then_success = signin_with_next.filter(
    (col("IsSuccess") == False) &
    (col("NextSuccess") == True) &
    ((unix_timestamp("NextTime") - unix_timestamp("TimeGenerated")) <= 3600)
)

# Group by user/IP and count consecutive failures before success
failed_counts = failed_then_success.groupBy(
    "UserPrincipalName", 
    "IPAddress",
    "NextEventId"
).agg(
    count("*").alias("FailedAttempts"),
    spark_min("TimeGenerated").alias("FirstFailure"),
    spark_max("NextTime").alias("SuccessTime"),
    first("Location").alias("Location"),
    first("TenantId").alias("TenantId"),
    collect_list("EventId").alias("FailedEventIds")
).filter(col("FailedAttempts") >= MIN_FAILED_ATTEMPTS)

# Create findings with properly structured EventReferences array
failed_then_success_findings = failed_counts.withColumn(
    "FindingId",
    concat_ws("-", lit("FTS"), col("UserPrincipalName"), col("IPAddress"), col("NextEventId"))
).withColumn(
    "AnomalyType", lit("FailedThenSuccess")
).withColumn(
    "EventReferences",
    array_distinct(concat(col("FailedEventIds"), array(col("NextEventId"))))
).select(
    "FindingId",
    "AnomalyType",
    "UserPrincipalName",
    "IPAddress",
    "Location",
    "FailedAttempts",
    lit(None).cast(DoubleType()).alias("TravelDistanceKm"),
    lit(None).cast(DoubleType()).alias("TravelVelocityKmh"),
    "FirstFailure",
    "SuccessTime",
    "EventReferences",
    "TenantId"
)

failed_then_success_count = failed_then_success_findings.count()
print(f"✓ Detected {failed_then_success_count:,} failed-then-success anomalies")

if SHOW_DEBUG_LOGS and failed_then_success_count > 0:
    print("\nSample findings:")
    failed_then_success_findings.show(5, truncate=True)

## Detect Anomaly: Impossible Travel

In [None]:
# ===============================================================================
# ANOMALY DETECTION: IMPOSSIBLE TRAVEL
# ===============================================================================

print("\nDetecting impossible travel patterns...")

# Parse location to extract city, state, country, lat, lon
# Location format typically: "City, State, Country, Lat:XX.XX, Long:YY.YY"
signin_with_geo = all_signin_logs.filter(col("IsSuccess") == True).withColumn(
    "LocationParts", expr("split(Location, ',')")  
).withColumn(
    "Latitude",
    expr("cast(regexp_extract(Location, 'Lat:([\\-0-9\\.]+)', 1) as double)")
).withColumn(
    "Longitude",
    expr("cast(regexp_extract(Location, 'Long:([\\-0-9\\.]+)', 1) as double)")
).filter(
    col("Latitude").isNotNull() & col("Longitude").isNotNull()
)

# Get consecutive logins per user
user_time_window = Window.partitionBy("UserPrincipalName").orderBy("TimeGenerated")

signin_with_prev = signin_with_geo.withColumn(
    "PrevTime", lag("TimeGenerated", 1).over(user_time_window)
).withColumn(
    "PrevLat", lag("Latitude", 1).over(user_time_window)
).withColumn(
    "PrevLon", lag("Longitude", 1).over(user_time_window)
).withColumn(
    "PrevLocation", lag("Location", 1).over(user_time_window)
).withColumn(
    "PrevEventId", lag("EventId", 1).over(user_time_window)
).filter(
    col("PrevTime").isNotNull()
)

# Calculate distance and velocity
travel_analysis = signin_with_prev.withColumn(
    "TimeDiffHours",
    (unix_timestamp("TimeGenerated") - unix_timestamp("PrevTime")) / 3600.0
).withColumn(
    "DistanceKm",
    haversine_udf(col("PrevLat"), col("PrevLon"), col("Latitude"), col("Longitude"))
).withColumn(
    "VelocityKmh",
    when(col("TimeDiffHours") > 0, col("DistanceKm") / col("TimeDiffHours")).otherwise(lit(None))
)

# Identify impossible travel (velocity exceeds threshold and within time window)
impossible_travel = travel_analysis.filter(
    (col("VelocityKmh") > IMPOSSIBLE_TRAVEL_KM_PER_HOUR) &
    (col("TimeDiffHours") <= MAX_TRAVEL_TIME_HOURS)
)

# Create findings
impossible_travel_findings = impossible_travel.withColumn(
    "FindingId",
    concat_ws("-", lit("IT"), col("UserPrincipalName"), col("EventId"))
).withColumn(
    "AnomalyType", lit("ImpossibleTravel")
).withColumn(
    "EventReferences",
    array(col("PrevEventId"), col("EventId"))
).select(
    "FindingId",
    "AnomalyType",
    "UserPrincipalName",
    "IPAddress",
    concat_ws(" -> ", col("PrevLocation"), col("Location")).alias("Location"),
    lit(None).cast(IntegerType()).alias("FailedAttempts"),
    col("DistanceKm").alias("TravelDistanceKm"),
    col("VelocityKmh").alias("TravelVelocityKmh"),
    col("PrevTime").alias("FirstFailure"),
    col("TimeGenerated").alias("SuccessTime"),
    "EventReferences",
    "TenantId"
)

impossible_travel_count = impossible_travel_findings.count()
print(f"✓ Detected {impossible_travel_count:,} impossible travel anomalies")

if SHOW_DEBUG_LOGS and impossible_travel_count > 0:
    print("\nSample findings:")
    impossible_travel_findings.show(5, truncate=True)

## Enrich with Threat Intelligence

In [None]:
# ===============================================================================
# THREAT INTELLIGENCE ENRICHMENT
# ===============================================================================

print("\nEnriching findings with threat intelligence...")

# Combine all findings
all_findings = failed_then_success_findings.union(impossible_travel_findings)

if all_findings.count() == 0:
    print("⚠ No anomalies detected. Skipping threat intelligence enrichment.")
    enriched_findings = all_findings.withColumn(
        "ThreatIntelMatch", lit(False)
    ).withColumn(
        "ThreatActors", array()
    )
else:
    # Load threat intelligence for IPs
    try:
        threat_intel = data_provider.read_table(THREAT_INTEL_TABLE, WORKSPACE_NAME)
        
        # Filter for IP-related indicators that are currently valid
        current_time = current_timestamp()
        ip_threat_intel = threat_intel.filter(
            (col("ObservableKey").isin(
                "ipv4-addr:value", 
                "ipv6-addr:value",
                "network-traffic:src_ref.value",
                "network-traffic:dst_ref.value"
            )) &
            (col("IsActive") == True) &
            (col("ValidFrom") <= current_time) &
            (col("ValidUntil") >= current_time)
        ).select(
            col("ObservableValue").alias("ThreatIP"),
            get_json_object(col("Data"), "$.threat_actors").alias("ThreatActorsJson"),
            col("ThreatType"),
            col("Confidence")
        ).withColumn(
            "ThreatActors",
            from_json(col("ThreatActorsJson"), ArrayType(StringType()))
        ).drop("ThreatActorsJson")
        
        ti_count = ip_threat_intel.count()
        print(f"✓ Loaded {ti_count:,} IP threat intelligence indicators")
        
        # Join findings with threat intel
        enriched_findings = all_findings.join(
            broadcast(ip_threat_intel),
            all_findings.IPAddress == ip_threat_intel.ThreatIP,
            "left"
        ).withColumn(
            "ThreatIntelMatch",
            when(col("ThreatIP").isNotNull(), lit(True)).otherwise(lit(False))
        ).select(
            all_findings["*"],
            "ThreatIntelMatch",
            coalesce(col("ThreatActors"), array()).alias("ThreatActors")
        )
        
        matched_count = enriched_findings.filter(col("ThreatIntelMatch") == True).count()
        print(f"✓ Matched {matched_count:,} findings with threat intelligence")
        
    except Exception as e:
        print(f"⚠ Could not load threat intelligence: {e}")
        enriched_findings = all_findings.withColumn(
            "ThreatIntelMatch", lit(False)
        ).withColumn(
            "ThreatActors", array()
        )

if SHOW_DEBUG_LOGS:
    print("\nSample enriched findings:")
    enriched_findings.show(5, truncate=True)

## Calculate Risk Scores and Aggregate

In [None]:
# ===============================================================================
# RISK SCORING AND AGGREGATION
# ===============================================================================

print("\nCalculating risk scores...")

# Calculate risk score
findings_with_risk = enriched_findings.withColumn(
    "RiskScore",
    calculate_risk_udf(
        col("FailedAttempts"),
        col("TravelVelocityKmh"),
        col("ThreatIntelMatch")
    )
)

# Aggregate by finding (in case of duplicates)
final_findings = findings_with_risk.groupBy("FindingId").agg(
    first("AnomalyType").alias("AnomalyType"),
    first("UserPrincipalName").alias("UserPrincipalName"),
    first("IPAddress").alias("IPAddress"),
    first("Location").alias("Location"),
    first("RiskScore").alias("RiskScore"),
    first("FailedAttempts").alias("FailedAttempts"),
    first("TravelDistanceKm").alias("TravelDistanceKm"),
    first("TravelVelocityKmh").alias("TravelVelocityKmh"),
    first("ThreatIntelMatch").alias("ThreatIntelMatch"),
    first("ThreatActors").alias("ThreatActors"),
    flatten(collect_list("EventReferences")).alias("EventReferences"),
    spark_min("FirstFailure").alias("FirstSeen"),
    spark_max("SuccessTime").alias("LastSeen"),
    count("*").alias("OccurrenceCount"),
    first("TenantId").alias("TenantId")
).withColumn(
    "JobId", lit(job_id)
).withColumn(
    "JobStartTime", job_start_time
).withColumn(
    "JobEndTime", current_timestamp()
).withColumn(
    "TimeGenerated", current_timestamp()
)

# Convert arrays to JSON for storage
final_findings = final_findings.withColumn(
    "ThreatActors", to_json(col("ThreatActors"))
).withColumn(
    "EventReferences", to_json(col("EventReferences"))
)

final_count = final_findings.count()
print(f"✓ Prepared {final_count:,} findings for ingestion")

if SHOW_STATS and final_count > 0:
    print("\n" + "="*80)
    print("FINDINGS SUMMARY")
    print("="*80)
    
    # Risk distribution
    print("\n📊 Risk Score Distribution:")
    final_findings.withColumn(
        "RiskBucket",
        when(col("RiskScore") >= 80, lit("Critical (80-100)"))
        .when(col("RiskScore") >= 60, lit("High (60-79)"))
        .when(col("RiskScore") >= 40, lit("Medium (40-59)"))
        .otherwise(lit("Low (0-39)"))
    ).groupBy("RiskBucket").count().orderBy("RiskBucket").show()
    
    # Anomaly type breakdown
    print("\n📊 Findings by Anomaly Type:")
    final_findings.groupBy("AnomalyType").agg(
        count("*").alias("Count"),
        avg("RiskScore").alias("AvgRiskScore")
    ).show()
    
    # Top users
    print("\n👤 Top 10 Users by Finding Count:")
    final_findings.groupBy("UserPrincipalName").count().orderBy(
        col("count").desc()
    ).limit(10).show(truncate=False)
    
    # Threat intel matches
    threat_matches = final_findings.filter(col("ThreatIntelMatch") == True).count()
    print(f"\n🎯 Threat Intelligence Matches: {threat_matches:,} / {final_count:,} ({threat_matches/final_count*100:.1f}%)")

if SHOW_DEBUG_LOGS:
    print("\nSample final findings:")
    final_findings.show(5, truncate=True)

## Save Results to Log Analytics

In [None]:
# ===============================================================================
# SAVE RESULTS TO LOG ANALYTICS
# ===============================================================================

if final_count == 0:
    print("\n✓ No anomalies detected in this run. No data to save.")
else:
    print(f"\nSaving {final_count:,} findings to {RESULTS_TABLE}...")
    
    try:
        # Check if table exists
        existing_df = None
        table_exists = False
        
        try:
            existing_df = data_provider.read_table(RESULTS_TABLE, WORKSPACE_NAME)
            table_exists = existing_df.count() > 0
        except Exception:
            table_exists = False
        
        if not table_exists:
            # Create new table
            print(f"📁 Creating new results table: {RESULTS_TABLE}")
            data_provider.save_as_table(final_findings, RESULTS_TABLE, WORKSPACE_NAME)
            print(f"✓ Created table with {final_count:,} findings")
        else:
            # Table exists - deduplicate by FindingId
            existing_count = existing_df.count()
            print(f"📁 Found existing results table with {existing_count:,} records")
            
            # Anti-join to find new findings only
            new_findings = final_findings.join(
                existing_df.select("FindingId"),
                "FindingId",
                "leftanti"
            )
            
            new_count = new_findings.count()
            duplicate_count = final_count - new_count
            
            print(f"\n📈 Deduplication results:")
            print(f"  • Total findings: {final_count:,}")
            print(f"  • Duplicate findings (already exist): {duplicate_count:,}")
            print(f"  • New findings to add: {new_count:,}")
            
            if new_count > 0:
                # Append new findings
                data_provider.save_as_table(
                    new_findings,
                    RESULTS_TABLE,
                    WORKSPACE_NAME,
                    mode="append"
                )
                print(f"✓ Appended {new_count:,} new findings to table")
            else:
                print("✓ All findings already exist in table. No updates needed.")
        
        print(f"\n✅ Results saved successfully to workspace '{WORKSPACE_NAME}'")
        print(f"   Query your findings: {RESULTS_TABLE}")
        
    except Exception as e:
        print(f"\n✗ Error saving results: {e}")
        raise

## Summary and Performance Metrics

In [None]:
# ===============================================================================
# SUMMARY AND METRICS
# ===============================================================================

end_time = time.time()
elapsed_seconds = end_time - start_time
elapsed_minutes = elapsed_seconds / 60

print("\n" + "="*80)
print("JOB SUMMARY")
print("="*80)
print(f"Notebook version: {VERSION}")
print(f"Job ID: {job_id}")
print(f"Workspace: {WORKSPACE_NAME}")
print(f"Analysis period: {LOOKBACK_DAYS} days")
print(f"Runtime: {elapsed_minutes:.2f} minutes")
print("\n📊 Processing Summary:")
print(f"  • Sign-in events analyzed: {total_signin_count:,}")
print(f"  • Failed-then-success anomalies: {failed_then_success_count:,}")
print(f"  • Impossible travel anomalies: {impossible_travel_count:,}")
print(f"  • Total anomalies detected: {final_count:,}")
print(f"\n📁 Output:")
print(f"  • Table: {RESULTS_TABLE}")
print(f"  • Workspace: {WORKSPACE_NAME}")
print("="*80)
print("\n✓ Analysis complete!")