# 🧪 Microsoft Sentinel Data Lake Notebook Examples

Practical walkthrough covering each scenario from the [Microsoft Sentinel data lake notebook examples](https://learn.microsoft.com/azure/sentinel/datalake/notebook-examples).

## 🎯 Scenarios covered
- Explore Microsoft Entra ID groups
- Filter sign-ins for a specific user
- Examine sign-in location details
- Detect sign-ins from unusual countries
- Spot brute-force patterns
- Identify lateral movement attempts
- Surface credential dumping indicators

Each section reuses the zero-config loader to keep the workflow consistent with the rest of this workbook collection.

---

## 🚀 Quick Start
1. Attach the Sentinel kernel (medium pool recommended)
2. Update any configuration variables marked with "CHANGE_ME"
   - Set `PRIMARY_WORKSPACE` if you want to bypass system-table discovery
   - Flip `AUTO_DISCOVER` to `False` if every table lives in that workspace
3. Run cells in order – each scenario will gracefully handle missing tables

⚙️ **Notebook defaults:** No manual workspace mapping required – the helper functions auto-discover available workspaces just like the other notebooks in this repo.

In [None]:
# 🎯 Notebook parameters
TARGET_USER_UPN = "<add your UPN user here>"
print(f"TARGET_USER_UPN set to: {TARGET_USER_UPN}")

In [None]:
# 📦 Imports & zero-config setup
from sentinel_lake.providers import MicrosoftSentinelProvider
from pyspark.sql.functions import (
    col, count as spark_count, countDistinct, expr, when,
    lower, upper, lit, date_trunc, hour, to_date,
    from_json, schema_of_json
)
from pyspark.sql import DataFrame
from pyspark.sql.types import StructType, MapType, StringType
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import timedelta
import warnings
warnings.filterwarnings('ignore')

sns.set_theme(style='whitegrid')

data_provider = MicrosoftSentinelProvider(spark)
print('✅ Environment initialized (auto workspace detection enabled)')

PRIMARY_WORKSPACE = 'ak-SecOps'  # Set to None to let auto-discovery handle everything
AUTO_DISCOVER = True  # Set to False to skip probing System Tables
FALLBACK_WORKSPACES = ['default']  # Additional workspaces to probe if needed
ANALYSIS_HOURS = 72

def try_read(table_name: str, workspace: str | None):
    if workspace:
        return data_provider.read_table(table_name, workspace)
    return data_provider.read_table(table_name)

def smart_load(table_name: str):
    last_error = None

    def attempt(workspace: str | None):
        nonlocal last_error
        try:
            df = try_read(table_name, workspace)
            return df, (workspace or 'auto'), None
        except Exception as e:
            last_error = str(e)
            return None

    if PRIMARY_WORKSPACE:
        result = attempt(PRIMARY_WORKSPACE)
        if result:
            return result

    if AUTO_DISCOVER:
        result = attempt(None)
        if result:
            return result

    for ws in FALLBACK_WORKSPACES:
        if PRIMARY_WORKSPACE and ws == PRIMARY_WORKSPACE:
            continue
        result = attempt(ws)
        if result:
            return result

    return None, None, last_error

def summarize_table(name: str, df: DataFrame | None, workspace: str | None, err: str | None):
    if df is None:
        print(f'❌ {name} not available: {err}')
        return False
    rows = df.count()
    print(f'✅ {name} loaded from workspace={workspace} ({rows:,} rows)')
    return rows > 0

print('🎯 Ready to explore all notebook scenarios')

## 1. Microsoft Entra ID Group Inventory
Discover group metadata (display name, types, descriptions) directly from the data lake tier (EntraGroups).

In [None]:
entra_groups, entra_workspace, entra_err = smart_load('EntraGroups')
if summarize_table('EntraGroups', entra_groups, entra_workspace, entra_err):
    sample_cols = ['displayName', 'groupTypes', 'mail', 'mailNickname', 'description', 'tenantId']
    available_cols = [c for c in sample_cols if c in entra_groups.columns]
    if available_cols:
        entra_groups.select(*available_cols).show(25, truncate=False)
    else:
        print('⚠️ Expected columns not found in EntraGroups table')

## 2. Sign-ins for a specific user
Start with an overview of recent sign-ins, then drill into a specific account when you assign `TARGET_USER_UPN` (for example, by running `TARGET_USER_UPN = "alice@contoso.com"` in a cell above this one).

In [None]:
TARGET_USER_UPN = globals().get('TARGET_USER_UPN')
if TARGET_USER_UPN is not None:
    TARGET_USER_UPN = str(TARGET_USER_UPN).strip() or None

signin_df, signin_workspace, signin_err = smart_load('SigninLogs')

if summarize_table('SigninLogs', signin_df, signin_workspace, signin_err):
    if TARGET_USER_UPN:
        focused = signin_df.filter(lower(col('UserPrincipalName')) == lower(lit(TARGET_USER_UPN)))
        total = focused.count()
        print(f'📊 Sign-ins for {TARGET_USER_UPN}: {total:,}')
        if total > 0:
            display_cols = ['TimeGenerated', 'ResultType', 'AppDisplayName', 'IpAddress', 'Location']
            available = [c for c in display_cols if c in focused.columns]
            focused.orderBy(col('TimeGenerated').desc()).select(*available).show(20, truncate=False)
        else:
            print('ℹ️ No sign-in records found for the specified user in the current retention window')
    else:
        if 'UserPrincipalName' not in signin_df.columns:
            print('⚠️ UserPrincipalName column missing – unable to summarise sign-ins by user')
        else:
            print('📌 Showing top sign-in activity. Set TARGET_USER_UPN = "user@domain" and rerun to drill into a specific account.')
            user_counts = (
                signin_df
                .groupBy('UserPrincipalName')
                .agg(spark_count('*').alias('SignIns'), countDistinct('AppDisplayName').alias('DistinctApps'))
                .orderBy(col('SignIns').desc())
                .limit(15)
            )
            user_counts.show(truncate=False)
            if 'AppDisplayName' in signin_df.columns:
                app_counts = (
                    signin_df
                    .groupBy('AppDisplayName')
                    .agg(spark_count('*').alias('SignIns'))
                    .orderBy(col('SignIns').desc())
                    .limit(10)
                )
                print('\n🎯 Top applications by sign-in volume:')
                app_counts.show(truncate=False)
            if 'ResultType' in signin_df.columns:
                result_counts = (
                    signin_df
                    .groupBy('ResultType')
                    .agg(spark_count('*').alias('Attempts'))
                    .orderBy(col('Attempts').desc())
                )
                print('\n🧾 Result type breakdown:')
                result_counts.show(truncate=False)

## 3. Sign-in location details
Parse the LocationDetails JSON blob to extract city, state, and country information for visualization or enrichment.

In [None]:
if 'signin_df' not in globals():
    signin_df, signin_workspace, signin_err = smart_load('SigninLogs')

if summarize_table('SigninLogs', signin_df, signin_workspace, signin_err):
    if 'LocationDetails' not in signin_df.columns:
        print('⚠️ LocationDetails column absent – skipping extraction')
    else:
        sample_row = signin_df.select('LocationDetails').dropna().limit(1).collect()
        if not sample_row:
            print('ℹ️ No LocationDetails data to parse')
        else:
            sample_json = sample_row[0]['LocationDetails']
            schema_literal = spark.read.json(spark.sparkContext.parallelize([sample_json])).schema
            location_expanded = signin_df.select(
                'TimeGenerated',
                'UserPrincipalName',
                from_json(col('LocationDetails'), schema_literal).alias('Location'),
                'AppDisplayName',
                'IpAddress',
                'ResultType'
            )
            expanded_cols = [
                col('Location.city').alias('City'),
                col('Location.state').alias('State'),
                col('Location.countryOrRegion').alias('CountryOrRegion'),
                col('Location.geoCoordinates.latitude').alias('Latitude'),
                col('Location.geoCoordinates.longitude').alias('Longitude')
            ]
            location_expanded.select('TimeGenerated', 'UserPrincipalName', *expanded_cols).show(20, truncate=False)

## 4. Sign-ins from unusual countries
Detect country deviations by comparing each user's distinct sign-in countries against a baseline threshold.

In [None]:
UNUSUAL_COUNTRY_THRESHOLD = 3
if 'signin_df' not in globals():
    signin_df, signin_workspace, signin_err = smart_load('SigninLogs')

if summarize_table('SigninLogs', signin_df, signin_workspace, signin_err):
    country_df = None

    if 'Location' in signin_df.columns:
        loc_field_type = signin_df.schema['Location'].dataType
        if isinstance(loc_field_type, StructType):
            country_df = signin_df.select(
                col('UserPrincipalName'),
                col('Location.countryOrRegion').alias('Country')
            )
        elif isinstance(loc_field_type, MapType):
            country_df = signin_df.select(
                col('UserPrincipalName'),
                col("Location['countryOrRegion']").alias('Country')
            )
        elif isinstance(loc_field_type, StringType):
            sample_value = signin_df.select('Location').dropna().limit(1).collect()
            if sample_value:
                sample_json = sample_value[0]['Location']
                trimmed = sample_json.strip()
                if trimmed.startswith('{') and trimmed.endswith('}'):
                    inferred_schema = spark.read.json(
                        spark.sparkContext.parallelize([trimmed])
                    ).schema
                    country_df = signin_df.select(
                        col('UserPrincipalName'),
                        from_json(col('Location'), inferred_schema).alias('ParsedLocation')
                    ).select(
                        col('UserPrincipalName'),
                        col('ParsedLocation.countryOrRegion').alias('Country')
                    )
                else:
                    country_df = signin_df.select(
                        col('UserPrincipalName'),
                        col('Location').alias('Country')
                    )
    if country_df is None and 'LocationDetails' in signin_df.columns:
        sample_value = signin_df.select('LocationDetails').dropna().limit(1).collect()
        if sample_value:
            sample_json = sample_value[0]['LocationDetails']
            inferred_schema = spark.read.json(
                spark.sparkContext.parallelize([sample_json])
            ).schema
            country_df = signin_df.select(
                col('UserPrincipalName'),
                from_json(col('LocationDetails'), inferred_schema).alias('LocationDetailsStruct')
            ).select(
                col('UserPrincipalName'),
                col('LocationDetailsStruct.countryOrRegion').alias('Country')
            )

    if country_df is None:
        print('⚠️ Unable to derive country information from Location/LocationDetails columns')
    else:
        country_counts = country_df.dropna()
        per_user = country_counts.groupBy('UserPrincipalName').agg(countDistinct('Country').alias('DistinctCountries'))
        unusual = per_user.filter(col('DistinctCountries') > UNUSUAL_COUNTRY_THRESHOLD)
        total = unusual.count()
        if total > 0:
            print(f'🚨 Users exceeding country threshold ({UNUSUAL_COUNTRY_THRESHOLD}): {total}')
            unusual.orderBy(col('DistinctCountries').desc()).show(20, truncate=False)
        else:
            print('✅ No users exceeded the unusual country threshold')

## 5. Brute-force activity (multiple failed sign-ins)
Flag accounts with repeated failures and large spreads across IPs.

In [None]:
FAIL_THRESHOLD = 20
if 'signin_df' not in globals():
    signin_df, signin_workspace, signin_err = smart_load('SigninLogs')

if summarize_table('SigninLogs', signin_df, signin_workspace, signin_err):
    required = {'ResultType', 'UserPrincipalName', 'IpAddress'}
    if not required.issubset(signin_df.columns):
        print('⚠️ Missing ResultType/UserPrincipalName/IpAddress columns')
    else:
        failures = signin_df.filter(~col('ResultType').isin(0, '0', 'Success', 'success'))
        grouped = failures.groupBy('UserPrincipalName').agg(
            spark_count('*').alias('FailedAttempts'),
            countDistinct('IpAddress').alias('DistinctFailIPs')
        )
        suspects = grouped.filter(col('FailedAttempts') >= FAIL_THRESHOLD)
        suspect_total = suspects.count()
        if suspect_total > 0:
            print(f'🚨 Accounts with >= {FAIL_THRESHOLD} failures: {suspect_total}')
            suspects.orderBy(col('FailedAttempts').desc()).show(20, truncate=False)
        else:
            print('✅ No accounts crossed the brute-force threshold')

## 6. Lateral movement attempts
Inspect DeviceNetworkEvents for unusual internal connections such as unexpected SMB/RDP cross-host traffic.

In [None]:
network_df, network_ws, network_err = smart_load('DeviceNetworkEvents')
if summarize_table('DeviceNetworkEvents', network_df, network_ws, network_err):
    required_cols = {'InitiatingProcessAccountName', 'RemoteUrl', 'RemotePort', 'Protocol', 'ReportId'}
    available = set(network_df.columns)
    if not required_cols.issubset(available):
        print('⚠️ Required columns missing – adjust schema or ingestion settings')
    else:
        suspicious_ports = [3389, 5985, 5986, 445]
        lateral = network_df.filter(col('RemotePort').isin(suspicious_ports))
        aggregated = lateral.groupBy('InitiatingProcessAccountName', 'RemoteUrl', 'RemotePort').agg(
            spark_count('*').alias('Events'),
            countDistinct('ReportId').alias('DistinctReports')
        )
        flagged = aggregated.filter((col('Events') >= 10) | (col('DistinctReports') >= 5))
        count_flagged = flagged.count()
        if count_flagged > 0:
            print(f'🚨 Potential lateral movement connections: {count_flagged}')
            flagged.orderBy(col('Events').desc()).show(20, truncate=False)
        else:
            print('✅ No high-volume suspicious lateral connections detected (threshold: >=10 events or >=5 reports)')

## 7. Credential dumping indicators
Search DeviceProcessEvents for known credential dumping tools and suspicious access to LSASS.

In [None]:
process_df, process_ws, process_err = smart_load('DeviceProcessEvents')
if summarize_table('DeviceProcessEvents', process_df, process_ws, process_err):
    required_cols = {'FileName', 'ProcessCommandLine', 'InitiatingProcessFileName', 'DeviceName', 'AccountName'}
    if not required_cols.issubset(process_df.columns):
        print('⚠️ DeviceProcessEvents missing required fields for this heuristic')
    else:
        pattern_terms = [
            'mimikatz',
            'procdump',
            'lsass',
            'comsvcs\\.dll',
            'rundll32\\.exe.*comsvcs\\.dll.*minidump'
        ]
        pattern_regex = '(' + '|'.join(pattern_terms) + ')'
        indicators = process_df.filter(
            lower(col('FileName')).rlike(pattern_regex) |
            lower(col('ProcessCommandLine')).rlike(pattern_regex)
        )
        total_hits = indicators.count()
        if total_hits > 0:
            print(f'🚨 Credential dumping indicators detected: {total_hits}')
            indicators.select('TimeGenerated', 'DeviceName', 'AccountName', 'FileName', 'ProcessCommandLine').orderBy(col('TimeGenerated').desc()).show(25, truncate=False)
        else:
            print('✅ No credential dumping indicators detected with the current heuristics')

## 📋 Wrap-up
- Update thresholds to fit your organization's baseline
- Promote interesting findings to analytics tier for detections
- Convert cells into scheduled jobs where recurring monitoring is required

🛠️ This notebook follows the same pattern as the rest of the collection – feel free to adapt or extend each scenario into a dedicated hunting playbook.