# 🔐 Password Spray Features Notebook

## 📖 Overview
This notebook generates the **Password Spray Features Table** by combining recent raw `SigninLogs` with historical rollups from `signin_summary_daily`.  
It enriches per-IP behavior with normalized metrics, a composite spray score, and categorical labels for triage.  

This design allows efficient detection of:
- 🚨 **High-volume sprays** (short bursts visible in raw logs)  
- 🐢 **Low-and-slow sprays** (persistent activity captured across 30-day lookbacks via summary table)  

---

## 🎯 Objectives
- ✅ Load recent raw `SigninLogs` and expand JSON location details into structured fields (City, Country, ASN).  
- ✅ Load historical `signin_summary_daily` for N-day lookback without rescanning raw logs.  
- ✅ Compute per-IP aggregated features: attempts, successes, distinct users, days active, entropy.  
- ✅ Normalize features and calculate a **spray_score** using weighted components:  
  - `distinct_users_norm` → breadth of targeted accounts  
  - `success_rate` → effectiveness of attempts  
  - `entropy_norm` → randomness/distribution of usernames  
- ✅ Assign categorical labels (**LOW / MEDIUM / HIGH**) based on thresholds.  
- ✅ Write enriched feature rows into `password_spray_features`, partitioned by run date.  
- ✅ Provide schema + sample preview for validation.  

---

## 🛠️ Workflow
1. 🔧 **Parameters & Config** — define lookback period, table names, selected fields.  
2. 📥 **Load Recent Raw Logs** — ingest `SigninLogs` for the last few hours/day, expand location JSON into structured columns.  
3. 🧮 **Aggregate Recent Data** — compute per-IP counts (attempts, distinct users, successes, first/last seen) and entropy.  
4. 📅 **Load Historical Summary** — pull per-IP daily rollups from `signin_summary_daily` covering the lookback window.  
5. 🔗 **Combine History + Recent** — merge both datasets into a unified summary.  
6. 📊 **Generate Features** — aggregate across the merged window to produce normalized features.  
7. 🎯 **Compute Spray Score & Labels** — apply formula and assign LOW/MEDIUM/HIGH classification.  
8. 💾 **Write to Features Table** — save results into `password_spray_features`, partitioned by run_date.  
9. 👀 **Preview Outputs** — inspect schema and sample rows for validation.  
10. ⚠️ **Optional Reset** — commented safeguard to delete table if re-run is required.  

---

## 🗺️ Data Flow (Mermaid Diagram)
```mermaid
flowchart TD
    %% Inputs
    A[📥 SigninLogs Raw Data - Recent] --> A1[📑 Select Columns<br/>User, IP, ASN, Location, Result*]
    B[💾 signin_summary_daily Historical Rollups] --> B1[📅 Filter Last N Days<br/>lookback window]

    %% Recent Processing
    A1 --> C[🧮 Aggregate per IP Recent<br/>attempts_total, success_count,<br/>distinct_users, first_seen, last_seen]
    A1 --> D[🧮 Username Entropy Recent<br/>calc p=count/sum, -Σ p*log2 p]
    C --> E[🔗 Join Aggregates and Entropy]
    D --> E

    %% Combine with History
    B1 --> F[🔗 Union Recent Summary and Historical Summary]
    E --> F

    %% Feature Aggregates
    F --> G[📊 Group per IP<br/>Σ attempts_total, Σ success_count,<br/>Σ distinct_users, count days_active,<br/>avg entropy, first_seen, last_seen]

    %% Normalization & Scoring
    G --> H[⚖️ Normalize Features<br/>distinct_users_norm, success_rate,<br/>entropy_norm]
    H --> I[🎯 Compute Spray Score<br/>0.5*users_norm + 0.2* 1-success_rate<br/>+ 0.3*entropy_norm]
    I --> J[🏷️ Assign Label by Score<br/>LOW <0.3, MEDIUM <0.6, HIGH]

    %% Output
    J --> K[💾 password_spray_features<br/>partitioned by run_date]

```

# 🔧 Parameters & Config

This section defines the **runtime parameters** and ensures the notebook can be reused or adapted easily.  
It sets up **time windows, input/output tables, and selected fields** before processing begins.

- ⏱️ **Time ranges**  
  - `run_start` → start of the current processing window (last 4 hours by default).  
  - `run_end` → end of the current processing window.  
  - `lookback_days` → number of historical days pulled from `signin_summary_daily` to combine with fresh data.  

- 🗂️ **Table names**  
  - Raw input → `SigninLogs` (fresh, high-fidelity events).  
  - Historical summary → `signin_summary_daily` (compact daily rollups).  
  - Output → `password_spray_features` (normalized features, spray score, and labels).  

- 📝 **Selected fields**  
  - Core identity fields → `UserPrincipalName`, `UserDisplayName`, `UserType`.  
  - Request context → `IPAddress`, `ResultType`, `ResultSignature`, `Status`, `UserAgent`.  
  - Enrichment fields → `AutonomousSystemNumber`, `LocationDetails` (JSON expanded into City, Country, Latitude, Longitude).  

- ⚙️ **Write options**  
  - Append mode is used, so historical runs are preserved and partitioned by date.  

---

✅ With this modular setup, analysts can quickly adjust **time windows, lookback period, field selection, or output tables** without modifying the core feature engineering logic.

In [None]:
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from sentinel_lake.providers import MicrosoftSentinelProvider

# Initialize provider
data_provider = MicrosoftSentinelProvider(spark)

# -----------------------------
# Parameters
# -----------------------------

# Date range for backfill
# Time range (last 4 hours fresh data)
run_end   = datetime.now().replace(minute=0, second=0, microsecond=0)
run_start = run_end - timedelta(hours=4)
lookback_days = 90

# Workspace name (update for your environment)
workspace_name = "Woodgrove-LogAnalyiticsWorkspace"  # Replace with your actual workspace name

# Table names (easy to swap)
input_table_raw  = "SigninLogs"
input_table_summary = "signin_summary_daily"
output_datalake_table_features = "password_spray_features" 

# Write options (append mode keeps history)
write_options = {"mode": "append"}

# -----------------------------
# Field selection
# -----------------------------
# Core fields + native enrichment (ASN, geolocation)
signin_fields = [
    "TimeGenerated",
    "UserPrincipalName",
    "UserDisplayName",
    "IPAddress",
    "ResultType",
    "ResultSignature",
    "Status",
    "UserDisplayName",
    "UserType",
    "UserAgent",
    # Geo & ASN enrichment -natively available in SigninLogs
    "LocationDetails",         # JSON object containing detailed location info - city, state, countryorRegion, geoCoordinates
    "AutonomousSystemNumber"
]


# -----------------------------
# Status output
# -----------------------------
print("📅 Window Parameters")
print(f"   Start → End: {run_start} → {run_end}")
print(f"   Lookback:   {lookback_days} days\n")

print("📂 Tables")
print(f"   Input RAW:      {input_table_raw}")
print(f"   Input Summary:      {input_table_summary}")
print(f"   Candidates: {output_datalake_table_features}\n")

print("📝 Selected Fields:")
print("   " + ", ".join(signin_fields))

## 📥 Load Raw SigninLogs

We read the source authentication data from the Azure AD `SigninLogs` table.  

- 📌 **Purpose**: Focus only on fields relevant to password spray detection.  
- 🌍 **Enrichment**: Expand the `LocationDetails` JSON into flat columns (City, State, Country, Latitude, Longitude).  
- 🔢 **Network context**: Capture `AutonomousSystemNumber (ASN)` for attribution to ISPs and hosting providers.  
- 📅 **Date column**: Add a derived `date` column (from `TimeGenerated`) to support partition pruning and time-based queries.  

✅ At this stage, the data is **normalized and structured**, ready for feature engineering in the backfill loop.

---

In [None]:
df_recent_raw = (
    data_provider.read_table(input_table_raw, workspace_name)
    .select(*signin_fields)
    .filter((F.col("TimeGenerated") >= F.lit(run_start)) &
            (F.col("TimeGenerated") <  F.lit(run_end)))
    .withColumn("date", F.to_date("TimeGenerated"))
    .withColumn("City", F.get_json_object("LocationDetails", "$.city"))
    .withColumn("Country", F.get_json_object("LocationDetails", "$.countryOrRegion"))
    .withColumn("Latitude", F.get_json_object("LocationDetails", "$.geoCoordinates.latitude").cast("double"))
    .withColumn("Longitude", F.get_json_object("LocationDetails", "$.geoCoordinates.longitude").cast("double"))
    .withColumnRenamed("AutonomousSystemNumber", "ASN")
)

df_recent_raw.printSchema()  # Prints only the DataFrame schema (metadata), does not scan the data so lightweight
print("✅ Loaded SigninLogs with expanded LocationDetails and ASN")

## 🔄 Recurring Features Run

This notebook is designed to run on a **recurring schedule** (e.g., every 4 hours).  
Each run produces **per-IP spray features** using a combination of **fresh raw SigninLogs** and a **historical summary table**.

### Step-by-Step
1. 🕒 **Define batch window**  
   - `run_start` and `run_end` are aligned to the nearest hour (e.g., 00:00–04:00, 04:00–08:00).  
   - Ensures consistent processing windows regardless of runtime.  

2. 📥 **Load fresh SigninLogs**  
   - Extract the last 4 hours of raw events.  
   - Expand JSON location details (City, Country, Latitude, Longitude).  
   - Aggregate attempts, successes, distinct users, and entropy.  

3. 📅 **Load historical summary**  
   - Pull N days of `signin_summary_daily`.  
   - Provides context for **low-and-slow spray attempts** beyond the current batch window.  

4. 🔗 **Combine fresh + history**  
   - Union recent aggregates with historical rollups.  
   - Ensures both short-term spikes and long-term campaigns are visible.  

5. 🧮 **Compute features**  
   - Aggregate per IP across the combined window.  
   - Normalize metrics:  
     - `distinct_users_norm`  
     - `success_rate`  
     - `entropy_norm`  
   - Calculate a **spray_score** using weighted formula.  
   - Assign categorical **labels** (LOW / MEDIUM / HIGH).  

6. 💾 **Append to `password_spray_features`**  
   - Results are written partitioned by `run_date` (aligned to `run_end`).  
   - Enables dashboards, detections, and investigations to consume the latest spray features.  

---

📌 Each scheduled run enriches the `password_spray_features` table with **up-to-date per-IP signals**, helping analysts detect both **fast bursts** and **stealthy campaigns**.

```mermaid
flowchart TD
    subgraph FeaturesRun[Recurring Features Run - Every 4 Hours]
        A[🕒 Define run_start & run_end<br/>Aligned to Hour] --> B[📥 Load Fresh Raw SigninLogs<br/>Expand Geo + ASN]
        B --> C[🧮 Aggregate per IP<br/>Attempts, Successes, Distinct Users, Entropy]
        A --> D[📅 Load signin_summary_daily<br/>Last N Days Context]
        C --> E[🔗 Union Fresh + Historical]
        D --> E
        E --> F[📊 Compute Features<br/>Normalized Metrics]
        F --> G[🎯 Spray Score + Label]
        G --> H[💾 Write to password_spray_features<br/>Partition run_date]
    end
```

In [None]:

# -----------------
# Candidate features (single groupBy)
# Carry over ASN, City, Country for attribution
# -----------------
agg_recent = (
    df_recent_raw.groupBy("IPAddress", "ASN", "City", "Country","date")
      .agg(
        F.count("*").alias("attempts_total"),
        F.sum(F.when(F.col("ResultSignature")=="Success",1).otherwise(0)).alias("success_count"),
        F.countDistinct("UserPrincipalName").alias("distinct_users"),
        F.min("TimeGenerated").alias("first_seen"),
        F.max("TimeGenerated").alias("last_seen")
      )
)

# -----------------
# Entropy (optimized with window instead of join)
# -----------------
user_counts = df_recent_raw.groupBy("IPAddress","UserPrincipalName").count()
entropy_recent = (
    user_counts
        .withColumn("p", F.col("count") / F.sum("count").over(Window.partitionBy("IPAddress")))
        .groupBy("IPAddress")
        .agg((-F.sum(F.col("p")*F.log2("p"))).alias("username_entropy"))
)

df_recent_summary = agg_recent.join(entropy_recent, "IPAddress", "left")

print("✅ Transformed recent logs into summary schema")

history_start = (run_end.date() - timedelta(days=lookback_days))

df_history = (
    data_provider.read_table(input_table_summary, workspace_name)
      .filter(F.col("date") >= F.lit(history_start.isoformat()))
)

print(f"✅ Loaded historical summary from {history_start} → {run_end.date()}")

history_start = (run_end.date() - timedelta(days=lookback_days))

df_combined = df_history.unionByName(df_recent_summary, allowMissingColumns=True)
print(f"✅ Combined {df_combined.count()} summary rows (history + fresh)")


features = (
    df_recent_summary.groupBy("IPAddress", "ASN", "City", "Country")
        .agg(
            F.sum("attempts_total").alias("attempts_total"),
            F.sum("success_count").alias("success_count"),
            F.sum("distinct_users").alias("distinct_users"),
            F.countDistinct("date").alias("days_active"),
            F.min("first_seen").alias("first_seen"),
            F.max("last_seen").alias("last_seen"),
            F.round(F.avg("username_entropy"), 2).alias("avg_entropy")
        )
        .withColumn("run_date", F.lit(run_end.date().isoformat()))
        .withColumn("detection_window_start", F.lit(history_start.isoformat()))
        .withColumn("detection_window_end", F.lit(run_end.date().isoformat()))
)

# Normalize & spray score
global_max = features.agg(
    F.max("distinct_users").alias("max_users"),
    F.max("avg_entropy").alias("max_entropy")
).first()

max_users = global_max.max_users or 1
max_entropy = global_max.max_entropy or 1

features = (
    features
      .withColumn("distinct_users_norm", F.col("distinct_users") / F.lit(max_users))
      .withColumn("success_rate", F.col("success_count") / F.greatest(F.col("attempts_total"), F.lit(1)))
      .withColumn("entropy_norm", F.col("avg_entropy") / F.lit(max_entropy))
      .withColumn(
            "spray_score",
            F.round(
                0.5*F.col("distinct_users_norm") +
                0.2*(1 - F.col("success_rate")) +
                0.3*F.col("entropy_norm"),
                2
            )
        )
      .withColumn(
          "spray_score_label",
          F.when(F.col("spray_score") < 0.3, "LOW")
           .when(F.col("spray_score") < 0.6, "MEDIUM")
           .otherwise("HIGH")
      )
)

print("✅ Computed spray score & labels")


# Write candidates
data_provider.save_as_table(features, output_datalake_table_features, write_options=write_options)
print(f"🎯 Wrote spray candidates for run {run_end} into {output_datalake_table_features}")

# 👀 Preview Outputs

After each scheduled run, it’s important to validate that the pipeline produced the expected data.  
This section shows the **schema** and **sample rows** for the candidates table.

---

### 🧩 Candidates Table (`password_spray_features`)

This table contains **per-IP aggregated features** across the rolling lookback window.  
It highlights which IP addresses exhibit password spray-like behavior and provides the necessary context for triage.

- 🌍 **Geo & ASN context** → IPAddress, ASN, City, Country  
- 📊 **Behavioral metrics** → attempts_total, distinct_users, days_active, entropy  
- 🔢 **Normalized features** → distinct_users_norm, entropy_norm, success_rate  
- 🎯 **Spray score** → weighted score reflecting spray likelihood  
- 🏷️ **Spray score label** → LOW / MEDIUM / HIGH for easier prioritization  

---

These outputs enable:
- 📈 **Dashboards** (geo heatmaps, spray score trends, IP hotspots)  
- 🚨 **Alerts** (triggering on high-score IPs)  
- 🕵️ **Investigations** (pivoting into ASN, city, or recurring IP patterns)

In [None]:
print("📑 Spray Candidates Schema")
features.printSchema()

print("\n🔍 Sample Rows")
display(
    features.select(
        "IPAddress", "ASN", "City", "Country",
        "attempts_total", "distinct_users", "avg_entropy",
        "spray_score", "spray_score_label"
    ).limit(20)
)