# 🔐 SigninLogs Summary + Stats Daily Notebook



## 📖 Overview

This notebook produces **two key tables** from raw Microsoft Entra ID `SigninLogs`:  

- 🧮 **signinlogs_summary_daily_SPRK** → pre-aggregated per-IP features per day (efficient base for rolling 30-day candidate analysis).  

- 📊 **signin_stats_daily_SPRK** → daily rollups for reporting, dashboards, and baselines.  

By combining both in a single run, we avoid scanning raw logs twice, saving cost and runtime.  


---

## 🎯 Objectives

- 🎯 Load raw `SigninLogs` with relevant fields (user, IP, ASN, geo, status).
- 🔧 Expand JSON location details into structured columns (City, Country, Latitude, Longitude).
- 📊 Compute **rolling window features** per IP: attempts, distinct users, days active, entropy.
- 🏷️ Calculate a **spray score** and assign **labels** (LOW / MEDIUM / HIGH).
- 💾 Write results to dedicated Sentinel data lake tables for further use (detections, dashboards, investigations).
- 👀 Provide preview of schema and sample rows for validation.
- 🛡️ Include safeguards for table deletion in case of incorrect backfill.


---


## 🗺️ Data Flow

```mermaid
flowchart TD
   A[📥 SigninLogs<br/>Raw Data] --> A1[📑 Column Selection<br/>TimeGenerated, User, IP, ASN, Location, Result*]
   
   %% Daily Summary
   A1 --> B[🧮 Daily Summary per IP<br/>by IP, ASN, City, Country, Date]
   B -->|Aggregates| B1[🔢 attempts_total, success_count,<br/>distinct_users, first_seen, last_seen,<br/>username_entropy]
   B1 --> D[💾 signin_summary_daily_SPRK<br/>partitioned by date]

   %% Daily Stats
   A1 --> C[📊 Daily Stats Rollup<br/>by Date only]
   C -->|Aggregates| C1[🔢 total_attempts, distinct_users,<br/>distinct_IPs, lockouts, successes]
   C1 --> E[💾 signin_stats_daily_SPRK<br/>partitioned by date]

``` 

In [None]:
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from pyspark.sql.window import Window
from sentinel_lake.providers import MicrosoftSentinelProvider

# Initialize provider
data_provider = MicrosoftSentinelProvider(spark)

# -----------------------------
# Parameters
# -----------------------------

# Dynamic Time window Parameters for a daily recurrent run.
end_date = datetime.now().date() - timedelta(days=1)  # yesterday
start_date = end_date   # Same as end_date for daily runs as inclusive of both start and end while filtering.

# Sentinel Workspace name (update for your environment)
workspace_name = "<YOUR_WORKSPACE_NAME>"  # Replace with your actual workspace name

# Table names (easy to swap)
input_table_raw = "SigninLogs"
output_datalake_summary_table = "signin_summary_daily_SPRK"
output_datalake_stats_table = "signin_stats_daily_SPRK"

# Write options (append mode keeps history)
write_options = {"mode": "append", "partitionBy": ["date"]}   # partitionBy needs to be a list of columns and only supported in data lake tier.

# -----------------------------
# Field selection
# -----------------------------
# Core fields + native enrichment (ASN, geolocation)
signin_fields = [
    "TimeGenerated",
    "UserPrincipalName",
    "UserDisplayName",
    "IPAddress",
    "ResultType",
    "ResultSignature",
    "Status",
    "UserType",
    "UserAgent",
    # Geo & ASN enrichment - natively available in SigninLogs
    "LocationDetails",  # JSON object containing detailed location info - city, state, countryorRegion, geoCoordinates
    "AutonomousSystemNumber",
]


# -----------------------------
# Status output
# -----------------------------
print("📅 Data Loading Parameters")
print(f"👉   Start → End: {start_date} → {end_date}")

print("\n\n⚠️ Please validate input Tables and target tables before proceeding:")
print(f"\t📂  Input Raw Table:      {input_table_raw}")
print(f"\t📂  Signins Summary: {output_datalake_summary_table}")
print(f"\t📂  Signins Stats: {output_datalake_stats_table}")

print("\n\n📝 Selected Fields:")
print("👉   " + ", ".join(signin_fields))

# 📥 Load Raw SigninLogs

We read the source authentication data from the Microsoft Entra ID `SigninLogs` table.  

- 📌 **Purpose**: Focus only on fields relevant to password spray detection.  
- 🌍 **Enrichment**: Expand the `LocationDetails` JSON into flat columns (City, State, Country, Latitude, Longitude).  
- 🔢 **Network context**: Capture `AutonomousSystemNumber (ASN)` for attribution to ISPs and hosting providers.  
- 📅 **Date column**: Add a derived `date` column (from `TimeGenerated`) to support partition pruning and time-based queries.  

✅ At this stage, the data is **normalized and structured**, ready for feature engineering in the backfill loop.

---

In [None]:
signin_daily_df = (
    data_provider.read_table(input_table_raw, workspace_name)
        .select(*signin_fields)
        .withColumn("date", F.to_date("TimeGenerated"))
        .filter((F.col("TimeGenerated") >= F.lit(start_date)) &
                (F.col("TimeGenerated") <= F.lit(end_date)))  # end date and start date are inclusive to load data for single day.
        .withColumn("City", F.get_json_object("LocationDetails", "$.city"))
        .withColumn("State", F.get_json_object("LocationDetails", "$.state"))
        .withColumn("Country", F.get_json_object("LocationDetails", "$.countryOrRegion"))
        .withColumn("Latitude", F.get_json_object("LocationDetails", "$.geoCoordinates.latitude").cast("double"))
        .withColumn("Longitude", F.get_json_object("LocationDetails", "$.geoCoordinates.longitude").cast("double"))
        .withColumnRenamed("AutonomousSystemNumber", "ASN")
        .cache()        # Cache as used multiple times below
)


signin_daily_df.printSchema()  # Prints only the DataFrame schema (metadata), does not scan the data so lightweight
print("✅ Loaded SigninLogs with expanded LocationDetails and ASN")

# 🧮 Daily Summary per IP

This step aggregates authentication activity **per source IP, per day**, providing richer attribution context than global daily stats.  
It compresses raw sign-in logs into **per-IP rollups** while still retaining critical behavioral signals.

---
### 🔄 Transformation Details
For each combination of `IPAddress`, `ASN`, `City`, `Country`, and `date`, we compute:

- 🔢 **attempts_total** → total number of sign-in attempts from that IP on that day  
- ✅ **success_count** → number of successful logons from the IP  
- 👤 **distinct_users** → number of unique targeted accounts  
- ⏱️ **first_seen / last_seen** → earliest and latest timestamps for that IP’s activity on the day  

Additionally, we enrich with **username entropy**:  
- 🧮 **username_entropy** → measures the randomness/spread of targeted usernames.  
   - Low entropy = focused attack (few accounts repeatedly).  
   - High entropy = broad spray (many accounts targeted).  

---
### 💾 Output
The results are appended into the **daily summary table** (`signin_summary_daily_SPRK`), partitioned by `date`.  
This table enables:  
- ⚡ Efficient long lookbacks (no need to scan raw `SigninLogs`)  
- 🎯 Foundation for **features scoring** (spray likelihood models)  
- 🔍 Investigative pivots by IP, ASN, or geographic location  


In [None]:
agg_all = (
    signin_daily_df.groupBy("IPAddress", "ASN", "City", "Country", "date")
        .agg(
            F.count("*").alias("attempts_total"),
            F.sum(F.when(F.col("ResultSignature") == "Success", 1).otherwise(0)).alias("success_count"),
            F.countDistinct("UserPrincipalName").alias("distinct_users"),
            F.min("TimeGenerated").alias("first_seen"),
            F.max("TimeGenerated").alias("last_seen")
        )
)

user_counts = signin_daily_df.groupBy("IPAddress","UserPrincipalName").count()
entropy = (
    user_counts
        .withColumn("p", F.col("count") / F.sum("count").over(Window.partitionBy("IPAddress")))
        .groupBy("IPAddress")
        .agg(F.round(-F.sum(F.col("p") * F.log2("p")), 2).alias("username_entropy"))
)

summary = agg_all.join(entropy, "IPAddress", "left")

try:
    data_provider.save_as_table(
           summary,
            output_datalake_summary_table,
            "System tables",        # System tables referes to writing to Datalake tier table.
            write_options
        )
    print(f"✅ Wrote summary for {end_date} into {output_datalake_summary_table}")
except Exception as save_err:
    print(f"❌ Failed writing summary for {end_date}: {save_err}")

# 📊 Daily Summary Stats

This step aggregates **all sign-in activity for a given day** into a compact rollup table.  
Instead of tracking activity per IP, this view focuses on **global daily metrics** that help analysts understand the overall scale and impact of password spray attempts.

---
### 🔄 Transformation Details
For each `date`, we compute:

- 🔢 **total_attempts** → total number of sign-in attempts observed  
- 👤 **distinct_targeted_users** → number of unique accounts targeted that day  
- 🌐 **distinct_source_ips** → number of unique source IPs seen attempting logons  
- 🚫 **lockouts** → count of lockout errors (`ResultType=50053`), indicating repeated failed attempts  
- ✅ **successes** → count of successful logons, useful for spotting when a spray attack succeeds  


---

### 💾 Output
The aggregated results are appended into the **daily stats table** (`signin_stats_daily_SPRK`), partitioned by `date`.  
This table enables:  
- 📈 High-level trend dashboards  
- 🚨 Alert thresholds (e.g., sudden spikes in distinct IPs or lockouts)  
- 🔍 Context for investigations when reviewing candidate IP behavior  


In [None]:
stats = (
    signin_daily_df.groupBy("date")
        .agg(
            F.count("*").alias("total_attempts"),
            F.countDistinct("UserPrincipalName").alias("distinct_targeted_users"),
            F.countDistinct("IPAddress").alias("distinct_source_ips"),
            F.sum(F.when(F.col("ResultType") == "50053", 1).otherwise(0)).alias("lockouts"),
            F.sum(F.when(F.col("ResultSignature") == "Success", 1).otherwise(0)).alias("successes")
        )
)

try:
    data_provider.save_as_table(
            stats,
            output_datalake_stats_table,
            "System tables",        # System tables referes to writing to Datalake tier table.
            write_options
        )
    print(f"✅ Wrote daily stats for {end_date} into {output_datalake_stats_table}")
except Exception as save_err:
    print(f"❌ Failed writing stats for {end_date}: {save_err}")

# Release cached DataFrame
signin_daily_df.unpersist()
print("🧹 Released cached signin_daily_df")

# 👀 Preview Outputs

After running the backfill loop, it’s important to validate that the pipeline produced the expected data.  
This section shows **schemas** and **sample rows** for both output tables.

---

### 🗂️ Daily Summary Table (`signin_summary_daily_SPRK`)

This table provides **per-IP, per-day aggregates** of sign-in activity.  
It compresses raw `SigninLogs` into a daily rollup for each source IP, while preserving attribution context.  

- 🌍 **Geo & ASN context** → IPAddress, ASN, City, Country  
- 📅 **Date** → reporting day of the aggregation  
- 🔢 **Total attempts** → total number of authentication attempts from the IP on that day  
- ✅ **Success count** → number of successful logons (helps measure spray effectiveness)  
- 👤 **Distinct users** → number of unique targeted accounts  
- ⏱️ **First seen / Last seen** → earliest and latest attempt timestamps for that IP within the day  
- 🧮 **Username entropy** → entropy score measuring spread/randomness of targeted usernames  

---

### 📊 Daily Stats Table (`signin_stats_daily_SPRK`)

This table provides **daily rollups** of authentication activity.  
It helps track the overall level of spray attempts and anomalies over time.

- 📅 **Date** → reporting date  
- 🔢 **Total attempts** → all sign-in attempts for the day  
- 👤 **Distinct targeted users** → how many unique accounts were hit  
- 🌐 **Distinct source IPs** → how many unique IPs attempted logons  
- 🚫 **Lockouts** → counts of account lockout events (ResultType=50053)  
- ✅ **Successes** → counts of successful logons  

---

In [None]:
# -----------------------------
# Preview sample outputs
# -----------------------------

print("\n📑 Daily Summary Table Schema")
summary.printSchema()

print("\n🔍 Daily Summary Sample Rows from the loaded spark dataframe")
print("⏳ Preparing to show DataFrame...this may take 2–3 minutes ⌛")
display(
    summary.select(
        "date", "IPAddress", "attempts_total", "success_count",
        "distinct_users", "username_entropy", "ASN", "City", "Country"
    ).limit(10)
)

print("\n📑 Daily Stats Table Schema")
stats.printSchema()

print("\n🔍 Daily Stats Sample Rows from the loaded spark dataframe")
print("⏳ Preparing to show DataFrame...this may take 2–3 minutes ⌛")
display(
    stats.select(
        "date", "total_attempts", "distinct_targeted_users",
        "distinct_source_ips", "lockouts", "successes"
    ).limit(10)
)