# User Risk Score Simple - Training Notebook

This simplified notebook is designed as a **training tool** for users new to Microsoft Sentinel. It demonstrates fundamental concepts:

- Reading data from Sentinel tables
- Performing basic PySpark aggregations
- Calculating simple risk scores
- Visualizing results with charts
- Writing results back to a custom table

## What This Notebook Does

Calculates a **50-point user risk score** based on sign-in behavior patterns:
- **IP Diversity (20 pts)**: How many different IP addresses?
- **Sign-in Frequency (20 pts)**: How many sign-ins in 14 days?
- **Consistency Pattern (10 pts)**: Is activity spread out or concentrated?

**Risk Levels:** Low (0-15), Medium (16-30), High (31-50)

## Data Sources (2 Tables)

- **SigninLogs**: User authentication events (UserId, UserPrincipalName, IPAddress)
- **EntraUsers**: User profiles (id, displayName, mail, department)

## Output

Results saved to: `UserRiskScoreSimple_SPRK` (12 columns)

## References

* [Available workspace tables](https://learn.microsoft.com/en-us/azure/azure-monitor/reference/tables-index)
* [Available system tables](https://learn.microsoft.com/en-us/azure/sentinel/datalake/enable-data-connectors)
* [Microsoft Sentinel Provider class](https://learn.microsoft.com/en-us/azure/sentinel/datalake/sentinel-provider-class-reference)
* [Notebook examples](https://learn.microsoft.com/en-us/azure/sentinel/datalake/notebook-examples)

---

## 1. Setup & Configuration

In this section, we'll set up our environment and configure parameters for our risk analysis.

**What we're doing:**
- Import required PySpark libraries for data processing
- Initialize the Microsoft Sentinel provider to access data lake tables
- Set the analysis time window (14 days)

**Expected Output:**
Confirmation of configuration settings.

In [None]:
# Import required libraries
from sentinel_lake.providers import MicrosoftSentinelProvider
from pyspark.sql.functions import (
    col, count, countDistinct, when, lit, expr,
    current_timestamp, avg, coalesce
)

# Configuration - Update this with your workspace name
WORKSPACE_NAME = "<YOUR_WORKSPACE_NAME>"

# Analysis window
ANALYSIS_DAYS = 14

# Initialize Sentinel provider
sentinel_provider = MicrosoftSentinelProvider(spark)

print("="*60)
print("USER RISK SCORE SIMPLE - CONFIGURATION")
print("="*60)
print(f"Analysis Window: {ANALYSIS_DAYS} days")
print(f"Workspace: {WORKSPACE_NAME}")
print(f"Tables: SigninLogs + EntraUsers")
print(f"Output: UserRiskScoreSimple_SPRK")
print("="*60)

## 2. Load SigninLogs

Load sign-in events for behavioral analysis.

**What we're doing:**
- Read from the SigninLogs table
- Filter for Member users (exclude guests)
- Filter for last 14 days only
- Select only 3 columns: UserId, UserPrincipalName, IPAddress
- Cache the data for faster processing

**Key Learning:**
- `.read_table()` - How to read from Sentinel tables
- `.filter()` - How to filter data with conditions
- `.select()` - How to choose specific columns
- `.persist()` - How to cache data for reuse

**Expected Output:**
Count of sign-in events loaded and a sample of the data.

In [None]:
print("üìä Loading SigninLogs...")

signin_df = (
    sentinel_provider.read_table('SigninLogs', WORKSPACE_NAME)
    .filter(
        (col("UserType") == "Member") & 
        (col("UserId").isNotNull()) &
        (col("TimeGenerated") >= expr(f"current_timestamp() - INTERVAL {ANALYSIS_DAYS} DAYS"))
    )
    .select("UserId", "UserPrincipalName", "IPAddress")
    .persist()
)

signin_count = signin_df.count()
print(f"‚úÖ Loaded {signin_count} sign-in events")

print("\nüìã Sample of sign-in data (first 5 rows):")
signin_df.show(5, truncate=False)

## 3. Load EntraUsers

Load user profile information to enrich our risk scores with context.

**What we're doing:**
- Read from the EntraUsers table
- Filter out null or empty user IDs
- Select 4 columns: id, displayName, mail, department
- Remove any duplicate user records
- Cache the data

**Key Learning:**
- Working with identity data
- `.dropDuplicates()` - Ensuring data quality
- Multiple filter conditions with `&`

**Expected Output:**
Count of user profiles loaded and a sample of the data.

In [None]:
print("üìä Loading EntraUsers...")

users_df = (
    sentinel_provider.read_table('EntraUsers')
    .filter(
        (col("id").isNotNull()) &
        (col("id") != "")
    )
    .select("id", "displayName", "mail", "department")
    .dropDuplicates(["id"])
    .persist()
)

users_count = users_df.count()
print(f"‚úÖ Loaded {users_count} user profiles")

print("\nüìã Sample of user profile data (first 5 rows):")
users_df.show(5, truncate=False)

## 4. Calculate Sign-in Metrics

Now we aggregate the sign-in data to calculate per-user metrics.

**What we're doing:**
- Group sign-in events by user (UserId, UserPrincipalName)
- Count unique IP addresses per user
- Count total sign-ins per user

**Key Learning:**
- `.groupBy()` - Grouping data for aggregation
- `.agg()` - Performing multiple aggregations
- `countDistinct()` - Counting unique values
- `count(*)` - Counting all rows
- `.alias()` - Naming columns

**Expected Output:**
Top 10 users by sign-in count with their unique IP counts.

In [None]:
print("üîç Calculating sign-in metrics per user...")

signin_metrics = (
    signin_df
    .groupBy("UserId", "UserPrincipalName")
    .agg(
        countDistinct("IPAddress").alias("unique_ip_count"),
        count("*").alias("total_signins")
    )
)

metrics_count = signin_metrics.count()
print(f"‚úÖ Calculated metrics for {metrics_count} users")

print("\nüìä Top 10 Users by Sign-in Count:")
signin_metrics.orderBy(col("total_signins").desc()).show(10, truncate=False)

## 5. Join with User Profiles

Enrich our sign-in metrics with user profile information.

**What we're doing:**
- Join signin_metrics with users_df
- Use LEFT JOIN to keep all users who signed in
- Join on: signin_metrics.UserId = users_df.id
- Select columns from both DataFrames using DataFrame.column syntax
- Rename displayName to UserDisplayName for clarity

**Key Learning:**
- LEFT JOIN syntax in PySpark
- Qualifying columns from specific DataFrames (DataFrame.column)
- Using `.alias()` to rename columns
- Why enrichment matters (adds context for analysis)

**Expected Output:**
Sample of enriched data showing user names and departments alongside metrics.

In [None]:
print("üîó Enriching metrics with user profiles...")

combined_df = (
    signin_metrics
    .join(
        users_df,
        signin_metrics.UserId == users_df.id,
        "left"
    )
    .select(
        col("UserId"),
        col("UserPrincipalName"),
        users_df.displayName.alias("UserDisplayName"),
        users_df.department,
        col("unique_ip_count"),
        col("total_signins")
    )
)

combined_count = combined_df.count()
print(f"‚úÖ Enriched {combined_count} user records")

print("\nüìã Sample of enriched data (first 5 rows):")
combined_df.show(5, truncate=False)

## 6. Calculate Risk Scores

Now we calculate risk scores based on the metrics we've gathered.

**Risk Categories (50 points total):**

1. **IP Diversity Score (0-20 points)**
   - 1 IP = 0 (normal single location)
   - 2-3 IPs = 5 (home + office or travel)
   - 4-6 IPs = 12 (elevated - multiple locations)
   - 7+ IPs = 20 (high - suspicious diversity)

2. **Frequency Score (0-20 points)**
   - ‚â§50 sign-ins = 0 (low activity)
   - 51-150 = 5 (moderate activity)
   - 151-300 = 12 (high activity)
   - 301+ = 20 (very high - unusual)

3. **Consistency Score (0-10 points)**
   - Detects automation: high frequency from few IPs
   - 200+ sign-ins from ‚â§2 IPs = 10 (suspicious)
   - 200+ sign-ins from 3+ IPs = 5 (monitor)
   - <200 sign-ins = 0 (normal)

**Key Learning:**
- `.withColumn()` - Creating new columns
- `when().otherwise()` - Conditional logic (like IF-THEN-ELSE)
- Nested conditions for complex logic
- Building composite scores by adding columns
- Classification logic (Low/Medium/High)

**Expected Output:**
Top 10 highest risk users with their risk scores broken down by category.

In [None]:
print("üéØ Calculating risk scores...")

risk_scores = (
    combined_df
    # IP Diversity Score (0-20)
    .withColumn("ip_diversity_score",
        when(col("unique_ip_count") == 1, 0)
        .when(col("unique_ip_count") <= 3, 5)
        .when(col("unique_ip_count") <= 6, 12)
        .otherwise(20)
    )
    # Frequency Score (0-20)
    .withColumn("frequency_score",
        when(col("total_signins") <= 50, 0)
        .when(col("total_signins") <= 150, 5)
        .when(col("total_signins") <= 300, 12)
        .otherwise(20)
    )
    # Consistency Score (0-10) - detects automation patterns
    .withColumn("consistency_score",
        when(col("total_signins") >= 200, 
            when(col("unique_ip_count") <= 2, 10).otherwise(5)
        ).otherwise(0)
    )
    # Total Risk Score (0-50)
    .withColumn("total_risk_score",
        col("ip_diversity_score") + 
        col("frequency_score") + 
        col("consistency_score")
    )
    # Risk Level Classification
    .withColumn("risk_level",
        when(col("total_risk_score") <= 15, "Low")
        .when(col("total_risk_score") <= 30, "Medium")
        .otherwise("High")
    )
)

print("‚úÖ Risk scores calculated")

print("\nüî¥ Top 10 Highest Risk Users:")
risk_scores.select(
    "UserPrincipalName",
    "department",
    "unique_ip_count",
    "total_signins",
    "ip_diversity_score",
    "frequency_score",
    "consistency_score",
    "total_risk_score",
    "risk_level"
).orderBy(col("total_risk_score").desc()).show(10, truncate=False)

## 7. Visualize Risk Distribution

Create a bar chart showing how many users fall into each risk level.

**What we're doing:**
- Group users by risk_level and count
- Convert PySpark DataFrame to Pandas for plotting
- Create a bar chart with matplotlib
- Color-code: Green (Low), Orange (Medium), Red (High)
- Add percentages on bars

**Key Learning:**
- `.toPandas()` - Converting Spark to Pandas
- Creating bar charts with matplotlib
- Color coding for visual clarity
- Adding data labels to charts
- Professional chart formatting

**Expected Output:**
A bar chart showing the risk level distribution across all users.

In [None]:
print("üìä Creating risk distribution chart...")

import matplotlib.pyplot as plt

# Calculate distribution
risk_dist = risk_scores.groupBy("risk_level").count().orderBy("risk_level").toPandas()
total_users = risk_scores.count()

# Create bar chart
plt.figure(figsize=(10, 6))
colors = {"Low": "#2ecc71", "Medium": "#f39c12", "High": "#e74c3c"}
bars = plt.bar(
    risk_dist["risk_level"], 
    risk_dist["count"],
    color=[colors[level] for level in risk_dist["risk_level"]],
    edgecolor="black",
    linewidth=1.2
)

# Add labels on bars
for bar in bars:
    height = bar.get_height()
    plt.text(
        bar.get_x() + bar.get_width()/2., height,
        f"{int(height)}\n({int(height)/total_users*100:.1f}%)",
        ha="center", va="bottom", fontsize=12, fontweight="bold"
    )

plt.xlabel("Risk Level", fontsize=14, fontweight="bold")
plt.ylabel("Number of Users", fontsize=14, fontweight="bold")
plt.title(f"User Risk Distribution (Total: {total_users} users)", 
          fontsize=16, fontweight="bold", pad=20)
plt.grid(axis="y", alpha=0.3, linestyle="--")
plt.tight_layout()
plt.show()

print("‚úÖ Chart created")

# Print summary statistics
print("\n" + "="*60)
print("RISK DISTRIBUTION SUMMARY")
print("="*60)
for _, row in risk_dist.iterrows():
    level = row["risk_level"]
    count = row["count"]
    pct = count/total_users*100
    print(f"{level:8s}: {count:4d} users ({pct:5.1f}%)")
print("="*60)

## 8. Prepare Final Output

Prepare the final DataFrame with all columns properly ordered and metadata added.

**What we're doing:**
- Add timestamp columns (calculation_date, TimeGenerated)
- Select all columns in a logical order:
  - Identity fields (UserId, UserPrincipalName, UserDisplayName, department)
  - Risk scores (total_risk_score, risk_level, subscores)
  - Metrics (unique_ip_count, total_signins)
  - Metadata (timestamps)
- Order by total_risk_score descending (highest risk first)

**Key Learning:**
- `current_timestamp()` - Adding timestamps
- Column ordering for readability
- `.select()` for final schema
- `.orderBy()` with descending order

**Output Schema (12 columns):**
1. UserId
2. UserPrincipalName
3. UserDisplayName
4. department
5. total_risk_score
6. risk_level
7. ip_diversity_score
8. frequency_score
9. unique_ip_count
10. total_signins
11. calculation_date
12. TimeGenerated

**Expected Output:**
Sample of the final output showing the top 10 highest risk users.

In [None]:
print("üìù Preparing final output...")

output_df = (
    risk_scores
    .withColumn("calculation_date", current_timestamp())
    .withColumn("TimeGenerated", current_timestamp())
    .select(
        # Identity (4 columns)
        "UserId",
        "UserPrincipalName",
        "UserDisplayName",
        "department",
        # Risk Scores (4 columns)
        "total_risk_score",
        "risk_level",
        "ip_diversity_score",
        "frequency_score",
        # Metrics (2 columns)
        "unique_ip_count",
        "total_signins",
        # Metadata (2 columns)
        "calculation_date",
        "TimeGenerated"
    )
    .orderBy(col("total_risk_score").desc())
)

output_count = output_df.count()
column_count = len(output_df.columns)

print(f"‚úÖ Output prepared:")
print(f"   Users: {output_count}")
print(f"   Columns: {column_count}")
print(f"   Ordered by: total_risk_score (descending)")

print("\nüìã Sample Final Output (Top 10 highest risk users):")
output_df.show(10, truncate=False)

## 9. Write to Custom Table

Write the risk scores to a custom Sentinel table for querying and analysis.

**Required Permissions:**
- **Microsoft Sentinel Contributor** role on the workspace, OR
- **Storage Blob Data Contributor** role on the storage account

**What we're doing:**
- Write output_df to `UserRiskScoreSimple_SPRK` table
- Mode: "overwrite" - replaces existing data with current analysis
- Format: Delta Lake for ACID transactions
- Handle permission errors gracefully

**Custom Table Benefits:**
- Query risk scores directly in KQL
- Create workbooks and dashboards
- Use in analytics rules for alerting
- Join with other Sentinel tables

**If write fails:**
- Data remains in output_df for analysis
- Can export to CSV for manual ingestion
- Contact admin for permissions

**Expected Output:**
Success message with record count, OR permission error with alternatives.

In [None]:
import traceback
import sys
import os

print("üíæ Writing risk scores to custom table...")
print("‚ö†Ô∏è  Note: Writing requires Microsoft Sentinel Contributor permissions\n")

# Custom table name - following the _SPRK convention
CUSTOM_TABLE_NAME = "UserRiskScoreSimple_SPRK"
write_success = False

# Save original stderr
original_stderr = sys.stderr

try:
    # Suppress stderr during write attempt to avoid mixed output
    sys.stderr = open(os.devnull, 'w')
    
    # Write using Sentinel provider's save_as_table method
    sentinel_provider.save_as_table(
        output_df,
        CUSTOM_TABLE_NAME,
        write_options={
            "mode": "overwrite",
            "mergeSchema": "true"
        }
    )
    
    write_success = True
    
    # Restore stderr before printing success
    sys.stderr = original_stderr
    
    record_count = output_df.count()
    print(f"‚úÖ Successfully wrote {record_count} records to {CUSTOM_TABLE_NAME}")
    print(f"\nüìä Table Details:")
    print(f"   Table Name: {CUSTOM_TABLE_NAME}")
    print(f"   Records: {record_count}")
    print(f"   Columns: {len(output_df.columns)}")
    
    print(f"\nüîç Sample KQL Queries:")
    print(f"   // Query all high-risk users")
    print(f"   {CUSTOM_TABLE_NAME}")
    print(f"   | where risk_level == 'High'")
    print(f"   | order by total_risk_score desc")
    print(f"")
    print(f"   // Query by department")
    print(f"   {CUSTOM_TABLE_NAME}")
    print(f"   | where department == 'IT'")
    print(f"   | summarize avg(total_risk_score) by risk_level")
    
except Exception as e:
    # Restore stderr
    sys.stderr = original_stderr
    
    print(f"‚ùå Could not write to custom table")
    print(f"\n‚ö†Ô∏è  Common causes:")
    print(f"   - Missing Microsoft Sentinel Contributor permissions")
    print(f"   - Storage account access issues")
    print(f"   See: https://learn.microsoft.com/en-us/azure/sentinel/roles")
    
finally:
    # Ensure stderr is always restored
    sys.stderr = original_stderr

if not write_success:
    print(f"\nüí° Your risk scores are still available in the 'output_df' variable!")
    print(f"\nüìä You can still analyze the data:")
    print(f"   output_df.filter(col('risk_level') == 'High').show()")
    print(f"\nüìÅ To export to CSV:")
    print(f"   pdf = output_df.toPandas()")
    print(f"   pdf.to_csv('user_risk_scores_simple.csv', index=False)")

## 10. Summary & Next Steps

## üéâ Congratulations!

You've successfully completed your first Sentinel security analytics notebook!

### What You Accomplished

‚úÖ **Loaded data from 2 Sentinel tables**
- SigninLogs (authentication events)
- EntraUsers (user profiles)

‚úÖ **Performed data operations**
- Filtered data with conditions
- Aggregated with groupBy() and agg()
- Joined two DataFrames
- Created calculated columns

‚úÖ **Calculated risk scores**
- IP diversity scoring (0-20 points)
- Sign-in frequency scoring (0-20 points)
- Consistency pattern detection (0-10 points)
- Risk level classification (Low/Medium/High)

‚úÖ **Created visualizations**
- Risk distribution bar chart
- Color-coded by risk level

‚úÖ **Saved results**
- Wrote to UserRiskScoreSimple_SPRK table (if permissions allowed)
- Created queryable data for KQL

---

## What You Learned

### Sentinel Skills
- Connecting to workspace with MicrosoftSentinelProvider
- Reading from system tables (SigninLogs, EntraUsers)
- Understanding table schemas and relationships
- Writing to custom tables for analysis

### PySpark Skills
- **Filtering**: `.filter()`, `.select()`
- **Aggregating**: `.groupBy()`, `.agg()`, `countDistinct()`, `count()`
- **Joining**: LEFT JOIN with qualified column names
- **Transforming**: `.withColumn()`, `when().otherwise()`
- **Sorting**: `.orderBy()` with ascending/descending

### Security Analytics Skills
- IP diversity as a risk signal
- Threshold-based scoring methodology
- Risk level classification
- Pattern detection (automation, consistency)

---

## Next Steps

### 1. Experiment with This Notebook

Try modifying the scoring thresholds:
```python
# In Section 6, change IP diversity thresholds:
.withColumn("ip_diversity_score",
    when(col("unique_ip_count") == 1, 0)
    .when(col("unique_ip_count") <= 5, 5)  # Changed from 3
    .when(col("unique_ip_count") <= 10, 12)  # Changed from 6
    .otherwise(20)
)
```

Try different time windows:
```python
# In Section 1:
ANALYSIS_DAYS = 7  # Change from 14 to 7 days
```

### 2. Enhance This Notebook

**Add Off-Hours Analysis:**
```python
# In Section 2, add to .select():
.select("UserId", "UserPrincipalName", "IPAddress", "TimeGenerated")

# In Section 4, add to .agg():
sum(
    when((hour("TimeGenerated") < 6) | (hour("TimeGenerated") >= 18), 1)
    .otherwise(0)
).alias("offhours_signins")
```

**Add Application Diversity:**
```python
# In Section 2, add to .select():
.select("UserId", "UserPrincipalName", "IPAddress", "AppId")

# In Section 4, add to .agg():
countDistinct("AppId").alias("unique_app_count")
```

### 3. Progress to Full Version

Once comfortable with this notebook, explore:
- **Risk Score.ipynb** - Full production version
  - 4 tables (adds AuditLogs, SecurityAlert)
  - 6 risk categories (100 points)
  - Department baselines
  - 27 output columns
  - More sophisticated scoring

### 4. Apply in Production

**Schedule Regular Runs:**
- Run daily or weekly to track risk trends
- Monitor changes in user risk profiles
- Identify emerging threats early

**Create Analytics Rules:**
```kql
UserRiskScoreSimple_SPRK
| where risk_level == "High"
| where total_risk_score >= 40
| project UserPrincipalName, total_risk_score, unique_ip_count, department
```

**Build Workbooks:**
- Visualize risk trends over time
- Compare departments
- Track high-risk users

**Join with Other Tables:**
```kql
UserRiskScoreSimple_SPRK
| join kind=inner (
    SecurityAlert
    | where CompromisedEntity != ""
) on $left.UserPrincipalName == $right.CompromisedEntity
| summarize AlertCount=count() by UserPrincipalName, risk_level
```

---

## Resources

**Documentation:**
- [Sentinel Data Lake Documentation](https://learn.microsoft.com/en-us/azure/sentinel/datalake/)
- [PySpark SQL Functions](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html)
- [KQL Reference](https://learn.microsoft.com/en-us/azure/data-explorer/kusto/query/)

**Related Notebooks:**
- `Risk Score.ipynb` - Full production version
- Design documents in `/plans/` directory

**Get Help:**
- [Microsoft Tech Community - Sentinel](https://techcommunity.microsoft.com/t5/microsoft-sentinel/bd-p/MicrosoftSentinel)
- [Stack Overflow - Azure Sentinel](https://stackoverflow.com/questions/tagged/azure-sentinel)

---

## üéì You're Ready!

You now have the foundational skills to:
- Work with Sentinel data programmatically
- Perform security analytics with PySpark
- Calculate risk scores from behavioral data
- Create visualizations and reports
- Build production security analytics

Keep exploring, experimenting, and building!

---