# 🚀 Getting Started with Apache Spark for Microsoft Sentinel data lake

This notebook is designed for **beginners** (security analysts, data engineers, SOC researchers) 
who are new to working with Spark notebooks and Sentinel data lake.

## 🎯 What You'll Learn

By the end of this tutorial, you'll be able to:
- 🔗 Connect to Sentinel data lake using the `MicrosoftSentinelProvider`
- 📊 Load and explore log data (SecurityEvent, SigninLogs, AuditLogs, etc.)
- 🔍 Perform common **security analysis queries** using Spark operations
- 💾 Save processed results back to the data lake for further analysis
- 🗑️ Safely manage tables (create, read, delete)

> **💡 Tip**: This is a hands-on tutorial - run each cell step by step to see the results!

---

## 🔗 Step 1: Connect to Sentinel data lake

First, we'll establish a connection using the **MicrosoftSentinelProvider** - your gateway to Sentinel data lake.

### What is MicrosoftSentinelProvider?
The `MicrosoftSentinelProvider` is a Python class that:
- 🔗 Connects your Spark session to Microsoft Sentinel data lake
- 📋 Lists available databases and tables
- 📖 Reads security log data at scale
- 💾 Saves processed results back to the data lake

### Getting Started
➡️ Simply initialize the provider with your active Spark session - that's it!

📚 **Learn More**: [Microsoft Sentinel Provider Class Reference](https://learn.microsoft.com/en-us/azure/sentinel/datalake/sentinel-provider-class-reference)


In [None]:
# Generic libraries for data manipulation and pyspark operations
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Loading the MicrosoftSentinelProvider from sentinel_lake library
from sentinel_lake.providers import MicrosoftSentinelProvider

# Initialize provider
data_provider = MicrosoftSentinelProvider(spark)

print("✅ MicrosoftSentinelProvider initialized")

## ⚙️ Step 2: Configure Parameters & Tables

Before we dive into the data, let's set up our configuration parameters.

### 📋 What We're Configuring:
- **🕒 Time Window** → How far back to look for data (e.g., last 1 hour)
- **📂 Input Table** → Which Sentinel table to read from (SigninLogs in our example)
- **🎯 Output Table** → Where to save our processed results
- **🏢 Workspace** → Your Sentinel workspace name

### 🔧 Why This Matters:
- **Reusability**: Change dates/tables without modifying code
- **Performance**: Smaller time windows = faster queries
- **Organization**: Clear naming helps track your analysis

> **⚠️ Remember**: Replace `<YOUR_WORKSPACE_NAME>` with your actual Sentinel workspace name!

In [None]:
# Time window
lookback_hours = 4  # Adjust Lookback period in hours to match with filter needs
# Define run_start and run_end to be used in queries
run_end = datetime.now().replace(minute=0, second=0, microsecond=0)  # Round down to the nearest hour
run_start = run_end - timedelta(hours=lookback_hours)   # Setting start time based on lookback hours relative to run_end

# Workspace name (replace with your own Sentinel workspace)
workspace_name = "<workspace_name>"  # Change this to your Sentinel workspace name  

# Table names
input_table_raw = "SigninLogs"
output_data_lake_table = "Test_output_table_SPRK"    # Chage this to your desired output table name - make sure to have _SPRK suffix

# Write options (append keeps history, partition improves query performance)
write_options = {"mode": "append"} # simple append mode options for demo

print("📅 Time Window:", run_start, "→", run_end)
print("✅ Parameters configured")

print("📂 Tables configured as:")
print(f"\t   Input:      {input_table_raw}")
print(f"\t   Output Table: {output_data_lake_table}")


## 📊 Step 3: Load & Transform Data

Now we'll load real security data from the **SigninLogs** table and prepare it for analysis.

### 🎯 What This Code Does:
1. **📖 Reads Data**: Connects to SigninLogs in your Analytics Tier workspace
2. **🔍 Filters Time**: Only gets data from our specified time window
3. **🏗️ Transforms Data**: 
   - Selects key security fields (user, IP, location, etc.)
   - Expands JSON location data into separate columns (City, Country, coordinates)
   - Adds a date column for easier querying
4. **⚡ Optimizes**: Caches the data in memory for faster subsequent operations

### 🗂️ Key Fields We're Working With:
- **Identity**: UserPrincipalName, UserDisplayName
- **Network**: IPAddress, AutonomousSystemNumber (ASN)
- **Location**: City, Country, Latitude/Longitude (from LocationDetails JSON)
- **Security**: ResultType, ResultSignature, Status
- **Context**: UserAgent, TimeGenerated

> **💡 Pro Tip**: The `.cache()` operation stores data in memory, making repeated queries much faster!

In [None]:
# Core fields + enrichment
signin_fields = [
    "TimeGenerated",
    "UserPrincipalName",
    "UserDisplayName",
    "IPAddress",
    "ResultType",
    "ResultSignature",
    "Status",
    "UserType",
    "UserAgent",
    "LocationDetails",      # JSON: city, state, countryOrRegion, geoCoordinates - the columns are part of LocationDetails dynamic JSON field and will be expanded below
    "AutonomousSystemNumber"
]

df_recent_raw = (
    data_provider.read_table(input_table_raw, workspace_name)   # Read from SigninLogs table from Analytics tier - workspace_name is provided
    .select(*signin_fields)
    .filter((F.col("TimeGenerated") > F.lit(run_start)) & 
            (F.col("TimeGenerated") <= F.lit(run_end))) # Filter by time window as defined above
    .withColumn("date", F.to_date("TimeGenerated"))
    # Expand JSON fields
    .withColumn("City", F.get_json_object("LocationDetails", "$.city"))
    .withColumn("Country", F.get_json_object("LocationDetails", "$.countryOrRegion"))
    .withColumn("Latitude", F.get_json_object("LocationDetails", "$.geoCoordinates.latitude").cast("double"))
    .withColumn("Longitude", F.get_json_object("LocationDetails", "$.geoCoordinates.longitude").cast("double"))
    .withColumnRenamed("AutonomousSystemNumber", "ASN")
    .cache()
)

# Prints only the DataFrame schema (metadata), does not scan the data so lightweight
df_recent_raw.printSchema()
print("✅ Loaded SigninLogs with expanded LocationDetails and ASN")

## 🔍 Step 4: Explore Security Data

Now for the fun part - let's analyze the data! We'll demonstrate common security analysis patterns.

### 🎯 Example Analysis: Failed Login Attempts

We're looking for accounts with failed authentication attempts (ResultType = "50126").
This helps identify:
- 🚨 **Potential brute force attacks**
- 🔐 **Accounts under attack**
- 📊 **Attack patterns and trends**

### 🛠️ What the Code Does:
1. **Filters** for failed login attempts
2. **Groups** by user account
3. **Counts** failures per account
4. **Sorts** by highest failure count
5. **Checks** if any results were found
6. **Conditionally displays** results with appropriate messaging:
   - If records found → Shows HTML table with top 10 targeted accounts
   - If no records → Provides helpful context and next steps

> **💡 Security Insight**: High failure counts on specific accounts often indicate targeted attacks or compromised credentials!

### 🔧 **Troubleshooting Display Issues:**
If you see `SynapseWidget(Synapse.DataFrame, xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx)` instead of an HTML table:
1. **Click the 3 dots (⋯)** next to the widget output cell with error
2. **Select "Change Presentation"** from the dropdown menu
3. **Choose a different renderer** (try "VS Code Builtin Notebook Output Rendere" or "Fabric Data Engineering Notebook Output Renderer")
4. This will convert the widget to a proper interactive table view
5. If it still does not show, you may have to restart the session.

> **💡 Pro Tip**: This widget issue sometimes occurs in VS Code with certain DataFrame sizes or configurations, but the data is still there!

In [None]:
# Create the analysis DataFrame
failed_logins_df = (df_recent_raw.filter(F.col("ResultType") == "50126") 
                   .groupBy("UserPrincipalName")
                   .count()
                   .orderBy(F.desc("count"))
                   .limit(10))

# Check if we have any results
record_count = failed_logins_df.count()

if record_count > 0:
    print(f"📊 Found {record_count} accounts with failed login attempts")
    print("⏳ Preparing to show DataFrame...Groupby-Count-Orderby Operations may take few minutes ⌛")
    display(failed_logins_df)
    print("✅ Displayed: Top accounts with failed logons")
else:
    print("🎉 Great news! No failed login attempts found in the selected time window.")
    print("💡 This could mean:")
    print("   • Your environment is secure with no brute force attempts")
    print("   • The time window might be too narrow (try increasing lookback_hours)")
    print("   • Different ResultType codes might be relevant for your environment")
    print("   • Consider checking other security events like successful logins from new locations")

## 💾 Step 5: Save Results to data lake

After processing and analyzing data, you'll often want to save results for:
- 📊 **Dashboards and reports**
- 🔍 **Future investigations** 
- 🤝 **Sharing with team members**
- ⚡ **Faster subsequent queries**

### 🎯 What We're Doing:
Using `save_as_table()` to write our processed data back to the Sentinel data lake.

### 📋 Save Options:
- **Table Name**: `Test_output_table_SPRK` (clearly marked as test data)
- **Mode**: `append` (adds to existing data rather than overwriting)
- **Tier**: "System tables" (data lake tier for long-term storage)

> **⚠️ Important**: This creates a test table that we'll clean up at the end of this tutorial.

In [None]:
try:
    data_provider.save_as_table(
            df_recent_raw,
            output_data_lake_table,
            "System tables",        # System tables refers to writing to data lake tier table.
            write_options
        )
    print(f"✅ Wrote test data into {output_data_lake_table}")
except Exception as save_err:
    print(f"❌ Failed writing data into {output_data_lake_table}: {save_err}")

## 🗑️ Step 6: Clean Up Test Data

### ⚠️ **CAUTION: Table Deletion**

We're about to demonstrate the `delete_table()` operation. This is **permanent** and **cannot be undone**.

### 🎯 Why We're Doing This:
- 🧹 **Clean up**: Remove the test table we created
- 📚 **Learning**: Show you how to safely manage tables
- 💰 **Cost control**: Avoid unnecessary storage charges

### 🛡️ Safety Guidelines:
- ✅ **Only delete tables you created for testing**
- ❌ **Never delete production tables without team approval**
- 📝 **Always double-check the table name before deletion**

> **🏭 Best Practice**: In production, implement approval workflows and backup procedures before any deletion operations!

### 🔍 What's Being Deleted:
Table: `Test_output_table_SPRK` (our demo table from Step 5)

In [None]:
print(f"⚠️ Deleting table - {output_data_lake_table} - created for demo purposes")
data_provider.delete_table(output_data_lake_table) 

print("✅ Specified Table deleted successfully.")

## 🎉 Congratulations!

You've successfully completed your first Spark notebook for Microsoft Sentinel data lake analysis! 

### 🚀 What You've Accomplished:
- ✅ Connected to Sentinel data lake using MicrosoftSentinelProvider
- ✅ Loaded and transformed real security data (SigninLogs)
- ✅ Performed security analysis (failed login detection)
- ✅ Saved results back to the data lake
- ✅ Safely managed table operations (create/delete)

### 🎯 Next Steps:
- **Explore More Tables**: Try SecurityEvent, AuditLogs, or DeviceEvents
- **Advanced Analytics**: Implement time-series analysis and anomaly detection
- **Automation**: Schedule notebooks to run automatically
- **Visualization**: Create dashboards from your saved results

---

## 📚 References & Further Learning

### 🔗 Microsoft Sentinel data lake Documentation:
- **[Sentinel Provider Class Reference](https://learn.microsoft.com/en-us/azure/sentinel/datalake/sentinel-provider-class-reference)** - Complete API documentation for MicrosoftSentinelProvider
- **[Notebook Jobs in Sentinel](https://learn.microsoft.com/en-us/azure/sentinel/datalake/notebook-jobs)** - Learn to schedule and automate your notebooks

### 📖 Additional Learning Resources:
- **[Apache Spark Official Documentation](https://spark.apache.org/docs/latest/)** - Comprehensive Spark programming guide
- **[PySpark Quickstart: DataFrame](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html)**  - Introduction and quickstart for the PySpark DataFrame API
- **[Microsoft Sentinel Skill-up Training](https://learn.microsoft.com/en-us/azure/sentinel/skill-up-resources)** - Free training modules for security analysts

### 🛠️ Advanced Topics:
- **[PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)** - Master data manipulation techniques
- **[Spark SQL Functions](https://spark.apache.org/docs/latest/api/sql/index.html)** - Powerful built-in functions for data analysis

### 🤝 Community & Support:
- **[Microsoft Sentinel data lake GitHub](https://github.com/microsoft/Sentinel)** - Out-of-the Box Notebooks and KQL queries for data lake.
- **[Microsoft Tech Community](https://techcommunity.microsoft.com/t5/microsoft-sentinel/bd-p/MicrosoftSentinel)** - Connect with other security professionals

---

> **💡 Pro Tip**: Use this notebook as a starter template for your own security analysis projects!