# üöÄ Getting Started with Apache Spark for Sentinel data lake

This notebook is designed for **beginners** (security analysts, data engineers, SOC researchers) 
who are new to working with Spark notebooks and Microsoft Sentinel data.

## üéØ What You'll Learn

By the end of this tutorial, you'll be able to:
- üîó Connect to Sentinel data lake using the `MicrosoftSentinelProvider`
- üìä Load and explore log data (SecurityEvent, SigninLogs, AuditLogs, etc.)
- üîç Perform common **security analysis queries** using Spark operations
- üíæ Save processed results back to the data lake for further analysis
- üóëÔ∏è Safely manage tables (create, read, delete)

> **üí° Tip**: This is a hands-on tutorial - run each cell step by step to see the results!

---

## üîó Step 1: Connect to Sentinel data lake

First, we'll establish a connection using the **MicrosoftSentinelProvider** - your gateway to Sentinel data.

### What is MicrosoftSentinelProvider?
The `MicrosoftSentinelProvider` is a Python class that:
- üîó Connects your Spark session to Microsoft Sentinel's data lake
- üìã Lists available databases and tables
- üìñ Reads security log data at scale
- üíæ Saves processed results back to the data lake

### Getting Started
‚û°Ô∏è Simply initialize the provider with your active Spark session - that's it!

üìö **Learn More**: [Microsoft Sentinel Provider Class Reference](https://learn.microsoft.com/en-us/azure/sentinel/datalake/sentinel-provider-class-reference)


In [None]:
# Generic libraries for data manipulation and pyspark operations
from datetime import datetime, timedelta
from pyspark.sql import functions as F
from pyspark.sql.window import Window
# Loading the MicrosoftSentinelProvider from sentinel_lake library
from sentinel_lake.providers import MicrosoftSentinelProvider

# Initialize provider
data_provider = MicrosoftSentinelProvider(spark)

print("‚úÖ MicrosoftSentinelProvider initialized")

## ‚öôÔ∏è Step 2: Configure Parameters & Tables

Before we dive into the data, let's set up our configuration parameters.

### üìã What We're Configuring:
- **üïí Time Window** ‚Üí How far back to look for data (e.g., last 1 hour)
- **üìÇ Input Table** ‚Üí Which Sentinel table to read from (SigninLogs in our example)
- **üéØ Output Table** ‚Üí Where to save our processed results
- **üè¢ Workspace** ‚Üí Your Sentinel workspace name

### üîß Why This Matters:
- **Reusability**: Change dates/tables without modifying code
- **Performance**: Smaller time windows = faster queries
- **Organization**: Clear naming helps track your analysis

> **‚ö†Ô∏è Remember**: Replace `<YOUR_WORKSPACE_NAME>` with your actual Sentinel workspace name!

In [None]:
# Time window
lookback_hours = 1
run_end = datetime.now().replace(minute=0, second=0, microsecond=0)  # Round down to the nearest hour
run_start = run_end - timedelta(hours=lookback_hours)   # Setting start time based on lookback hours relative to run_end

# Workspace name (replace with your own Sentinel workspace)
workspace_name = "<YOUR_WORKSPACE_NAME>"

# Table names
input_table_raw = "SigninLogs"
output_datalake_table = "Test_output_table_SPRK"

# Write options (append keeps history, partition improves query performance)
write_options = {"mode": "append"} # simple append mode options for demo

print("üìÖ Time Window:", run_start, "‚Üí", run_end)
print("‚úÖ Parameters configured")

print("üìÇ Tables configured as:")
print(f"\t   Input:      {input_table_raw}")
print(f"\t   Output Table: {output_datalake_table}")


## üìä Step 3: Load & Transform Data

Now we'll load real security data from the **SigninLogs** table and prepare it for analysis.

### üéØ What This Code Does:
1. **üìñ Reads Data**: Connects to SigninLogs in your Analytics Tier workspace
2. **üîç Filters Time**: Only gets data from our specified time window
3. **üèóÔ∏è Transforms Data**: 
   - Selects key security fields (user, IP, location, etc.)
   - Expands JSON location data into separate columns (City, Country, coordinates)
   - Adds a date column for easier querying
4. **‚ö° Optimizes**: Caches the data in memory for faster subsequent operations

### üóÇÔ∏è Key Fields We're Working With:
- **Identity**: UserPrincipalName, UserDisplayName
- **Network**: IPAddress, AutonomousSystemNumber (ASN)
- **Location**: City, Country, Latitude/Longitude (from LocationDetails JSON)
- **Security**: ResultType, ResultSignature, Status
- **Context**: UserAgent, TimeGenerated

> **üí° Pro Tip**: The `.cache()` operation stores data in memory, making repeated queries much faster!

In [None]:
# Core fields + enrichment
signin_fields = [
    "TimeGenerated",
    "UserPrincipalName",
    "UserDisplayName",
    "IPAddress",
    "ResultType",
    "ResultSignature",
    "Status",
    "UserType",
    "UserAgent",
    "LocationDetails",      # JSON: city, state, countryOrRegion, geoCoordinates
    "AutonomousSystemNumber"
]

df_recent_raw = (
    data_provider.read_table(input_table_raw, workspace_name)   # Read from SigninLogs table from Analytics tier - workspace_name is provided
    .select(*signin_fields)
    .filter((F.col("TimeGenerated") > F.lit(run_start)) & 
            (F.col("TimeGenerated") <= F.lit(run_end))) # Filter by time window of last 1 hour
    .withColumn("date", F.to_date("TimeGenerated"))
    # Expand JSON fields
    .withColumn("City", F.get_json_object("LocationDetails", "$.city"))
    .withColumn("Country", F.get_json_object("LocationDetails", "$.countryOrRegion"))
    .withColumn("Latitude", F.get_json_object("LocationDetails", "$.geoCoordinates.latitude").cast("double"))
    .withColumn("Longitude", F.get_json_object("LocationDetails", "$.geoCoordinates.longitude").cast("double"))
    .withColumnRenamed("AutonomousSystemNumber", "ASN")
    .cache()
)

# Prints only the DataFrame schema (metadata), does not scan the data so lightweight
df_recent_raw.printSchema()
print("‚úÖ Loaded SigninLogs with expanded LocationDetails and ASN")

## üîç Step 4: Explore Security Data

Now for the fun part - let's analyze the data! We'll demonstrate common security analysis patterns.

### üéØ Example Analysis: Failed Login Attempts

We're looking for accounts with failed authentication attempts (ResultType = "50126").
This helps identify:
- üö® **Potential brute force attacks**
- üîê **Accounts under attack**
- üìä **Attack patterns and trends**

### üõ†Ô∏è What the Code Does:
1. **Filters** for failed login attempts
2. **Groups** by user account
3. **Counts** failures per account
4. **Sorts** by highest failure count
5. **Displays** top 10 targeted accounts

> **üí° Security Insight**: High failure counts on specific accounts often indicate targeted attacks or compromised credentials!

In [None]:

print("‚è≥ Preparing to show DataFrame...Groupby-Count-Orderby Operations may take few minutes ‚åõ")
display(            # display will show nicely formatted HTML table in notebook . Alternatively, use .show() for text table output
    df_recent_raw.filter(F.col("ResultType") == "50126")
             .groupBy("UserPrincipalName")
             .count()
             .orderBy(F.desc("count")).limit(10)
)

print("‚úÖ Displayed Example: Top 10 accounts with failed logons")

## üíæ Step 5: Save Results to data lake

After processing and analyzing data, you'll often want to save results for:
- üìä **Dashboards and reports**
- üîç **Future investigations** 
- ü§ù **Sharing with team members**
- ‚ö° **Faster subsequent queries**

### üéØ What We're Doing:
Using `save_as_table()` to write our processed data back to the Sentinel data lake.

### üìã Save Options:
- **Table Name**: `Test_output_table_SPRK` (clearly marked as test data)
- **Mode**: `append` (adds to existing data rather than overwriting)
- **Tier**: "System tables" (data lake tier for long-term storage)

> **‚ö†Ô∏è Important**: This creates a test table that we'll clean up at the end of this tutorial.

In [None]:
try:
    data_provider.save_as_table(
            df_recent_raw,
            output_datalake_table,
            "System tables",        # System tables refers to writing to data lake tier table.
            write_options
        )
    print(f"‚úÖ Wrote test data into {output_datalake_table}")
except Exception as save_err:
    print(f"‚ùå Failed writing data into {output_datalake_table}: {save_err}")

## üóëÔ∏è Step 6: Clean Up Test Data

### ‚ö†Ô∏è **CAUTION: Table Deletion**

We're about to demonstrate the `delete_table()` operation. This is **permanent** and **cannot be undone**.

### üéØ Why We're Doing This:
- üßπ **Clean up**: Remove the test table we created
- üìö **Learning**: Show you how to safely manage tables
- ? **Cost control**: Avoid unnecessary storage charges

### üõ°Ô∏è Safety Guidelines:
- ‚úÖ **Only delete tables you created for testing**
- ‚ùå **Never delete production tables without team approval**
- üìù **Always double-check the table name before deletion**

> **? Best Practice**: In production, implement approval workflows and backup procedures before any deletion operations!

### üîç What's Being Deleted:
Table: `Test_output_table_SPRK` (our demo table from Step 5)

In [None]:
print(f"‚ö†Ô∏è Deleting table - {output_datalake_table} - created for demo purposes")
data_provider.delete_table(output_datalake_table) 

print("‚úÖ Specified Table deleted successfully.")

## üéâ Congratulations!

You've successfully completed your first Spark notebook for Microsoft Sentinel data lake analysis! 

### üöÄ What You've Accomplished:
- ‚úÖ Connected to Sentinel data lake using MicrosoftSentinelProvider
- ‚úÖ Loaded and transformed real security data (SigninLogs)
- ‚úÖ Performed security analysis (failed login detection)
- ‚úÖ Saved results back to the data lake
- ‚úÖ Safely managed table operations (create/delete)

### üéØ Next Steps:
- **Explore More Tables**: Try SecurityEvent, AuditLogs, or DeviceEvents
- **Advanced Analytics**: Implement time-series analysis and anomaly detection
- **Automation**: Schedule notebooks to run automatically
- **Visualization**: Create dashboards from your saved results

---

## üìö References & Further Learning

### üîó Microsoft Sentinel data lake Documentation:
- **[Sentinel Provider Class Reference](https://learn.microsoft.com/en-us/azure/sentinel/datalake/sentinel-provider-class-reference)** - Complete API documentation for MicrosoftSentinelProvider
- **[Notebook Jobs in Sentinel](https://learn.microsoft.com/en-us/azure/sentinel/datalake/notebook-jobs)** - Learn to schedule and automate your notebooks

### üìñ Additional Learning Resources:
- **[Apache Spark Official Documentation](https://spark.apache.org/docs/latest/)** - Comprehensive Spark programming guide
- **[PySpark Quickstart: DataFrame](https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_df.html)**  - Introduction and quickstart for the PySpark DataFrame API
- **[Microsoft Sentinel Skill-up Training](https://learn.microsoft.com/en-us/azure/sentinel/skill-up-resources)** - Free training modules for security analysts

### üõ†Ô∏è Advanced Topics:
- **[PySpark DataFrame API](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/dataframe.html)** - Master data manipulation techniques
- **[Spark SQL Functions](https://spark.apache.org/docs/latest/api/sql/index.html)** - Powerful built-in functions for data analysis

### ü§ù Community & Support:
- **[Microsoft Sentinel data lake GitHub](https://github.com/microsoft/Sentinel)** - Out-of-the Box Notebooks and KQL queries for data lake.
- **[Microsoft Tech Community](https://techcommunity.microsoft.com/t5/microsoft-sentinel/bd-p/MicrosoftSentinel)** - Connect with other security professionals

---

> **üí° Pro Tip**: Use this notebook as a starter template for your own security analysis projects!