# 01 - Data Exploration

## Email Compliance Intelligence Demo

**Goal:** Explore the email dataset and understand the compliance classification task.

In this notebook, we'll:
1. Connect to Snowflake and explore the raw email data
2. Understand the label distribution (class imbalance)
3. Identify patterns that signal compliance risk
4. Set up baseline metrics for model evaluation

---

## Setup: Connect to Snowflake

Snowpark is Snowflake's Python API. It lets you write DataFrame operations that execute inside Snowflake - your data never leaves the platform.

In [None]:
# Import Snowpark
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, count, avg, sum as sum_, when, hour, length, lit

# Connect using default config (~/.snowflake/config.toml)
session = Session.builder.getOrCreate()

# Set context for the demo
session.use_warehouse("COMPLIANCE_DEMO_WH")
session.use_database("COMPLIANCE_DEMO")
session.use_schema("EMAIL_SURVEILLANCE")

print(f"Connected as: {session.get_current_user()}")
print(f"Warehouse: {session.get_current_warehouse()}")
print(f"Database: {session.get_current_database()}")

---
## 1. Load and Preview the Data

We have 10,000 synthetic hedge fund email communications with ground truth compliance labels.

In [None]:
# Load the emails table as a Snowpark DataFrame
emails = session.table("EMAILS")

# Check the count
total_count = emails.count()
print(f"Total emails in dataset: {total_count:,}")

In [None]:
# Preview a few rows
emails.select(
    "EMAIL_ID", 
    "SENDER", 
    "RECIPIENT", 
    "SUBJECT", 
    "COMPLIANCE_LABEL"
).show(10)

---
## 2. Label Distribution (Class Imbalance)

In real compliance scenarios, most emails are legitimate. Violations are rare events. This class imbalance is important for model training.

In [None]:
# Count by compliance label with percentages
print("=" * 50)
print("COMPLIANCE LABEL DISTRIBUTION")
print("=" * 50)

session.sql("""
    SELECT 
        COMPLIANCE_LABEL,
        COUNT(*) AS COUNT,
        ROUND(COUNT(*) * 100.0 / SUM(COUNT(*)) OVER (), 1) AS PERCENTAGE
    FROM EMAILS
    GROUP BY COMPLIANCE_LABEL
    ORDER BY COUNT DESC
""").show()

**Key Insight:** ~70% of emails are CLEAN. This matches real-world distributions where violations are the minority class. Our models need to handle this imbalance.

---
## 3. Department Communication Patterns

One key compliance signal is **cross-department communication**, especially between Research and Trading (information barriers / "Chinese walls").

In [None]:
# Cross-department communication and Research <-> Trading specifically
print("=" * 50)
print("RESEARCH <-> TRADING COMMUNICATIONS (Info Barrier Risk)")
print("=" * 50)

session.sql("""
    SELECT 
        COMPLIANCE_LABEL,
        COUNT(*) as COUNT
    FROM EMAILS
    WHERE (SENDER_DEPT = 'Research' AND RECIPIENT_DEPT = 'Trading')
       OR (SENDER_DEPT = 'Trading' AND RECIPIENT_DEPT = 'Research')
    GROUP BY COMPLIANCE_LABEL
    ORDER BY COUNT DESC
""").show()

---
## 4. Sample Emails by Category

Let's look at example emails for each compliance category to understand what we're detecting.

In [None]:
# Sample one email from each category
categories = ["CLEAN", "INSIDER_TRADING", "CONFIDENTIALITY_BREACH", "PERSONAL_TRADING", "INFO_BARRIER_VIOLATION"]

for category in categories:
    print("=" * 60)
    print(f"SAMPLE: {category}")
    print("=" * 60)
    
    sample = session.sql(f"""
        SELECT SUBJECT, LEFT(BODY, 300) as BODY_PREVIEW
        FROM EMAILS
        WHERE COMPLIANCE_LABEL = '{category}'
        LIMIT 1
    """).collect()
    
    if sample:
        print(f"Subject: {sample[0]['SUBJECT']}")
        print(f"Body: {sample[0]['BODY_PREVIEW']}...")
    print()

---
## Summary

**Key findings from data exploration:**

1. **Class Imbalance:** ~70% clean emails, ~30% violations (realistic distribution)
2. **Department Patterns:** Research and Trading communications are high-risk for info barrier violations
3. **Violation Types:** 4 distinct categories with different characteristics

**Next:** In notebook 02, we'll use Snowflake's **Feature Store** to build reusable compliance risk features from these patterns.