# 02 - Feature Engineering with Feature Store

## Building Reusable Compliance Risk Features

**Goal:** Use Snowflake's Feature Store to create reusable, versioned features for compliance risk detection.

### What is a Feature Store?

A Feature Store is a centralized repository for ML features. It provides:
- **Reusability:** Define features once, use across multiple models
- **Versioning:** Track feature changes over time
- **Consistency:** Same feature logic for training and inference
- **Point-in-time correctness:** Generate training data without data leakage

### Features We'll Build

| Feature | Description | Risk Signal |
|---------|-------------|-------------|
| `cross_dept_ratio` | % of emails sent to other departments | High cross-dept = potential leaks |
| `after_hours_ratio` | % of emails sent outside business hours | Unusual activity pattern |
| `urgency_score` | Count of urgent keywords | High urgency = suspicious |
| `secrecy_score` | Count of secrecy keywords | Explicit secrecy = red flag |
| `barrier_crossing` | Research↔Trading emails | Information barrier risk |

---

In [None]:
# Setup
from snowflake.snowpark import Session
from snowflake.snowpark.functions import col, count, avg, sum as sum_, when, hour, length, lit
from snowflake.ml.feature_store import FeatureStore, Entity, FeatureView

session = Session.builder.getOrCreate()
session.use_warehouse("COMPLIANCE_DEMO_WH")
session.use_database("COMPLIANCE_DEMO")
session.use_schema("ML")

print(f"Connected as: {session.get_current_user()}")

## 1. Initialize Feature Store

The Feature Store uses a schema to store metadata about entities and features.

In [None]:
# Initialize Feature Store
fs = FeatureStore(
    session=session,
    database="COMPLIANCE_DEMO",
    name="ML",  # Schema name
    default_warehouse="COMPLIANCE_DEMO_WH",
)
print("✅ Feature Store initialized")

## 2. Define Entities

Entities are the "join keys" for features - the business objects we're building features for. We'll create an EMAIL entity.

In [None]:
# Create Email entity
email_entity = Entity(
    name="EMAIL",
    join_keys=["EMAIL_ID"],
    desc="Individual email message for compliance analysis"
)

fs.register_entity(email_entity)
print("✅ Registered entity: EMAIL")

# List all entities
fs.list_entities().show()

## 3. Compute Features

Now we'll compute compliance risk features from the raw email data. These features capture patterns we identified in data exploration.

In [None]:
# First, create UDFs for text-based features
# These run inside Snowflake - no data movement

session.sql("""
CREATE OR REPLACE FUNCTION COMPLIANCE_DEMO.ML.DETECT_URGENCY(text STRING)
RETURNS INT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.11'
HANDLER = 'detect_urgency'
AS $$
def detect_urgency(text: str) -> int:
    '''Return urgency score based on keyword presence (0-5).'''
    if text is None:
        return 0
    text_upper = text.upper()
    urgency_keywords = ["URGENT", "ASAP", "IMMEDIATELY", "TIME SENSITIVE", "ACT NOW", "ACT FAST"]
    return sum(1 for kw in urgency_keywords if kw in text_upper)
$$
""").collect()

session.sql("""
CREATE OR REPLACE FUNCTION COMPLIANCE_DEMO.ML.DETECT_SECRECY(text STRING)
RETURNS INT
LANGUAGE PYTHON
RUNTIME_VERSION = '3.11'
HANDLER = 'detect_secrecy'
AS $$
def detect_secrecy(text: str) -> int:
    '''Return secrecy score based on suspicious phrases (0-6).'''
    if text is None:
        return 0
    text_upper = text.upper()
    secrecy_phrases = [
        "DELETE", "DON'T TELL", "KEEP THIS BETWEEN", 
        "OFF THE RECORD", "CONFIDENTIAL", "SECRET"
    ]
    return sum(1 for phrase in secrecy_phrases if phrase in text_upper)
$$
""").collect()

print("✅ Created UDFs: DETECT_URGENCY, DETECT_SECRECY")

In [None]:
# Compute all features using SQL (cleaner than Snowpark for complex logic)
email_features_df = session.sql("""
    SELECT 
        EMAIL_ID,
        COMPLIANCE_LABEL,
        
        -- Text-based features using our UDFs
        COMPLIANCE_DEMO.ML.DETECT_URGENCY(BODY) AS URGENCY_SCORE,
        COMPLIANCE_DEMO.ML.DETECT_SECRECY(BODY) AS SECRECY_SCORE,
        COMPLIANCE_DEMO.ML.DETECT_URGENCY(BODY) + COMPLIANCE_DEMO.ML.DETECT_SECRECY(BODY) AS TOTAL_RISK_SCORE,
        
        -- Structural features
        LENGTH(BODY) AS BODY_LENGTH,
        
        -- Cross-department indicator
        CASE WHEN SENDER_DEPT != RECIPIENT_DEPT THEN 1 ELSE 0 END AS IS_CROSS_DEPT,
        
        -- Information barrier crossing (Research <-> Trading)
        CASE 
            WHEN (SENDER_DEPT = 'Research' AND RECIPIENT_DEPT = 'Trading')
              OR (SENDER_DEPT = 'Trading' AND RECIPIENT_DEPT = 'Research')
            THEN 1 ELSE 0 
        END AS IS_BARRIER_CROSSING,
        
        -- After-hours indicator
        CASE 
            WHEN HOUR(SENT_AT) < 8 OR HOUR(SENT_AT) >= 18 
            THEN 1 ELSE 0 
        END AS IS_AFTER_HOURS
        
    FROM COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS
""")

print("--- Feature Preview ---")
email_features_df.show(10)

## 4. Save and Register Feature View

Feature Views wrap feature computation logic and register it with the Feature Store for reuse.

In [None]:
# Save features to a table (Feature View source)
email_features_df.write.mode("overwrite").save_as_table("COMPLIANCE_DEMO.ML.EMAIL_RISK_FEATURES")
print("✅ Saved features to COMPLIANCE_DEMO.ML.EMAIL_RISK_FEATURES")

# Register Feature View
email_risk_fv = FeatureView(
    name="EMAIL_RISK_FEATURES",
    entities=[email_entity],
    feature_df=session.table("COMPLIANCE_DEMO.ML.EMAIL_RISK_FEATURES"),
    desc="Per-email risk signals from text analysis and metadata"
)

email_risk_fv = fs.register_feature_view(
    feature_view=email_risk_fv,
    version="V1",
    block=True,
)
print("✅ Registered FeatureView: EMAIL_RISK_FEATURES/V1")

## 5. Generate Training Dataset

Use the Feature Store to create a point-in-time correct training dataset.

In [None]:
# Create training spine (list of entities to get features for)
training_spine = session.sql("""
    SELECT 
        EMAIL_ID,
        COMPLIANCE_LABEL AS LABEL
    FROM COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS
""")

print(f"Training spine has {training_spine.count():,} records")

# Generate training dataset with features
training_dataset = fs.generate_dataset(
    name="EMAIL_COMPLIANCE_TRAINING",
    spine_df=training_spine,
    features=[email_risk_fv],
    output_type="table",
    desc="Training dataset for email compliance classification"
)

print("✅ Generated training dataset")
print(f"Dataset: {training_dataset.fully_qualified_name()}")

# Preview the dataset
training_dataset.read.df().show(10)

## Summary

**What we built:**
- Custom Python UDFs for text analysis (urgency, secrecy detection)
- Feature computation pipeline for compliance risk signals
- Registered Feature View in the Feature Store
- Generated training dataset for model training

**Features created:**
| Feature | Type | Description |
|---------|------|-------------|
| URGENCY_SCORE | INT | Count of urgent keywords (0-6) |
| SECRECY_SCORE | INT | Count of secrecy phrases (0-6) |
| TOTAL_RISK_SCORE | INT | Combined risk score |
| BODY_LENGTH | INT | Email body length |
| IS_CROSS_DEPT | INT | 1 if cross-department |
| IS_BARRIER_CROSSING | INT | 1 if Research↔Trading |
| IS_AFTER_HOURS | INT | 1 if sent outside 8am-6pm |

**Next:** In notebook 03, we'll train an XGBoost classifier using these features and register it in the Model Registry.