# Layer 1: Semantic Feature Engineering

### What We're Building:
- **Semantic Risk Scores** - How much MORE similar is each email to risk concepts vs normal business?
- **Feature Store** - Centralized, versioned, reusable features
- **Normalized Embeddings** - Relative scores that distinguish violations from clean

**Key Insight:** Raw similarity scores are high for all business emails. We need *relative* scores: `risk_similarity - baseline_similarity`

In [None]:
from snowflake.snowpark import Session
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode

session = Session.builder.getOrCreate()
session.use_warehouse('COMPLIANCE_DEMO_WH')
session.use_database('COMPLIANCE_DEMO')
session.use_schema('ML')

print("Layer 1: Building Semantic Feature Store...")

## The Concept: Relative Semantic Risk Scores

**Problem:** Raw embedding similarity is high for ALL business emails (0.65-0.70).

**Solution:** Compute *relative* risk = `similarity_to_risk - similarity_to_baseline`

| Email Type | Risk Similarity | Baseline Similarity | **Relative Score** |
|------------|-----------------|---------------------|-------------------|
| Clean | 0.68 | 0.70 | **-0.02** (normal) |
| Violation | 0.75 | 0.69 | **+0.06** (risky) |

Now violations have positive scores, clean emails have negative scores!

## Step 1: Define Concepts

In [None]:
BASELINE_CONCEPT = "quarterly report meeting schedule project update team discussion client follow up status check"

RISK_CONCEPTS = {
    'SECRECY': "keep this secret between us, do not tell anyone, off the record, nobody can know",
    'URGENCY': "act before the announcement, move now before news breaks, time sensitive inside info",
    'INSIDER': "inside information about merger, non-public material facts, confidential deal details",
    'EVASION': "delete this email, destroy evidence, cover our tracks, shred documents",
    'TIPPING': "buy this stock now, guaranteed profit, act on this tip, trust me on this investment"
}

print("Baseline (normal business):")
print(f"  {BASELINE_CONCEPT[:60]}...")
print("\nRisk concepts:")
for name, phrase in RISK_CONCEPTS.items():
    print(f"  {name}: {phrase[:50]}...")

## Step 2: Create Semantic Features Table

For each email: `RISK_SCORE = similarity_to_risk - similarity_to_baseline`

In [None]:
import time

print("Computing relative semantic risk scores...")
start = time.time()

session.sql(f"""
CREATE OR REPLACE TABLE COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES AS
WITH email_embeddings AS (
    SELECT 
        EMAIL_ID,
        SENT_AT,
        COMPLIANCE_LABEL,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            CONCAT(SUBJECT, ' ', LEFT(BODY, 1000))
        )::VECTOR(FLOAT, 768) AS EMAIL_EMBEDDING
    FROM COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS
),
concept_embeddings AS (
    SELECT 
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{BASELINE_CONCEPT}')::VECTOR(FLOAT, 768) AS BASELINE_VEC,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{RISK_CONCEPTS['SECRECY']}')::VECTOR(FLOAT, 768) AS SECRECY_VEC,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{RISK_CONCEPTS['URGENCY']}')::VECTOR(FLOAT, 768) AS URGENCY_VEC,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{RISK_CONCEPTS['INSIDER']}')::VECTOR(FLOAT, 768) AS INSIDER_VEC,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{RISK_CONCEPTS['EVASION']}')::VECTOR(FLOAT, 768) AS EVASION_VEC,
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', 
            '{RISK_CONCEPTS['TIPPING']}')::VECTOR(FLOAT, 768) AS TIPPING_VEC
)
SELECT 
    e.EMAIL_ID,
    e.SENT_AT,
    e.COMPLIANCE_LABEL,
    -- Relative risk scores: risk_similarity - baseline_similarity
    ROUND(VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.SECRECY_VEC) - 
          VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.BASELINE_VEC), 4) AS RISK_SECRECY,
    ROUND(VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.URGENCY_VEC) - 
          VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.BASELINE_VEC), 4) AS RISK_URGENCY,
    ROUND(VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.INSIDER_VEC) - 
          VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.BASELINE_VEC), 4) AS RISK_INSIDER,
    ROUND(VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.EVASION_VEC) - 
          VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.BASELINE_VEC), 4) AS RISK_EVASION,
    ROUND(VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.TIPPING_VEC) - 
          VECTOR_COSINE_SIMILARITY(e.EMAIL_EMBEDDING, c.BASELINE_VEC), 4) AS RISK_TIPPING
FROM email_embeddings e
CROSS JOIN concept_embeddings c
""").collect()

elapsed = time.time() - start
count = session.sql('SELECT COUNT(*) as cnt FROM EMAIL_SEMANTIC_FEATURES').collect()[0]['CNT']
print(f"\nCreated relative risk scores for {count:,} emails in {elapsed:.1f}s")

## Step 3: Validate - Clean vs Violation Separation

In [None]:
results = session.sql("""
SELECT 
    CASE WHEN COMPLIANCE_LABEL = 'CLEAN' THEN 'CLEAN' ELSE 'VIOLATION' END as LABEL_TYPE,
    COUNT(*) as COUNT,
    ROUND(AVG(RISK_SECRECY), 4) AS AVG_SECRECY,
    ROUND(AVG(RISK_URGENCY), 4) AS AVG_URGENCY,
    ROUND(AVG(RISK_INSIDER), 4) AS AVG_INSIDER,
    ROUND(AVG(RISK_EVASION), 4) AS AVG_EVASION,
    ROUND(AVG(RISK_TIPPING), 4) AS AVG_TIPPING
FROM COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES
GROUP BY 1
ORDER BY 1
""").to_pandas()

print("\n" + "="*85)
print("RELATIVE RISK SCORES: Negative = normal, Positive = risky")
print("="*85)
print(results.to_string(index=False))

print("\n** CLEAN emails have NEGATIVE scores (closer to normal business) **")
print("** VIOLATIONS have POSITIVE scores (closer to risk concepts) **")

## Step 4: Register in Feature Store

In [None]:
fs = FeatureStore(
    session=session,
    database="COMPLIANCE_DEMO",
    name="ML",
    default_warehouse="COMPLIANCE_DEMO_WH",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST
)

email_entity = Entity(
    name="EMAIL",
    join_keys=["EMAIL_ID"],
    desc="Individual email communications for compliance monitoring"
)
fs.register_entity(email_entity)

print("Feature Store initialized, EMAIL entity registered")

In [None]:
from snowflake.snowpark.functions import col

feature_df = session.table('COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES').select(
    col('EMAIL_ID'),
    col('SENT_AT').alias('TS'),
    col('COMPLIANCE_LABEL'),
    col('RISK_SECRECY'),
    col('RISK_URGENCY'),
    col('RISK_INSIDER'),
    col('RISK_EVASION'),
    col('RISK_TIPPING')
)

semantic_fv = FeatureView(
    name="EMAIL_SEMANTIC_FEATURES",
    entities=[email_entity],
    feature_df=feature_df,
    timestamp_col="TS",
    refresh_freq="1 day",
    desc="Relative semantic risk scores (risk_similarity - baseline_similarity)"
)

semantic_fv = semantic_fv.attach_feature_desc({
    "RISK_SECRECY": "Relative similarity to secrecy language vs normal business (negative=normal, positive=risky)",
    "RISK_URGENCY": "Relative similarity to suspicious urgency language vs normal business",
    "RISK_INSIDER": "Relative similarity to insider information language vs normal business",
    "RISK_EVASION": "Relative similarity to evidence destruction language vs normal business",
    "RISK_TIPPING": "Relative similarity to stock tipping language vs normal business"
})

print("Feature View created")

In [None]:
registered_fv = fs.register_feature_view(
    feature_view=semantic_fv,
    version="V1",
    block=True,
    overwrite=True
)

print(f"\nFeature View registered: {registered_fv.name}/V1")
print(f"  -> 5 relative risk score features")
print(f"  -> Negative = normal, Positive = risky")

## Step 5: Example Violations

In [None]:
examples = session.sql("""
SELECT 
    f.EMAIL_ID,
    f.COMPLIANCE_LABEL,
    e.SUBJECT,
    LEFT(e.BODY, 200) as BODY_PREVIEW,
    f.RISK_SECRECY,
    f.RISK_INSIDER,
    f.RISK_TIPPING
FROM COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES f
JOIN COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS e ON f.EMAIL_ID = e.EMAIL_ID
WHERE f.COMPLIANCE_LABEL != 'CLEAN'
ORDER BY (f.RISK_SECRECY + f.RISK_INSIDER + f.RISK_TIPPING) DESC
LIMIT 3
""").to_pandas()

print("\n" + "="*80)
print("TOP VIOLATIONS BY RELATIVE RISK SCORE")
print("="*80)

for _, row in examples.iterrows():
    print(f"\n[{row['COMPLIANCE_LABEL']}] Email {row['EMAIL_ID']}")
    print(f"Subject: {row['SUBJECT']}")
    print(f"Body: {row['BODY_PREVIEW']}...")
    print(f"\nRisk Scores (positive = risky):")
    print(f"  SECRECY: {row['RISK_SECRECY']:+.4f}  INSIDER: {row['RISK_INSIDER']:+.4f}  TIPPING: {row['RISK_TIPPING']:+.4f}")

## Layer 1 Complete

**What we built:**
- **5 relative risk scores** (not raw similarity)
- **Clear separation:** Clean = negative, Violations = positive
- **Feature Store** with versioned, documented features

**Why relative scores work:**

| Approach | Clean Score | Violation Score | Separation |
|----------|-------------|-----------------|------------|
| Raw similarity | 0.68 | 0.72 | 0.04 (weak) |
| **Relative risk** | **-0.04** | **+0.05** | **0.09 (strong)** |

**Next:** Train ML model on these relative risk scores â†’