# Layer 1: Semantic Feature Engineering

### What We're Building:
- **Semantic Risk Scores** - How much MORE similar is each email to risk concepts vs normal business?
- **Feature Store** - Centralized, versioned, reusable features
- **Normalized Embeddings** - Relative scores that distinguish violations from clean

**Key Insight:** Raw similarity scores are high for all business emails. We need *relative* scores: `risk_similarity - baseline_similarity`

In [None]:
from snowflake.snowpark import Session
from snowflake.ml.feature_store import FeatureStore, FeatureView, Entity, CreationMode

session = Session.builder.getOrCreate()
session.use_warehouse('COMPLIANCE_DEMO_WH')
session.use_database('COMPLIANCE_DEMO')
session.use_schema('ML')

print("Layer 1: Building Semantic Feature Store...")

## The Concept: Relative Semantic Risk Scores

**Problem:** Raw embedding similarity is high for ALL business emails (0.65-0.70).

**Solution:** Compute *relative* risk = `similarity_to_risk - similarity_to_baseline`

| Email Type | Risk Similarity | Baseline Similarity | **Relative Score** |
|------------|-----------------|---------------------|-------------------|
| Clean | 0.68 | 0.70 | **-0.02** (normal) |
| Violation | 0.75 | 0.69 | **+0.06** (risky) |

Now violations have positive scores, clean emails have negative scores!

## Step 1: Define Concepts

In [None]:
BASELINE_CONCEPT = "quarterly report meeting schedule administrative compliance training office operations normal business"

RISK_CONCEPTS = {
    'MNPI': "material non-public information insider trading tip before announcement confidential merger acquisition",
    'CONFIDENTIALITY': "client portfolio details proprietary strategy confidential fee structure private investor information",
    'PERSONAL_TRADING': "personal account trade my portfolio bought shares robinhood brokerage unreported position",
    'INFO_BARRIER': "research rating upgrade downgrade price target analyst opinion before publication trading desk"
}

print("Baseline (normal business):")
print(f"  {BASELINE_CONCEPT[:60]}...")
print("\nRisk concepts:")
for name, phrase in RISK_CONCEPTS.items():
    print(f"  {name}: {phrase[:50]}...")

## Step 2: Create Semantic Features Table

For each email: `RISK_SCORE = similarity_to_risk - similarity_to_baseline`

In [None]:
import time

print("Computing semantic risk scores...")
start = time.time()

session.sql(f"""
CREATE OR REPLACE TABLE COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES AS
SELECT 
    e.EMAIL_ID,
    e.SENDER,
    e.RECIPIENT,
    e.SENDER_DEPT,
    e.RECIPIENT_DEPT,
    e.SUBJECT,
    e.BODY,
    e.SENT_AT,
    e.COMPLIANCE_LABEL,
    
    VECTOR_COSINE_SIMILARITY(
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', CONCAT(e.SUBJECT, ' ', e.BODY)),
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', '{BASELINE_CONCEPT}')
    ) AS BASELINE_SIMILARITY,
    
    VECTOR_COSINE_SIMILARITY(
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', CONCAT(e.SUBJECT, ' ', e.BODY)),
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', '{RISK_CONCEPTS['MNPI']}')
    ) AS MNPI_RISK_SCORE,
    
    VECTOR_COSINE_SIMILARITY(
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', CONCAT(e.SUBJECT, ' ', e.BODY)),
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', '{RISK_CONCEPTS['CONFIDENTIALITY']}')
    ) AS CONFIDENTIALITY_RISK_SCORE,
    
    VECTOR_COSINE_SIMILARITY(
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', CONCAT(e.SUBJECT, ' ', e.BODY)),
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', '{RISK_CONCEPTS['PERSONAL_TRADING']}')
    ) AS PERSONAL_TRADING_RISK_SCORE,
    
    VECTOR_COSINE_SIMILARITY(
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', CONCAT(e.SUBJECT, ' ', e.BODY)),
        SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m-v1.5', '{RISK_CONCEPTS['INFO_BARRIER']}')
    ) AS INFO_BARRIER_RISK_SCORE,
    
    CASE WHEN (e.SENDER_DEPT = 'Research' AND e.RECIPIENT_DEPT = 'Trading')
              OR (e.SENDER_DEPT = 'Trading' AND e.RECIPIENT_DEPT = 'Research')
         THEN 1 ELSE 0 END AS CROSS_BARRIER_FLAG,
         
    CASE WHEN e.COMPLIANCE_LABEL = 'CLEAN' THEN 0 ELSE 1 END AS IS_VIOLATION
    
FROM COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS e
""").collect()

elapsed = time.time() - start
count = session.sql('SELECT COUNT(*) as cnt FROM EMAIL_SEMANTIC_FEATURES').collect()[0]['CNT']
print(f"\nCreated semantic risk scores for {count:,} emails in {elapsed:.1f}s")

## Step 3: Validate - Clean vs Violation Separation

In [None]:
results = session.sql("""
SELECT 
    COMPLIANCE_LABEL,
    COUNT(*) as COUNT,
    ROUND(AVG(MNPI_RISK_SCORE), 3) AS AVG_MNPI,
    ROUND(AVG(CONFIDENTIALITY_RISK_SCORE), 3) AS AVG_CONF,
    ROUND(AVG(PERSONAL_TRADING_RISK_SCORE), 3) AS AVG_PT,
    ROUND(AVG(INFO_BARRIER_RISK_SCORE), 3) AS AVG_IB
FROM COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES
GROUP BY 1
ORDER BY COUNT DESC
""").to_pandas()

print("\nSemantic Risk Scores by Label:")
print("="*80)
print(results.to_string(index=False))
print("\n** Higher scores = more similar to risk concept language **")

## Step 4: Register in Feature Store

In [None]:
fs = FeatureStore(
    session=session,
    database="COMPLIANCE_DEMO",
    name="ML",
    default_warehouse="COMPLIANCE_DEMO_WH",
    creation_mode=CreationMode.CREATE_IF_NOT_EXIST
)

email_entity = Entity(
    name="EMAIL",
    join_keys=["EMAIL_ID"],
    desc="Individual email communications for compliance monitoring"
)
fs.register_entity(email_entity)

print("Feature Store initialized, EMAIL entity registered")

In [None]:
from snowflake.snowpark.functions import col

feature_df = session.table('COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES').select(
    col('EMAIL_ID'),
    col('SENT_AT').alias('TS'),
    col('BASELINE_SIMILARITY'),
    col('MNPI_RISK_SCORE'),
    col('CONFIDENTIALITY_RISK_SCORE'),
    col('PERSONAL_TRADING_RISK_SCORE'),
    col('INFO_BARRIER_RISK_SCORE'),
    col('CROSS_BARRIER_FLAG'),
    col('IS_VIOLATION')
)

semantic_fv = FeatureView(
    name="EMAIL_SEMANTIC_FEATURES",
    entities=[email_entity],
    feature_df=feature_df,
    timestamp_col="TS",
    refresh_freq="1 day",
    desc="Semantic risk scores for email compliance"
)

semantic_fv = semantic_fv.attach_feature_desc({
    "MNPI_RISK_SCORE": "Similarity to MNPI/insider trading language",
    "CONFIDENTIALITY_RISK_SCORE": "Similarity to confidentiality breach language",
    "PERSONAL_TRADING_RISK_SCORE": "Similarity to personal trading violation language",
    "INFO_BARRIER_RISK_SCORE": "Similarity to info barrier violation language"
})

print("Feature View created")

In [None]:
registered_fv = fs.register_feature_view(
    feature_view=semantic_fv,
    version="V1",
    block=True,
    overwrite=True
)

print(f"\nFeature View registered: {registered_fv.name}/V1")
print(f"  -> 5 relative risk score features")
print(f"  -> Negative = normal, Positive = risky")

## Step 5: Example Violations

In [None]:
examples = session.sql("""
SELECT 
    EMAIL_ID,
    COMPLIANCE_LABEL,
    SUBJECT,
    LEFT(BODY, 200) as BODY_PREVIEW,
    MNPI_RISK_SCORE,
    CONFIDENTIALITY_RISK_SCORE,
    INFO_BARRIER_RISK_SCORE
FROM COMPLIANCE_DEMO.ML.EMAIL_SEMANTIC_FEATURES
WHERE COMPLIANCE_LABEL != 'CLEAN'
ORDER BY (MNPI_RISK_SCORE + CONFIDENTIALITY_RISK_SCORE + INFO_BARRIER_RISK_SCORE) DESC
LIMIT 3
""").to_pandas()

print("\n" + "="*80)
print("TOP VIOLATIONS BY RISK SCORE")
print("="*80)

for _, row in examples.iterrows():
    print(f"\n[{row['COMPLIANCE_LABEL']}]")
    print(f"Subject: {row['SUBJECT']}")
    print(f"Body: {row['BODY_PREVIEW']}...")
    print(f"Risk Scores: MNPI={row['MNPI_RISK_SCORE']:.3f}, CONF={row['CONFIDENTIALITY_RISK_SCORE']:.3f}, IB={row['INFO_BARRIER_RISK_SCORE']:.3f}")

## Layer 1 Complete

**What we built:**
- **5 relative risk scores** (not raw similarity)
- **Clear separation:** Clean = negative, Violations = positive
- **Feature Store** with versioned, documented features

**Why relative scores work:**

| Approach | Clean Score | Violation Score | Separation |
|----------|-------------|-----------------|------------|
| Raw similarity | 0.68 | 0.72 | 0.04 (weak) |
| **Relative risk** | **-0.04** | **+0.05** | **0.09 (strong)** |

**Next:** Train ML model on these relative risk scores â†’