# 06 - Vector Search with Snowflake Arctic Embed

## Semantic Search for Similar Violations

**Goal:** Use vector embeddings to find emails similar to known violations.

### Use Cases for Compliance

1. **"Find emails like this violation"** - When you find a problematic email, search for similar ones
2. **Anomaly detection** - Emails far from "normal" clusters may warrant review
3. **Pattern discovery** - Group similar communications to find systemic issues

### Embedding Model: snowflake-arctic-embed-m

| Model | Dimensions | Optimized For |
|-------|------------|---------------|
| e5-base-v2 | 768 | General purpose |
| **snowflake-arctic-embed-m** | 768 | **Retrieval/Search** |
| snowflake-arctic-embed-l | 1024 | Highest quality |

We use `snowflake-arctic-embed-m` because it's specifically optimized for retrieval tasks.

---

In [None]:
# Setup
from snowflake.snowpark import Session

session = Session.builder.getOrCreate()
session.use_warehouse("COMPLIANCE_DEMO_WH")
session.use_database("COMPLIANCE_DEMO")
session.use_schema("SEARCH")

print(f"Connected as: {session.get_current_user()}")

## 1. Generate Email Embeddings

Create vector embeddings for all emails using Arctic Embed.

In [None]:
# Generate embeddings using snowflake-arctic-embed-m
# This may take a few minutes for 10K emails
print("üîÑ Generating embeddings for all emails...")
print("   Using model: snowflake-arctic-embed-m (768 dimensions)")

session.sql("""
INSERT OVERWRITE INTO COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS
SELECT 
    EMAIL_ID,
    SUBJECT,
    LEFT(BODY, 500) as BODY_PREVIEW,
    COMPLIANCE_LABEL,
    SENDER,
    RECIPIENT,
    
    -- Generate embedding using Arctic Embed
    SNOWFLAKE.CORTEX.EMBED_TEXT_768(
        'snowflake-arctic-embed-m',
        CONCAT(
            'Subject: ', COALESCE(SUBJECT, ''), ' ',
            'Body: ', COALESCE(BODY, '')
        )
    ) AS EMBEDDING,
    
    CURRENT_TIMESTAMP() as EMBEDDED_AT
    
FROM COMPLIANCE_DEMO.EMAIL_SURVEILLANCE.EMAILS
""").collect()

# Check count
count = session.sql("SELECT COUNT(*) as CNT FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS").collect()[0]["CNT"]
print(f"‚úÖ Generated embeddings for {count:,} emails")

## 2. Semantic Search: Find Similar Emails

Given a known violation, find similar emails that may also be problematic.

In [None]:
# Find an example insider trading email to use as query
print("--- Reference: Known Insider Trading Email ---")
reference_email = session.sql("""
    SELECT EMAIL_ID, SUBJECT, BODY_PREVIEW 
    FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS 
    WHERE COMPLIANCE_LABEL = 'INSIDER_TRADING'
    LIMIT 1
""").collect()[0]

print(f"Email ID: {reference_email['EMAIL_ID']}")
print(f"Subject: {reference_email['SUBJECT']}")
print(f"Preview: {reference_email['BODY_PREVIEW'][:200]}...")

In [None]:
# Search for similar emails using VECTOR_COSINE_SIMILARITY
print("\\n--- Top 10 Most Similar Emails ---")

ref_id = reference_email['EMAIL_ID']

session.sql(f"""
    WITH query_embedding AS (
        SELECT EMBEDDING 
        FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS 
        WHERE EMAIL_ID = '{ref_id}'
    )
    SELECT 
        e.EMAIL_ID,
        e.COMPLIANCE_LABEL,
        ROUND(VECTOR_COSINE_SIMILARITY(e.EMBEDDING, q.EMBEDDING), 4) as SIMILARITY,
        e.SUBJECT,
        LEFT(e.BODY_PREVIEW, 100) as PREVIEW
    FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS e
    CROSS JOIN query_embedding q
    WHERE e.EMAIL_ID != '{ref_id}'
    ORDER BY SIMILARITY DESC
    LIMIT 10
""").show()

## 3. Search by Natural Language Query

Investigators can search using plain English descriptions.

In [None]:
# Search using natural language query
search_query = "emails discussing upcoming acquisitions or mergers before public announcement"

print(f"üîç Search Query: '{search_query}'")
print("\\n--- Search Results ---")

session.sql(f"""
    SELECT 
        EMAIL_ID,
        COMPLIANCE_LABEL,
        ROUND(VECTOR_COSINE_SIMILARITY(
            EMBEDDING,
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', '{search_query}')
        ), 4) as RELEVANCE,
        SUBJECT,
        LEFT(BODY_PREVIEW, 150) as PREVIEW
    FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS
    ORDER BY RELEVANCE DESC
    LIMIT 10
""").show()

## 4. Analyze Embedding Clusters by Label

See how different violation types cluster in embedding space.

In [None]:
# Calculate average similarity within and between compliance labels
print("--- Intra-Label Similarity (how similar are emails within same category?) ---")

session.sql("""
    WITH label_samples AS (
        SELECT 
            COMPLIANCE_LABEL,
            EMAIL_ID,
            EMBEDDING,
            ROW_NUMBER() OVER (PARTITION BY COMPLIANCE_LABEL ORDER BY RANDOM()) as RN
        FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS
    )
    SELECT 
        a.COMPLIANCE_LABEL,
        ROUND(AVG(VECTOR_COSINE_SIMILARITY(a.EMBEDDING, b.EMBEDDING)), 4) as AVG_INTRA_SIMILARITY
    FROM label_samples a
    JOIN label_samples b 
        ON a.COMPLIANCE_LABEL = b.COMPLIANCE_LABEL 
        AND a.EMAIL_ID < b.EMAIL_ID
    WHERE a.RN <= 50 AND b.RN <= 50  -- Sample for speed
    GROUP BY a.COMPLIANCE_LABEL
    ORDER BY AVG_INTRA_SIMILARITY DESC
""").show()

## 5. Build a Reusable Search Function

Create a SQL function for easy searching.

In [None]:
# Create a UDF for semantic search
session.sql("""
CREATE OR REPLACE FUNCTION COMPLIANCE_DEMO.SEARCH.SEARCH_EMAILS(query STRING, limit_results INT)
RETURNS TABLE (
    EMAIL_ID VARCHAR,
    COMPLIANCE_LABEL VARCHAR,
    RELEVANCE FLOAT,
    SUBJECT VARCHAR,
    BODY_PREVIEW VARCHAR
)
AS
$$
    SELECT 
        EMAIL_ID,
        COMPLIANCE_LABEL,
        ROUND(VECTOR_COSINE_SIMILARITY(
            EMBEDDING,
            SNOWFLAKE.CORTEX.EMBED_TEXT_768('snowflake-arctic-embed-m', query)
        ), 4) as RELEVANCE,
        SUBJECT,
        BODY_PREVIEW
    FROM COMPLIANCE_DEMO.SEARCH.EMAIL_EMBEDDINGS
    ORDER BY RELEVANCE DESC
    LIMIT limit_results
$$
""").collect()

print("‚úÖ Created COMPLIANCE_DEMO.SEARCH.SEARCH_EMAILS(query, limit)")

# Test the function
print("\\n--- Test: SEARCH_EMAILS('confidential client information leak', 5) ---")
session.sql("""
    SELECT * FROM TABLE(COMPLIANCE_DEMO.SEARCH.SEARCH_EMAILS('confidential client information leak', 5))
""").show()

## Summary

**What we built:**
- Generated 768-dimensional embeddings for 10K emails using `snowflake-arctic-embed-m`
- Demonstrated similarity search ("find emails like this violation")
- Showed natural language search ("emails about upcoming acquisitions")
- Analyzed embedding clusters by compliance label
- Created a reusable `SEARCH_EMAILS()` function

**Key Benefits:**
1. **Find Hidden Patterns:** Discover related violations missed by keyword search
2. **Natural Language Interface:** Investigators search in plain English
3. **Scalable:** Arctic Embed is optimized for retrieval at scale
4. **All in Snowflake:** No data movement to external services

**Production Use:**
```sql
-- Find similar emails to a known violation
SELECT * FROM TABLE(COMPLIANCE_DEMO.SEARCH.SEARCH_EMAILS(
    'sharing non-public information about company earnings',
    20
));
```