# Parallel Data Enrichment with Snowflake

This notebook demonstrates how to deploy and use the `parallel-web-tools` Snowflake integration for data enrichment.

## Features

- **Batched Table Function**: All rows processed in a single API call via `PARTITION BY`
- **Secure API Access**: External Access Integration for secure HTTPS calls
- **Multiple processors**: Choose speed vs. depth tradeoff
- **Easy deployment**: Python helper or manual SQL

## Prerequisites

```bash
pip install parallel-web-tools[snowflake]
export PARALLEL_API_KEY="your-api-key"
```

## Setup

In [None]:
# Install dependencies if needed
# !pip install parallel-web-tools[snowflake]

In [None]:
import os

from parallel_web_tools.integrations.snowflake import (
    get_cleanup_sql,
    get_setup_sql,
    get_udf_sql,
)

print("Parallel Snowflake integration loaded!")

## Configuration

Set your Snowflake credentials and Parallel API key.

In [None]:
from dotenv import load_dotenv

# Load environment variables from .env file (if present)
load_dotenv()

api_key = os.environ.get("PARALLEL_API_KEY")
if api_key:
    print(f"PARALLEL_API_KEY is set ({len(api_key)} chars)")
else:
    print("PARALLEL_API_KEY not found. Create a .env file with:")
    print("  PARALLEL_API_KEY=your-key")

## Deployment Options

### Option 1: Python Deployment (Recommended)

Use the Python helper to deploy everything automatically.

In [None]:
# Deploy to Snowflake (uncomment to run)
# deploy_parallel_functions(
#     account=SNOWFLAKE_ACCOUNT,
#     user=SNOWFLAKE_USER,
#     password=SNOWFLAKE_PASSWORD,
#     parallel_api_key=os.environ.get("PARALLEL_API_KEY"),
# )

### Option 2: Manual SQL Deployment

Get the SQL templates and run them manually in Snowflake.

In [None]:
# Get setup SQL with your API key
setup_sql = get_setup_sql(api_key=os.environ.get("PARALLEL_API_KEY"))
print("=" * 60)
print("SETUP SQL (01_setup.sql)")
print("=" * 60)
print(setup_sql[:2000] + "...")

In [None]:
# Get UDF creation SQL
udf_sql = get_udf_sql()
print("=" * 60)
print("UDF SQL (02_create_udf.sql)")
print("=" * 60)
print(udf_sql[:2000] + "...")

## SQL Usage Examples

After deployment, you can use `parallel_enrich()` table function in Snowflake SQL.

### Basic Enrichment

```sql
WITH companies AS (
    SELECT * FROM (VALUES
        ('Google', 'google.com'),
        ('Anthropic', 'anthropic.com'),
        ('Apple', 'apple.com')
    ) AS t(company_name, website)
)
SELECT
    e.input:company_name::STRING AS company_name,
    e.input:website::STRING AS website,
    e.enriched:ceo_name::STRING AS ceo_name,
    e.enriched:founding_year::STRING AS founding_year
FROM companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name, 'website', t.website)),
         ARRAY_CONSTRUCT('CEO name', 'Founding year')
     ) OVER (PARTITION BY 1)) e;
```

**How it works:**
- `PARTITION BY 1` batches all rows into a single API call
- `TO_JSON(OBJECT_CONSTRUCT(...))` creates the input JSON
- Returns `input` (original data) and `enriched` (results) columns

### Multiple Input Fields

```sql
SELECT
    e.input:company_name::STRING AS company_name,
    e.enriched:ceo_name::STRING AS ceo_name,
    e.enriched:founding_year::STRING AS founding_year,
    e.enriched:headquarters_city::STRING AS headquarters,
    e.enriched:number_of_employees::STRING AS employees
FROM my_companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT(
             'company_name', t.company_name,
             'website', t.website,
             'industry', t.industry
         )),
         ARRAY_CONSTRUCT(
             'CEO name (current CEO or equivalent leader)',
             'Founding year (YYYY format)',
             'Headquarters city',
             'Number of employees (approximate)'
         )
     ) OVER (PARTITION BY 1)) e;
```

### Custom Processor

Use a different processor for more depth or faster results.

```sql
SELECT
    e.input:company_name::STRING AS company_name,
    e.enriched:ceo_name::STRING AS ceo_name,
    e.enriched:recent_news_headline::STRING AS news
FROM my_companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name)),
         ARRAY_CONSTRUCT('CEO name', 'Recent news headline'),
         'base-fast'  -- processor option
     ) OVER (PARTITION BY 1)) e;
```

**Processor Options:**

| Processor | Speed | Cost | Best For |
|-----------|-------|------|----------|
| `lite-fast` | Fastest | Lowest | Basic metadata, high volume |
| `base-fast` | Fast | Low | Standard enrichments |
| `core-fast` | Medium | Medium | Cross-referenced data |
| `pro-fast` | Slow | High | Deep research |

### Enrich Table Rows

```sql
-- Enrich rows from an existing table
SELECT
    e.input:company_name::STRING AS company_name,
    e.input:website::STRING AS website,
    e.enriched:ceo_name::STRING AS ceo_name,
    e.enriched:founding_year::STRING AS founding_year
FROM companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name, 'website', t.website)),
         ARRAY_CONSTRUCT('CEO name', 'Founding year')
     ) OVER (PARTITION BY 1)) e;
```

### Partitioning Strategies

The `PARTITION BY` clause controls how rows are batched into API calls.

**All rows in one batch (default):**
```sql
-- Single API call for all rows
TABLE(parallel_enrich(...) OVER (PARTITION BY 1))
```

**One batch per group:**
```sql
-- Process each region as a separate API call
SELECT
    e.input:company_name::STRING AS company_name,
    e.input:region::STRING AS region,
    e.enriched:ceo_name::STRING AS ceo_name
FROM companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name, 'region', t.region)),
         ARRAY_CONSTRUCT('CEO name')
     ) OVER (PARTITION BY t.region)) e;
```

**Fixed batch sizes:**
```sql
-- Process in batches of 100 rows each
WITH numbered AS (
    SELECT *, CEIL(ROW_NUMBER() OVER (ORDER BY company_name) / 100.0) AS batch_id
    FROM companies
)
SELECT
    e.input:company_name::STRING AS company_name,
    e.enriched:ceo_name::STRING AS ceo_name
FROM numbered t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name)),
         ARRAY_CONSTRUCT('CEO name')
     ) OVER (PARTITION BY t.batch_id)) e;
```

| Pattern | Use Case |
|---------|----------|
| `PARTITION BY 1` | Small datasets, fastest for few rows |
| `PARTITION BY column` | Natural groupings, incremental processing |
| `PARTITION BY batch_id` | Large datasets with fixed batch sizes |

### Save Results to Table

```sql
-- Create a new table with enriched data
CREATE TABLE enriched_companies AS
SELECT
    e.input:company_name::STRING AS company_name,
    e.input:website::STRING AS website,
    e.enriched:ceo_name::STRING AS ceo_name,
    e.enriched:founding_year::STRING AS founding_year
FROM companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name, 'website', t.website)),
         ARRAY_CONSTRUCT('CEO name', 'Founding year')
     ) OVER (PARTITION BY 1)) e;
```

## Column Name Mapping

Output columns are automatically converted to valid JSON property names:

| Description | Property Name |
|-------------|---------------|
| `"CEO name"` | `ceo_name` |
| `"Founding year (YYYY)"` | `founding_year` |
| `"Annual revenue [USD]"` | `annual_revenue` |
| `"2024 Revenue"` | `col_2024_revenue` |

## Error Handling

Errors are returned in the `enriched` column.

```sql
-- Check for errors in results
SELECT
    e.input:company_name::STRING AS company_name,
    e.enriched:error::STRING AS error_message,
    e.enriched:ceo_name::STRING AS ceo_name
FROM companies t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(
         TO_JSON(OBJECT_CONSTRUCT('company_name', t.company_name)),
         ARRAY_CONSTRUCT('CEO name')
     ) OVER (PARTITION BY 1)) e;
```

Common errors:
- `"No API key provided"` - Secret not configured
- `"Timeout waiting for enrichment"` - API took too long
- `"API request failed: ..."` - Network or API error

## Cleanup

To remove all Parallel integration objects from Snowflake:

In [None]:
# Option 1: Python cleanup
# cleanup_parallel_functions(
#     account=SNOWFLAKE_ACCOUNT,
#     user=SNOWFLAKE_USER,
#     password=SNOWFLAKE_PASSWORD,
# )

In [None]:
# Option 2: Get cleanup SQL
cleanup_sql = get_cleanup_sql()
print("=" * 60)
print("CLEANUP SQL (03_cleanup.sql)")
print("=" * 60)
print(cleanup_sql)

## Best Practices

### 1. Use Specific Descriptions

```sql
-- Good - specific descriptions
ARRAY_CONSTRUCT(
    'CEO name (current CEO or equivalent leader)',
    'Founding year (YYYY format)',
    'Annual revenue (USD, most recent fiscal year)'
)

-- Less specific - may get inconsistent results
ARRAY_CONSTRUCT('CEO', 'Year', 'Revenue')
```

### 2. Use Appropriate Processors

- `lite-fast`: Basic metadata, high volume (cheapest, default)
- `base-fast`: Standard company information
- `core-fast`: Cross-referenced data from multiple sources
- `pro-fast`: Deep research requiring multiple sources

### 3. Batching via PARTITION BY

Use `PARTITION BY 1` to batch all rows into a single API call:

```sql
-- All rows processed together (efficient)
TABLE(parallel_enrich(...) OVER (PARTITION BY 1))

-- Each group processed separately
TABLE(parallel_enrich(...) OVER (PARTITION BY region))
```

### 4. Cache Results

```sql
-- Store enriched results to avoid re-processing
CREATE TABLE enriched_cache AS
SELECT e.input, e.enriched
FROM source_table t,
     TABLE(PARALLEL_INTEGRATION.ENRICHMENT.parallel_enrich(...) OVER (PARTITION BY 1)) e;
```

## Security

The integration uses Snowflake's security features:

1. **Network Rule**: Only allows egress to `api.parallel.ai:443`
2. **Secret**: API key stored encrypted (not visible in SQL)
3. **External Access Integration**: Combines rule and secret
4. **Roles**: PARALLEL_DEVELOPER and PARALLEL_USER for permissions

Grant PARALLEL_USER to users who need to run enrichments:

```sql
GRANT ROLE PARALLEL_USER TO USER analyst_user;
```

## Next Steps

- See the [Snowflake Setup Guide](../docs/snowflake-setup.md) for more details
- Check [Parallel Documentation](https://docs.parallel.ai) for API information
- View [parallel-web-tools on GitHub](https://github.com/parallel-web/parallel-web-tools)