# Parallel Data Enrichment with Snowflake

This notebook demonstrates how to deploy and use the `parallel-web-tools` Snowflake integration for data enrichment.

## Features

- **SQL UDF**: Use `parallel_enrich()` directly in SQL queries
- **Secure API Access**: External Access Integration for secure HTTPS calls
- **Multiple processors**: Choose speed vs. depth tradeoff
- **Easy deployment**: Python helper or manual SQL

## Prerequisites

```bash
pip install parallel-web-tools[snowflake]
export PARALLEL_API_KEY="your-api-key"
```

## Setup

In [None]:
# Install dependencies if needed
# !pip install parallel-web-tools[snowflake]

In [None]:
import os

from parallel_web_tools.integrations.snowflake import (
    get_cleanup_sql,
    get_setup_sql,
    get_udf_sql,
)

print("Parallel Snowflake integration loaded!")

## Configuration

Set your Snowflake credentials and Parallel API key.

In [None]:
from dotenv import load_dotenv

# Load environment variables from .env file (if present)
load_dotenv()

api_key = os.environ.get("PARALLEL_API_KEY")
if api_key:
    print(f"PARALLEL_API_KEY is set ({len(api_key)} chars)")
else:
    print("PARALLEL_API_KEY not found. Create a .env file with:")
    print("  PARALLEL_API_KEY=your-key")

## Deployment Options

### Option 1: Python Deployment (Recommended)

Use the Python helper to deploy everything automatically.

In [None]:
# Deploy to Snowflake (uncomment to run)
# deploy_parallel_functions(
#     account=SNOWFLAKE_ACCOUNT,
#     user=SNOWFLAKE_USER,
#     password=SNOWFLAKE_PASSWORD,
#     parallel_api_key=os.environ.get("PARALLEL_API_KEY"),
# )

### Option 2: Manual SQL Deployment

Get the SQL templates and run them manually in Snowflake.

In [None]:
# Get setup SQL with your API key
setup_sql = get_setup_sql(api_key=os.environ.get("PARALLEL_API_KEY"))
print("=" * 60)
print("SETUP SQL (01_setup.sql)")
print("=" * 60)
print(setup_sql[:2000] + "...")

In [None]:
# Get UDF creation SQL
udf_sql = get_udf_sql()
print("=" * 60)
print("UDF SQL (02_create_udf.sql)")
print("=" * 60)
print(udf_sql[:2000] + "...")

## SQL Usage Examples

After deployment, you can use `parallel_enrich()` in Snowflake SQL.

### Basic Enrichment

```sql
SELECT parallel_enrich(
    OBJECT_CONSTRUCT('company_name', 'Google'),
    ARRAY_CONSTRUCT('CEO name', 'Founding year')
) AS enriched_data;
```

**Result:**
```json
{
  "ceo_name": "Sundar Pichai",
  "founding_year": "1998",
  "basis": [...]
}
```

### Multiple Input Fields

```sql
SELECT parallel_enrich(
    OBJECT_CONSTRUCT(
        'company_name', 'Apple',
        'website', 'apple.com',
        'industry', 'Technology'
    ),
    ARRAY_CONSTRUCT(
        'CEO name (current CEO or equivalent leader)',
        'Founding year (YYYY format)',
        'Headquarters city',
        'Number of employees (approximate)'
    )
) AS enriched_data;
```

### Custom Processor

Use a different processor for more depth or faster results.

```sql
-- Use base-fast for more depth
SELECT parallel_enrich(
    OBJECT_CONSTRUCT('company_name', 'Microsoft'),
    ARRAY_CONSTRUCT('CEO name', 'Recent news headline', 'Stock ticker'),
    'base-fast'
) AS enriched_data;
```

**Processor Options:**

| Processor | Speed | Cost | Best For |
|-----------|-------|------|----------|
| `lite-fast` | Fastest | Lowest | Basic metadata, high volume |
| `base-fast` | Fast | Low | Standard enrichments |
| `core-fast` | Medium | Medium | Cross-referenced data |
| `pro-fast` | Slow | High | Deep research |

### Enrich Table Rows

```sql
-- Enrich each row in a table
SELECT
    company_name,
    parallel_enrich(
        OBJECT_CONSTRUCT('company_name', company_name, 'website', website),
        ARRAY_CONSTRUCT('CEO name', 'Industry', 'Founding year')
    ) AS enriched_data
FROM companies
LIMIT 10;
```

### Parse Enriched Results

```sql
-- Extract specific fields from enriched data
WITH enriched AS (
    SELECT
        company_name,
        parallel_enrich(
            OBJECT_CONSTRUCT('company_name', company_name),
            ARRAY_CONSTRUCT('CEO name', 'Founding year')
        ) AS data
    FROM companies
)
SELECT
    company_name,
    data:ceo_name::STRING AS ceo,
    data:founding_year::STRING AS founded,
    data:basis AS sources
FROM enriched;
```

### Save Results to Table

```sql
-- Create a new table with enriched data
CREATE TABLE enriched_companies AS
SELECT
    c.*,
    e.enriched_data:ceo_name::STRING AS ceo_name,
    e.enriched_data:founding_year::STRING AS founding_year
FROM companies c
CROSS JOIN LATERAL (
    SELECT parallel_enrich(
        OBJECT_CONSTRUCT('company_name', c.company_name),
        ARRAY_CONSTRUCT('CEO name', 'Founding year')
    ) AS enriched_data
) e;
```

## Column Name Mapping

Output columns are automatically converted to valid JSON property names:

| Description | Property Name |
|-------------|---------------|
| `"CEO name"` | `ceo_name` |
| `"Founding year (YYYY)"` | `founding_year` |
| `"Annual revenue [USD]"` | `annual_revenue` |
| `"2024 Revenue"` | `col_2024_revenue` |

## Error Handling

Errors are returned as JSON in the result.

```sql
-- Check for errors
SELECT
    enriched_data:error::STRING AS error_message
FROM (
    SELECT parallel_enrich(
        OBJECT_CONSTRUCT('company_name', 'NonexistentCompanyXYZ123'),
        ARRAY_CONSTRUCT('CEO name')
    ) AS enriched_data
);
```

Common errors:
- `"No API key provided"` - Secret not configured
- `"Timeout waiting for enrichment"` - API took too long
- `"API request failed: ..."` - Network or API error

## Cleanup

To remove all Parallel integration objects from Snowflake:

In [None]:
# Option 1: Python cleanup
# cleanup_parallel_functions(
#     account=SNOWFLAKE_ACCOUNT,
#     user=SNOWFLAKE_USER,
#     password=SNOWFLAKE_PASSWORD,
# )

In [None]:
# Option 2: Get cleanup SQL
cleanup_sql = get_cleanup_sql()
print("=" * 60)
print("CLEANUP SQL (03_cleanup.sql)")
print("=" * 60)
print(cleanup_sql)

## Best Practices

### 1. Use Specific Descriptions

```sql
-- Good - specific descriptions
ARRAY_CONSTRUCT(
    'CEO name (current CEO or equivalent leader)',
    'Founding year (YYYY format)',
    'Annual revenue (USD, most recent fiscal year)'
)

-- Less specific - may get inconsistent results
ARRAY_CONSTRUCT('CEO', 'Year', 'Revenue')
```

### 2. Use Appropriate Processors

- `lite-fast`: Basic metadata, high volume (cheapest)
- `base-fast`: Standard company information
- `pro-fast`: Deep research requiring multiple sources

### 3. Process in Batches

```sql
-- For large tables, process in batches
SELECT parallel_enrich(...)
FROM companies
WHERE id BETWEEN 1 AND 100;
```

### 4. Cache Results

```sql
-- Store enriched results to avoid re-processing
CREATE TABLE enriched_cache AS
SELECT company_name, parallel_enrich(...) AS data
FROM companies;
```

## Security

The integration uses Snowflake's security features:

1. **Network Rule**: Only allows egress to `api.parallel.ai:443`
2. **Secret**: API key stored encrypted (not visible in SQL)
3. **External Access Integration**: Combines rule and secret
4. **Roles**: PARALLEL_DEVELOPER and PARALLEL_USER for permissions

Grant PARALLEL_USER to users who need to run enrichments:

```sql
GRANT ROLE PARALLEL_USER TO USER analyst_user;
```

## Next Steps

- See the [Snowflake Setup Guide](../docs/snowflake-setup.md) for more details
- Check [Parallel Documentation](https://docs.parallel.ai) for API information
- View [parallel-web-tools on GitHub](https://github.com/parallel-web/parallel-web-tools)