# Parallel Data Enrichment with BigQuery

This notebook demonstrates how to use the `parallel-web-tools` package to enrich data directly in BigQuery using SQL-native Remote Functions.

## Architecture

```
BigQuery SQL Query
       │
       ▼
BigQuery Remote Function (parallel_enrich)
       │
       ▼
Cloud Function (HTTP endpoint)
       │
       ▼
Parallel Task API
```

## Features

- **SQL-native**: Use `parallel_enrich()` directly in BigQuery SQL
- **Serverless**: Cloud Function scales automatically
- **Secure**: API key stored in Secret Manager
- **Multiple processors**: Choose speed vs. depth tradeoff

## Prerequisites

1. Google Cloud project with billing enabled
2. `gcloud` CLI installed and authenticated
3. `bq` CLI (comes with Google Cloud SDK)
4. Parallel API key from [platform.parallel.ai](https://platform.parallel.ai)

```bash
# Install the package
pip install parallel-web-tools

# Authenticate with GCP
gcloud auth login
gcloud auth application-default login
```

## Configuration

Set your GCP project and Parallel API key:

In [None]:
import os

# Your GCP project ID
PROJECT_ID = "your-gcp-project"  # <-- CHANGE THIS
REGION = "us-central1"
DATASET_ID = "parallel_functions"

# Your Parallel API key (or set PARALLEL_API_KEY env var)
PARALLEL_API_KEY = os.environ.get("PARALLEL_API_KEY", "your-parallel-api-key")  # <-- CHANGE THIS

print(f"Project: {PROJECT_ID}")
print(f"Region: {REGION}")
print(f"API Key: {'*' * 8}...{PARALLEL_API_KEY[-4:] if len(PARALLEL_API_KEY) > 8 else '(not set)'}")

## Step 1: Deploy the Integration

Deploy the Cloud Function and BigQuery Remote Functions. This creates:

1. **Secret** in Secret Manager for the API key
2. **Cloud Function** (Gen2) that handles enrichment requests
3. **BigQuery Connection** for remote function calls
4. **BigQuery Dataset** (`parallel_functions`)
5. **Remote Functions**:
   - `parallel_enrich(input_data, output_columns)` - Main enrichment function
   - `parallel_enrich_company(name, website, fields)` - Convenience function

### Option A: Using Python API

In [None]:
from parallel_web_tools.integrations.bigquery import deploy_bigquery_integration

# Deploy the integration (takes 2-3 minutes)
result = deploy_bigquery_integration(
    project_id=PROJECT_ID,
    api_key=PARALLEL_API_KEY,
    region=REGION,
    dataset_id=DATASET_ID,
)

print(f"\nFunction URL: {result['function_url']}")
print(f"\nExample query:\n{result['example_query']}")

### Option B: Using CLI

Alternatively, deploy from the command line:

```bash
parallel-cli enrich deploy --system bigquery \
    --project=your-gcp-project \
    --region=us-central1 \
    --api-key=your-parallel-api-key
```

## Step 2: Connect to BigQuery

Now let's connect to BigQuery and run some enrichment queries.

In [None]:
from google.cloud import bigquery

# Create BigQuery client
client = bigquery.Client(project=PROJECT_ID)


def run_query(sql):
    """Run a query and return results as a pandas DataFrame."""
    return client.query(sql).to_dataframe()


print(f"Connected to BigQuery project: {PROJECT_ID}")

## Step 3: Basic Enrichment

Use `parallel_enrich()` to enrich data directly in SQL:

In [None]:
# Basic enrichment - single company
query = f"""
SELECT
    'Google' as company_name,
    `{PROJECT_ID}.{DATASET_ID}.parallel_enrich`(
        JSON_OBJECT('company_name', 'Google', 'website', 'google.com'),
        JSON_ARRAY('CEO name', 'Founding year', 'Brief description')
    ) as enriched_data
"""

print("Running enrichment query...")
df = run_query(query)
df

## Step 4: Parse JSON Results

The enrichment returns JSON. Use `JSON_EXTRACT_SCALAR` to parse into columns:

In [None]:
# Parse JSON results into columns
query = f"""
WITH enriched AS (
    SELECT
        name,
        `{PROJECT_ID}.{DATASET_ID}.parallel_enrich`(
            JSON_OBJECT('company_name', name),
            JSON_ARRAY(
                'CEO name (current CEO)',
                'Founding year (YYYY)',
                'Headquarters city'
            )
        ) as info
    FROM UNNEST(['Google', 'Microsoft', 'Apple']) as name
)
SELECT
    name,
    JSON_EXTRACT_SCALAR(info, '$.ceo_name') as ceo,
    JSON_EXTRACT_SCALAR(info, '$.founding_year') as founded,
    JSON_EXTRACT_SCALAR(info, '$.headquarters_city') as hq
FROM enriched
"""

print("Running enrichment with JSON parsing...")
df = run_query(query)
df

## Step 5: Company Convenience Function

Use `parallel_enrich_company()` for a simpler interface when enriching company data:

In [None]:
# Convenience function for companies
query = f"""
SELECT
    `{PROJECT_ID}.{DATASET_ID}.parallel_enrich_company`(
        'Parallel Web Systems',
        'parallel.ai',
        JSON_ARRAY('CEO name', 'Stock ticker', 'Number of employees')
    ) as company_info
"""

print("Running company enrichment...")
df = run_query(query)
df

## Step 6: Enrich from Existing Tables

Enrich data from your existing BigQuery tables:

In [None]:
# First, create a sample table
create_table_query = f"""
CREATE OR REPLACE TABLE `{PROJECT_ID}.{DATASET_ID}.sample_companies` AS
SELECT * FROM UNNEST([
    STRUCT('Amazon' as name, 'amazon.com' as website),
    STRUCT('Netflix' as name, 'netflix.com' as website),
    STRUCT('Spotify' as name, 'spotify.com' as website)
])
"""

client.query(create_table_query).result()
print("Created sample_companies table")

In [None]:
# Enrich from the table
query = f"""
SELECT
    name,
    website,
    `{PROJECT_ID}.{DATASET_ID}.parallel_enrich`(
        JSON_OBJECT('company_name', name, 'website', website),
        JSON_ARRAY('Industry', 'Founded year', 'Business model')
    ) as enriched_data
FROM `{PROJECT_ID}.{DATASET_ID}.sample_companies`
"""

print("Enriching from table...")
df = run_query(query)
df

## Processor Options

The default processor is `lite-fast`. To use a different processor, create a custom function with different settings.

| Processor | Speed | Cost | Best For |
|-----------|-------|------|----------|
| `lite`, `lite-fast` | Fastest | ~$0.005/query | Basic metadata, high volume |
| `base`, `base-fast` | Fast | ~$0.01/query | Standard enrichments |
| `core`, `core-fast` | Medium | ~$0.025/query | Cross-referenced data |
| `pro`, `pro-fast` | Slow | ~$0.10/query | Deep research |

### Creating a Pro-tier Function

```sql
-- Create a function that uses pro-fast processor
CREATE OR REPLACE FUNCTION `your-project.parallel_functions.parallel_enrich_pro`(
    input_data STRING,
    output_columns STRING
)
RETURNS STRING
REMOTE WITH CONNECTION `your-project.us-central1.parallel-connection`
OPTIONS (
    endpoint = 'YOUR_FUNCTION_URL',
    user_defined_context = [("processor", "pro-fast")]
);
```

## Best Practices

### 1. Be Specific in Column Descriptions

```sql
-- Good - specific descriptions
JSON_ARRAY(
    'CEO name (current CEO or equivalent leader)',
    'Founding year (YYYY format)',
    'Annual revenue (USD, most recent fiscal year)'
)

-- Less specific - may get inconsistent results
JSON_ARRAY('CEO', 'Year', 'Revenue')
```

### 2. Handle Errors

```sql
SELECT
    name,
    CASE 
        WHEN JSON_EXTRACT_SCALAR(enriched, '$.error') IS NOT NULL
        THEN CONCAT('Error: ', JSON_EXTRACT_SCALAR(enriched, '$.error'))
        ELSE JSON_EXTRACT_SCALAR(enriched, '$.ceo_name')
    END as ceo_or_error
FROM ...
```

### 3. Batch Processing

For large datasets, process in batches to avoid timeouts:

```sql
-- Process in batches of 100
SELECT * FROM (
    SELECT *, ROW_NUMBER() OVER() as rn
    FROM your_table
)
WHERE rn BETWEEN 1 AND 100
```

### 4. Save Enriched Results

```sql
-- Create a table with enriched results
CREATE TABLE `project.dataset.enriched_companies` AS
SELECT
    name,
    parallel_enrich(...) as enriched_data,
    CURRENT_TIMESTAMP() as enriched_at
FROM source_table
```

## Cost Estimation

| Component | Cost |
|-----------|------|
| Cloud Functions | ~$0.40/million invocations + compute |
| BigQuery | Query processing costs |
| Parallel API | $0.005-$0.10 per enrichment |
| Secret Manager | ~$0.06/10,000 accesses |

**Example**: 1,000 company enrichments using `lite-fast`:
- Parallel API: ~$5
- GCP infrastructure: <$1

## Check Deployment Status

In [None]:
from parallel_web_tools.integrations.bigquery import get_deployment_status

status = get_deployment_status(
    project_id=PROJECT_ID,
    region=REGION,
)

if status["function_deployed"]:
    print("✓ Cloud Function deployed")
    print(f"  URL: {status['function_url']}")
else:
    print("✗ Cloud Function not deployed")
    print("  Run the deployment cell above")

## Cleanup

Remove all deployed resources when you're done:

In [None]:
# Uncomment to clean up all resources
# WARNING: This will delete the Cloud Function, connection, dataset, and optionally the secret

# from parallel_web_tools.integrations.bigquery import cleanup_bigquery_integration
#
# cleanup_bigquery_integration(
#     project_id=PROJECT_ID,
#     region=REGION,
#     dataset_id=DATASET_ID,
#     delete_secret=True,  # Also delete API key secret
# )
# print("Cleanup complete!")

Or manually via gcloud:

```bash
gcloud functions delete parallel-enrich --gen2 --region=us-central1 --project=your-gcp-project --quiet
bq rm --connection --force your-gcp-project.us-central1.parallel-connection
bq rm -r -f your-gcp-project:parallel_functions
gcloud secrets delete parallel-api-key --project=your-gcp-project --quiet
```

## Troubleshooting

### "Permission denied" when calling function

The BigQuery connection's service account needs Cloud Functions Invoker and Cloud Run Invoker roles:

```bash
# Get connection service account
CONNECTION_SA=$(bq show --connection PROJECT.REGION.parallel-connection --format=json | jq -r '.cloudResource.serviceAccountId')

# Grant Cloud Run Invoker (required for Gen2 functions)
gcloud run services add-iam-policy-binding parallel-enrich \
    --region=us-central1 \
    --member="serviceAccount:$CONNECTION_SA" \
    --role="roles/run.invoker" \
    --project=your-project
```

### Function timeout

- Use `lite-fast` processor for faster results
- Process smaller batches
- Increase function timeout (max 3600s for Gen2)

### View logs

```bash
gcloud functions logs read parallel-enrich --gen2 --region=us-central1 --project=your-project
```

## Next Steps

- **Scheduled enrichment**: Use BigQuery scheduled queries to enrich new data periodically
- **Data pipelines**: Integrate with Dataform, dbt, or Apache Airflow
- **Custom processors**: Create additional functions with different processor tiers

For more information:
- [BigQuery Setup Guide](../docs/bigquery-setup.md)
- [Parallel Documentation](https://docs.parallel.ai)
- [parallel-web-tools GitHub](https://github.com/parallel-web/parallel-web-tools)