# Parallel Data Enrichment with Polars

This notebook demonstrates how to use the `parallel-web-tools` package to enrich Polars DataFrames using the Parallel API.

## Features

- **DataFrame-native**: Works directly with Polars DataFrames
- **Batch processing**: All rows processed in a single efficient batch
- **Multiple processors**: Choose speed vs. depth tradeoff
- **Error handling**: Graceful handling with detailed error reporting

## Prerequisites

```bash
pip install parallel-web-tools[polars]
export PARALLEL_API_KEY="your-api-key"
```

## Setup

In [None]:
# Install dependencies if needed
# !pip install parallel-web-tools[polars]

In [None]:
import os

import polars as pl

from parallel_web_tools.integrations.polars import parallel_enrich

print(f"Polars version: {pl.__version__}")

## Authentication

Set your Parallel API key via environment variable or pass it directly.

In [None]:
from dotenv import load_dotenv

# Load environment variables from .env file (if present)
load_dotenv()

api_key = os.environ.get("PARALLEL_API_KEY")
if api_key:
    print(f"PARALLEL_API_KEY is set ({len(api_key)} chars)")
else:
    print("PARALLEL_API_KEY not found. Create a .env file with:")
    print("  PARALLEL_API_KEY=your-key")

## Create Sample Data

In [None]:
# Sample company data
df = pl.DataFrame(
    {
        "company_name": ["Google", "Microsoft", "Apple", "Amazon", "Parallel Web Systems"],
        "website": ["google.com", "microsoft.com", "apple.com", "amazon.com", "parallel.ai"],
        "industry": ["Technology", "Technology", "Technology", "E-commerce", "Technology"],
    }
)

df

## Basic Enrichment

Enrich the DataFrame with company information.

In [None]:
# Enrich with CEO name and founding year
# Note: This will make API calls - may take a few seconds

result = parallel_enrich(
    df.head(2),  # Start with just 2 rows for demo
    input_columns={
        "company_name": "company_name",
        "website": "website",
    },
    output_columns=[
        "CEO name (current CEO or equivalent leader)",
        "Founding year (YYYY format)",
        "Brief company description (1-2 sentences)",
    ],
)

print(f"Success: {result.success_count}, Errors: {result.error_count}")
print(f"Time: {result.elapsed_time:.2f} seconds")
result.result

## Understanding the Result

The `EnrichmentResult` object contains:
- `dataframe`: The enriched DataFrame with new columns
- `success_count`: Number of rows successfully enriched
- `error_count`: Number of rows that failed
- `errors`: List of error details for failed rows
- `elapsed_time`: Total processing time

In [None]:
# Check for any errors
if result.error_count > 0:
    print("Errors encountered:")
    for error in result.errors:
        print(f"  Row {error['row']}: {error['error']}")
else:
    print("All rows enriched successfully!")

## Column Name Mapping

Output columns are automatically converted to valid Python identifiers:

| Description | Column Name |
|-------------|-------------|
| `"CEO name"` | `ceo_name` |
| `"Founding year (YYYY)"` | `founding_year` |
| `"Brief company description"` | `brief_company_description` |

In [None]:
# See the column names
print("Original columns:", df.columns)
print("Enriched columns:", result.result.columns)

## Working with Enriched Data

The enriched DataFrame works like any other Polars DataFrame.

In [None]:
# Select specific columns
result.result.select(["company_name", "ceo_name", "founding_year"])

In [None]:
# Filter and transform
(
    result.result.filter(pl.col("founding_year").is_not_null())
    .with_columns(pl.col("founding_year").cast(pl.Int64).alias("founded_int"))
    .select(["company_name", "ceo_name", "founded_int"])
)

## Including Citations (Basis)

You can include the sources used for enrichment by setting `include_basis=True`.

In [None]:
# Get enrichment with citations
result_with_basis = parallel_enrich(
    df.head(1),
    input_columns={"company_name": "company_name"},
    output_columns=["CEO name"],
    include_basis=True,
)

# Access the basis (citations)
for row in result_with_basis.result.iter_rows(named=True):
    print(f"Company: {row['company_name']}")
    print(f"CEO: {row['ceo_name']}")
    print(f"Sources: {row['_basis']}")

## Processor Options

Choose a processor based on your needs:

| Processor | Speed | Cost | Best For |
|-----------|-------|------|----------|
| `lite-fast` | Fastest | Lowest | Basic metadata, high volume |
| `base-fast` | Fast | Low | Standard enrichments |
| `core-fast` | Medium | Medium | Cross-referenced data |
| `pro-fast` | Slow | High | Deep research |

In [None]:
# Use a different processor for more depth
result_detailed = parallel_enrich(
    df.head(1),
    input_columns={"company_name": "company_name"},
    output_columns=[
        "Recent news headline about this company",
        "Stock ticker symbol",
    ],
    processor="base-fast",  # Use base processor for more depth
)

result_detailed.result

## Large Dataset Processing

For large datasets, consider processing in batches to manage API costs and timeouts.

In [None]:
def enrich_in_batches(
    df: pl.DataFrame, input_columns: dict, output_columns: list, batch_size: int = 50, **kwargs
) -> pl.DataFrame:
    """Process a large DataFrame in batches."""
    results = []
    total_success = 0
    total_errors = 0

    for i in range(0, len(df), batch_size):
        batch = df.slice(i, batch_size)
        print(f"Processing rows {i} to {i + len(batch)}...")

        result = parallel_enrich(batch, input_columns=input_columns, output_columns=output_columns, **kwargs)

        results.append(result.result)
        total_success += result.success_count
        total_errors += result.error_count

    print(f"\nTotal: {total_success} success, {total_errors} errors")
    return pl.concat(results)


# Example usage (commented out to avoid API calls)
# large_df = pl.DataFrame({"company": ["Company " + str(i) for i in range(100)]})
# enriched = enrich_in_batches(
#     large_df,
#     input_columns={"company_name": "company"},
#     output_columns=["CEO name"],
#     batch_size=25
# )

## Error Handling

Errors in individual rows don't stop the batch processing. Failed rows will have `None` values in enriched columns.

In [None]:
# Example with potential errors
df_with_issues = pl.DataFrame(
    {
        "company_name": ["Google", "NonexistentCompanyXYZ123"],
    }
)

result = parallel_enrich(
    df_with_issues,
    input_columns={"company_name": "company_name"},
    output_columns=["CEO name"],
)

print(f"Success: {result.success_count}, Errors: {result.error_count}")

# Filter to only successful rows
successful_df = result.result.filter(pl.col("ceo_name").is_not_null())
successful_df

## Best Practices

### 1. Be Specific in Descriptions

```python
# Good - specific descriptions
output_columns = [
    "CEO name (current CEO or equivalent leader)",
    "Founding year (YYYY format)",
    "Annual revenue (USD, most recent fiscal year)",
]

# Less specific - may get inconsistent results
output_columns = ["CEO", "Year", "Revenue"]
```

### 2. Use Appropriate Processors

- `lite-fast`: Basic metadata, high volume (cheapest)
- `base-fast`: Standard company information
- `pro-fast`: Deep research requiring multiple sources

### 3. Handle Errors Gracefully

```python
result = parallel_enrich(df, ...)
if result.error_count > 0:
    logger.warning(f"{result.error_count} rows failed")
```

### 4. Consider Batch Sizes

For very large datasets (1000+ rows), process in batches to:
- Avoid timeout issues
- Get partial results faster
- Better handle failures

## Next Steps

- See the [Polars Setup Guide](../docs/polars-setup.md) for more details
- Check [Parallel Documentation](https://docs.parallel.ai) for API information
- View [parallel-web-tools on GitHub](https://github.com/parallel-web/parallel-web-tools)