# Receipt Data Extraction: Extracting Structured Data from Ad-Campaign Receipts

## Overview
This notebook demonstrates how to use Snowflake's Document AI capabilities to automatically extract structured information from ad-campaign receipt PDFs. We'll process receipts uploaded to our Snowflake stage and transform them into queryable structured data.

## What We'll Accomplish
- Parse PDF receipts from the `@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS` stage using `AI_PARSE_DOCUMENT`
- Extract structured receipt data (vendor, transaction details, campaign info, metrics) using `AI_COMPLETE` with a defined schema
- Transform unstructured receipt content into structured, analyzable data

## Prerequisites
- Access to Snowflake with Cortex AI features enabled
- Receipts uploaded to `@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS` stage
- RECEIPTS_PROCESSING_DB database and RAW schema configured
- Appropriate permissions for the ETL service role


In [None]:
# Import python packages
import streamlit as st
import pandas as pd

# Use Snowpark for our analyses
from snowflake.snowpark.context import get_active_session
session = get_active_session()


## Step 1: Environment Setup and Session Initialization

Setting up our environment by importing necessary packages and establishing a Snowflake session.

### Key Components:
- **Streamlit**: For building interactive applications
- **Pandas**: For data manipulation and analysis
- **Snowpark Session**: Connection to Snowflake and access to Cortex AI capabilities

The `get_active_session()` retrieves our Snowflake session for executing SQL and using AI features.


In [None]:
session.sql("USE ROLE SYSADMIN").collect()

# Set warehouse for AI processing
session.sql("USE WAREHOUSE RECEIPTS_PARSE_COMPLETE_WH").collect()

session.sql("ALTER WAREHOUSE RECEIPTS_PARSE_COMPLETE_WH SET WAREHOUSE_SIZE='XSMALL'").collect()

session.use_database('RECEIPTS_PROCESSING_DB')
session.use_schema('RAW')


## Step 2: Set Database and Schema Context

Setting the working context for our session:

- **Database**: `RECEIPTS_PROCESSING_DB` - Our receipt processing database
- **Schema**: `RAW` - The schema containing our receipts stage

This ensures all operations execute within the correct context without needing to fully qualify object names.


In [None]:
from pydantic import BaseModel, Field
from typing import List, Optional
import json
#from snowflake.cortex import complete, CompleteOptions
#from snowflake.snowpark.functions import col, prompt, ai_complete


## Step 3: Import AI and Processing Libraries

Importing specialized libraries for document AI processing:

### Key Imports:
- **Pydantic**: For data validation and schema definition
- **Snowflake Cortex**: Access to AI functions (`complete`, `CompleteOptions`)
- **Snowpark Functions**: SQL functions including `ai_complete` and `prompt`
- **JSON**: For handling structured data
- **Typing**: For type hints (List, Optional)

These libraries enable us to process receipt documents and extract structured information using large language models.


In [None]:
--REMOVE @RECEIPTS_PROCESSING_DB.RAW.RECEIPTS; -- REMOVES ALL FILES FROM THE STAGE
ALTER STAGE RECEIPTS_PROCESSING_DB.RAW.RECEIPTS REFRESH;
SELECT * FROM DIRECTORY(@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS);


## Step 4: Explore Available Receipts

Before processing, let's see what receipt files are available in our stage.

### What This Shows:
- File names and paths of receipts ready for processing
- File sizes and metadata
- Upload timestamps

The `DIRECTORY()` function provides a view of all files in the `@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS` stage, essential for understanding our data source.


In [None]:
# Probably unnecessary to change the warehouse size, as AI_PARSE_DOCUMENT runs on separate AI services infrastructure
session.sql("ALTER WAREHOUSE RECEIPTS_PARSE_COMPLETE_WH SET WAREHOUSE_SIZE='MEDIUM'").collect()

# Create parsed_receipts table if it doesn't exist
session.sql("""
CREATE TABLE IF NOT EXISTS parsed_receipts (
    relative_path STRING,
    content STRING
)
""").collect()

# Only parse documents that haven't been parsed yet
docs_df = session.sql("""
INSERT INTO parsed_receipts
SELECT
    relative_path,
    AI_PARSE_DOCUMENT(
        to_file('@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS', relative_path), 
        {'mode': 'layout'}
    ):content AS content
FROM DIRECTORY(@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS)
WHERE relative_path NOT IN (SELECT relative_path FROM parsed_receipts)
""").collect()

print(f"✓ Parsed {len(docs_df)} new receipt(s)")

session.sql("ALTER WAREHOUSE RECEIPTS_PARSE_COMPLETE_WH SET WAREHOUSE_SIZE='XSMALL'").collect()

## Step 5: Parse Receipt PDFs with AI_PARSE_DOCUMENT (Incremental)

Using Snowflake's AI to extract text content from receipt PDFs - only parsing new documents.

### What's Happening:
1. **Create Table If Not Exists**: Creates `parsed_receipts` table on first run (not transient)
2. **Incremental Processing**: Only parses documents NOT already in `parsed_receipts` table
3. **AI_PARSE_DOCUMENT**: This function:
   - Reads PDF files from `@RECEIPTS_PROCESSING_DB.RAW.RECEIPTS` stage
   - Uses `'layout'` mode to preserve receipt structure
   - Extracts text content including vendor info, line items, amounts, campaign details
4. **INSERT Results**: Adds only new parsed content to existing table

### Benefits of Incremental Processing:
- ✅ Avoids re-parsing already processed documents (saves time and costs)
- ✅ Preserves existing parsed data
- ✅ Only processes new receipts uploaded to stage
- ✅ Can run repeatedly without duplicating work

### Why Layout Mode?
Layout mode preserves the receipt's visual structure (headers, tables, campaign details section), which helps the AI understand:
- Vendor branding and headers
- Line item tables with services and amounts
- Campaign details section (display/video formats, metrics)
- Total amounts and tax calculations


In [None]:
#session.table('parsed_receipts').to_pandas()
session.table('parsed_receipts').to_pandas().head()


## Step 6: Preview Parsed Receipt Content

Examining the parsed receipt data:

- **relative_path**: Original receipt filename (e.g., `receipt_TechAds_Pro_20251020.pdf`)
- **content**: Extracted text from the receipt PDF

This verification ensures successful parsing and shows the text quality for extraction.


In [None]:
prompt_text = """
    CONCAT($$Analyze the ad-campaign receipt document provided and extract the following structured information:
    
    1. Vendor/Provider details (name, contact info)
    2. Transaction information (ID, date, time, payment method)
    3. Customer/Client information (name, company)
    4. Campaign details (name, content types, ad formats, period start date, period end date)
    5. Financial details (line items, subtotal, tax, total)
    6. Campaign metrics (CPM, CTR, bounce rate %, targets)
    7. Targeting information (geography, demographics, devices)
    
    <document-content>$$, {0}, $$
    </document-content>

    <output-format>
    Provide JSON output in the exact format specified in the response schema.
    For any fields not found in the document, use an empty string.
    For numeric fields, use numbers (not strings).
    For percentages, use whole numbers.
    For arrays, provide empty arrays [] if no data is found.
    </output-format>$$)
"""


## Step 7: Define Receipt Extraction Prompt

Creating the instruction prompt to guide AI extraction from receipt documents.

### Prompt Structure:
- **Task Definition**: Clear instructions to analyze receipts and extract specific fields
- **Document Content Placeholder**: `{0}` will be replaced with actual receipt text
- **Output Format**: Explicit JSON structure requirements

### Information to Extract:
1. **Vendor Details**: Name, branding, contact information
2. **Transaction Info**: Receipt ID, date, time, payment method
3. **Customer Info**: Client name and company
4. **Campaign Details**: Name, display/video formats, ad types
5. **Financials**: Line items, subtotal, tax, total amount
6. **Metrics**: CPM, CTR, bounce rate, target impressions/clicks
7. **Targeting**: Geographic, demographic, device targeting

This prompt instructs the AI model on exactly what data to extract and how to format the response.


## Step 8: Define Response Schema for Receipt Extraction

Defining the exact structure for extracted receipt data with all receipt-specific fields including vendor, transaction, campaign details, metrics (CPM, CTR, Bounce Rate), budget, targeting, and line items.


In [None]:
resp_schema = """
{
    'type': 'json',
    'schema': {
        'type': 'object',
        'properties': {
            'vendor': {
                'type': 'object',
                'properties': {
                    'vendor_name': {'type': 'string'}
                },
                'required': ['vendor_name']
            },
            'transaction': {
                'type': 'object',
                'properties': {
                    'receipt_id': {'type': 'string'},
                    'date': {'type': 'string'},
                    'payment_method': {'type': 'string'}
                },
                'required': ['receipt_id', 'date', 'payment_method']
            },
            'customer': {
                'type': 'object',
                'properties': {
                    'customer_name': {'type': 'string'},
                    'company_name': {'type': 'string'}
                },
                'required': ['company_name']
            },
            'campaign': {
                'type': 'object',
                'properties': {
                    'name': {'type': 'string'},
                    'content_types': {'type': 'string'},
                    'ad_formats': {'type': 'array'},
                    'period_startdate': {'type': 'string'},
                    'period_enddate': {'type': 'string'},
                    'budget': {'type':'number'}
                },
                'required': ['name', 'content_types', 'ad_formats', 'period_startdate', 'period_enddate']
            },
            'financials': {
                'type': 'object',
                'properties': {
                    'line_items': {'type': 'object'},
                    'subtotal': {'type': 'number'},
                    'tax': {'type': 'number'},
                    'total': {'type': 'number'}
                },
                'required': ['total', 'line_items', 'subtotal', 'tax']
            },
            'metrics': {
                'type': 'object',
                'properties': {
                    'cpm': {'type': 'string', 'description':'Cost Per Milli, abbreviated as CPM'},
                    'ctr': {'type': 'string', 'description':'Click-through rate, abbreviated as CTR'},
                    'bounce_rate': {'type': 'string', 'description':'Bounce Rate, sometimes just referred to as Bounce, a % value'},
                    'targets': {'type': 'object'},
                    'pricing_model': {'type': 'string'}
                },
                'required': ['cpm', 'ctr', 'bounce_rate', 'targets', 'pricing_model']
            },
            'budget':{
                'type': 'object',
                'properties': {
                    'daily_budget': {'type': 'string'},
                    'total_budget': {'type': 'string'}
                },
                'required': ['daily_budget', 'total_budget']
            },
            'targeting': {
                'type': 'object',
                'properties': {
                    'geography': {'type': 'array'},
                    'demographics': {'type': 'string'},
                    'age_range': {'type': 'string'},
                    'devices': {'type': 'string'}
                },
                'required': ['geography', 'demographics', 'age_range', 'devices']
            }
        },
        'required': ['vendor', 'transaction', 'customer', 'campaign', 'financials', 'metrics', 'budget', 'targeting']
    }
}
"""


In [None]:
# Create extracted_receipt_data table if it doesn't exist
session.sql("""
CREATE TABLE IF NOT EXISTS extracted_receipt_data (
    relative_path STRING,
    content STRING,
    extracted_data VARIANT
)
""").collect()

# Only extract data from newly parsed receipts
query = f"""
INSERT INTO extracted_receipt_data
SELECT
    relative_path,
    content,
    ai_complete(
        model=>'claude-sonnet-4-5',
        prompt=>{prompt_text.format('content')},
        response_format=>{resp_schema}
    ) as extracted_data
FROM parsed_receipts
WHERE relative_path NOT IN (SELECT relative_path FROM extracted_receipt_data)
"""

In [None]:
session.sql(query).collect()

## Step 9: Extract Structured Data with AI_COMPLETE (Incremental)

Using Snowflake's AI to extract structured receipt data - only processing newly parsed receipts.

### Incremental Extraction:
- **CREATE TABLE IF NOT EXISTS**: Preserves existing extracted data
- **INSERT INTO**: Adds only new extractions
- **WHERE NOT IN**: Only processes receipts not already in extracted_receipt_data
- **Saves Costs**: Avoids re-running expensive AI_COMPLETE on same receipts

The AI reads each new receipt and extracts vendor details, transaction info, campaign details (display/video formats), financial totals, performance metrics (CPM, CTR, Bounce Rate), targeting parameters, and line items into structured JSON.


In [None]:
# View the extracted data
result_df = session.table('extracted_receipt_data').to_pandas()
result_df.head()


## Step 10: Preview Extracted Receipt Data

Each row contains a complete structured representation of a receipt with all extracted fields in JSON format, ready for flattening and analysis.


In [None]:
# Parse and flatten the JSON data for analysis
flattened_df = session.sql("""
SELECT
    relative_path,
    extracted_data:vendor.vendor_name::STRING AS vendor_name,
    extracted_data:transaction.receipt_id::STRING AS receipt_id,
    extracted_data:transaction.date::DATE AS transaction_date,
    extracted_data:transaction.payment_method::STRING AS payment_method,
    extracted_data:customer.company_name::STRING AS company_name,
    extracted_data:campaign.name::STRING AS campaign_name,
    extracted_data:campaign.period_startdate::DATE AS period_startdate,
    extracted_data:campaign.period_enddate::DATE AS period_enddate,
    extracted_data:campaign.content_types::STRING AS content_types,
    extracted_data:financials.total::NUMBER AS total_amount,
    extracted_data:metrics.cpm::STRING AS cpm,
    extracted_data:metrics.ctr::STRING AS ctr,
    extracted_data:metrics.bounce_rate::STRING AS bounce_rate,
    extracted_data:metrics.pricing_model::STRING AS pricing_model,
    extracted_data:budget.daily_budget::STRING AS daily_budget,
    extracted_data:budget.total_budget::STRING AS total_budget
FROM extracted_receipt_data
""").to_pandas()

flattened_df.head(10)


## Step 11: Flatten and Query Extracted Data

Converting nested JSON into a flat table showing key receipt fields including vendor, transaction date, campaign details, total amount, and performance metrics (CPM, CTR, Bounce Rate).


In [None]:
# Create a permanent table with flattened receipt data
# Strip non-numeric characters ($, %, commas) before converting to numbers
session.sql("""
CREATE VIEW receipt_analytics_vw if not exists AS
SELECT
    relative_path AS receipt_filename,
    extracted_data:vendor.vendor_name::STRING AS vendor_name,
    extracted_data:transaction.receipt_id::STRING AS receipt_id,
    TRY_TO_DATE(extracted_data:transaction.date::STRING) AS transaction_date,
    extracted_data:transaction.payment_method::STRING AS payment_method,
    extracted_data:customer.company_name::STRING AS company_name,
    extracted_data:customer.customer_name::STRING AS customer_name,
    extracted_data:campaign.name::STRING AS campaign_name,
    TRY_TO_DATE(extracted_data:campaign.period_startdate::STRING) AS period_startdate,
    TRY_TO_DATE(extracted_data:campaign.period_enddate::STRING) AS period_enddate,
    extracted_data:campaign.content_types::STRING AS content_types,
    -- Strip $ and , from financial values and preserve decimals
    TRY_TO_DECIMAL(REPLACE(REPLACE(extracted_data:financials.subtotal::STRING, '$', ''), ',', ''), 10, 2) AS subtotal,
    TRY_TO_DECIMAL(REPLACE(REPLACE(extracted_data:financials.tax::STRING, '$', ''), ',', ''), 10, 2) AS tax,
    TRY_TO_DECIMAL(REPLACE(REPLACE(extracted_data:financials.total::STRING, '$', ''), ',', ''), 10, 2) AS total_amount,
    -- Strip $ from CPM and preserve decimals
    TRY_TO_DECIMAL(REPLACE(extracted_data:metrics.cpm::STRING, '$', ''), 10, 2) AS cpm,
    -- Strip % from CTR and Bounce Rate and preserve decimals
    TRY_TO_DECIMAL(REPLACE(extracted_data:metrics.ctr::STRING, '%', ''), 10, 2) AS ctr_percent,
    TRY_TO_DECIMAL(REPLACE(extracted_data:metrics.bounce_rate::STRING, '%', ''), 10, 2) AS bounce_rate_percent,
    extracted_data:metrics.pricing_model::STRING AS pricing_model,
    -- Strip $ and , from budget values and preserve decimals
    TRY_TO_DECIMAL(REPLACE(REPLACE(extracted_data:budget.daily_budget::STRING, '$', ''), ',', ''), 10, 2) AS daily_budget,
    TRY_TO_DECIMAL(REPLACE(REPLACE(extracted_data:budget.total_budget::STRING, '$', ''), ',', ''), 10, 2) AS campaign_budget,
    -- Frequency cap stays as string (not numeric)
    extracted_data:targeting.frequency_cap::STRING AS frequency_cap,
    extracted_data:targeting.age_range::STRING AS age_range,
    CURRENT_TIMESTAMP() AS processed_at
FROM extracted_receipt_data
""").collect()

print("✓ receipt_analytics table created successfully!")


In [None]:
select * from receipt_analytics_vw limit 10;

## Step 12: Create Analytics Table

Creating a permanent, flattened table for receipt analytics with proper type conversion, descriptive column names, and processing timestamp. Ready for dashboards and reporting!


In [None]:
# Example analytics: Spending by vendor
session.sql("""
SELECT 
    vendor_name,
    COUNT(*) AS receipt_count,
    SUM(total_amount) AS total_spending,
    AVG(total_amount) AS avg_receipt_amount,
    AVG(cpm) AS avg_cpm,
    AVG(ctr_percent) AS avg_ctr
FROM receipt_analytics_vw
GROUP BY vendor_name
ORDER BY total_spending DESC
""").to_pandas()


## Step 13: Analyze Spending by Vendor

Analyzing receipt count, total spending, average amounts, and performance metrics (CPM, CTR) by vendor to identify top advertising partners and their performance.


In [None]:
# Campaign type analysis
session.sql("""
SELECT 
    content_types,
    COUNT(*) AS campaign_count,
    AVG(total_amount) AS avg_spending,
    AVG(cpm) AS avg_cpm,
    AVG(ctr_percent) AS avg_ctr,
    AVG(bounce_rate_percent) AS avg_bounce_rate
FROM receipt_analytics_vw
WHERE content_types IS NOT NULL
GROUP BY content_types
ORDER BY campaign_count DESC
""").to_pandas()


## Step 14: Analyze by Campaign Content Type

Comparing performance between Display, Video, and mixed campaigns to optimize content strategy and budget allocation.


In [None]:
# Performance metrics by pricing model
session.sql("""
SELECT 
    pricing_model,
    COUNT(*) AS receipt_count,
    AVG(cpm) AS avg_cpm,
    AVG(ctr_percent) AS avg_ctr,
    AVG(bounce_rate_percent) AS avg_bounce_rate,
    SUM(total_amount) AS total_spending
FROM receipt_analytics_vw
WHERE pricing_model IS NOT NULL
GROUP BY pricing_model
ORDER BY receipt_count DESC
""").to_pandas()


## Step 15: Analyze by Pricing Model

Understanding performance across different pricing models (CPM, CPC, CPA, CPV, Flat Rate) to determine which delivers the best ROI.


## Summary

### What We've Accomplished:

1. ✅ **Parsed Receipts**: Extracted text from PDF receipts using AI_PARSE_DOCUMENT
2. ✅ **Structured Extraction**: Converted unstructured receipts to structured JSON using AI_COMPLETE
3. ✅ **Analytics Table**: Created `receipt_analytics` table with flattened, queryable data
4. ✅ **Generated Insights**: Analyzed spending, performance, and campaign metrics

### Key Metrics Captured:
- **Financial**: Subtotal, tax, total amounts
- **Performance**: CPM, CTR, Bounce Rate
- **Campaign**: Display formats, video placements, targeting
- **Budget**: Daily and total campaign budgets

### Next Steps:
- Build dashboards on `receipt_analytics` table
- Create automated alerts for unusual spending
- Analyze trends over time
- Compare vendor performance
- Optimize campaign strategies based on metrics

### Tables Created:
1. `parsed_receipts` - Raw parsed text from PDFs
2. `extracted_receipt_data` - Structured JSON extraction
3. `receipt_analytics` - Flattened, analytics-ready table

---

**Your receipt data is now structured, queryable, and ready for analysis!** 📊
