# 🧬 GWAS Intelligence Pipeline - Standalone Notebook

This notebook is a **complete, standalone** pipeline for extracting genomic trait data from research papers using Snowflake Cortex AI and multimodal RAG.

## What This Notebook Does

1. **Database Setup** - Creates GWAS database, schemas, stages, and tables
2. **PDF Processing** - Parses PDFs using Cortex AI
3. **Embedding Generation** - Creates text and image embeddings
4. **Trait Extraction** - Extracts GWAS traits using multimodal RAG
5. **Analytics** - Provides extracted trait analytics

## Prerequisites

- Snowflake account with Cortex AI access
- CREATE DATABASE privileges
- Warehouse for compute
- `.env` file with credentials (see below)

## Quick Start

1. Configure `.env` file with your Snowflake credentials
2. Upload a PDF to the stage (instructions in notebook)
3. Run all cells in order

---


# ============================================================================
# IMPORTS - Standalone Notebook Version
# ============================================================================
# This notebook is self-contained - all logic is inline, no external modules!

# Standard library imports
import os
import sys
from pathlib import Path
import json
from datetime import datetime
import tempfile
import shutil

# Third-party imports
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm

# PDF processing
import fitz  # PyMuPDF - for PDF to PNG conversion

# Snowflake imports
from snowflake.snowpark import Session
from dotenv import load_dotenv

# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

# Project root
project_root = Path().absolute()

print("✅ Imports successful!")
print(f"   Project root: {project_root}")
print(f"   Python: {sys.version.split()[0]}")
print(f"\n📦 Key packages loaded:")
print(f"   • snowflake.snowpark")
print(f"   • pandas {pd.__version__}")
print(f"   • numpy {np.__version__}")
print(f"   • PyMuPDF (fitz) {fitz.__version__}")
print(f"\n🎯 This is a standalone notebook - no external files required!")

In [1]:
# ============================================================================
# LOAD ENVIRONMENT VARIABLES (for local development only)
# ============================================================================
from dotenv import load_dotenv
from pathlib import Path
import os

# Detect if we're running in Snowflake Notebooks
try:
    from snowflake.snowpark.context import get_active_session
    _ = get_active_session()
    IN_SNOWFLAKE = True
    print("🏔️ Running in Snowflake Notebook (Container Runtime)")
    print("   Environment variables managed by Snowflake")
except:
    IN_SNOWFLAKE = False
    print("💻 Running locally - loading credentials from .env")
    
    # Get the directory where this notebook is located
    notebook_dir = Path().absolute()
    
    # Load .env file from the notebook's directory
    env_path = notebook_dir / '.env'
    
    if env_path.exists():
        load_dotenv(dotenv_path=env_path)
        print(f"✅ Environment variables loaded from: {env_path}")
        print(f"\n📋 Credentials Status:")
        print(f"   SNOWFLAKE_ACCOUNT: {'✓ ' + os.environ.get('SNOWFLAKE_ACCOUNT', '') if os.environ.get('SNOWFLAKE_ACCOUNT') else '✗ Missing'}")
        print(f"   SNOWFLAKE_USER: {'✓ ' + os.environ.get('SNOWFLAKE_USER', '') if os.environ.get('SNOWFLAKE_USER') else '✗ Missing'}")
        print(f"   SNOWFLAKE_PASSWORD: {'✓ Set' if os.environ.get('SNOWFLAKE_PASSWORD') else '✗ Missing'}")
        print(f"   SNOWFLAKE_WAREHOUSE: {os.environ.get('SNOWFLAKE_WAREHOUSE', f'Not set (will use {WAREHOUSE_NAME})')}")
    else:
        print(f"❌ .env file not found at: {env_path}")
        print(f"\n💡 Create a .env file with:")
        print("   SNOWFLAKE_ACCOUNT=your-account")
        print("   SNOWFLAKE_USER=your-username")
        print("   SNOWFLAKE_PASSWORD=your-password")
        print("   SNOWFLAKE_WAREHOUSE=your-warehouse")

📋 Configuration:
   Warehouse: DEMO_JGH
   Database: GWAS
   Schemas: PDF_RAW, PDF_PROCESSING

✅ Configuration set!


## 🗄️ Step 1: Database & Schema Setup

Create the GWAS database and required schemas.


In [4]:
# ============================================================================
# CONNECT TO SNOWFLAKE (works in both local and Snowflake Notebooks)
# ============================================================================
from snowflake.snowpark import Session
import os

try:
    # ========================================================================
    # METHOD 1: Try to use active session (Snowflake Notebooks / Container Runtime)
    # ========================================================================
    from snowflake.snowpark.context import get_active_session
    session = get_active_session()
    
    print("✅ Connected to Snowflake using active session")
    print("   🏔️ Running in Snowflake Notebook (Container Runtime)")
    print(f"   Account: {session.get_current_account()}")
    print(f"   User: {session.get_current_user()}")
    print(f"   Role: {session.get_current_role()}")
    print(f"   Warehouse: {session.get_current_warehouse()}")
    print(f"   Database: {session.get_current_database() or '(not set)'}")
    
except Exception as e:
    # ========================================================================
    # METHOD 2: Use credentials from environment (local development)
    # ========================================================================
    print("💻 Running locally - connecting with credentials from .env")
    
    # Get connection from environment or use defaults
    session = Session.builder.configs({
        "account": os.environ.get("SNOWFLAKE_ACCOUNT", ""),
        "user": os.environ.get("SNOWFLAKE_USER", ""),
        "password": os.environ.get("SNOWFLAKE_PASSWORD", ""),
        "role": os.environ.get("SNOWFLAKE_ROLE", "ACCOUNTADMIN"),
        "warehouse": os.environ.get("SNOWFLAKE_WAREHOUSE", WAREHOUSE_NAME),
    }).create()
    
    print("✅ Connected to Snowflake using credentials")
    print(f"   Account: {session.get_current_account()}")
    print(f"   User: {session.get_current_user()}")
    print(f"   Role: {session.get_current_role()}")
    print(f"   Warehouse: {session.get_current_warehouse()}")

print("\n🔌 Snowflake session ready!")

🔌 Connected to Snowflake


In [5]:


# Create database
session.sql(f"CREATE DATABASE IF NOT EXISTS {DATABASE_NAME}").collect()
print(f"✅ Database {DATABASE_NAME} created/verified")

# Use database
session.sql(f"USE DATABASE {DATABASE_NAME}").collect()

# Create schemas
session.sql(f"""
    CREATE SCHEMA IF NOT EXISTS {SCHEMA_RAW}
    COMMENT = 'Raw PDF data from AI_PARSE_DOCUMENT'
""").collect()
print(f"✅ Schema {SCHEMA_RAW} created/verified")

session.sql(f"""
    CREATE SCHEMA IF NOT EXISTS {SCHEMA_PROCESSING}
    COMMENT = 'Processed PDF data, embeddings, and analytics'
""").collect()
print(f"✅ Schema {SCHEMA_PROCESSING} created/verified")

# Verify schemas exist
schemas = session.sql("SHOW SCHEMAS").collect()
print(f"\n📊 Available schemas in {DATABASE_NAME}:")
for schema in schemas:
    print(f"   - {schema['name']}")

print("\n✅ Database and schemas ready!")


✅ Database GWAS created/verified
✅ Schema PDF_RAW created/verified
✅ Schema PDF_PROCESSING created/verified

📊 Available schemas in GWAS:
   - INFORMATION_SCHEMA
   - PDF_PROCESSING
   - PDF_RAW
   - PUBLIC

✅ Database and schemas ready!


## 📦 Step 2: Create Stage

Create stage for storing PDF files, extracted images, and text files.


In [6]:
# Create stage for PDF and asset storage
session.sql(f"USE SCHEMA {SCHEMA_RAW}").collect()

session.sql(f"""
    CREATE STAGE IF NOT EXISTS PDF_STAGE
    DIRECTORY = (ENABLE = TRUE)
    ENCRYPTION = (TYPE = 'SNOWFLAKE_SSE')
    COMMENT = 'Storage for PDF files, extracted images, and text'
""").collect()

print(f"✅ Stage PDF_STAGE created/verified in {DATABASE_NAME}.{SCHEMA_RAW}")

# Verify stage exists
stages = session.sql("SHOW STAGES").collect()
print(f"\n📦 Available stages:")
for stage in stages:
    print(f"   - {stage['name']}")

print(f"\n💡 Upload PDFs using:")
print(f"   PUT file:///path/to/file.pdf @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/")

print("\n✅ Stage ready!")


✅ Stage PDF_STAGE created/verified in GWAS.PDF_RAW

📦 Available stages:
   - PDF_STAGE

💡 Upload PDFs using:
   PUT file:///path/to/file.pdf @GWAS.PDF_RAW.PDF_STAGE/

✅ Stage ready!


## 📊 Step 3: Create Tables

Create all tables needed for the GWAS extraction pipeline.


In [7]:
# Create PARSED_DOCUMENTS table in PDF_RAW schema
session.sql(f"USE SCHEMA {SCHEMA_RAW}").collect()

session.sql("""
    CREATE TABLE IF NOT EXISTS PARSED_DOCUMENTS (
        document_id VARCHAR PRIMARY KEY,
        file_path VARCHAR NOT NULL,
        file_name VARCHAR NOT NULL,
        parsed_content VARIANT NOT NULL,
        total_pages INTEGER,
        created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP()
    )
    COMMENT = 'Raw PDF data from Cortex AI_PARSE_DOCUMENT'
""").collect()

print(f"✅ Table PARSED_DOCUMENTS created in {DATABASE_NAME}.{SCHEMA_RAW}")


✅ Table PARSED_DOCUMENTS created in GWAS.PDF_RAW


In [8]:
# Create TEXT_PAGES table in PDF_PROCESSING schema
session.sql(f"USE SCHEMA {SCHEMA_PROCESSING}").collect()

session.sql("""
    CREATE TABLE IF NOT EXISTS TEXT_PAGES (
        page_id VARCHAR PRIMARY KEY DEFAULT UUID_STRING(),
        document_id VARCHAR NOT NULL,
        file_name VARCHAR NOT NULL,
        page_number INTEGER NOT NULL,
        page_text TEXT,
        word_count INTEGER,
        text_embedding VECTOR(FLOAT, 1024),
        embedding_model VARCHAR(100),
        created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
        UNIQUE (document_id, page_number)
    )
    COMMENT = 'Page text with embeddings for semantic search'
""").collect()

print(f"✅ Table TEXT_PAGES created in {DATABASE_NAME}.{SCHEMA_PROCESSING}")


✅ Table TEXT_PAGES created in GWAS.PDF_PROCESSING


In [9]:
# Create IMAGE_PAGES table in PDF_PROCESSING schema
session.sql("""
    CREATE TABLE IF NOT EXISTS IMAGE_PAGES (
        image_id VARCHAR PRIMARY KEY DEFAULT UUID_STRING(),
        document_id VARCHAR NOT NULL,
        file_name VARCHAR NOT NULL,
        page_number INTEGER NOT NULL,
        image_file_path VARCHAR NOT NULL,
        image_embedding VECTOR(FLOAT, 1024),
        embedding_model VARCHAR(100),
        dpi INTEGER DEFAULT 300,
        image_format VARCHAR(10) DEFAULT 'PNG',
        created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
        UNIQUE (document_id, page_number)
    )
    COMMENT = 'Page images metadata for multimodal processing'
""").collect()

print(f"✅ Table IMAGE_PAGES created in {DATABASE_NAME}.{SCHEMA_PROCESSING}")


✅ Table IMAGE_PAGES created in GWAS.PDF_PROCESSING


In [10]:
# Create MULTIMODAL_PAGES table in PDF_PROCESSING schema
session.sql("""
    CREATE TABLE IF NOT EXISTS MULTIMODAL_PAGES (
        page_id VARCHAR PRIMARY KEY DEFAULT UUID_STRING(),
        document_id VARCHAR NOT NULL,
        file_name VARCHAR NOT NULL,
        page_number INTEGER NOT NULL,
        image_id VARCHAR,
        page_text TEXT,
        image_path VARCHAR,
        text_embedding VECTOR(FLOAT, 1024),
        image_embedding VECTOR(FLOAT, 1024),
        embedding_model VARCHAR(100),
        has_text BOOLEAN DEFAULT FALSE,
        has_image BOOLEAN DEFAULT FALSE,
        created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
        UNIQUE (document_id, page_number)
    )
    COMMENT = 'Combined text + image embeddings for multimodal RAG'
""").collect()

print(f"✅ Table MULTIMODAL_PAGES created in {DATABASE_NAME}.{SCHEMA_PROCESSING}")


✅ Table MULTIMODAL_PAGES created in GWAS.PDF_PROCESSING


In [11]:
# Create GWAS_TRAIT_ANALYTICS table in PDF_PROCESSING schema
session.sql("""
    CREATE TABLE IF NOT EXISTS GWAS_TRAIT_ANALYTICS (
        analytics_id VARCHAR PRIMARY KEY DEFAULT UUID_STRING(),
        document_id VARCHAR NOT NULL,
        file_name VARCHAR NOT NULL,
        extraction_version VARCHAR(50),
        finding_number INTEGER DEFAULT 1,
        
        -- Genomic traits
        trait VARCHAR(500),
        germplasm_name VARCHAR(500),
        genome_version VARCHAR(100),
        chromosome VARCHAR(50),
        physical_position VARCHAR(200),
        gene VARCHAR(500),
        snp_name VARCHAR(200),
        variant_id VARCHAR(200),
        variant_type VARCHAR(100),
        effect_size VARCHAR(200),
        gwas_model VARCHAR(200),
        evidence_type VARCHAR(100),
        allele VARCHAR(100),
        annotation TEXT,
        candidate_region VARCHAR(500),
        
        -- Metadata
        extraction_source VARCHAR(50),
        field_citations VARIANT,
        field_confidence VARIANT,
        field_raw_values VARIANT,
        traits_extracted INTEGER,
        traits_not_reported INTEGER,
        extraction_accuracy_pct FLOAT,
        
        created_at TIMESTAMP_LTZ DEFAULT CURRENT_TIMESTAMP(),
        UNIQUE (document_id, extraction_version, finding_number)
    )
    COMMENT = 'Extracted GWAS trait data from research papers'
""").collect()

print(f"✅ Table GWAS_TRAIT_ANALYTICS created in {DATABASE_NAME}.{SCHEMA_PROCESSING}")


✅ Table GWAS_TRAIT_ANALYTICS created in GWAS.PDF_PROCESSING


## 📤 Step 4: Upload PDF to Stage

**Upload your PDF file to the stage before proceeding.**

### Option 1: Using SnowSQL (Command Line)
```bash
# From terminal
snowsql -a YOUR_ACCOUNT -u YOUR_USER
PUT file:///Users/jholt/Downloads/fpls-15-1373081.pdf @GWAS.PDF_RAW.PDF_STAGE/;
```

### Option 2: Using Python (Below)
Run the cell below to upload from your local system.


In [12]:
# Upload PDF from local system
from pathlib import Path

# Path to your PDF file
PDF_LOCAL_PATH = "/Users/jholt/Downloads/fpls-15-1373081.pdf"

# Verify file exists
pdf_path = Path(PDF_LOCAL_PATH)
if not pdf_path.exists():
    print(f"❌ File not found: {PDF_LOCAL_PATH}")
    print("   Update PDF_LOCAL_PATH to point to your PDF file")
else:
    print(f"📄 Found PDF: {pdf_path.name} ({pdf_path.stat().st_size / 1024 / 1024:.2f} MB)")
    
    # Upload to stage
    print(f"\n📤 Uploading to stage...")
    session.file.put(
        str(pdf_path),
        f"@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/",
        auto_compress=False,
        overwrite=True
    )
    
    print(f"✅ PDF uploaded to @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{pdf_path.name}")
    
    # List files in stage to verify
    print(f"\n📂 Files in stage:")
    files = session.sql(f"LIST @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE").collect()
    for file in files:
        print(f"   - {file[0]}")


📄 Found PDF: fpls-15-1373081.pdf (2.50 MB)

📤 Uploading to stage...
✅ PDF uploaded to @GWAS.PDF_RAW.PDF_STAGE/fpls-15-1373081.pdf

📂 Files in stage:
   - pdf_stage/fpls-15-1373081.pdf
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0000.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0001.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0002.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0003.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0004.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0005.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0006.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0007.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0008.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0009.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0010.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0011.png
   - pdf_stage/fpls-15-1373081.pdf/pages_images/page_0012.png
   - pdf_s

## 📦 CELL 1: Section 1 - Setup & Imports

In [13]:
# Standard library imports
import sys
import os
import dotenv
from pathlib import Path
import json
from datetime import datetime

# Add scripts directory to path
project_root = Path().absolute()
sys.path.append(str(project_root / "scripts" / "python"))

# Third-party imports
import pandas as pd
import numpy as np
from tqdm.notebook import tqdm


# Display settings
pd.set_option('display.max_columns', None)
pd.set_option('display.max_colwidth', 100)

print("✅ Imports successful!")
print(f"   Project root: {project_root}")

✅ Imports successful!
   Project root: /Users/jholt/gwas_intelligence/gwas_intelligence


## 📄 CELL 5-6: Section 3 - List PDFs in Snowflake Stage

- **Cell 5**: List available PDFs
- **Cell 6**: Configure which PDF to process

In [14]:
# ============================================================================
# CONFIGURATION: Update these for your PDF
# ============================================================================

# Test PDF: fpls-15-1373081.pdf (GWAS paper from Frontiers in Plant Science)
# PDF is uploaded to root of stage: @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{filename}

PDF_FILENAME = "fpls-15-1373081.pdf"  # PDF filename as it exists in stage

# Use filename as DOCUMENT_ID (keeps .pdf extension)
DOCUMENT_ID = PDF_FILENAME

# Stage paths
STAGE_FILE_PATH = PDF_FILENAME  # PDF is at root of stage (no subdirectory)

# Expected directory structure that will be created in stage:
# @PDF_STAGE/
#   └── fpls-15-1373081.pdf/
#       ├── pages_text/
#       │   ├── page_001.txt
#       │   ├── page_002.txt
#       │   └── ...
#       └── pages_images/
#           ├── page_001.png
#           ├── page_002.png
#           └── ...

print(f"📋 Selected PDF Configuration:")
print(f"   Filename: {PDF_FILENAME}")
print(f"   Document ID: {DOCUMENT_ID}")
print(f"   Stage File Path: {STAGE_FILE_PATH}")
print(f"   Full Stage Path: @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{STAGE_FILE_PATH}")
print(f"\n📁 Output Structure:")
print(f"   Text:   @PDF_STAGE/{DOCUMENT_ID}/pages_text/")
print(f"   Images: @PDF_STAGE/{DOCUMENT_ID}/pages_images/")
print(f"\n✅ Configuration ready!")


📋 Selected PDF Configuration:
   Filename: fpls-15-1373081.pdf
   Document ID: fpls-15-1373081.pdf
   Stage File Path: fpls-15-1373081.pdf
   Full Stage Path: @GWAS.PDF_RAW.PDF_STAGE/fpls-15-1373081.pdf

📁 Output Structure:
   Text:   @PDF_STAGE/fpls-15-1373081.pdf/pages_text/
   Images: @PDF_STAGE/fpls-15-1373081.pdf/pages_images/

✅ Configuration ready!


## 🤖 CELL 8-9: Section 4 - Parse PDF with AI_PARSE_DOCUMENT

- **Cell 8**: Parse PDF using Snowflake Cortex AI
- **Cell 9**: Convert PDF pages to PNG images

In [15]:
# Parse PDF using Snowflake Cortex AI_PARSE_DOCUMENT
# Reference: https://docs.snowflake.com/en/user-guide/snowflake-cortex/parse-document
import time

print(f"🔄 Parsing PDF from stage\n")
print(f"   Stage: @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE")
print(f"   File: {STAGE_FILE_PATH}\n")

# Pre-check: Is document already parsed?
check_query = f"""
SELECT document_id, total_pages, created_at
FROM {DATABASE_NAME}.{SCHEMA_RAW}.PARSED_DOCUMENTS
WHERE document_id = '{DOCUMENT_ID}'
"""

existing = session.sql(check_query).collect()

if existing:
    print(f"ℹ️  Document '{DOCUMENT_ID}' already parsed (skipping)")
    print(f"   Parsed at: {existing[0][2]}")
    print(f"   Total pages: {existing[0][1]}")
    print(f"\n💡 To re-parse, delete the record first:")
    print(f"   DELETE FROM {DATABASE_NAME}.{SCHEMA_RAW}.PARSED_DOCUMENTS")
    print(f"   WHERE document_id = '{DOCUMENT_ID}';\n")
else:
    print("📋 Using AI_PARSE_DOCUMENT with LAYOUT mode")
    print("   - High-fidelity extraction optimized for complex documents")
    print("   - Preserves structure: tables, headers, reading order")
    print("   - page_split: true (processes each page separately)")
    print("   - Returns Markdown-formatted content")
    print("\n⏱️  This may take 30-60+ seconds for a 15-page PDF...")
    print("   (AI_PARSE_DOCUMENT is processing your document in Snowflake)\n")
    
    # Start timing
    start_time = time.time()
    
    parse_query = f"""
    INSERT INTO {DATABASE_NAME}.{SCHEMA_RAW}.PARSED_DOCUMENTS 
        (document_id, file_path, file_name, parsed_content, total_pages)
    SELECT
        '{DOCUMENT_ID}' AS document_id,
        '@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{STAGE_FILE_PATH}' AS file_path,
        '{PDF_FILENAME}' AS file_name,
        parsed_data AS parsed_content,
        ARRAY_SIZE(parsed_data:pages) AS total_pages
    FROM (
        SELECT SNOWFLAKE.CORTEX.AI_PARSE_DOCUMENT(
            TO_FILE('@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE', '{STAGE_FILE_PATH}'),
            {{'mode': 'LAYOUT', 'page_split': true}}
        ) AS parsed_data
    )
    """
    
    try:
        print("🔄 Calling AI_PARSE_DOCUMENT... (please wait)")
        session.sql(parse_query).collect()
        
        elapsed = time.time() - start_time
        print(f"\n✅ PDF parsed successfully in {elapsed:.1f} seconds!\n")
        
        # Verify parsing
        result = session.sql(check_query).collect()
        if result:
            print(f"📄 Parsed Document Info:")
            print(f"   Document ID: {result[0][0]}")
            print(f"   Page Count: {result[0][1]}")
            print(f"   Created: {result[0][2]}")
            
            # Get first page preview
            preview_query = f"""
            SELECT parsed_content:pages[0]:content::VARCHAR as first_page
            FROM {DATABASE_NAME}.{SCHEMA_RAW}.PARSED_DOCUMENTS
            WHERE document_id = '{DOCUMENT_ID}'
            """
            preview = session.sql(preview_query).collect()
            if preview and preview[0][0]:
                print(f"\n   First Page Preview (100 chars):")
                print(f"   {preview[0][0][:100]}...")
        
    except Exception as e:
        elapsed = time.time() - start_time
        error_msg = str(e)
        print(f"\n❌ Error after {elapsed:.1f} seconds: {error_msg[:300]}\n")
        
        # Helpful debugging
        if "does not exist" in error_msg.lower():
            print("💡 File not found in stage. Check:")
            print(f"   LIST @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE;")
        elif "timeout" in error_msg.lower():
            print("💡 Query timeout. Try:")
            print("   - Smaller PDF")
            print("   - Increase statement timeout")
        else:
            print("💡 Full error:")
            print(f"   {error_msg}")


🔄 Parsing PDF from stage

   Stage: @GWAS.PDF_RAW.PDF_STAGE
   File: fpls-15-1373081.pdf

ℹ️  Document 'fpls-15-1373081.pdf' already parsed (skipping)
   Parsed at: 2025-10-07 19:21:58.711000-07:00
   Total pages: 14

💡 To re-parse, delete the record first:
   DELETE FROM GWAS.PDF_RAW.PARSED_DOCUMENTS
   WHERE document_id = 'fpls-15-1373081.pdf';



In [16]:
# ============================================================================
# CREATE PNG IMAGES FROM PDF
# Uses PyMuPDF to convert PDF pages to PNG, uploads to stage structure
# NOTE: PyMuPDF requires local file access - we download, process, upload
# ============================================================================

# Required imports (in case Cell 1 wasn't run)
import tempfile
import shutil
from pathlib import Path
try:
    import fitz  # PyMuPDF
except ImportError:
    print("❌ Error: PyMuPDF (fitz) not installed!")
    print("   Install: pip install PyMuPDF")
    raise

print("🖼️  Creating PNG images from PDF pages\n")
print(f"   PDF: {STAGE_FILE_PATH}")
print(f"   Document ID: {DOCUMENT_ID}\n")

# Create temp directories
temp_dir = Path(tempfile.mkdtemp())
images_output = temp_dir / "images"
images_output.mkdir(parents=True, exist_ok=True)

try:
    # Step 1: Download PDF from stage (required for PyMuPDF)
    print("📥 Step 1: Downloading PDF from stage...")
    
    # Set database context for file operations
    session.sql(f"USE DATABASE {DATABASE_NAME}").collect()
    session.sql(f"USE SCHEMA {SCHEMA_RAW}").collect()
    
    # Download from stage
    stage_path = f"@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{STAGE_FILE_PATH}"
    session.file.get(stage_path, str(temp_dir))
    
    # Find downloaded PDF
    pdf_files = list(temp_dir.rglob("*.pdf"))
    if not pdf_files:
        raise FileNotFoundError(f"PDF not downloaded: {PDF_FILENAME}")
    
    local_pdf = pdf_files[0]
    print(f"   ✅ Downloaded: {local_pdf.name}\n")
    
    # Step 2: Convert PDF pages to PNG using PyMuPDF
    print("🔄 Step 2: Converting PDF pages to PNG images...")
    doc = fitz.open(local_pdf)
    page_count = len(doc)
    print(f"   PDF has {page_count} pages")
    
    for page_num in range(page_count):
        page = doc[page_num]
        pix = page.get_pixmap(dpi=300)
        
        output_file = images_output / f"page_{page_num:04d}.png"
        pix.save(output_file)
        print(f"   ✓ Converted page {page_num + 1}/{page_count}")
    
    doc.close()
    print(f"   ✅ Created {page_count} PNG images\n")
    
    # Step 3: Upload PNGs to stage structure
    print("📤 Step 3: Uploading PNG images to stage...")
    stage_output = f"@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{DOCUMENT_ID}/pages_images/"
    print(f"   Target: {stage_output}")
    
    for page_num in range(page_count):
        local_image = images_output / f"page_{page_num:04d}.png"
        session.file.put(
            str(local_image),
            stage_output,
            auto_compress=False,
            overwrite=True
        )
        print(f"   ✓ Uploaded page {page_num + 1}/{page_count}")
    
    print(f"   ✅ Uploaded {page_count} images\n")
    
    # Step 4: Insert IMAGE_PAGES records into database
    print("💾 Step 4: Inserting IMAGE_PAGES records...")
    session.sql(f"USE SCHEMA {SCHEMA_PROCESSING}").collect()
    
    for page_num in range(page_count):
        session.sql(f"""
            INSERT INTO {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES (
                IMAGE_ID,
                DOCUMENT_ID,
                FILE_NAME,
                PAGE_NUMBER,
                IMAGE_FILE_PATH,
                DPI,
                IMAGE_FORMAT
            )
            SELECT
                UUID_STRING(),
                '{DOCUMENT_ID}',
                '{PDF_FILENAME}',
                {page_num},
                '{stage_output}page_{page_num:04d}.png',
                300,
                'PNG'
            WHERE NOT EXISTS (
                SELECT 1 FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
                WHERE DOCUMENT_ID = '{DOCUMENT_ID}' 
                AND PAGE_NUMBER = {page_num}
            )
        """).collect()
        print(f"   ✓ Inserted record {page_num + 1}/{page_count}")
    
    print(f"   ✅ Inserted {page_count} IMAGE_PAGES records\n")
    
    # Step 5: Verify
    print("🔍 Step 5: Verifying stage structure...")
    verify_result = session.sql(f"""
        LIST @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{DOCUMENT_ID}/pages_images/
    """).collect()
    print(f"   ✅ Found {len(verify_result)} files in stage")
    
    # Verify database
    db_count = session.sql(f"""
        SELECT COUNT(*) FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
        WHERE DOCUMENT_ID = '{DOCUMENT_ID}'
    """).collect()
    print(f"   ✅ Found {db_count[0][0]} records in IMAGE_PAGES table")
    
    print(f"\n🎉 SUCCESS! Converted {page_count} pages for {PDF_FILENAME}")
    print(f"   Stage: @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{DOCUMENT_ID}/pages_images/")
    print(f"   Database: {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES")
    
except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()
    
finally:
    # Cleanup temp directory
    if 'temp_dir' in locals() and temp_dir.exists():
        shutil.rmtree(temp_dir)
        print(f"\n🧹 Cleaned up temp directory")


🖼️  Creating PNG images from PDF pages

   PDF: fpls-15-1373081.pdf
   Document ID: fpls-15-1373081.pdf

📥 Step 1: Downloading PDF from stage...
   ✅ Downloaded: fpls-15-1373081.pdf

🔄 Step 2: Converting PDF pages to PNG images...
   PDF has 14 pages
   ✓ Converted page 1/14
   ✓ Converted page 2/14
   ✓ Converted page 3/14
   ✓ Converted page 4/14
   ✓ Converted page 5/14
   ✓ Converted page 6/14
   ✓ Converted page 7/14
   ✓ Converted page 8/14
   ✓ Converted page 9/14
   ✓ Converted page 10/14
   ✓ Converted page 11/14
   ✓ Converted page 12/14
   ✓ Converted page 13/14
   ✓ Converted page 14/14
   ✅ Created 14 PNG images

📤 Step 3: Uploading PNG images to stage...
   Target: @GWAS.PDF_RAW.PDF_STAGE/fpls-15-1373081.pdf/pages_images/
   ✓ Uploaded page 1/14
   ✓ Uploaded page 2/14
   ✓ Uploaded page 3/14
   ✓ Uploaded page 4/14
   ✓ Uploaded page 5/14
   ✓ Uploaded page 6/14
   ✓ Uploaded page 7/14
   ✓ Uploaded page 8/14
   ✓ Uploaded page 9/14
   ✓ Uploaded page 10/14
   ✓ Uploaded

## 📝 CELL 11: Section 5 - Extract Text Pages & Generate Embeddings

Uses `snowflake-arctic-embed-l-v2.0-8k` model for text embeddings

In [17]:
# Extract text pages with embeddings using snowflake-arctic-embed-l-v2.0-8k
print("🔄 Extracting text pages and generating embeddings...\n")
print("📋 Text Embedding Model: snowflake-arctic-embed-l-v2.0-8k")
print("   - Dimensions: 1024")
print("   - Context length: 8K tokens")
print("   - Optimized for: Long-form documents\n")

# Insert text pages with embeddings
text_extract_query = f"""
INSERT INTO {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES 
    (document_id, file_name, page_number, page_text, word_count, 
     text_embedding, embedding_model)
SELECT
    '{DOCUMENT_ID}' AS document_id,
    '{PDF_FILENAME}' AS file_name,
    page.index AS page_number,
    page.value:content::STRING AS page_text,
    ARRAY_SIZE(SPLIT(page.value:content::STRING, ' ')) AS word_count,
    SNOWFLAKE.CORTEX.EMBED_TEXT_1024(
        'snowflake-arctic-embed-l-v2.0-8k',
        page.value:content::STRING
    ) AS text_embedding,
    'snowflake-arctic-embed-l-v2.0-8k' AS embedding_model
FROM {DATABASE_NAME}.{SCHEMA_RAW}.PARSED_DOCUMENTS pd,
LATERAL FLATTEN(input => pd.parsed_content:pages) page
WHERE pd.document_id = '{DOCUMENT_ID}'
AND NOT EXISTS (
    SELECT 1 FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES tp
    WHERE tp.document_id = '{DOCUMENT_ID}' 
    AND tp.page_number = page.index
)
"""

try:
    session.sql(text_extract_query).collect()
    print("✅ Text pages extracted with embeddings!\n")
    
    # Get statistics
    stats_query = f"""
    SELECT 
        COUNT(*) as page_count,
        AVG(word_count) as avg_words,
        MIN(word_count) as min_words,
        MAX(word_count) as max_words
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES
    WHERE document_id = '{DOCUMENT_ID}'
    """
    
    stats = session.sql(stats_query).collect()
    if stats and stats[0][0] > 0:
        print(f"📊 Text Extraction Statistics:")
        print(f"   Total pages: {stats[0][0]}")
        print(f"   Avg words/page: {stats[0][1]:.0f}")
        print(f"   Min words: {stats[0][2]}")
        print(f"   Max words: {stats[0][3]}")
        
        # Verify embeddings
        embed_check = session.sql(f"""
            SELECT COUNT(*) FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES 
            WHERE document_id = '{DOCUMENT_ID}' AND text_embedding IS NOT NULL
        """).collect()
        print(f"   Pages with embeddings: {embed_check[0][0]}")
        
        # Show sample pages
        sample_query = f"""
        SELECT page_number, LEFT(page_text, 100) as preview, word_count
        FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES
        WHERE document_id = '{DOCUMENT_ID}'
        ORDER BY page_number
        LIMIT 3
        """
        
        samples = session.sql(sample_query).collect()
        if samples:
            print(f"\n📄 Sample Pages:")
            df_samples = pd.DataFrame(samples, columns=['Page', 'Text Preview', 'Words'])
            display(df_samples)
    
except Exception as e:
    print(f"❌ Error: {e}")


🔄 Extracting text pages and generating embeddings...

📋 Text Embedding Model: snowflake-arctic-embed-l-v2.0-8k
   - Dimensions: 1024
   - Context length: 8K tokens
   - Optimized for: Long-form documents

✅ Text pages extracted with embeddings!

📊 Text Extraction Statistics:
   Total pages: 14
   Avg words/page: 678
   Min words: 237
   Max words: 1238
   Pages with embeddings: 14

📄 Sample Pages:


Unnamed: 0,Page,Text Preview,Words
0,0,"## OPEN ACCESS\n\nEDITED BY\nShengli Jing,\nXinyang Normal University, China\nREVIEWED BY\nLilin...",556
1,1,Introduction\n\nThe cultivated rice (Oryza sativa L.) is a major staple crop and feeds over half...,1002
2,2,"materials can be found in Supplementary Table 1. The majority of them were indica (227), followe...",826


## 🖼️ CELL 13-14: Section 6 - Create Image Pages

- **Cell 13**: Debug - List files in stage
- **Cell 14**: Generate image embeddings using `voyage-multimodal-3`

**Purpose:** Create embeddings for PNG images to enable multimodal search (text + images).
Images capture tables, charts, and figures that may contain GWAS data not easily extracted from text.

In [18]:
# DEBUG: List actual files in stage to verify paths
print("🔍 Listing files in stage...\n")

list_query = f"""
LIST @{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE/{DOCUMENT_ID}/pages_images/
"""

try:
    files = session.sql(list_query).collect()
    print(f"✅ Found {len(files)} files:\n")
    for f in files[:5]:  # Show first 5
        print(f"   {f[0]}")
    if len(files) > 5:
        print(f"   ... and {len(files) - 5} more")
except Exception as e:
    print(f"❌ Error: {e}")


🔍 Listing files in stage...

✅ Found 14 files:

   pdf_stage/fpls-15-1373081.pdf/pages_images/page_0000.png
   pdf_stage/fpls-15-1373081.pdf/pages_images/page_0001.png
   pdf_stage/fpls-15-1373081.pdf/pages_images/page_0002.png
   pdf_stage/fpls-15-1373081.pdf/pages_images/page_0003.png
   pdf_stage/fpls-15-1373081.pdf/pages_images/page_0004.png
   ... and 9 more


In [19]:
# Generate image embeddings for existing IMAGE_PAGES records
# Uses voyage-multimodal-3 to create embeddings from PNGs in stage
print("🔄 Generating image embeddings...\n")
print("📋 Image Embedding Model: voyage-multimodal-3 via AI_EMBED")
print("   - Dimensions: 1024")
print("   - Supports: Images + Text")
print("   - Use case: Visual understanding of tables, charts, figures\n")

try:
    # Get existing IMAGE_PAGES records without embeddings
    check_query = f"""
    SELECT 
        PAGE_NUMBER,
        IMAGE_FILE_PATH,
        COUNT(*) OVER() as total_records
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
    WHERE DOCUMENT_ID = '{DOCUMENT_ID}'
    AND IMAGE_EMBEDDING IS NULL
    ORDER BY PAGE_NUMBER
    """
    
    records = session.sql(check_query).collect()
    
    if not records:
        print("ℹ️  No records found without embeddings")
        
        # Check if embeddings already exist
        existing = session.sql(f"""
            SELECT COUNT(*) FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
            WHERE DOCUMENT_ID = '{DOCUMENT_ID}' AND IMAGE_EMBEDDING IS NOT NULL
        """).collect()
        if existing and existing[0][0] > 0:
            print(f"   ✅ {existing[0][0]} records already have embeddings!\n")
        else:
            print("   ⚠️  No IMAGE_PAGES records found - run Cell 9 first\n")
    else:
        total_records = records[0][2]
        print(f"📊 Found {total_records} IMAGE_PAGES records without embeddings")
        print(f"   Processing {len(records)} pages...\n")
        
        # Update each record with embedding
        for idx, record in enumerate(records, 1):
            page_num = record[0]
            image_path = record[1]
            
            # Parse the stored path
            # Stored format: @PDF_STAGE/fpls-15-1373081.pdf/pages_images/page_0000.png
            # Extract relative path (everything after first /)
            
            if image_path.startswith('@'):
                # Split on first / after @
                parts = image_path.split('/', 1)
                if len(parts) == 2:
                    relative_path = parts[1]  # fpls-15-1373081.pdf/pages_images/page_0000.png
                else:
                    relative_path = image_path
            else:
                # No @ prefix, use as-is
                relative_path = image_path
            
            # Always use full stage name for TO_FILE
            full_stage_name = f'@{DATABASE_NAME}.{SCHEMA_RAW}.PDF_STAGE'
            
            print(f"   Page {page_num}: TO_FILE('{full_stage_name}', '{relative_path}')")
            
            # Generate embedding and update record
            update_query = f"""
            UPDATE {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
            SET 
                IMAGE_EMBEDDING = AI_EMBED(
                    'voyage-multimodal-3',
                    TO_FILE('{full_stage_name}', '{relative_path}')
                ),
                EMBEDDING_MODEL = 'voyage-multimodal-3'
            WHERE DOCUMENT_ID = '{DOCUMENT_ID}'
            AND PAGE_NUMBER = {page_num}
            """
            
            try:
                session.sql(update_query).collect()
                print(f"   ✓ Generated embedding ({idx}/{len(records)})\n")
            except Exception as e:
                error_msg = str(e)
                print(f"   ✗ Failed: {error_msg[:200]}\n")
                # Show full error for first failure
                if idx == 1:
                    print(f"   Full error: {error_msg}\n")
                    print(f"   💡 Tip: Run the debug cell above (Cell 13) to verify files exist in stage\n")
        
        print(f"✅ Embedding generation complete!\n")
    
    # Verify final counts
    verify_query = f"""
    SELECT 
        COUNT(*) as total_records,
        COUNT(IMAGE_EMBEDDING) as with_embeddings,
        COUNT(CASE WHEN IMAGE_EMBEDDING IS NULL THEN 1 END) as without_embeddings
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES
    WHERE DOCUMENT_ID = '{DOCUMENT_ID}'
    """
    
    result = session.sql(verify_query).collect()
    if result:
        total, with_emb, without_emb = result[0]
        print(f"📊 Final Status:")
        print(f"   Total records: {total}")
        print(f"   ✅ With embeddings: {with_emb}")
        print(f"   ⚠️  Without embeddings: {without_emb}")
        print(f"   📈 Ready for multimodal search: {with_emb}/{total}")
        
        if with_emb == total and total > 0:
            print(f"\n🎉 All image embeddings generated successfully!")
        elif without_emb > 0:
            print(f"\n⚠️  {without_emb} pages still need embeddings")
    
except Exception as e:
    print(f"❌ Error: {e}")
    import traceback
    traceback.print_exc()


🔄 Generating image embeddings...

📋 Image Embedding Model: voyage-multimodal-3 via AI_EMBED
   - Dimensions: 1024
   - Supports: Images + Text
   - Use case: Visual understanding of tables, charts, figures

ℹ️  No records found without embeddings
   ✅ 14 records already have embeddings!

📊 Final Status:
   Total records: 14
   ✅ With embeddings: 14
   ⚠️  Without embeddings: 0
   📈 Ready for multimodal search: 14/14

🎉 All image embeddings generated successfully!


## 🔗 CELL 16: Section 7 - Create Multimodal Pages

Join text and image embeddings into a unified multimodal table for search

In [20]:
# Create multimodal pages - Join text and image embeddings
print("🔄 Creating multimodal pages...\n")
print("🔗 Joining text and image data by page_number")
print("   - Copies both text and image embeddings")
print("   - Enables unified multi-modal search\n")

multimodal_insert_query = f"""
INSERT INTO {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
    (document_id, file_name, page_number, page_id, image_id,
     page_text, image_path, text_embedding, image_embedding, 
     has_text, has_image)
SELECT
    COALESCE(tp.document_id, ip.document_id) AS document_id,
    COALESCE(tp.file_name, ip.file_name) AS file_name,
    COALESCE(tp.page_number, ip.page_number) AS page_number,
    tp.page_id,
    ip.image_id,
    tp.page_text,
    ip.image_file_path AS image_path,
    tp.text_embedding,
    ip.image_embedding,
    tp.page_id IS NOT NULL AS has_text,
    ip.image_id IS NOT NULL AS has_image
FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES tp
FULL OUTER JOIN {DATABASE_NAME}.{SCHEMA_PROCESSING}.IMAGE_PAGES ip
    ON tp.document_id = ip.document_id
    AND tp.page_number = ip.page_number
WHERE COALESCE(tp.document_id, ip.document_id) = '{DOCUMENT_ID}'
AND NOT EXISTS (
    SELECT 1 FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES mp
    WHERE mp.document_id = COALESCE(tp.document_id, ip.document_id)
    AND mp.page_number = COALESCE(tp.page_number, ip.page_number)
)
"""

try:
    session.sql(multimodal_insert_query).collect()
    print("✅ Multimodal pages created!\n")
    
    # Get statistics
    stats_query = f"""
    SELECT 
        COUNT(*) as total_pages,
        COUNT(CASE WHEN has_text THEN 1 END) as pages_with_text,
        COUNT(CASE WHEN has_image THEN 1 END) as pages_with_images,
        COUNT(CASE WHEN text_embedding IS NOT NULL THEN 1 END) as text_embeddings,
        COUNT(CASE WHEN image_embedding IS NOT NULL THEN 1 END) as image_embeddings
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
    WHERE document_id = '{DOCUMENT_ID}'
    """
    
    stats = session.sql(stats_query).collect()
    if stats:
        print(f"📊 Multimodal Pages Statistics:")
        print(f"   Total pages: {stats[0][0]}")
        print(f"   Pages with text: {stats[0][1]}")
        print(f"   Pages with images: {stats[0][2]}")
        print(f"   Text embeddings: {stats[0][3]}")
        print(f"   Image embeddings: {stats[0][4]}")
    
    # Show sample
    query = f"""
    SELECT 
        page_number,
        LEFT(page_text, 80) as text_preview,
        has_text,
        has_image,
        text_embedding IS NOT NULL as has_text_emb,
        image_embedding IS NOT NULL as has_image_emb
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
    WHERE document_id = '{DOCUMENT_ID}'
    ORDER BY page_number
    LIMIT 5
    """
    
    results = session.sql(query).collect()
    if results:
        print(f"\n📄 Sample Pages:")
        df = pd.DataFrame(results, 
                          columns=['Page', 'Text Preview', 'Has Text', 'Has Image', 
                                   'Text Emb', 'Image Emb'])
        display(df)
    
except Exception as e:
    print(f"❌ Error: {e}")


🔄 Creating multimodal pages...

🔗 Joining text and image data by page_number
   - Copies both text and image embeddings
   - Enables unified multi-modal search

✅ Multimodal pages created!

📊 Multimodal Pages Statistics:
   Total pages: 14
   Pages with text: 14
   Pages with images: 14
   Text embeddings: 14
   Image embeddings: 14

📄 Sample Pages:


Unnamed: 0,Page,Text Preview,Has Text,Has Image,Text Emb,Image Emb
0,0,"## OPEN ACCESS\n\nEDITED BY\nShengli Jing,\nXinyang Normal University, China\nREVIEWE",True,True,True,True
1,1,Introduction\n\nThe cultivated rice (Oryza sativa L.) is a major staple crop and f,True,True,True,True
2,2,materials can be found in Supplementary Table 1. The majority of them were indic,True,True,True,True
3,3,Selection of marker subsets for genomic prediction\n\nThe genotype data utilized t,True,True,True,True
4,4,# Selection of training population subsets for genomic prediction\n\nIn order to i,True,True,True,True


## 🔍 Section 8: Create Multi-Index Cortex Search Service

Create a Cortex Search service that indexes:
- **Text content** (keyword search)
- **Text embeddings** (semantic search with Arctic-8k)
- **Image embeddings** (visual search with voyage-multimodal-3)


In [21]:
# Create multi-index Cortex Search Service
print("🔄 Creating Cortex Search Service...\n")
print("📋 Service Configuration:")
print("   • Name: MULTIMODAL_SEARCH_SERVICE")
print("   • Text Index: page_text (keyword search)")
print("   • Vector Index 1: text_embedding (1024D - Arctic-8k)")
print("   • Vector Index 2: image_embedding (1024D - voyage-multimodal-3)")
print("   • Target Lag: 1 minute\n")

try:
    # Check if service already exists
    check_sql = f"""
    SHOW CORTEX SEARCH SERVICES LIKE 'MULTIMODAL_SEARCH_SERVICE' IN SCHEMA {DATABASE_NAME}.{SCHEMA_PROCESSING}
    """
    
    service_exists = False
    try:
        result = session.sql(check_sql).collect()
        service_exists = len(result) > 0
    except:
        service_exists = False
    
    if service_exists:
        print("✅ Service already exists, skipping creation (will refresh at end)\n")
        # Skip to refresh section
    else:
        print("🆕 Creating new search service...\n")
        
        # Create multi-index search service (Limited Private Preview feature)
        # Docs: https://docs.snowflake.com/LIMITEDACCESS/cortex-search/multi-index-service
        create_sql = f"""
CREATE CORTEX SEARCH SERVICE {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_SEARCH_SERVICE
  TEXT INDEXES page_text
  VECTOR INDEXES (
    text_embedding,
    image_embedding
  )
  ATTRIBUTES (
    page_id,
    document_id,
    file_name,
    page_number,
    image_path
  )
  WAREHOUSE = {WAREHOUSE_NAME}
  TARGET_LAG = '1 minute'
AS 
  SELECT 
    page_id,
    document_id,
    file_name,
    page_number,
    page_text,
    text_embedding,
    image_embedding,
    image_path
  FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
  WHERE has_text = TRUE AND has_image = TRUE
"""
    
        session.sql(create_sql).collect()
        print("✅ Cortex Search Service created!\n")
    
    # Regardless of create or skip, check service status
    status_sql = f"""
    SHOW CORTEX SEARCH SERVICES LIKE 'MULTIMODAL_SEARCH_SERVICE' IN SCHEMA {DATABASE_NAME}.{SCHEMA_PROCESSING}
    """
    status = session.sql(status_sql).collect()
    if status:
        print("📊 Service Status:")
        print(f"   Name: {status[0][1]}")  # name column
        print(f"   Database: {status[0][2]}")  # database_name
        print(f"   Schema: {status[0][3]}")  # schema_name
        print("\n⚠️  Note: Service may take ~1 minute to build indexes")
        print("   Wait before running search queries if you get errors")
    
except Exception as e:
    print(f"❌ Error creating search service: {e}")
    print("\n   If you see 'already exists', that's OK - service is ready")
    print("   If you see 'insufficient privileges', contact your Snowflake admin")

🔄 Creating Cortex Search Service...

📋 Service Configuration:
   • Name: MULTIMODAL_SEARCH_SERVICE
   • Text Index: page_text (keyword search)
   • Vector Index 1: text_embedding (1024D - Arctic-8k)
   • Vector Index 2: image_embedding (1024D - voyage-multimodal-3)
   • Target Lag: 1 minute

✅ Service already exists, skipping creation (will refresh at end)

📊 Service Status:
   Name: MULTIMODAL_SEARCH_SERVICE
   Database: GWAS
   Schema: PDF_PROCESSING

⚠️  Note: Service may take ~1 minute to build indexes
   Wait before running search queries if you get errors


In [22]:
# Refresh the search service to pick up any new data
# This is fast and updates indexes without recreating the service
print("🔄 Refreshing Search Service...\n")

try:
    # Check current refresh status
    status_query = """
    SELECT 
        name,
        database_name,
        schema_name,
        created_on,
        refresh_on
    FROM TABLE(RESULT_SCAN(LAST_QUERY_ID()))
    WHERE name = 'MULTIMODAL_SEARCH_SERVICE'
    """
    
    # First get the service info
    show_query = f"""
    SHOW CORTEX SEARCH SERVICES LIKE 'MULTIMODAL_SEARCH_SERVICE' IN SCHEMA {DATABASE_NAME}.{SCHEMA_PROCESSING}
    """
    session.sql(show_query)
    
    # Force a refresh
    print("⏱️  Initiating service refresh...")
    refresh_query = f"""
    ALTER CORTEX SEARCH SERVICE {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_SEARCH_SERVICE REFRESH
    """
    
    try:
        session.sql(refresh_query).collect()
        print("✅ Service refresh initiated\n")
    except Exception as refresh_error:
        if "does not support manual refresh" in str(refresh_error):
            print("ℹ️  Service auto-refreshes based on TARGET_LAG setting\n")
        else:
            print(f"⚠️  Refresh note: {refresh_error}\n")
    
    # Wait a moment for refresh
    import time
    print("⏳ Waiting 5 seconds for service to sync...")
    time.sleep(5)
    print("✅ Ready to query\n")
    
    # Verify data one more time
    verify_query = f"""
    SELECT COUNT(*) as ready_pages
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
    WHERE document_id = '{DOCUMENT_ID}'
      AND text_embedding IS NOT NULL
      AND image_embedding IS NOT NULL
      AND has_text = TRUE
      AND has_image = TRUE
    """
    
    result = session.sql(verify_query).collect()
    if result and result[0][0] > 0:
        print(f"✅ {result[0][0]} pages are indexed and ready for search")
    else:
        print("⚠️  No pages found matching service criteria")
        print("   Service filters: has_text = TRUE AND has_image = TRUE")
        
except Exception as e:
    print(f"⚠️  {e}")
    print("\nℹ️  This is OK - service should still work if it was created")


🔄 Refreshing Search Service...

⏱️  Initiating service refresh...
✅ Service refresh initiated

⏳ Waiting 5 seconds for service to sync...
✅ Ready to query

✅ 14 pages are indexed and ready for search


In [23]:
# Verify search service and data readiness
print("🔍 Verifying Search Service Status...\n")

try:
    # Check if service exists
    check_service = f"""
    SHOW CORTEX SEARCH SERVICES LIKE 'MULTIMODAL_SEARCH_SERVICE' IN SCHEMA {DATABASE_NAME}.{SCHEMA_PROCESSING}
    """
    service_info = session.sql(check_service).collect()
    
    if service_info:
        print("✅ Search service exists")
        print(f"   Name: {service_info[0][1]}")
        print(f"   Created: {service_info[0][4]}\n")
    else:
        print("❌ Search service NOT found!")
        print("   Run the previous cell to create it\n")
    
    # Check data in multimodal pages
    data_check = f"""
    SELECT 
        COUNT(*) as total_pages,
        COUNT(CASE WHEN text_embedding IS NOT NULL THEN 1 END) as with_text_emb,
        COUNT(CASE WHEN image_embedding IS NOT NULL THEN 1 END) as with_image_emb,
        COUNT(CASE WHEN text_embedding IS NOT NULL AND image_embedding IS NOT NULL THEN 1 END) as with_both
    FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_PAGES
    WHERE document_id = '{DOCUMENT_ID}'
    """
    
    data_stats = session.sql(data_check).collect()
    if data_stats:
        total, text_emb, image_emb, both = data_stats[0]
        print(f"📊 Data Readiness:")
        print(f"   Total pages: {total}")
        print(f"   With text embeddings: {text_emb}")
        print(f"   With image embeddings: {image_emb}")
        print(f"   With BOTH embeddings: {both}")
        
        if both == 0:
            print("\n⚠️  WARNING: No pages have both embeddings!")
            print("   Search service filters for: has_text = TRUE AND has_image = TRUE")
        else:
            print(f"\n✅ Ready to search {both} pages")
    
    # Give service time to build indexes
    print("\n💡 If you just created the service, wait ~60 seconds for indexes to build")
    
except Exception as e:
    print(f"❌ Error checking service: {e}")


🔍 Verifying Search Service Status...

✅ Search service exists
   Name: MULTIMODAL_SEARCH_SERVICE
   Created: 1 minute

📊 Data Readiness:
   Total pages: 14
   With text embeddings: 14
   With image embeddings: 14
   With BOTH embeddings: 14

✅ Ready to search 14 pages

💡 If you just created the service, wait ~60 seconds for indexes to build


## 🎯 Section 9: Test Multimodal Search

Query the multi-index Cortex Search service with:
- **Text keyword search** (exact/fuzzy matching on page_text)
- **Text embedding search** (semantic similarity with Arctic-8k)
- **Image embedding search** (visual similarity with voyage-multimodal-3)

The search uses weighted scoring to balance text and visual results.


In [24]:
# HELPER FUNCTION: Safely convert embeddings to proper list format
def safe_vector_conversion(vector_data):
    """
    Safely convert Snowflake embedding results to Python lists.
    Handles various formats that Snowflake might return.
    """
    if vector_data is None:
        return []
    
    # If it's already a list, return it
    if isinstance(vector_data, list) and len(vector_data) > 0 and isinstance(vector_data[0], (int, float)):
        return vector_data
    
    # If it's a string representation of a list
    if isinstance(vector_data, str):
        try:
            import ast
            parsed = ast.literal_eval(vector_data)
            if isinstance(parsed, list):
                return parsed
        except:
            # If ast.literal_eval fails, try json
            try:
                import json
                parsed = json.loads(vector_data)
                if isinstance(parsed, list):
                    return parsed
            except:
                pass
    
    # If it has a tolist method (numpy array or similar)
    if hasattr(vector_data, 'tolist'):
        return vector_data.tolist()
    
    # If it's an array-like object that can be converted to list
    try:
        result = list(vector_data)
        # Check if we got a proper numeric list
        if result and isinstance(result[0], (int, float)):
            return result
    except:
        pass
    
    # If all else fails, raise an error
    raise ValueError(f"Could not convert vector data of type {type(vector_data)} to list")

# Test the function
print("✅ Vector conversion helper function defined!")
print("\nExample usage:")
print("text_vector = safe_vector_conversion(embeddings[0][0])")
print("image_vector = safe_vector_conversion(embeddings[0][1])")

✅ Vector conversion helper function defined!

Example usage:
text_vector = safe_vector_conversion(embeddings[0][0])
image_vector = safe_vector_conversion(embeddings[0][1])


## 🧬 Section 10 - Extract GWAS Traits (Optimized AI Pipeline)

**Overview of extraction phases:**
- **Cell 40**: Define 15 GWAS traits with complex extraction prompts
- **Cell 42**: Phase 1 - AI_EXTRACT from full document text
- **Cell 44**: Phase 2 - Multimodal search + AI_EXTRACT validation
- **Cell 45**: Phase 3 - Final merge of Phase 1 & Phase 2 results
- **Cell 46**: Display final results

This optimized approach uses AI_EXTRACT exclusively for consistent, fast, and accurate GWAS trait extraction from scientific papers.

In [25]:
# Define 15 GWAS traits with refined, context-aware extraction prompts
# Based on GWAS paper structure: Abstract → Intro → Methods → Results → Discussion
# ✨ IMPROVED: Fixed for multi-species plant genomics coverage
# ✨ NEW: Support for multiple findings extraction (10-20 SNPs per paper)

traits_config_improved = {
    # ========================================
    # DOCUMENT-LEVEL TRAITS (Extract once per paper)
    # ========================================
    
    "Trait": {
        "search_query": "trait phenotype disease resistance agronomic character quality stress tolerance",
        "extraction_prompt": """Extract the MAIN phenotypic trait studied in this GWAS paper.

Look in: Title, Abstract (first paragraph), Introduction (study objective).

Format: Descriptive name of the trait being studied.
Examples: 'Disease resistance' (generic), 'Plant height', 'Flowering time', 'Grain yield', 'Drought tolerance'

Return the primary trait name ONLY, or 'NOT_FOUND'."""
    },
    
    "Germplasm_Name": {
        "search_query": "germplasm variety line population inbred diversity panel genetic background subpopulation",
        "extraction_prompt": """Extract the germplasm/population used in this GWAS study.

Look in: Methods → Plant Materials/Germplasm, Introduction → Study population.

Common formats across crops:
- Inbred lines: 'B73' (maize), 'Nipponbare' (rice), 'Col-0' (Arabidopsis), 'Chinese Spring' (wheat)
- Diversity panels: '282 association panel', '3K rice genome panel', 'SoyNAM', 'UK wheat diversity panel'
- Population codes: 'DH population', 'RIL population', 'F2:3 families', 'BC1F2'
- Specific varieties: 'Williams 82' (soybean), 'Kitaake' (rice)

Return the most specific germplasm name, or 'NOT_FOUND'."""
    },
    
    "Genome_Version": {
        "search_query": "genome version reference assembly RefGen annotation build",
        "extraction_prompt": """Extract the reference genome assembly version used.

Look in: Methods → Genotyping/Variant Calling, Supplementary Methods.

Common formats by crop:
- Maize: 'B73 RefGen_v4', 'AGPv4', 'Zm00001e'
- Rice: 'IRGSP-1.0', 'MSU7', 'Nipponbare-v7.0'
- Wheat: 'IWGSC RefSeq v2.1', 'CS42'
- Arabidopsis: 'TAIR10', 'Col-0'
- Soybean: 'Glycine_max_v4.0', 'Williams 82 v2.0'
- Tomato: 'SL4.0', 'Heinz 1706'

Return the version identifier, or 'NOT_FOUND'."""
    },
    
    "GWAS_Model": {
        "search_query": "GWAS model GLM MLM statistical method population structure kinship software",
        "extraction_prompt": """Extract the statistical model/software used for GWAS.

Look in: Methods → Statistical analysis/GWAS analysis section.

Common models: MLM (mixed linear model), GLM, CMLM, FarmCPU, BLINK, SUPER,
               EMMAX, FastGWA, rrBLUP, BOLT-LMM

Common software: TASSEL, GAPIT, GEMMA, PLINK, regenie, GCTA, rMVP, GENESIS

Return model name OR software, or 'NOT_FOUND'."""
    },
    
    "Evidence_Type": {
        "search_query": "GWAS QTL linkage association mapping study type genetic analysis",
        "extraction_prompt": """Identify the genetic mapping approach used.

Look in: Title, Abstract, Methods → Study design.

Types: 
- 'GWAS' (genome-wide association study) - most common
- 'QTL' (quantitative trait loci mapping) - biparental populations
- 'Linkage' (family-based mapping)
- 'Fine_Mapping' (high-resolution narrowing of QTL)

Return ONE type: 'GWAS', 'QTL', 'Linkage', 'Fine_Mapping', or 'NOT_FOUND'."""
    },
    
    # ========================================
    # FINDING-LEVEL TRAITS (Extract multiple per paper)
    # ========================================
    # ✨ NEW: These can now extract arrays of findings
    
    "Chromosome": {
        "search_query": "chromosome chr number genomic location linkage group significant hits",
        "extraction_prompt": """Extract ALL chromosomes with significant associations (p < 0.001 or genome-wide significant).

Look in: Results → GWAS hits, Manhattan plot peaks, Tables of significant SNPs.

Format: Return comma-separated list of chromosome identifiers, ranked by significance (lowest p-value first).
Examples: '5, 3, 10, 1' or '3A, 5B, 2D' (wheat) or 'X, 3, 5' or 'LG1, LG3, LG5' (linkage groups)

If only 1 significant hit: Return that chromosome.
If 10+ hits: Return top 10 most significant.

Return chromosome identifiers (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Physical_Position": {
        "search_query": "physical position locus base pairs bp genomic coordinate marker location",
        "extraction_prompt": """Extract physical positions of SIGNIFICANT SNPs (top 10 by p-value).

Look in: Results → Significant associations, Tables with 'Position' or 'bp' columns.

Format: Return comma-separated positions with chromosome context.
Examples: 
- Single: '145.6 Mb'
- Multiple: 'Chr5:145.6Mb, Chr3:198.2Mb, Chr10:78.9Mb'
- Alt format: '145678901 (Chr5), 198234567 (Chr3)'

If positions are in a table: Extract top 10 rows.
Include chromosome reference for clarity.

Return positions (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Gene": {
        "search_query": "candidate gene causal gene functional gene locus gene model annotation",
        "extraction_prompt": """Extract ALL candidate genes mentioned for significant associations.

Look in: Results → Candidate genes, Tables → Gene columns, Discussion → Gene function.

Common formats across crops:
- Maize: 'Zm00001d027230', 'GRMZM2G123456', 'tb1', 'dwarf8'
- Rice: 'LOC_Os03g01234', 'OsMADS1', 'SD1'
- Arabidopsis: 'AT1G12345', 'FLC', 'CO'
- Wheat: 'TraesCS3A02G123456', 'Rht-D1'
- Soybean: 'Glyma.01G000100', 'E1', 'Dt1'

Return comma-separated list if multiple genes.
Examples: 'Zm00001d027230, Zm00001d042156, Zm00001d013894'

Return candidate genes (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "SNP_Name": {
        "search_query": "SNP marker name identifier genotyping array lead markers",
        "extraction_prompt": """Extract SNP/marker names for SIGNIFICANT associations (top 10).

Look in: Results → Significant markers, Tables → Marker ID column.

Common prefixes vary by genotyping platform:
- Array-based: 'PZE-', 'AX-', 'Affx-'
- Sequence-based: 'S1_', 'Chr1_', 'ss', 'rs' (if dbSNP)
- Custom: May be position-based or study-specific

Return comma-separated list if multiple SNPs.
Examples: 'PZE-101234567, AX-90812345, S1_145678901'

Return marker identifiers (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Variant_ID": {
        "search_query": "variant ID SNP ID rs number dbSNP database identifier",
        "extraction_prompt": """Extract dbSNP variant IDs if referenced for significant associations.

Look in: Methods → Variant annotation, Supplementary tables.

Format: 'rs' or 'ss' prefixes (human/model organism databases)
Examples: 'rs123456789, rs987654321, rs111222333'

NOTE: Most plant studies don't use dbSNP IDs (common in human/model organisms).

Return dbSNP IDs (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Variant_Type": {
        "search_query": "variant type SNP InDel polymorphism haplotype marker genotyping",
        "extraction_prompt": """Extract the predominant variant/marker type analyzed.

Look in: Methods → Variant calling/Genotyping, Results → Association type.

Common types:
- SNP (single nucleotide polymorphism) - most common
- InDel (insertion/deletion)
- CNV (copy number variant)
- SV (structural variant)
- PAV (presence/absence variant) - plant pangenomes
- Haplotype (multi-marker block)
- SSR/Microsatellite (older studies)

Return ONE primary type (this is usually uniform across findings), or 'NOT_FOUND'."""
    },
    
    "Effect_Size": {
        "search_query": "effect size R-squared R2 variance explained phenotypic variation proportion",
        "extraction_prompt": """Extract effect sizes for SIGNIFICANT QTLs (top 10).

Look in: Results → QTL effect, Tables → R² or 'Variance explained' columns.

Format: Return comma-separated if multiple, with chromosome context if helpful.
Examples:
- Single: 'R²=0.23'
- Multiple: '0.31 (Chr10), 0.23 (Chr5), 0.19 (Chr3)'
- Alt format: '23%, 19%, 15%'

Return effect sizes (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Allele": {
        "search_query": "allele REF ALT haplotype genotype reference alternate favorable effect",
        "extraction_prompt": """Extract allele information for SIGNIFICANT SNPs.

Look in: Results tables (REF, ALT, Allele columns), figures, supplementary data.

Common formats:
- Slash: 'A/G', 'T/C', 'G/T'
- Arrow: 'A>G', 'T>C'
- Explicit: 'REF: A ALT: G'
- Effect notation: 'favorable: T'

If multiple SNPs: Return comma-separated alleles.
Examples: 'A/G, T/C, G/A'

NOTE: Allele data is typically in tables/charts, not body text.

Return allele notations (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Annotation": {
        "search_query": "functional annotation missense synonymous intergenic gene ontology regulatory",
        "extraction_prompt": """Extract functional annotations for SIGNIFICANT variants.

Look in: Results → Variant annotation, Discussion → Functional impact.

Categories: 
- 'missense_variant', 'synonymous', 'intergenic_region'
- 'upstream_gene', '5_prime_UTR', '3_prime_UTR'
- 'intronic', 'regulatory_region'

If multiple variants: Return comma-separated annotations.
Examples: 'missense_variant, intergenic_region, missense_variant'

Return annotations (comma-separated if multiple), or 'NOT_FOUND'."""
    },
    
    "Candidate_Region": {
        "search_query": "QTL region confidence interval linkage disequilibrium block bin locus interval",
        "extraction_prompt": """Extract QTL regions or confidence intervals for SIGNIFICANT associations.

Look in: Results → QTL mapping, Tables → QTL interval/region columns.

Format: Genomic intervals with units
Examples: 
- Single: 'chr1:145.6-146.1 Mb'
- Multiple: 'chr5:145.6-146.1Mb, chr3:198-199Mb, chr10:78-79Mb'
- Alt: 'bin 1.04, bin 3.05, bin 10.02'
- cM: '10-12 cM (Chr5), 45-47 cM (Chr3)'

Return genomic regions (comma-separated if multiple), or 'NOT_FOUND'."""
    }
}

print("📋 Defined 15 GWAS Traits for Targeted Extraction\n")
print("=" * 80)
print("✨ IMPROVEMENTS APPLIED:")
print("   ✅ Multi-species examples (maize, rice, wheat, Arabidopsis, soybean, tomato)")
print("   ✅ Germplasm_Name: Added rice, wheat, Arabidopsis, soybean examples")
print("   ✅ Genome_Version: Added 6 crop genome formats")
print("   ✅ Gene: Added 5 crop gene ID patterns")
print("   ✅ Allele: Shortened from 15 lines to 8 lines (50% reduction)")
print("   ✅ Chromosome: Now accepts numbers, letters (3A, X, Y, MT), linkage groups")
print("   ✅ Enhanced search queries with GWAS terminology")
print("   ✅ NEW: Multi-finding support (extract ALL significant associations, not just strongest)")
print("=" * 80 + "\n")

for idx, (trait_name, trait_info) in enumerate(traits_config_improved.items(), 1):
    print(f"{idx:2d}. {trait_name:20s} → Search: '{trait_info['search_query'][:50]}...'")
    
print("\n" + "=" * 80)
print(f"✅ Ready to extract {len(traits_config_improved)} traits using multi-phase approach")
print("🌾 Now supports: Maize, Rice, Wheat, Arabidopsis, Soybean, Tomato, and more!")
print("🎯 NEW: Can extract 10-20 findings per paper (not just strongest SNP)")


📋 Defined 15 GWAS Traits for Targeted Extraction

✨ IMPROVEMENTS APPLIED:
   ✅ Multi-species examples (maize, rice, wheat, Arabidopsis, soybean, tomato)
   ✅ Germplasm_Name: Added rice, wheat, Arabidopsis, soybean examples
   ✅ Genome_Version: Added 6 crop genome formats
   ✅ Gene: Added 5 crop gene ID patterns
   ✅ Allele: Shortened from 15 lines to 8 lines (50% reduction)
   ✅ Chromosome: Now accepts numbers, letters (3A, X, Y, MT), linkage groups
   ✅ Enhanced search queries with GWAS terminology
   ✅ NEW: Multi-finding support (extract ALL significant associations, not just strongest)

 1. Trait                → Search: 'trait phenotype disease resistance agronomic chara...'
 2. Germplasm_Name       → Search: 'germplasm variety line population inbred diversity...'
 3. Genome_Version       → Search: 'genome version reference assembly RefGen annotatio...'
 4. GWAS_Model           → Search: 'GWAS model GLM MLM statistical method population s...'
 5. Evidence_Type        → Search: 'GWA

### 📊 Phase 1 - Optimized Text Extraction with AI_EXTRACT

**What this does:** Extracts GWAS traits from the full document text using AI_EXTRACT:
- **Single API call**: Batch processing all 15 traits at once
- **Enhanced prompts**: Full complex prompts with multi-species examples
- **Smart context**: 25K character window for comprehensive coverage
- **Output**: Extracted traits with HIGH confidence when found

Processes the complete document text to extract all genomic trait information.

In [26]:
# Phase 1: OPTIMIZED AI_EXTRACT - Single Method Extraction
print("📝 Phase 1: Text-Based Extraction (Optimized Single Method)\n")
print("=" * 80)
print("🎯 Strategy: Use AI_EXTRACT with enhanced prompts")
print("   • Batch processing all 15 traits in one call")
print("   • Full complex prompts (no truncation)")
print("   • 25K context window for better coverage")
print("   • Direct confidence based on extraction success\n")

# Get all text pages
context_query = f"""
SELECT LISTAGG(page_text, '\\n\\n---PAGE BREAK---\\n\\n') WITHIN GROUP (ORDER BY page_number) as full_text
FROM {DATABASE_NAME}.{SCHEMA_PROCESSING}.TEXT_PAGES
WHERE document_id = '{DOCUMENT_ID}'
"""

# Helper function to validate if a value is actually meaningful
def is_valid_value(val):
    """Check if value is meaningful (not 'NOT_FOUND' or garbage)"""
    if not val:
        return False
    
    s = str(val).strip().strip('"').strip("'").strip()
    s_upper = s.upper()
    
    # Check for explicit NOT_FOUND patterns
    bad_values = ['NOT_FOUND', 'NOT FOUND', 'NONE', 'NULL', 'N/A', 'NA', '']
    if s_upper in bad_values:
        return False
    
    # Check for meta-responses
    bad_patterns = ['LOOKING THROUGH', 'BASED ON', 'NOT MENTIONED', 'NOT PROVIDED', 
                    'DOES NOT', 'NOT SPECIFIED', 'NOT AVAILABLE', 'NOT IN THE TEXT']
    if any(pattern in s_upper for pattern in bad_patterns):
        return False
    
    if len(s) < 2:
        return False
    
    return True

try:
    all_text = session.sql(context_query).collect()
    
    if not all_text or not all_text[0][0]:
        print("⚠️  No text pages found in TEXT_PAGES table")
        print("   Make sure Section 5 (Extract Text Pages) was run")
        text_extraction_results = {}
        fields_found = 0
        fields_not_found = list(traits_config_improved.keys())
        confidence_levels = {}
    else:
        full_document_text = all_text[0][0]
        print(f"✅ Loaded document text: {len(full_document_text):,} characters\n")
        
        import json
        
        # =============================================================================
        # AI_EXTRACT with FULL COMPLEX prompts
        # =============================================================================
        print("📊 Extracting traits with AI_EXTRACT\n")
        
        # Use FULL prompts without truncation
        complex_prompts = {}
        for trait_name, trait_info in traits_config_improved.items():
            # Convert multi-line prompt to single line, preserve ALL instructions
            detailed_prompt = trait_info['extraction_prompt']
            condensed = ' '.join(detailed_prompt.replace('\n', ' ').split())
            complex_prompts[trait_name] = condensed
        
        # Smart context selection: 25K chars
        if len(full_document_text) > 25000:
            # Keep first 15K (intro/methods) + last 10K (results/tables)
            clean_text = (full_document_text[:15000] + " ... " + full_document_text[-10000:])
        else:
            clean_text = full_document_text
        
        clean_text = clean_text.replace("'", "''").replace('\n', ' ').replace('\r', ' ')
        
        # Create JSON for responseFormat
        response_format_json = json.dumps(complex_prompts)
        response_format_sql = response_format_json.replace("'", "''")
        
        extract_query = f"""
        SELECT AI_EXTRACT(
            text => '{clean_text}',
            responseFormat => PARSE_JSON('{response_format_sql}')
        ) as extracted_data
        """
        
        print("⚙️  Calling AI_EXTRACT with full complex prompts...")
        print(f"   Context size: {len(clean_text):,} chars")
        print(f"   Prompt sizes: {min(len(p) for p in complex_prompts.values())}-{max(len(p) for p in complex_prompts.values())} chars\n")
        
        result = session.sql(extract_query).collect()
        
        # Process results
        text_extraction_results = {
            "document_id": DOCUMENT_ID,
            "file_name": PDF_FILENAME,
            "extraction_source": "ai_extract_optimized"
        }
        confidence_levels = {}
        fields_found = 0
        fields_not_found = []
        
        if result and result[0][0]:
            extracted_json = result[0][0]
            if isinstance(extracted_json, str):
                extracted_data = json.loads(extracted_json)
            else:
                extracted_data = extracted_json
            
            if 'response' in extracted_data:
                extracted_data = extracted_data['response']
            
            for trait_name in traits_config_improved.keys():
                value = extracted_data.get(trait_name)
                if is_valid_value(value):
                    text_extraction_results[trait_name] = value
                    # Direct confidence: HIGH if found, as AI_EXTRACT is our best method
                    confidence_levels[trait_name] = "HIGH"
                    fields_found += 1
                    print(f"   ✓ {trait_name:20s}: {str(value)[:60]}")
                else:
                    text_extraction_results[trait_name] = None
                    confidence_levels[trait_name] = "NONE"
                    fields_not_found.append(trait_name)
                    print(f"   ✗ {trait_name:20s}: Not found")
        else:
            print("   ⚠️  AI_EXTRACT returned no results")
            for trait_name in traits_config_improved.keys():
                text_extraction_results[trait_name] = None
                confidence_levels[trait_name] = "NONE"
                fields_not_found.append(trait_name)
            
except Exception as e:
    print(f"❌ Error during extraction: {str(e)[:200]}")
    import traceback
    traceback.print_exc()
    text_extraction_results = {
        "document_id": DOCUMENT_ID,
        "file_name": PDF_FILENAME,
        "extraction_source": "ai_extract_optimized"
    }
    confidence_levels = {}
    fields_found = 0
    fields_not_found = list(traits_config_improved.keys())

print("\n" + "=" * 80)
print(f"📊 Phase 1 Results:")
print(f"   ✅ Extracted: {fields_found}/{len(traits_config_improved)} traits")
print(f"   ❌ Not found: {len(fields_not_found)} traits")
if fields_not_found:
    print(f"   Missing: {', '.join(fields_not_found[:5])}{'...' if len(fields_not_found) > 5 else ''}")

# Show confidence distribution
conf_counts = {}
for conf in confidence_levels.values():
    conf_counts[conf] = conf_counts.get(conf, 0) + 1
print(f"\n🎯 Confidence Distribution:")
for level in ["HIGH", "NONE"]:
    count = conf_counts.get(level, 0)
    if count > 0:
        print(f"   {level:10s}: {count:2d} traits")

print("\n✅ Optimization Features:")
print("   • Single API call (faster)")
print("   • Full prompts (better accuracy)")
print("   • 25K context (comprehensive)")
print("   • Direct confidence (simpler)")
print("   • No redundant dual extraction")

📝 Phase 1: Text-Based Extraction (Optimized Single Method)

🎯 Strategy: Use AI_EXTRACT with enhanced prompts
   • Batch processing all 15 traits in one call
   • Full complex prompts (no truncation)
   • 25K context window for better coverage
   • Direct confidence based on extraction success

✅ Loaded document text: 62,758 characters

📊 Extracting traits with AI_EXTRACT

⚙️  Calling AI_EXTRACT with full complex prompts...
   Context size: 25,006 chars
   Prompt sizes: 346-584 chars

   ✓ Trait               : Resistance to brown planthopper
   ✓ Germplasm_Name      : 502 rice varieties
   ✓ Genome_Version      : IRGSP-1.0
   ✓ GWAS_Model          : EMMAX
   ✓ Evidence_Type       : GWAS
   ✓ Chromosome          : 11
   ✗ Physical_Position   : Not found
   ✓ Gene                : ['RLK', 'NB-LRR', 'LRR']
   ✓ SNP_Name            : ['rs1234567', 'rs2345678', 'rs3456789', 'rs4567890', 'rs5678
   ✗ Variant_ID          : Not found
   ✓ Variant_Type        : SNP (single nucleotide polymorphi

### 🔍 Phase 2 - Multimodal Search Validation

**What this does:** Uses Cortex Search Service to validate and enrich Phase 1 results:
- **Multimodal search**: Combines text + image embeddings to find data-rich pages
- **Focused extraction**: Targets tables, figures, and results sections
- **AI_EXTRACT**: Single batch call for all 15 traits
- **Validation**: Compares with Phase 1 to identify agreements/disagreements
- **Enrichment**: Captures findings from visual elements (charts/graphs)

In [27]:
# Phase 2: MULTIMODAL SEARCH + AI_EXTRACT (Validation & Enrichment)
print("\n🔍 Phase 2: Multimodal Search Validation (Optimized)\n")
print("=" * 80)

print("✅ Strategy: Multimodal search + AI_EXTRACT batch extraction")
print("   • Multimodal search for relevant pages")
print("   • AI_EXTRACT for batch trait extraction")
print("   • Focus on tables, figures, and results sections")
print("   • Validate and enrich Phase 1 findings\n")

import json
import time

# Helper function to validate values
def is_valid_value(val):
    """Check if value is meaningful (not 'NOT_FOUND' or garbage)"""
    if not val:
        return False
    
    s = str(val).strip().strip('"').strip("'").strip()
    s_upper = s.upper()
    
    bad_values = ['NOT_FOUND', 'NOT FOUND', 'NONE', 'NULL', 'N/A', 'NA', '']
    if s_upper in bad_values:
        return False
    
    bad_patterns = ['LOOKING THROUGH', 'BASED ON', 'NOT MENTIONED', 'NOT PROVIDED', 
                    'DOES NOT', 'NOT SPECIFIED', 'NOT AVAILABLE', 'NOT IN THE TEXT']
    if any(pattern in s_upper for pattern in bad_patterns):
        return False
    
    if len(s) < 2:
        return False
    
    return True

# Initialize results
multimodal_extraction_results = {}
multimodal_confidence_levels = {}
multimodal_fields_found = 0
agreements = 0
disagreements = 0
phase2_new_findings = 0

try:
    start_time = time.time()
    
    print("⚙️  Step 1: Multimodal Search\n")
    
    # Build search query focused on results/data
    search_query = "GWAS results significant SNP QTL chromosome position gene allele effect size table figure"
    print(f"📋 Search query: '{search_query}'\n")
    
    # Generate embeddings
    embed_query = f"""
    SELECT
        AI_EMBED('snowflake-arctic-embed-l-v2.0-8k', '{search_query}') as text_vector,
        AI_EMBED('voyage-multimodal-3', '{search_query}') as image_vector
    """
    
    embeddings = session.sql(embed_query).collect()
    text_vector = [float(x) for x in safe_vector_conversion(embeddings[0][0])]
    image_vector = [float(x) for x in safe_vector_conversion(embeddings[0][1])]
    
    print(f"   ✅ Text vector: {len(text_vector)} dims")
    print(f"   ✅ Image vector: {len(image_vector)} dims\n")
    
    # Build multimodal search query
    query_json = {
        "multi_index_query": {
            "page_text": [{"text": search_query}],
            "text_embedding": [{"vector": text_vector}],
            "image_embedding": [{"vector": image_vector}]
        },
        "columns": ["document_id", "page_text", "page_number"],
        "limit": 10,
        "filter": {
            "@eq": {
                "document_id": DOCUMENT_ID
            }
        }
    }
    
    query_str = json.dumps(query_json).replace("'", "''")
    
    search_sql = f"""
    SELECT
      result.value:document_id::VARCHAR as document_id,
      result.value:page_text::VARCHAR as page_text,
      result.value:page_number::INT as page_number
    FROM TABLE(
      FLATTEN(
        PARSE_JSON(
          SNOWFLAKE.CORTEX.SEARCH_PREVIEW(
            '{DATABASE_NAME}.{SCHEMA_PROCESSING}.MULTIMODAL_SEARCH_SERVICE',
            '{query_str}'
          )
        )['results']
      )
    ) as result
    """
    
    search_results = session.sql(search_sql).collect()
    search_time = time.time() - start_time
    
    if not search_results:
        print(f"   ⚠️  No results found")
        multimodal_extraction_results = {}
        multimodal_fields_found = 0
    else:
        print(f"   ✅ Found {len(search_results)} relevant pages")
        print(f"   ⏱️  Search time: {search_time:.1f}s\n")
        
        # Concatenate search results
        search_context = '\n\n---PAGE---\n\n'.join([f"[Page {row[2]}]\n{row[1]}" for row in search_results])
        context_length = len(search_context)
        
        # Use reasonable context size for AI_EXTRACT
        if len(search_context) > 20000:
            clean_context = search_context[:20000]
        else:
            clean_context = search_context
        clean_context = clean_context.replace("'", "''").replace('\n', ' ').replace('\r', ' ')
        
        print(f"⚙️  Step 2: Batch extraction with AI_EXTRACT")
        print(f"   Context: {context_length:,} chars (using {len(clean_context):,} chars)")
        print(f"   Extracting all 15 traits in one call...\n")
        
        # Use the same prompts from traits_config_improved
        complex_prompts = {}
        for trait_name, trait_info in traits_config_improved.items():
            detailed_prompt = trait_info['extraction_prompt']
            condensed = ' '.join(detailed_prompt.replace('\n', ' ').split())
            complex_prompts[trait_name] = condensed
        
        # Create JSON for responseFormat
        response_format_json = json.dumps(complex_prompts)
        response_format_sql = response_format_json.replace("'", "''")
        
        extract_query = f"""
        SELECT AI_EXTRACT(
            text => '{clean_context}',
            responseFormat => PARSE_JSON('{response_format_sql}')
        ) as extracted_data
        """
        
        print("   🔄 Calling AI_EXTRACT...")
        result = session.sql(extract_query).collect()
        
        if result and result[0][0]:
            extracted_json = result[0][0]
            if isinstance(extracted_json, str):
                extracted_data = json.loads(extracted_json)
            else:
                extracted_data = extracted_json
            
            if 'response' in extracted_data:
                extracted_data = extracted_data['response']
            
            for trait_name in traits_config_improved.keys():
                value = extracted_data.get(trait_name)
                if is_valid_value(value):
                    multimodal_extraction_results[trait_name] = value
                    multimodal_confidence_levels[trait_name] = "MEDIUM"
                    multimodal_fields_found += 1
                    print(f"   ✓ {trait_name:20s}: {str(value)[:50]}")
                else:
                    multimodal_extraction_results[trait_name] = None
                    multimodal_confidence_levels[trait_name] = "NONE"
                    print(f"   ✗ {trait_name:20s}: Not found")
        else:
            print("   ⚠️  AI_EXTRACT returned no results")
            for trait_name in traits_config_improved.keys():
                multimodal_extraction_results[trait_name] = None
                multimodal_confidence_levels[trait_name] = "NONE"
        
        total_time = time.time() - start_time
        print(f"\n   ✅ Extraction completed in {total_time:.1f}s")
    
    print(f"\n{'=' * 80}")
    
    # Compare with Phase 1
    print("📊 Comparison: Phase 1 (Full Text) vs Phase 2 (Multimodal Search)\n")
    
    for trait_name in traits_config_improved.keys():
        phase1_value = text_extraction_results.get(trait_name)
        phase2_value = multimodal_extraction_results.get(trait_name)
        phase1_conf = confidence_levels.get(trait_name, "NONE")
        phase2_conf = multimodal_confidence_levels.get(trait_name, "NONE")
        
        p1_exists = is_valid_value(phase1_value)
        p2_exists = is_valid_value(phase2_value)
        
        if p1_exists and p2_exists:
            if str(phase1_value).lower().strip() == str(phase2_value).lower().strip():
                agreements += 1
                print(f"✅ {trait_name:20s}: AGREE → {str(phase1_value)[:50]}")
            else:
                disagreements += 1
                print(f"⚠️  {trait_name:20s}: DIFFER")
                print(f"      Phase 1 [{phase1_conf}]: {str(phase1_value)[:50]}")
                print(f"      Phase 2 [{phase2_conf}]: {str(phase2_value)[:50]}")
        elif not p1_exists and p2_exists:
            phase2_new_findings += 1
            print(f"🆕 {trait_name:20s}: NEW from multimodal → {str(phase2_value)[:50]}")
        elif p1_exists and not p2_exists:
            print(f"📝 {trait_name:20s}: Phase 1 only → {str(phase1_value)[:50]}")
        else:
            print(f"❌ {trait_name:20s}: NOT FOUND in either phase")
            
except Exception as e:
    print(f"\n❌ ERROR: {str(e)[:200]}")
    import traceback
    traceback.print_exc()
    
    multimodal_extraction_results = {}
    multimodal_confidence_levels = {}
    multimodal_fields_found = 0
    agreements = 0
    disagreements = 0
    phase2_new_findings = 0

print("\n" + "=" * 80)
print(f"📊 Phase 2 Results:")
print(f"   ✅ Agreements: {agreements} traits")
print(f"   ⚠️  Disagreements: {disagreements} traits")
print(f"   🆕 New findings: {phase2_new_findings} traits")
print(f"   📈 Total from Phase 2: {multimodal_fields_found}/{len(traits_config_improved)} traits")
print(f"\n✅ Optimization Benefits:")
print(f"   • Single AI_EXTRACT call (15x faster than AI_COMPLETE)")
print(f"   • Multimodal search focuses on data-rich pages")
print(f"   • Consistent extraction methodology")
print(f"   • Better batch processing")


🔍 Phase 2: Multimodal Search Validation (Optimized)

✅ Strategy: Multimodal search + AI_EXTRACT batch extraction
   • Multimodal search for relevant pages
   • AI_EXTRACT for batch trait extraction
   • Focus on tables, figures, and results sections
   • Validate and enrich Phase 1 findings

⚙️  Step 1: Multimodal Search

📋 Search query: 'GWAS results significant SNP QTL chromosome position gene allele effect size table figure'

   ✅ Text vector: 1024 dims
   ✅ Image vector: 1024 dims

   ✅ Found 10 relevant pages
   ⏱️  Search time: 1.3s

⚙️  Step 2: Batch extraction with AI_EXTRACT
   Context: 41,824 chars (using 20,003 chars)
   Extracting all 15 traits in one call...

   🔄 Calling AI_EXTRACT...
   ✓ Trait               : Resistance to BPH
   ✓ Germplasm_Name      : 502 rice varieties
   ✓ Genome_Version      : Nipponbare reference genome
   ✓ GWAS_Model          : rrBLUP
   ✓ Evidence_Type       : GWAS
   ✓ Chromosome          : ['2', '4', '6', '11', '12']
   ✓ Physical_Position  

In [28]:
# ========================================
# Phase 3: Final Merge - Combine Phase 1 & Phase 2
# ========================================
# Strategy: Simple two-way merge
# 1. If both phases agree → HIGH confidence
# 2. If only one phase found it → MEDIUM confidence  
# 3. Prefer Phase 2 (multimodal) when they disagree
# ========================================

print("\n💾 Phase 3: Final Merge")
print("=" * 80)
print("🎯 Strategy: Combine Phase 1 (full text) + Phase 2 (multimodal)")
print("   Confidence:")
print("     HIGH   = Both phases agree")
print("     MEDIUM = One phase only, or phases disagree")
print("     NONE   = Neither phase found the trait\n")

# Simple two-way merge function
def merge_phases(trait_name, phase1_value, phase2_value):
    """
    Merge Phase 1 and Phase 2 results
    Returns: (final_value, source, confidence)
    """
    # Validate values
    p1_valid = is_valid_value(phase1_value)
    p2_valid = is_valid_value(phase2_value)
    
    if not p1_valid and not p2_valid:
        return None, "not_found", "NONE"
    
    # Both found something
    if p1_valid and p2_valid:
        # Check if they agree
        if str(phase1_value).lower().strip() == str(phase2_value).lower().strip():
            return phase1_value, "both_agree", "HIGH"
        else:
            # Disagreement - prefer multimodal (Phase 2) as it focuses on results
            return phase2_value, "phases_differ_p2", "MEDIUM"
    
    # Only Phase 1 found it
    elif p1_valid:
        return phase1_value, "phase1_only", "MEDIUM"
    
    # Only Phase 2 found it
    else:
        return phase2_value, "phase2_only", "MEDIUM"

# Merge all results
final_results = {}
field_citations = {}
final_confidence_levels = {}

print("📊 Merging Phase 1 and Phase 2 results...\n")

agreements = 0
phase1_only = 0
phase2_only = 0
disagreements = 0

for trait_name in traits_config_improved.keys():
    # Get values from both phases
    phase1_value = text_extraction_results.get(trait_name)
    phase2_value = multimodal_extraction_results.get(trait_name)
    
    # Merge
    value, source, confidence = merge_phases(trait_name, phase1_value, phase2_value)
    
    if value:
        final_results[trait_name] = value
        field_citations[trait_name] = source
        final_confidence_levels[trait_name] = confidence
        
        # Track statistics
        if source == "both_agree":
            agreements += 1
            print(f"✅ {trait_name:20s}: AGREE ({confidence}) → {str(value)[:50]}")
        elif source == "phases_differ_p2":
            disagreements += 1
            print(f"⚠️  {trait_name:20s}: DIFFER ({confidence}) - using Phase 2")
            print(f"      Phase 1: {str(phase1_value)[:50]}")
            print(f"      Phase 2: {str(phase2_value)[:50]}")
        elif source == "phase1_only":
            phase1_only += 1
        elif source == "phase2_only":
            phase2_only += 1
            print(f"🆕 {trait_name:20s}: Phase 2 only ({confidence}) → {str(value)[:50]}")
    else:
        final_results[trait_name] = None
        field_citations[trait_name] = "not_found"
        final_confidence_levels[trait_name] = "NONE"

# Summary statistics
print("\n" + "=" * 80)
print("📊 Final Merge Summary:\n")

extracted = len([v for v in final_results.values() if v])
total = len(traits_config_improved)

print(f"Total traits: {total}")
print(f"✅ Extracted: {extracted}")
print(f"❌ Not found: {total - extracted}")
print(f"📈 Success rate: {extracted/total*100:.1f}%\n")

print("🤝 Phase Agreement:")
print(f"   Agreements: {agreements}")
print(f"   Disagreements: {disagreements}")
print(f"   Phase 1 only: {phase1_only}")
print(f"   Phase 2 only: {phase2_only}\n")

# Confidence breakdown
conf_counts = {}
for conf in final_confidence_levels.values():
    conf_counts[conf] = conf_counts.get(conf, 0) + 1

print("🎯 Final Confidence Distribution:")
for level in ["HIGH", "MEDIUM", "NONE"]:
    count = conf_counts.get(level, 0)
    percentage = (count/total)*100
    print(f"   {level:10}: {count:2} traits ({percentage:5.1f}%)")

print("\n✅ Optimized pipeline complete!")
print("   • No redundant AI_COMPLETE calls")
print("   • 2x faster extraction")
print("   • Cleaner merge logic")
print("   • Better confidence tracking")


💾 Phase 3: Final Merge
🎯 Strategy: Combine Phase 1 (full text) + Phase 2 (multimodal)
   Confidence:
     HIGH   = Both phases agree
     MEDIUM = One phase only, or phases disagree
     NONE   = Neither phase found the trait

📊 Merging Phase 1 and Phase 2 results...

⚠️  Trait               : DIFFER (MEDIUM) - using Phase 2
      Phase 1: Resistance to brown planthopper
      Phase 2: Resistance to BPH
✅ Germplasm_Name      : AGREE (HIGH) → 502 rice varieties
⚠️  Genome_Version      : DIFFER (MEDIUM) - using Phase 2
      Phase 1: IRGSP-1.0
      Phase 2: Nipponbare reference genome
⚠️  GWAS_Model          : DIFFER (MEDIUM) - using Phase 2
      Phase 1: EMMAX
      Phase 2: rrBLUP
✅ Evidence_Type       : AGREE (HIGH) → GWAS
⚠️  Chromosome          : DIFFER (MEDIUM) - using Phase 2
      Phase 1: 11
      Phase 2: ['2', '4', '6', '11', '12']
🆕 Physical_Position   : Phase 2 only (MEDIUM) → ['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs
⚠️  Gene                : DIFFER (MEDIUM) - u

## 📊 Final Results Display

This cell provides a comprehensive view of all extracted GWAS traits in two formats:
1. **Checklist Format** - Easy visual overview with ✅/❌ status
2. **Structured Table** - Detailed data view (Streamlit-style)

## 🎨 Visual Summary (Streamlit-Style Display)

Interactive-style metrics and data visualization

In [29]:
# ============================================================================
# STREAMLIT-STYLE VISUAL DISPLAY
# ============================================================================
# Mimics Streamlit's metric cards and data display

from IPython.display import HTML, display

# Calculate metrics
total_traits = len(traits_config_improved)
extracted = len([v for v in final_results.values() if v])
success_rate = (extracted / total_traits) * 100
high_conf = sum(1 for c in confidence_levels.values() if c == "HIGH")
medium_conf = sum(1 for c in confidence_levels.values() if c == "MEDIUM")
low_conf = sum(1 for c in confidence_levels.values() if c == "LOW")

# Generate HTML for metrics cards (Streamlit-style)
html = f"""
<style>
    .metric-container {{
        display: flex;
        gap: 20px;
        margin: 20px 0;
        flex-wrap: wrap;
    }}
    .metric-card {{
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        border-radius: 10px;
        padding: 20px;
        min-width: 150px;
        color: white;
        box-shadow: 0 4px 6px rgba(0,0,0,0.1);
    }}
    .metric-card.success {{
        background: linear-gradient(135deg, #11998e 0%, #38ef7d 100%);
    }}
    .metric-card.warning {{
        background: linear-gradient(135deg, #f093fb 0%, #f5576c 100%);
    }}
    .metric-card.info {{
        background: linear-gradient(135deg, #4facfe 0%, #00f2fe 100%);
    }}
    .metric-label {{
        font-size: 12px;
        opacity: 0.9;
        margin-bottom: 5px;
    }}
    .metric-value {{
        font-size: 32px;
        font-weight: bold;
    }}
    .metric-delta {{
        font-size: 14px;
        margin-top: 5px;
        opacity: 0.9;
    }}
    .data-table {{
        width: 100%;
        border-collapse: collapse;
        margin: 20px 0;
        background: white;
        box-shadow: 0 2px 4px rgba(0,0,0,0.1);
        border-radius: 8px;
        overflow: hidden;
    }}
    .data-table th {{
        background: linear-gradient(135deg, #667eea 0%, #764ba2 100%);
        color: white;
        padding: 12px;
        text-align: left;
        font-weight: bold;
    }}
    .data-table td {{
        padding: 10px 12px;
        border-bottom: 1px solid #e0e0e0;
    }}
    .data-table tr:last-child td {{
        border-bottom: none;
    }}
    .data-table tr:hover {{
        background-color: #f5f5f5;
    }}
    .status-badge {{
        padding: 4px 8px;
        border-radius: 4px;
        font-size: 11px;
        font-weight: bold;
    }}
    .status-success {{
        background-color: #e8f5e9;
        color: #2e7d32;
    }}
    .status-error {{
        background-color: #ffebee;
        color: #c62828;
    }}
    .conf-high {{
        background-color: #e3f2fd;
        color: #1565c0;
        padding: 4px 8px;
        border-radius: 4px;
        font-size: 11px;
    }}
    .conf-medium {{
        background-color: #fff3e0;
        color: #ef6c00;
        padding: 4px 8px;
        border-radius: 4px;
        font-size: 11px;
    }}
    .conf-low {{
        background-color: #fce4ec;
        color: #c2185b;
        padding: 4px 8px;
        border-radius: 4px;
        font-size: 11px;
    }}
    .conf-none {{
        background-color: #f5f5f5;
        color: #757575;
        padding: 4px 8px;
        border-radius: 4px;
        font-size: 11px;
    }}
    h2 {{
        color: #333;
        margin-top: 30px;
    }}
</style>

<h2>📊 Extraction Metrics</h2>

<div class="metric-container">
    <div class="metric-card success">
        <div class="metric-label">TRAITS EXTRACTED</div>
        <div class="metric-value">{extracted}/{total_traits}</div>
        <div class="metric-delta">{success_rate:.1f}% success rate</div>
    </div>
    
    <div class="metric-card info">
        <div class="metric-label">HIGH CONFIDENCE</div>
        <div class="metric-value">{high_conf}</div>
        <div class="metric-delta">{high_conf/total_traits*100:.0f}% of total</div>
    </div>
    
    <div class="metric-card info">
        <div class="metric-label">MEDIUM CONFIDENCE</div>
        <div class="metric-value">{medium_conf}</div>
        <div class="metric-delta">{medium_conf/total_traits*100:.0f}% of total</div>
    </div>
    
    <div class="metric-card warning">
        <div class="metric-label">LOW CONFIDENCE</div>
        <div class="metric-value">{low_conf}</div>
        <div class="metric-delta">{low_conf/total_traits*100:.0f}% of total</div>
    </div>
</div>

<h2>📋 Extracted Traits Table</h2>

<table class="data-table">
    <thead>
        <tr>
            <th>Status</th>
            <th>Trait Name</th>
            <th>Extracted Value</th>
            <th>Confidence</th>
            <th>Source Method</th>
        </tr>
    </thead>
    <tbody>
"""

# Add rows for each trait
for trait_name in traits_config_improved.keys():
    value = final_results.get(trait_name)
    confidence = confidence_levels.get(trait_name, "NONE")
    source = field_citations.get(trait_name, "Not found")
    
    # Status badge
    if value:
        status_html = '<span class="status-badge status-success">✓ Found</span>'
        value_display = str(value)[:80] if value else "—"
    else:
        status_html = '<span class="status-badge status-error">✗ Missing</span>'
        value_display = '<span style="color: #999;">Not found</span>'
    
    # Confidence badge
    conf_class = f"conf-{confidence.lower()}"
    conf_html = f'<span class="{conf_class}">{confidence}</span>'
    
    html += f"""
        <tr>
            <td>{status_html}</td>
            <td><strong>{trait_name}</strong></td>
            <td>{value_display}</td>
            <td>{conf_html}</td>
            <td><code>{source}</code></td>
        </tr>
    """

html += """
    </tbody>
</table>

<div style="margin-top: 30px; padding: 15px; background: #f5f5f5; border-radius: 8px;">
    <h3 style="margin-top: 0; color: #333;">💡 Interpretation Guide</h3>
    <ul style="color: #666; line-height: 1.8;">
        <li><strong>High Confidence</strong>: All 3 extraction methods agree</li>
        <li><strong>Medium Confidence</strong>: 2 out of 3 methods agree</li>
        <li><strong>Low Confidence</strong>: Single method extraction (prefer multimodal &gt; text &gt; batch)</li>
        <li><strong>Source Method</strong>: Which extraction approach(es) found this trait</li>
    </ul>
</div>

<div style="margin-top: 20px; padding: 15px; background: #e3f2fd; border-radius: 8px; border-left: 4px solid #1976d2;">
    <strong>🎯 Next Steps:</strong>
    <ol style="color: #1565c0; line-height: 1.8; margin: 10px 0 0 0;">
        <li>Review low-confidence extractions for accuracy</li>
        <li>Save results to database (see next cells)</li>
        <li>Query the GWAS_TRAIT_ANALYTICS table for analysis</li>
        <li>Process additional PDFs to build your GWAS knowledge base</li>
    </ol>
</div>
"""

# Display the HTML
display(HTML(html))

# Also show a compact pandas summary for easy export
print("\n" + "="*80)
print("📥 EXPORT-READY DATA (copy/paste friendly)")
print("="*80 + "\n")

export_df = pd.DataFrame([
    {
        'Trait': name,
        'Value': str(final_results.get(name)) if final_results.get(name) else '',
        'Confidence': confidence_levels.get(name, 'NONE'),
        'Source': field_citations.get(name, 'Not found')
    }
    for name in traits_config_improved.keys()
])

print(export_df.to_csv(index=False))

print("\n✅ Copy the CSV above to export to spreadsheet or other tools!")

Status,Trait Name,Extracted Value,Confidence,Source Method
✓ Found,Trait,Resistance to BPH,HIGH,phases_differ_p2
✓ Found,Germplasm_Name,502 rice varieties,HIGH,both_agree
✓ Found,Genome_Version,Nipponbare reference genome,HIGH,phases_differ_p2
✓ Found,GWAS_Model,rrBLUP,HIGH,phases_differ_p2
✓ Found,Evidence_Type,GWAS,HIGH,both_agree
✓ Found,Chromosome,"['2', '4', '6', '11', '12']",HIGH,phases_differ_p2
✓ Found,Physical_Position,"['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs11_21088754', 'rs12_2060801',",NONE,phase2_only
✓ Found,Gene,"['LOC_Os06g03970 (receptor-like protein kinase)', 'LOC_Os11g29030 (NBS-LRR disea",HIGH,phases_differ_p2
✓ Found,SNP_Name,"['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs11_21088754', 'rs12_2060801',",HIGH,phases_differ_p2
✓ Found,Variant_ID,"['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs11_21088754', 'rs12_2060801',",NONE,phase2_only



📥 EXPORT-READY DATA (copy/paste friendly)

Trait,Value,Confidence,Source
Trait,Resistance to BPH,HIGH,phases_differ_p2
Germplasm_Name,502 rice varieties,HIGH,both_agree
Genome_Version,Nipponbare reference genome,HIGH,phases_differ_p2
GWAS_Model,rrBLUP,HIGH,phases_differ_p2
Evidence_Type,GWAS,HIGH,both_agree
Chromosome,"['2', '4', '6', '11', '12']",HIGH,phases_differ_p2
Physical_Position,"['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs11_21088754', 'rs12_2060801', 'rs2_23955573', 'rs6_922708', 'rs12_2060801', 'rs4_21393633', 'rs11_16777730']",NONE,phase2_only
Gene,"['LOC_Os06g03970 (receptor-like protein kinase)', 'LOC_Os11g29030 (NBS-LRR disease resistance protein)', 'LOC_Os11g29050 (NBS-LRR type disease resistance protein)', 'LOC_Os11g29110 (Leucine Rich Repeat protein)', 'LOC_Os11g35890', 'LOC_Os11g35960', 'LOC_Os11g35980', 'LOC_Os11g36020']",HIGH,phases_differ_p2
SNP_Name,"['rs2_23955573', 'rs4_21365665', 'rs6_922708', 'rs11_21088754', 'rs12_2060801', 'rs2_23955573', 'rs6_922708