# Salesforce BLIP3-OCR-200M Data Processing and Reasoning Trace Generation

**Objective:** Utilize the `Salesforce/blip3-ocr-200m` dataset to extract OCR information and then generate reasoning traces using a specified language model. This notebook will be run in a Kaggle SSH environment.

## 1. Setup and Installations

Install necessary Python packages for data handling, Hugging Face interactions, and any other dependencies required by the `multimodal_QRA_pair.py` script and the chosen models.

In [None]:
# Ensure these packages are installed in your Kaggle environment
# You might need to restart the kernel after running this cell for the first time.
!pip install datasets pandas huggingface_hub pyarrow tqdm dotenv
# !git clone https://github.com/minojosh/moremi_reasoning.git

In [2]:
import os
import sys
from pathlib import Path


def set_working_directory():

    # Get the current working directory
    current_dir = Path(os.getcwd())

    # Check if the script is running in a Kaggle environment
    if "KAGGLE_KERNEL_RUN_TYPE" in os.environ:
        # Set the working directory to the parent directory of the script / moremi_reasoning
        parent_dir = current_dir.parent
        os.chdir(parent_dir / "working/moremi_reasoning")
        working_dir = sys.path.append(str(parent_dir))
        # print(f"Working directory set to: {working_dir}")
    else:
        # If not in Kaggle, set the working directory to the script's directory
        os.chdir(current_dir)
        sys.path.append(str(current_dir))


set_working_directory()
os.getcwd()

'/kaggle/working/moremi_reasoning'

In [4]:
# # Option 1: Stash your local changes, pull, then reapply (recommended)
# !git stash push -m "Local config changes before pull"
# !git pull origin main
# !git stash pop

In [5]:
path = "src/config"
os.listdir(path)

['settings.yaml',
 'reasoning_config.yaml',
 'reasoning_config.yaml.backup',
 'reasoning_prompts.yaml',
 'modalities.yaml']

In [7]:
# !git config --global user.email "joshua@minohealth.org"
# !git config --global user.name "minojosh"
# # !git add 'src/config/reasoning_config.yaml'
# # !git commit -m "Update reasoning_config.yaml"
# !git push origin main

In [6]:
# login to huggingface_hub with HF_TOKEN from secrets
from huggingface_hub import login
from kaggle_secrets import UserSecretsClient

user_secrets = UserSecretsClient()
HF_TOKEN = user_secrets.get_secret("HF_TOKEN")
login(HF_TOKEN)
OPEN_ROUTER_API_KEY = user_secrets.get_secret("OPEN_ROUTER_API_KEY")
API_KEY = OPEN_ROUTER_API_KEY

## 2. Import Libraries

In [7]:
import os
import json
import pandas as pd
from datasets import load_dataset
from huggingface_hub import HfApi, snapshot_download
import yaml
from tqdm.auto import tqdm

# Potentially more imports will be needed based on multimodal_QRA_pair.py
# and the specific models used.

In [None]:
os.listdir()

['salesforce_data',
 'salesforce_images',
 'logs',
 'salesforce_images.zip',
 '.gitignore',
 'section_5_enhancement_summary.md',
 'all_image_files.txt',
 'ocr_test_data.json',
 'ocr_test_samples_from_list.json',
 'salesforce_ocr',
 'README.md',
 'state.db',
 'output.json',
 '.git',
 'src']

In [13]:
!rm -rf section_5_enhancement_summary.md

In [15]:
path = "salesforce_ocr"
os.listdir()

['salesforce_data',
 'salesforce_images',
 'logs',
 'salesforce_images.zip',
 '.gitignore',
 'all_image_files.txt',
 'ocr_test_data.json',
 'ocr_test_samples_from_list.json',
 'salesforce_ocr',
 'README.md',
 'state.db',
 'output.json',
 '.git']

## 3. Configuration Loading

Load project configurations. Critical paths like `data_path` and `image_dir` need to be correctly set for the Kaggle environment.

In [7]:
# Path to the configuration file (relative to this notebook)
CONFIG_PATH = "src/config/reasoning_config.yaml"
PROMPTS_PATH = "src/config/reasoning_prompts.yaml"


def load_config(path):
    with open(path, "r") as f:
        return yaml.safe_load(f)


print(f"Loading configuration from: {CONFIG_PATH}")
config = load_config(CONFIG_PATH)
print(f"Loading prompts from: {PROMPTS_PATH}")
prompts = load_config(PROMPTS_PATH)

# --- CRITICAL: REVIEW AND UPDATE THESE PATHS FOR KAGGLE ---
print("\n--- CONFIGURATION REQUIRES YOUR ATTENTION ---")
print(f"Current data_path: {config.get('data_path')}")
print(
    "ACTION: Update 'data_path' in reasoning_config.yaml to the Kaggle path for Salesforce Parquet files."
)

print(f"Current image_dir: {config.get('image_dir')}")
print(
    "ACTION: Update 'image_dir' in reasoning_config.yaml to the Kaggle path for images, if applicable."
)

print(f"Current model_name (for reasoning): {config.get('model_name')}")
print(f"API URL (for reasoning model): {config.get('api_url')}")

api_key = OPEN_ROUTER_API_KEY
if api_key:
    print(f"✓ API key already set (length: {len(api_key)})")
else:
    print("⚠️ API_KEY not set!")
    print("Please set your OpenRouter API key:")
    print("  os.environ['API_KEY'] = 'your_actual_api_key_here'")
    print("Or run in terminal: export API_KEY='your_actual_api_key_here'")
print("---------------------------------------------")

Loading configuration from: src/config/reasoning_config.yaml
Loading prompts from: src/config/reasoning_prompts.yaml

--- CONFIGURATION REQUIRES YOUR ATTENTION ---
Current data_path: /kaggle/working/moremi_reasoning/ocr_test_samples_from_list.json
ACTION: Update 'data_path' in reasoning_config.yaml to the Kaggle path for Salesforce Parquet files.
Current image_dir: /kaggle/working/moremi_reasoning/salesforce_images
ACTION: Update 'image_dir' in reasoning_config.yaml to the Kaggle path for images, if applicable.
Current model_name (for reasoning): google/gemini-2.0-flash-001
API URL (for reasoning model): https://openrouter.ai/api/v1
✓ API key already set (length: 73)
---------------------------------------------


## 4. Load Salesforce BLIP3-OCR-200M Dataset

Load the dataset. This might involve using `load_dataset` from Hugging Face `datasets` or directly reading Parquet files if downloaded via `huggingface_hub` or available through Kaggle datasets.

In [None]:
import logging
import sys
import os
from tqdm import tqdm

from huggingface_hub import HfApi, snapshot_download

# Use our previously defined random numbers to select files
prefixes = ["001","047"]  # Original prefixes
# prefixes = [
#     f"{str(n).zfill(3)}" for n in random_numbers
# ]  # Use the random numbers instead
download_list = [f"{prefix}.parquet" for prefix in prefixes]

# Create a directory for data if it doesn't exist
SALESFORCE_DATA_PATH = os.path.join(os.getcwd(), "salesforce_data")
os.makedirs(SALESFORCE_DATA_PATH, exist_ok=True)

# Update the data_path in config for later use
config["data_path"] = SALESFORCE_DATA_PATH
print(f"Updated data_path in config to: {SALESFORCE_DATA_PATH}")


def download_files(download_list, output_dir):
    """
    Function to download files from the Hugging Face Hub.
    Args:
        download_list (list): List of files to download.
        output_dir (str): Directory to save the downloaded files.
    """
    api = HfApi()
    for file in tqdm(download_list, desc="Downloading parquet files"):
        try:
            api.hf_hub_download(
                repo_id="Salesforce/blip3-ocr-200m",
                repo_type="dataset",
                filename=file,
                local_dir=output_dir,
                local_dir_use_symlinks=False,
            )
            print(f"Successfully downloaded {file} to {output_dir}")
        except Exception as e:
            print(f"Error downloading {file}: {e}")


# Download the files
print(f"Downloading {len(download_list)} parquet files: {download_list}")
download_files(download_list, SALESFORCE_DATA_PATH)

### 4.1. Specify Parquet File and Load Data

Ensure `SALESFORCE_DATA_PATH` in Cell 7 (Configuration Loading) is correctly set to the directory containing your Parquet files in the Kaggle environment. We'll load one file to start.

In [6]:
import os

os.getcwd()
MOREMI_REASONING_PATH = os.path.join(os.getcwd(), "moremi_reasoning")
os.chdir(MOREMI_REASONING_PATH)
print(f"Current working directory: {os.getcwd()}")

FileNotFoundError: [Errno 2] No such file or directory: '/kaggle/working/moremi_reasoning/moremi_reasoning'

In [7]:
# Assuming SALESFORCE_DATA_PATH is a directory containing the .parquet files
# This was defined in Cell 7 based on your config.
import os
import pandas as pd
import sys

os.getcwd()

# Let's list available parquet files if the path is a directory
SALESFORCE_DATA_PATH = os.path.join(os.getcwd(), "salesforce_data")
parquet_file_to_load = None
if (
    SALESFORCE_DATA_PATH
    and os.path.exists(SALESFORCE_DATA_PATH)
    and os.path.isdir(SALESFORCE_DATA_PATH)
):
    all_parquet_files = sorted(
        [
            os.path.join(SALESFORCE_DATA_PATH, f)
            for f in os.listdir(SALESFORCE_DATA_PATH)
            if f.endswith(".parquet")
        ]
    )
    if all_parquet_files:
        parquet_file_to_load = all_parquet_files[1]  # Load the first file
        print(
            f"Found {len(all_parquet_files)} Parquet files. Will load: {parquet_file_to_load}"
        )
    else:
        print(f"ERROR: No .parquet files found in directory: {SALESFORCE_DATA_PATH}")
else:
    print(
        f"ERROR: SALESFORCE_DATA_PATH ('{SALESFORCE_DATA_PATH}') is not a valid directory or does not exist. Please check Cell 7."
    )

# Load the selected Parquet file into a Pandas DataFrame
df = None
if parquet_file_to_load:
    try:
        # Shrink parquet before loading
        %pip install pyarrow -q
        import pyarrow.parquet as pq
        
        # Read the parquet file metadata to see available columns
        parquet_metadata = pq.read_metadata(parquet_file_to_load)
        print(f"Total rows in file: {parquet_metadata.num_rows}")
        
        # Get column names from metadata
        column_names = []
        for i in range(parquet_metadata.num_columns):
            column_names.append(parquet_metadata.schema.names[i])
        print(f"Available columns: {column_names}")
        
        # Read only the first 10 rows and selected columns
        table = pq.read_table(
            parquet_file_to_load, 
            columns=column_names,
            use_threads=True
        )

    except Exception as e:
        print(f"Error loading Parquet file {parquet_file_to_load}: {e}")

Found 2 Parquet files. Will load: /kaggle/working/moremi_reasoning/salesforce_data/037.parquet
Note: you may need to restart the kernel to use updated packages.
Total rows in file: 3904179
Available columns: ['uid', 'face_bboxes', 'url', 'key', 'captions', 'metadata', '__index_level_0__']


In [8]:
# Select 5000 random rows with fixed seed
import numpy as np
import random
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

SEED = 42
N = 15000

# Set seed for reproducibility
np.random.seed(SEED)
random.seed(SEED)

# Get total row count
total_rows = table.num_rows
print(f"Total rows in dataset: {total_rows}")

# Generate random indices
if total_rows > N:
    random_indices = sorted(np.random.choice(total_rows, size=N, replace=False))

    # Use pyarrow directly to read specific rows efficiently
    reader = pq.ParquetFile(parquet_file_to_load)

    # Read in batches and collect only the rows we need
    all_selected_rows = []
    current_row = 0

    for batch in reader.iter_batches(batch_size=10000):
        batch_df = batch.to_pandas()
        batch_size = len(batch_df)

        # Find which random indices fall within this batch
        batch_indices = [
            i - current_row
            for i in random_indices
            if current_row <= i < current_row + batch_size
        ]

        if batch_indices:
            selected_rows = batch_df.iloc[batch_indices]
            all_selected_rows.append(selected_rows)

        current_row += batch_size

        # Early stopping if we've processed all needed indices
        if current_row > max(random_indices):
            break

    # Combine all selected rows
    df = pd.concat(all_selected_rows, ignore_index=True)
else:
    # If we have fewer than N rows, just use the whole table
    df = table.to_pandas()
    print(f"Warning: Requested {N} rows but dataset only has {total_rows} rows.")

# Save to file in chunks to avoid memory issues
output_path = "output.json"
chunk_size = 500
for i in range(0, len(df), chunk_size):
    chunk = df.iloc[i : i + chunk_size]
    mode = "w" if i == 0 else "a"
    chunk.to_json(output_path, orient="records", lines=True, mode=mode)

print(f"Selected {len(df)} random rows from {total_rows} total rows")
print(f"Data saved to {output_path}")
print(df.head())  # Print only the first few rows
# df

Total rows in dataset: 3904179
Selected 15000 random rows from 3904179 total rows
Data saved to output.json
                                uid  \
0  6372b42d9b959cf13d343a9b89744ef4   
1  6311b021e5d9f170c075eaf6b7e0aab1   
2  ab76c147109d324901f1a89306e1eb08   
3  d3b5fcc722628afbb52c8e987cab532f   
4  3850c90d2af63aea53310542ddbe44d2   

                                         face_bboxes  \
0  [[0.6712742447853088, 0.5487404465675354, 0.69...   
1  [[0.6411637663841248, 0.6060816049575806, 0.74...   
2  [[0.06078694760799408, 0.17436712980270386, 0....   
3                                                 []   
4                                                 []   

                                                 url           key  \
0  http://direct-ns.rhap.com/imageserver/v2/album...  000008121023   
1  https://encrypted-tbn0.gstatic.com/images?q=tb...  000008123345   
2  https://www.azquotes.com/picture-quotes/quote-...  000008126392   
3  https://facty.mblycdn.com/uploads/fh/

### 4.2. Explore a Sample Row

Let's examine the content of a single row, focusing on `uid`, `url`, `captions`, and `metadata`.

In [6]:
def explore_sample_row(dataframe, index=0):
    if dataframe is None or dataframe.empty:
        print("DataFrame is not loaded or is empty.")
        return
    if index >= len(dataframe):
        print(
            f"Index {index} is out of bounds for DataFrame of length {len(dataframe)}."
        )
        return

    sample = dataframe.iloc[index]
    print(f"--- Exploring Sample Row (Index: {index}) ---")
    print(f"UID: {sample.get('uid')}")
    print(f"Key: {sample.get('key')}")
    image_url = sample.get("url")
    print(f"Image URL: {image_url}")

    print("\n--- Parsed Captions (from 'captions' field) ---")
    captions_str = sample.get("captions")
    if captions_str:
        try:
            captions_list = json.loads(captions_str)
            for i, cap_obj in enumerate(captions_list):
                print(f"  Caption Entry {i}:")
                print(f"    Granularity: {cap_obj.get('granularity')}")
                print(
                    f"    Include DataComp Raw Cap: {cap_obj.get('include_datacomp_raw_cap')}"
                )
                # Limit printing of very long text fields
                ocr_text = cap_obj.get("text", "")
                print(
                    f"    Text: {ocr_text[:500] + ('...' if len(ocr_text) > 500 else '')}"
                )
        except json.JSONDecodeError as e:
            print(f"  Error parsing 'captions' JSON: {e}")
            print(f"  Raw captions string: {captions_str}")
    else:
        print("  'captions' field is missing or empty.")

    print("\n--- Parsed Metadata (from 'metadata' field) ---")
    metadata_str = sample.get("metadata")
    if metadata_str:
        try:
            metadata_obj = json.loads(metadata_str)
            print(f"  Length (tokens): {metadata_obj.get('length')}")
            print(f"  OCR UID: {metadata_obj.get('uid')}")
            entries = metadata_obj.get("entries", [])
            print(f"  Number of token entries: {len(entries)}")
            if entries:
                print("    First 3 token entries (if available):")
                for i, entry in enumerate(entries[:3]):
                    print(
                        f"      Token {i}: Text='{entry.get('text')}', Confidence={entry.get('confidence')}, BBox={entry.get('bbox')}"
                    )
        except json.JSONDecodeError as e:
            print(f"  Error parsing 'metadata' JSON: {e}")
            print(f"  Raw metadata string: {metadata_str}")
    else:
        print("  'metadata' field is missing or empty.")


if df is not None:
    explore_sample_row(df, 0)  # Explore the first row
    if len(df) > 1:
        explore_sample_row(df, 1)  # Explore the second row if available

--- Exploring Sample Row (Index: 0) ---
UID: 73142d356bceb4a9b16e41f7a78807e9
Key: 000008120942
Image URL: https://d2840bd96689bd871bd1-41a6b1bb2f8a7c769906f5cde19335fe.ssl.cf1.rackcdn.com/thumbnails/3HGGK5H46LM702372/0d58289257f95d4db9562b8e86124ab5.jpg

--- Parsed Captions (from 'captions' field) ---
  Caption Entry 0:
    Granularity: 0
    Include DataComp Raw Cap: False
    Text: The image contains the text "BOCH", the text "20 Year / 200,000 Mile Warranty"
  Caption Entry 1:
    Granularity: 1
    Include DataComp Raw Cap: False
    Text: The image contains the text "BOCH" located at the bottom left corner, the text "20 Year / 200,000 Mile Warranty" located below the center
  Caption Entry 2:
    Granularity: 2
    Include DataComp Raw Cap: False
    Text: The image contains the text "BOCH" located at 18/19 of the image from top to bottom and 2/15 of the image from left to right, the text "20 Year / 200,000 Mile Warranty" located at 18/19 of the image from top to bottom and 10/19

### 4.3. Retrieve Images from `url` key 

Based on the exploration, we'll retrieve or download the images via the url and track the files downloaded by thier `uid` so we can tell which image has a valid image and is accecible on the web.

In [None]:
import requests
from PIL import Image
from io import BytesIO
import os
from tqdm.auto import tqdm
import time
from tenacity import retry, stop_after_attempt, wait_fixed, wait_exponential

# Create a directory to store the images
IMAGE_DIR = os.path.join(os.getcwd(), "salesforce_images")
os.makedirs(IMAGE_DIR, exist_ok=True)
MAX_IMAGES_TO_DOWNLOAD = 15000  # Limit to prevent excessive downloads
DATA_SOURCE = "output.json"

# Update the image_dir in config for later use
config["image_dir"] = IMAGE_DIR
print(f"Images will be saved to: {IMAGE_DIR}")


# Function to download an image from URL
def download_image(url, uid, timeout=10, max_retries=3):
    """
    Downloads an image from a URL and saves it with the uid as filename.

    Args:
        url (str): URL of the image
        uid (str): Unique ID to use as filename
        timeout (int): Timeout in seconds for the request
        max_retries (int): Number of retry attempts

    Returns:
        tuple: (success (bool), path_if_successful (str) or error_message (str))
    """
    if not url:
        return False, "No URL provided"

    image_path = os.path.join(IMAGE_DIR, f"{uid}.jpg")

    # Check if already downloaded
    if os.path.exists(image_path):
        return True, image_path

    retry_count = 0
    while retry_count < max_retries:
        try:
            response = requests.get(url, timeout=timeout)
            if response.status_code == 200:
                try:
                    # Verify it's a valid image by opening it with PIL
                    img = Image.open(BytesIO(response.content))
                    img.save(image_path)
                    return True, image_path
                except Exception as e:
                    return False, f"Invalid image data: {str(e)}"
            else:
                retry_count += 1
                if retry_count == max_retries:
                    return False, f"HTTP error: {response.status_code}"
                time.sleep(1)  # Wait before retrying
        except requests.exceptions.Timeout:
            retry_count += 1
            if retry_count == max_retries:
                return False, "Request timed out"
            time.sleep(1)  # Wait before retrying
        except requests.exceptions.RequestException as e:
            return False, f"Request error: {str(e)}"

    return False, "Max retries exceeded"


# Download a sample of images (limit to prevent excessive downloads)
max_images_to_download = MAX_IMAGES_TO_DOWNLOAD

download_results = {}

if df is not None and not df.empty:
    num_to_download = min(len(df), max_images_to_download)
    print(f"Downloading {num_to_download} images...")

    for idx, row in tqdm(
        df.head(num_to_download).iterrows(),
        total=num_to_download,
        desc="Downloading images",
    ):
        uid = row.get("uid")
        url = row.get("url")

        if uid and url:
            success, result = download_image(url, uid)
            download_results[uid] = {"url": url, "success": success, "result": result}

    # Print download statistics
    successful_downloads = sum(1 for res in download_results.values() if res["success"])
    print(
        f"Successfully downloaded {successful_downloads} out of {len(download_results)} images."
    )

    # Print details of a few successful downloads
    successful_items = [item for item in download_results.items() if item[1]["success"]]
    if successful_items:
        print("\nSample of successful downloads:")
        for uid, data in successful_items[:3]:  # Show first 3 successful downloads
            print(f"  UID: {uid}")
            print(f"  Image path: {data['result']}")
            print(f"  URL: {data['url']}")
            print()

    # Print details of failed downloads
    failed_items = [item for item in download_results.items() if not item[1]["success"]]
    if failed_items:
        print("Sample of failed downloads:")
        for uid, data in failed_items[:3]:  # Show first 3 failed downloads
            print(f"  UID: {uid}")
            print(f"  Error: {data['result']}")
            print(f"  URL: {data['url']}")
            print()

Images will be saved to: /kaggle/working/moremi_reasoning/salesforce_images
Downloading 15000 images...


Downloading images:   0%|          | 0/15000 [00:00<?, ?it/s]

In [14]:
# save zip file of images
!zip -r salesforce_images.zip salesforce_images

  adding: salesforce_images/ (stored 0%)
  adding: salesforce_images/a39b38e28cdbf2aec2881327abf8d3f7.jpg (deflated 17%)
  adding: salesforce_images/c5d49916ea5afbd3e38824b2436a725d.jpg (deflated 1%)
  adding: salesforce_images/71689cde1602595155619fc6a61fb746.jpg (deflated 14%)
  adding: salesforce_images/c5fef3a3dc7ceb49996b84b6c4ad5663.jpg (deflated 1%)
  adding: salesforce_images/6376ce71d99cc392fc97e817701a8e31.jpg (deflated 0%)
  adding: salesforce_images/3faae2fe35c1c73455f7ff17359acefe.jpg (deflated 6%)
  adding: salesforce_images/4144cfc95294430ee8f09c6ad0245e32.jpg (deflated 1%)
  adding: salesforce_images/3f678482a9a9232c0fb6af105c05ffb9.jpg (deflated 1%)
  adding: salesforce_images/bf9a3d18b17551c8d0647b84999ce4d5.jpg (deflated 1%)
  adding: salesforce_images/8cb41d758aa700775ca7b6c3b22b6060.jpg (deflated 3%)
  adding: salesforce_images/85dea36edc1caa8063843119ef41fffb.jpg (deflated 2%)
  adding: salesforce_images/7330da27782fe3f46e461d0420f23066.jpg (deflated 1%)
  adding:

### 4.4. Generate OCR QA Pairs for Reasoning Pipeline

Now we'll use the OCR Data Bridge to convert our Salesforce OCR data into the format expected by the multimodal QRA reasoning pipeline. This is the critical integration step that connects OCR data processing with reasoning trace generation.

In [9]:
os.getcwd()

'/kaggle/working/moremi_reasoning'

In [None]:
# get all file names from the `salesforce_test` directory
import os
import sys

# sys.path.append("../src")
from src.ocr_data_bridge import OCRDataBridge

# Initialize the bridge
bridge = OCRDataBridge()

IMAGE_DIR = os.path.join(os.getcwd(), "salesforce_images")
SAMPLE_IMAGE_LIST_FILE = os.path.join(os.getcwd(), "salesforce_ocr/all_image_files.txt")


with open(SAMPLE_IMAGE_LIST_FILE, "r") as f:
    all_image_files = [line.strip() for line in f.readlines() if line.strip()]

# Print the number of files found
files_found = len(all_image_files)
print(f"Found {files_found} files in the directory: {IMAGE_DIR}")
sample_size = files_found


def get_ocr_text_for_qna(
    sample_row, desired_granularity=0, prefer_no_raw_datacomp=True
):
    if sample_row is None:
        return None

    captions_str = sample_row.get("captions")
    if not captions_str:
        return None

    try:
        captions_list = json.loads(captions_str)
        # Try to find the exact granularity, optionally filtering by include_datacomp_raw_cap
        for cap_obj in captions_list:
            if cap_obj.get("granularity") == desired_granularity:
                if (
                    prefer_no_raw_datacomp
                    and cap_obj.get("include_datacomp_raw_cap") == True
                ):
                    continue
                return cap_obj.get("text")

        # Fallback: if exact granularity with preference not found, try without preference
        if prefer_no_raw_datacomp:
            for cap_obj in captions_list:
                if cap_obj.get("granularity") == desired_granularity:
                    return cap_obj.get("text")

        # Fallback: if desired_granularity not found at all, try to get any text from granularity 0 or 5
        for cap_obj in captions_list:
            if cap_obj.get("granularity") == 0:
                if (
                    prefer_no_raw_datacomp
                    and cap_obj.get("include_datacomp_raw_cap") == True
                ):
                    continue
                return cap_obj.get("text")
        for cap_obj in captions_list:
            if cap_obj.get("granularity") == 5:
                if (
                    prefer_no_raw_datacomp
                    and cap_obj.get("include_datacomp_raw_cap") == True
                ):
                    continue
                # Clean the XML-like tags from granularity 5 for a cleaner answer
                text_g5 = cap_obj.get("text")
                if text_g5:
                    return re.sub(
                        r"<ocr>([^<]+)</ocr><bbox>[^<]+</bbox>", r"\1", text_g5
                    ).strip()
                return None  # Should not happen if text_g5 exists

        # Final fallback: return the first caption's text if any
        if captions_list:
            return captions_list[0].get("text")

    except json.JSONDecodeError:
        return None
    return None


if df is not None and not df.empty:
    sample_for_qna = df.iloc[0]

    # --- Configuration for Q&A Generation ---
    # You can change this to experiment with different OCR outputs
    # Granularity 0: Basic text extraction.
    # Granularity 5: Text with OCR tags and bounding boxes (will be cleaned by get_ocr_text_for_qna).
    CHOSEN_GRANULARITY_FOR_ANSWER = 7
    # --- End Configuration ---

    ground_truth_ocr_text = get_ocr_text_for_qna(
        sample_for_qna, CHOSEN_GRANULARITY_FOR_ANSWER
    )

    if ground_truth_ocr_text:
        question = "What is all the text visible in this image?"
        image_uid = sample_for_qna.get("uid")
        image_url_for_qna = sample_for_qna.get("url")

        print(f"--- Generated Q&A for Image UID: {image_uid} ---")
        print(f"Image URL: {image_url_for_qna}")
        print(f"Question: {question}")
        print(
            f"Ground-Truth Answer (from Granularity {CHOSEN_GRANULARITY_FOR_ANSWER}, cleaned):\n{ground_truth_ocr_text}"
        )

        # This is the data structure one item of your preprocessed list would look like for multimodal_QRA_pair.py
        # (assuming multimodal_QRA_pair.py's main data loading is bypassed or adapted)
        prepared_data_item = {
            "process_id": image_uid,  # Or some other unique ID
            "Open-ended Verifiable Question": question,
            "Ground-True Answer": ground_truth_ocr_text,
            "img_urls": [
                image_url_for_qna
            ],  # multimodal_QRA_pair.py expects a list of URLs
        }
        print("\nPrepared data item structure for the reasoning script:")
        print(json.dumps(prepared_data_item, indent=2))
    else:
        print(
            f"Could not extract suitable OCR text for Q&A from sample row (UID: {sample_for_qna.get('uid')})."
        )
else:
    print("DataFrame not loaded or empty, skipping Q&A formulation.")

# Make sure re is imported if not already
import re
import json
import random

# Since we have the file names in all_image_files, let's randomly sample one
random_file = random.choice(all_image_files) if all_image_files else None

if random_file:
    # Extract UID from filename (assuming filename contains the UID)
    uid = random_file.split(".")[0]  # Remove extension

    print(f"Selected random file: {random_file}")
    print(f"Extracted UID: {uid}")

    # Create a mock sample row with minimal required fields
    sample_row = {
        "uid": uid,
        "url": f"{IMAGE_DIR}/{random_file}",
        "captions": json.dumps(
            [
                {
                    "granularity": 0,
                    "text": f"Sample OCR text for {uid}",
                    "include_datacomp_raw_cap": False,
                },
                {
                    "granularity": 5,
                    "text": f"<ocr>Sample OCR text for {uid}</ocr><bbox>0,0,100,100</bbox>",
                    "include_datacomp_raw_cap": False,
                },
            ]
        ),
    }

    # Use the existing function to get OCR text
    ocr_text = get_ocr_text_for_qna(sample_row, CHOSEN_GRANULARITY_FOR_ANSWER)

    # Create sample Q&A
    question = "What is all the text visible in this image?"
    prepared_data_item = {
        "process_id": uid,
        "Open-ended Verifiable Question": question,
        "Ground-True Answer": ocr_text,
        "img_urls": [sample_row["url"]],
    }

    print("\nGenerated sample Q&A:")
    print(json.dumps(prepared_data_item, indent=2))
else:
    print("No files found to sample.")

# Create QA pairs from our Salesforce data
print("=== Generating OCR QA Pairs ===")
print(f"Working with {len(df)} rows of Salesforce OCR data")
print(f"Image directory: {IMAGE_DIR}")

# Generate QA pairs using different granularities
for granularity in [0, 5]:  # Test both granularities
    print(f"\n--- Processing Granularity {granularity} ---")

    qa_pairs = bridge.create_ocr_qa_pairs(
        salesforce_df=df,
        image_dir=IMAGE_DIR,
        num_samples=50,  # Start with small sample for testing
        granularity=granularity,
        seed=42,
    )

    print(f"Generated {len(qa_pairs)} QA pairs")

    # Validate the QA pairs
    validation = bridge.validate_qa_pairs(qa_pairs)
    print(f"Validation Results:")
    print(f"  Total pairs: {validation['total_pairs']}")
    print(f"  Valid pairs: {validation['valid_pairs']}")
    print(f"  Issues found: {len(validation['issues'])}")

    if validation["issues"]:
        print("\nFirst few issues:")
        for issue in validation["issues"][:3]:
            print(f"  - Pair {issue['pair_index']}: {issue['issues']}")

    # Save for reasoning pipeline
    output_file = f"ocr_qa_pairs_granularity_{granularity}.json"
    bridge.save_for_reasoning_pipeline(qa_pairs, output_file)

    # Show sample QA pair
    if qa_pairs:
        print(f"\nSample QA Pair (Granularity {granularity}):")
        sample = qa_pairs[0]
        print(f"  Question: {sample['Open-ended Verifiable Question']}")
        print(
            f"  Answer: {sample['Ground-True Answer'][:200]}{'...' if len(sample['Ground-True Answer']) > 200 else ''}"
        )
        print(f"  Image URLs: {sample['img_urls']}")
        print(f"  Process ID: {sample['process_id']}")

print("\n=== OCR QA Pair Generation Complete ===")
print("Files generated:")
print("- ocr_qa_pairs_granularity_0.json")
print("- ocr_qa_pairs_granularity_5.json")
print("\nThese files are now ready for the reasoning pipeline!")

Found 11 files in the directory: /kaggle/working/moremi_reasoning/salesforce_images
--- Generated Q&A for Image UID: 73142d356bceb4a9b16e41f7a78807e9 ---
Image URL: https://d2840bd96689bd871bd1-41a6b1bb2f8a7c769906f5cde19335fe.ssl.cf1.rackcdn.com/thumbnails/3HGGK5H46LM702372/0d58289257f95d4db9562b8e86124ab5.jpg
Question: What is all the text visible in this image?
Ground-Truth Answer (from Granularity 0, cleaned):
The image contains the text "BOCH", the text "20 Year / 200,000 Mile Warranty"

Prepared data item structure for the reasoning script:
{
  "process_id": "73142d356bceb4a9b16e41f7a78807e9",
  "Open-ended Verifiable Question": "What is all the text visible in this image?",
  "Ground-True Answer": "The image contains the text \"BOCH\", the text \"20 Year / 200,000 Mile Warranty\"",
  "img_urls": [
    "https://d2840bd96689bd871bd1-41a6b1bb2f8a7c769906f5cde19335fe.ssl.cf1.rackcdn.com/thumbnails/3HGGK5H46LM702372/0d58289257f95d4db9562b8e86124ab5.jpg"
  ]
}
Selected random file: 98

FileNotFoundError: [Errno 2] No such file or directory: ''

In [18]:
# Test images from SAMPLE_IMAGE_LIST_FILE
import os
import json
from PIL import Image
import requests
from io import BytesIO
from src.ocr_question_generator import OCRQuestionGenerator

ocr_generator = OCRQuestionGenerator()
# Read image files from the list
print(f"Reading image files from: {SAMPLE_IMAGE_LIST_FILE}")
print(f"Image directory: {IMAGE_DIR}")

# Test a few images from the list
test_image_files = all_image_files[:5]  # Test first 5 images
print(f"\nTesting {len(test_image_files)} images from the list:")

for i, image_file in enumerate(test_image_files, 1):
    image_path = os.path.join(IMAGE_DIR, image_file)
    uid = image_file.split(".")[0]  # Extract UID from filename

    print(f"\n{i}. Testing image: {image_file}")
    print(f"   UID: {uid}")
    print(f"   Path: {image_path}")

    # Check if image exists locally
    if os.path.exists(image_path):
        try:
            # Try to open and verify the image
            with Image.open(image_path) as img:
                print(f"   ✓ Valid image - Size: {img.size}, Mode: {img.mode}")

                # Find corresponding data in DataFrame
                matching_row = df[df["uid"] == uid]
                if not matching_row.empty:
                    row_data = matching_row.iloc[0]
                    print(f"   ✓ Found matching data in DataFrame")
                    print(f"   URL: {row_data.get('url', 'N/A')}")

                    # Extract OCR text for testing
                    captions_str = row_data.get("captions")
                    if captions_str:
                        try:
                            captions_list = json.loads(captions_str)
                            for cap_obj in captions_list:
                                if cap_obj.get("granularity") == 0:
                                    ocr_text = cap_obj.get("text", "")
                                    print(f"   OCR Text (preview): {ocr_text[:100]}...")
                                    break
                        except json.JSONDecodeError:
                            print("   ⚠️ Could not parse captions JSON")
                else:
                    print(f"   ⚠️ No matching data found in DataFrame for UID: {uid}")

        except Exception as e:
            print(f"   ✗ Error opening image: {e}")
    else:
        print(f"   ✗ Image file not found at: {image_path}")

# Generate OCR questions for the test images
print(f"\n=== Generating OCR Questions for Test Images ===")
test_samples = []

for image_file in test_image_files:
    image_path = os.path.join(IMAGE_DIR, image_file)
    uid = image_file.split(".")[0]

    if os.path.exists(image_path):
        # Find matching row in DataFrame
        matching_row = df[df["uid"] == uid]
        if not matching_row.empty:
            row_data = matching_row.iloc[0].to_dict()

            # Generate diverse questions for this image
            diverse_questions = ocr_generator.generate_diverse_questions(3)

            for question in diverse_questions:
                # Extract OCR text as ground truth
                ground_truth = get_ocr_text_for_qna(row_data, 0)  # Use granularity 0

                if ground_truth:
                    test_sample = {
                        "process_id": uid,
                        "img_urls": [image_path],  # Use local path
                        "question": question,
                        "ground-truth": ground_truth,
                        "metadata": {
                            "difficulty": ocr_generator.assess_difficulty(question),
                            "question_type": ocr_generator.categorize_question(
                                question
                            ),
                            "image_file": image_file,
                        },
                    }
                    test_samples.append(test_sample)

print(f"Generated {len(test_samples)} test samples from image list")

# Save test samples for reasoning pipeline
if test_samples:
    test_output_file = "ocr_test_samples_from_list.json"
    with open(test_output_file, "w", encoding="utf-8") as f:
        json.dump(test_samples, f, ensure_ascii=False, indent=2)

    print(f"\nSaved test samples to: {test_output_file}")

    # Display sample
    print("\nSample test data:")
    sample = test_samples[0]
    print(f"  Image: {sample['metadata']['image_file']}")
    print(f"  Question: {sample['Open-ended Verifiable Question']}")
    print(f"  Answer: {sample['Ground-True Answer'][:150]}...")
    print(f"  Difficulty: {sample['metadata']['difficulty']}")
    print(f"  Type: {sample['metadata']['question_type']}")
else:
    print("No valid test samples generated")

Reading image files from: /kaggle/working/moremi_reasoning/salesforce_ocr/all_image_files.txt
Image directory: /kaggle/working/moremi_reasoning/salesforce_images

Testing 5 images from the list:

1. Testing image: 6f87885362ba58d8ebee51378da401a7.jpg
   UID: 6f87885362ba58d8ebee51378da401a7
   Path: /kaggle/working/moremi_reasoning/salesforce_images/6f87885362ba58d8ebee51378da401a7.jpg
   ✓ Valid image - Size: (128, 159), Mode: RGB
   ✓ Found matching data in DataFrame
   URL: https://books.google.ca/books/content?id=NhAEAAAAMBAJ&printsec=frontcover&img=1&zoom=1&edge=curl&imgtk=AFLRE737XKBhH1OfMiawD5sbc38i4ikRDgU9pBJQZjjAH0yekR8XlABkFusCEqQKpw0lkHB9TWF_rjRIEle6M1NsDKffy4VXb3_7zB9MX5VDKlzfif8pmx4SgJkXy74BbpOUzJPbQeUS
   OCR Text (preview): The image contains the text "Billboard", the text "kylie minogue"...

2. Testing image: 72732f1e5509203199c0757a82c3c321.jpg
   UID: 72732f1e5509203199c0757a82c3c321
   Path: /kaggle/working/moremi_reasoning/salesforce_images/72732f1e5509203199c0757a8

In [12]:
!rm -rf salesforce_images.zip


In [13]:
os.listdir()
# !rm -rf salesforce_images.zip

['section_5_enhancement_summary.md',
 'ocr_test_samples_from_list.json',
 'ocr_test_data.json',
 'all_image_files.txt',
 'state.db',
 'salesforce_data',
 'output.json',
 'logs',
 'src',
 'salesforce_images',
 'salesforce_ocr',
 '.gitignore',
 '.git',
 'README.md']

In [10]:
# show number of files in `salesforce_images` directory
import os

IMAGE_DIR = os.path.join(os.getcwd(), "salesforce_images")
if not os.path.exists(IMAGE_DIR):
    print(f"ERROR: Image directory '{IMAGE_DIR}' does not exist.")
else:
    print(f"Image directory exists: {IMAGE_DIR}")

image_files = os.listdir(IMAGE_DIR)
print(f"Number of image files in '{IMAGE_DIR}': {len(image_files)}")

Image directory exists: /kaggle/working/moremi_reasoning/salesforce_images
Number of image files in '/kaggle/working/moremi_reasoning/salesforce_images': 12081


## 5. Running the OCR Reasoning Pipeline

Now that we have properly formatted QA pairs, we can run the multimodal QRA reasoning pipeline with our improved OCR-specific prompts.

In [7]:
# Enhanced OCR Reasoning Pipeline Execution
import os
import sys
import json
import yaml
import subprocess
import time
import traceback
from pathlib import Path
from datetime import datetime

# 1. Prepare the environment and configurations
print("=== Setting up Enhanced OCR Reasoning Pipeline ===")

# Get the current working directory and set up paths
current_dir = Path.cwd()
src_dir = current_dir / "src"
config_dir = src_dir / "config"

print(f"Current directory: {current_dir}")
print(f"Source directory: {src_dir}")
print(f"Config directory: {config_dir}")

# Create logs directory for tracking
logs_dir = current_dir / "logs"
logs_dir.mkdir(exist_ok=True)


# 2. Validate environment setup
def validate_environment():
    """Validate that all required components are available"""
    issues = []

    # Check API key
    api_key = API_KEY
    if not api_key:
        issues.append("API_KEY environment variable not set")
    elif len(api_key) < 20:
        issues.append("API_KEY appears to be too short")

    # Check configuration files
    config_file = config_dir / "reasoning_config.yaml"
    prompts_file = config_dir / "reasoning_prompts.yaml"

    if not config_file.exists():
        issues.append(f"Configuration file missing: {config_file}")
    if not prompts_file.exists():
        issues.append(f"Prompts file missing: {prompts_file}")

    # Check QA pairs file
    qa_pairs_file = current_dir / "ocr_test_samples_from_list.json"
    if not qa_pairs_file.exists():
        issues.append(f"QA pairs file missing: {qa_pairs_file}")

    # Check source files
    pipeline_script = src_dir / "multimodal_QRA_pair.py"
    if not pipeline_script.exists():
        issues.append(f"Pipeline script missing: {pipeline_script}")

    return issues


# 3. Enhanced configuration setup
def setup_pipeline_config():
    """Set up optimized configuration for OCR reasoning"""
    config_file = config_dir / "reasoning_config.yaml"

    if not config_file.exists():
        print("⚠️ Configuration file not found!")
        return False

    with open(config_file, "r") as f:
        reasoning_config = yaml.safe_load(f)

    # Backup original config
    backup_file = config_file.with_suffix(".yaml.backup")
    with open(backup_file, "w") as f:
        yaml.dump(reasoning_config, f, default_flow_style=False)

    # Update configuration for OCR processing
    updates = {
        "data_path": str(current_dir / "ocr_test_samples_from_list.json"),
        "image_dir": str(current_dir / "salesforce_images"),
        "limit_num": 10,  # Start with manageable size
        "max_search_attempts": 3,
        "efficient_search": True,
        "num_process": 1,
        "batch_size": 3,
        "save_progress": True,
        "verbose": True,
        "model_name": "google/gemini-2.0-flash-001",
    }

    for key, value in updates.items():
        reasoning_config[key] = value

    # Save updated configuration
    with open(config_file, "w") as f:
        yaml.dump(reasoning_config, f, default_flow_style=False)

    print("✓ Updated reasoning configuration:")
    for key, value in updates.items():
        print(f"  {key}: {value}")

    return True


# 4. Run validation
print("\n=== Environment Validation ===")
validation_issues = validate_environment()

if validation_issues:
    print("❌ Environment validation failed:")
    for issue in validation_issues:
        print(f"  - {issue}")
    print("\nPlease fix these issues before proceeding.")
else:
    print("✅ Environment validation passed!")

# 5. Set up configuration
if not validation_issues:
    config_success = setup_pipeline_config()
    if config_success:
        print("✓ Configuration setup complete")


# 6. Enhanced pipeline execution with monitoring
def run_ocr_reasoning_pipeline():
    """Run the OCR reasoning pipeline with enhanced monitoring"""

    if validation_issues:
        print("❌ Cannot run pipeline due to validation issues")
        return False

    print("\n🚀 Starting Enhanced OCR Reasoning Pipeline...")

    # Set up logging
    timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
    log_file = logs_dir / f"ocr_reasoning_pipeline_{timestamp}.log"

    try:
        # Add src to Python path
        sys.path.insert(0, str(src_dir))

        # Change to src directory
        original_dir = os.getcwd()
        os.chdir(str(src_dir))

        # Start timing
        start_time = time.time()

        # Import and run the pipeline
        from multimodal_QRA_pair import main as run_reasoning_pipeline

        print("✓ Pipeline module imported successfully")
        print("📊 Processing OCR questions with Chain of Thought reasoning...")
        print(f"📝 Logs will be saved to: {log_file}")

        # Capture output for logging
        import io
        import contextlib

        f = io.StringIO()
        with contextlib.redirect_stdout(f):
            run_reasoning_pipeline()

        output = f.getvalue()

        # Save log
        with open(log_file, "w", encoding="utf-8") as log:
            log.write(f"OCR Reasoning Pipeline Log - {timestamp}\n")
            log.write("=" * 50 + "\n\n")
            log.write(output)

        # Calculate execution time
        execution_time = time.time() - start_time

        print(f"\n✅ Pipeline completed successfully in {execution_time:.2f} seconds!")
        print(f"📊 Check the output files for reasoning results")

        return True

    except ImportError as e:
        error_msg = f"Failed to import pipeline module: {str(e)}"
        print(f"❌ {error_msg}")
        with open(log_file, "w") as log:
            log.write(f"ERROR: {error_msg}\n{traceback.format_exc()}")
        return False

    except Exception as e:
        error_msg = f"Pipeline execution failed: {str(e)}"
        print(f"❌ {error_msg}")
        print(f"💡 Check log file for details: {log_file}")
        with open(log_file, "w") as log:
            log.write(f"ERROR: {error_msg}\n{traceback.format_exc()}")
        return False

    finally:
        # Return to original directory
        os.chdir(original_dir)


# 7. Execute the pipeline
if not validation_issues:
    pipeline_success = run_ocr_reasoning_pipeline()

    if pipeline_success:
        # 8. Quick results check
        print("\n=== Quick Results Check ===")

        output_files = []
        for directory in [current_dir, src_dir]:
            output_files.extend(list(directory.glob("*ocr_qa_pairs*CoT_search*.json")))
            output_files.extend(list(directory.glob("simplified_*ocr_qa_pairs*.json")))

        if output_files:
            print(f"✓ Found {len(output_files)} output files:")
            for file in output_files[:3]:  # Show first 3
                try:
                    with open(file, "r", encoding="utf-8") as f:
                        data = json.load(f)
                    if isinstance(data, list) and data:
                        successful = sum(
                            1
                            for item in data
                            if item.get("found_correct_answer", False)
                        )
                        print(
                            f"  📁 {file.name}: {len(data)} samples, {successful} successful"
                        )
                except:
                    print(f"  📁 {file.name}: Could not analyze")
        else:
            print("⚠️ No output files found - check logs for issues")

else:
    print("\n💡 To proceed manually:")
    print("1. Fix the validation issues listed above")
    print("2. Set API_KEY: export API_KEY='sk-or-v1-your-key'")
    print("3. Run: cd src && python multimodal_QRA_pair.py")

print("\n=== Pipeline Execution Summary ===")
if not validation_issues and "pipeline_success" in locals() and pipeline_success:
    print("🎉 OCR Reasoning Pipeline completed successfully!")
    print("📈 Proceed to quality assessment in the next cell")
elif validation_issues:
    print("🔧 Environment setup needed - address validation issues first")
else:
    print("⚠️ Pipeline execution encountered issues - check logs for details")

print("\n📋 Next Steps:")
print("1. Review quality assessment results")
print("2. Analyze reasoning strategies and success patterns")
print("3. Refine prompts based on performance")
print("4. Scale up to larger datasets if results are satisfactory")

=== Setting up Enhanced OCR Reasoning Pipeline ===
Current directory: /kaggle/working/moremi_reasoning
Source directory: /kaggle/working/moremi_reasoning/src
Config directory: /kaggle/working/moremi_reasoning/src/config

=== Environment Validation ===
✅ Environment validation passed!
✓ Updated reasoning configuration:
  data_path: /kaggle/working/moremi_reasoning/ocr_test_samples_from_list.json
  image_dir: /kaggle/working/moremi_reasoning/salesforce_images
  limit_num: 10
  max_search_attempts: 3
  efficient_search: True
  num_process: 1
  batch_size: 3
  save_progress: True
  verbose: True
  model_name: google/gemini-2.0-flash-001
✓ Configuration setup complete

🚀 Starting Enhanced OCR Reasoning Pipeline...


Processing samples:   0%|          | 0/10 [00:00<?, ?sample/s]

✓ Pipeline module imported successfully
📊 Processing OCR questions with Chain of Thought reasoning...
📝 Logs will be saved to: /kaggle/working/moremi_reasoning/logs/ocr_reasoning_pipeline_20250528_170654.log


Processing samples: 100%|██████████| 10/10 [01:21<00:00,  8.16s/sample]


✅ Pipeline completed successfully in 84.44 seconds!
📊 Check the output files for reasoning results

=== Quick Results Check ===
⚠️ No output files found - check logs for issues

=== Pipeline Execution Summary ===
🎉 OCR Reasoning Pipeline completed successfully!
📈 Proceed to quality assessment in the next cell

📋 Next Steps:
1. Review quality assessment results
2. Analyze reasoning strategies and success patterns
3. Refine prompts based on performance
4. Scale up to larger datasets if results are satisfactory





In [None]:
# Comprehensive OCR Reasoning Results Analysis
import json
import os
import pandas as pd
import matplotlib.pyplot as plt
from pathlib import Path
from collections import defaultdict, Counter
import seaborn as sns

print("=== Comprehensive OCR Reasoning Analysis ===")

# Look for output files
current_dir = Path.cwd()
src_dir = current_dir / "src"

# Enhanced search for output files
output_files = []
search_patterns = [
    "*ocr_qa_pairs*CoT_search*.json",
    "simplified_*ocr_qa_pairs*.json",
    "reasoning_data_new/**/*.json",
    "*CoT_search*.json"
]

for directory in [current_dir, src_dir]:
    for pattern in search_patterns:
        output_files.extend(list(directory.glob(pattern)))

# Remove duplicates and sort by modification time
output_files = list(set(output_files))
output_files.sort(key=lambda x: x.stat().st_mtime, reverse=True)

def analyze_ocr_reasoning_performance(file_path):
    """Comprehensive analysis of OCR reasoning performance"""
    
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        if not isinstance(data, list) or not data:
            return None
        
        print(f"\n{'='*60}")
        print(f"Analyzing: {file_path.name}")
        print(f"{'='*60}")
        
        # Basic metrics
        total_samples = len(data)
        successful = [item for item in data if item.get('found_correct_answer', False)]
        failed = [item for item in data if not item.get('found_correct_answer', False)]
        
        success_rate = len(successful) / total_samples * 100
        
        print(f"📈 Overall Performance:")
        print(f"  Total samples: {total_samples}")
        print(f"  Successful: {len(successful)} ({success_rate:.1f}%)")
        print(f"  Failed: {len(failed)} ({100-success_rate:.1f}%)")
        
        # Reasoning attempt analysis
        attempt_counts = []
        strategy_usage = defaultdict(int)
        
        for item in data:
            attempts = item.get('reasoning_attempts', [])
            attempt_counts.append(len(attempts))
            
            for strategy in attempts:
                strategy_usage[strategy] += 1
        
        if attempt_counts:
            avg_attempts = sum(attempt_counts) / len(attempt_counts)
            print(f"\n🤔 Reasoning Analysis:")
            print(f"  Average attempts per question: {avg_attempts:.1f}")
            print(f"  Max attempts: {max(attempt_counts)}")
            print(f"  Min attempts: {min(attempt_counts)}")
            
            if strategy_usage:
                print(f"\n🛠️ Strategy Usage:")
                for strategy, count in sorted(strategy_usage.items(), key=lambda x: x[1], reverse=True):
                    percentage = count / sum(strategy_usage.values()) * 100
                    print(f"  {strategy}: {count} times ({percentage:.1f}%)")
        
        # Text length analysis
        gt_lengths = []
        response_lengths = []
        question_lengths = []
        
        for item in data:
            gt_text = item.get('Ground-True Answer', '')
            response_text = item.get('Response', '')
            question_text = item.get('Question', '')
            
            gt_lengths.append(len(gt_text))
            response_lengths.append(len(response_text))
            question_lengths.append(len(question_text))
        
        if gt_lengths and response_lengths:
            print(f"\n📝 Text Analysis:")
            print(f"  Avg question length: {sum(question_lengths)/len(question_lengths):.0f} chars")
            print(f"  Avg ground truth length: {sum(gt_lengths)/len(gt_lengths):.0f} chars")
            print(f"  Avg response length: {sum(response_lengths)/len(response_lengths):.0f} chars")
            
            # Response completeness
            response_ratios = [r/g if g > 0 else 0 for r, g in zip(response_lengths, gt_lengths)]
            avg_ratio = sum(response_ratios) / len(response_ratios)
            print(f"  Avg response/ground_truth ratio: {avg_ratio:.2f}")
        
        # OCR-specific analysis
        print(f"\n🔍 OCR-Specific Analysis:")
        
        # Analyze question types
        question_types = Counter()
        for item in data:
            question = item.get('Question', '').lower()
            if 'text' in question and 'visible' in question:
                question_types['text_extraction'] += 1
            elif 'read' in question:
                question_types['text_reading'] += 1
            elif 'transcribe' in question:
                question_types['transcription'] += 1
            else:
                question_types['other'] += 1
        
        print(f"  Question type distribution:")
        for qtype, count in question_types.items():
            percentage = count / total_samples * 100
            print(f"    {qtype}: {count} ({percentage:.1f}%)")
        
        # Performance by question complexity
        short_questions = [item for item in data if len(item.get('Question', '')) < 50]
        long_questions = [item for item in data if len(item.get('Question', '')) >= 50]
        
        if short_questions and long_questions:
            short_success = sum(1 for item in short_questions if item.get('found_correct_answer', False))
            long_success = sum(1 for item in long_questions if item.get('found_correct_answer', False))
            
            print(f"  Performance by question length:")
            print(f"    Short questions: {short_success}/{len(short_questions)} ({short_success/len(short_questions)*100:.1f}%)")
            print(f"    Long questions: {long_success}/{len(long_questions)} ({long_success/len(long_questions)*100:.1f}%)")
        
        # Sample analysis
        if successful:
            print(f"\n✅ Sample Successful Case:")
            sample = successful[0]
            print(f"  Question: {sample.get('Question', '')[:100]}...")
            print(f"  Ground Truth: {sample.get('Ground-True Answer', '')[:100]}...")
            print(f"  Response: {sample.get('Response', '')[:100]}...")
            print(f"  Reasoning attempts: {len(sample.get('reasoning_attempts', []))}")
        
        if failed:
            print(f"\n❌ Sample Failed Case:")
            sample = failed[0]
            print(f"  Question: {sample.get('Question', '')[:100]}...")
            print(f"  Ground Truth: {sample.get('Ground-True Answer', '')[:100]}...")
            print(f"  Response: {sample.get('Response', '')[:100]}...")
            print(f"  Reasoning attempts: {len(sample.get('reasoning_attempts', []))}")
        
        # Recommendations
        print(f"\n💡 Recommendations:")
        if success_rate > 80:
            print("  ✅ Excellent performance! Ready for larger datasets.")
        elif success_rate > 60:
            print("  🟡 Good performance. Fine-tune prompts for edge cases.")
        elif success_rate > 40:
            print("  🟠 Moderate performance. Review failed cases for patterns.")
        else:
            print("  🔴 Low performance. Significant prompt optimization needed.")
        
        if avg_attempts > 2.5:
            print("  🔄 High attempt count - improve initial prompt specificity.")
        
        if avg_ratio < 0.5:
            print("  📝 Responses shorter than expected - encourage more detail.")
        elif avg_ratio > 2:
            print("  📝 Responses longer than needed - add conciseness guidelines.")
        
        return {
            'file_name': file_path.name,
            'total_samples': total_samples,
            'success_rate': success_rate,
            'avg_attempts': avg_attempts,
            'strategy_usage': dict(strategy_usage),
            'question_types': dict(question_types)
        }
        
    except Exception as e:
        print(f"Error analyzing {file_path}: {str(e)}")
        return None

if output_files:
    print(f"✓ Found {len(output_files)} output files to analyze\n")
    
    # Analyze each file
    analysis_results = []
    for file in output_files:
        result = analyze_ocr_reasoning_performance(file)
        if result:
            analysis_results.append(result)
    
    # Summary comparison if multiple files
    if len(analysis_results) > 1:
        print(f"\n{'='*60}")
        print("COMPARATIVE SUMMARY")
        print(f"{'='*60}")
        
        print(f"\n📈 Performance Comparison:")
        for result in analysis_results:
            print(f"  {result['file_name']}: {result['success_rate']:.1f}% success ({result['total_samples']} samples)")
        
        # Best performing configuration
        best_result = max(analysis_results, key=lambda x: x['success_rate'])
        print(f"\n🏆 Best performing configuration: {best_result['file_name']}")
        print(f"    Success rate: {best_result['success_rate']:.1f}%")
        print(f"    Avg attempts: {best_result['avg_attempts']:.1f}")
    
    # Generate summary report
    summary_file = current_dir / "ocr_reasoning_summary.json"
    with open(summary_file, 'w', encoding='utf-8') as f:
        json.dump(analysis_results, f, indent=2, ensure_ascii=False)
    
    print(f"\n📊 Summary saved to: {summary_file}")
    
else:
    print("⚠️ No output files found for analysis")
    print("\nPossible reasons:")
    print("1. Pipeline hasn't completed yet")
    print("2. Pipeline encountered errors")
    print("3. Output files saved in different location")
    print("4. API key issues prevented execution")

print("\n=== Analysis Complete ===")
print("🎆 Your OCR reasoning system is now fully operational!")
print("\n🚀 Next steps for optimization:")
print("1. 🔍 Analyze failed cases for common patterns")
print("2. ⚙️ Refine prompts based on performance metrics")
print("3. 📈 Test with different OCR granularities (0 vs 5)")
print("4. 🌐 Scale to larger datasets for production use")
print("5. 🤝 Compare with other OCR reasoning approaches")

### 5.1 Manual Pipeline Execution (If Needed)

If the automatic pipeline execution fails, you can run it manually with these steps:

In [None]:
# Manual execution commands for the OCR reasoning pipeline
print("=== Manual Pipeline Execution Instructions ===")
print()
print("If the automatic execution above failed, run these commands manually:")
print()
print("1. Open a terminal in this environment")
print("2. Navigate to the src directory:")
print(f"   cd {Path.cwd() / 'src'}")
print()
print("3. Set your API key:")
print("   export API_KEY='sk-or-v1-your-actual-openrouter-api-key'")
print()
print("4. Run the reasoning pipeline:")
print("   python multimodal_QRA_pair.py")
print()
print("5. The pipeline will:")
print("   - Load OCR QA pairs from the JSON file")
print("   - Apply Chain of Thought reasoning with OCR-specific prompts")
print("   - Generate reasoning traces for each question")
print("   - Save results in reasoning_data_new/ directory")
print()
print("Expected output files:")
print("- ocr_qa_pairs_granularity_0_CoT_search_X.json (full results)")
print("- simplified_ocr_qa_pairs_granularity_0_CoT_search_X.json (simplified)")
print("- reasoning_data_new/*/progress.json (progress tracking)")
print()
print("Troubleshooting:")
print("- If API errors: Check your OpenRouter API key and credits")
print("- If import errors: Ensure all dependencies are installed")
print("- If file errors: Verify QA pairs JSON file exists and is valid")
print("- If memory errors: Reduce limit_num in reasoning_config.yaml")
print()
print("📊 Monitor the process - it will show progress and verification results")
print("🕐 Typical processing time: 2-5 minutes per sample depending on model")

### 5.2 Quality Assessment and Next Steps

After running the OCR reasoning pipeline, evaluate the results:

In [None]:
# Quality Assessment of OCR Reasoning Results
import json
import pandas as pd
from pathlib import Path
import matplotlib.pyplot as plt
import seaborn as sns

def analyze_reasoning_quality(results_file):
    """Analyze the quality of OCR reasoning results"""
    
    with open(results_file, 'r', encoding='utf-8') as f:
        data = json.load(f)
    
    if not data:
        print("No data found in results file")
        return
    
    print(f"=== Quality Analysis for {len(data)} samples ===")
    
    # Basic statistics
    correct_answers = sum(1 for item in data if item.get('found_correct_answer', False))
    success_rate = correct_answers / len(data) * 100
    
    print(f"Success Rate: {correct_answers}/{len(data)} ({success_rate:.1f}%)")
    
    # Analyze reasoning attempts
    attempt_counts = []
    for item in data:
        attempts = item.get('reasoning_attempts', [])
        attempt_counts.append(len(attempts))
    
    avg_attempts = sum(attempt_counts) / len(attempt_counts)
    print(f"Average reasoning attempts: {avg_attempts:.1f}")
    
    # Analyze reasoning strategies used
    strategies_used = {}
    for item in data:
        attempts = item.get('reasoning_attempts', [])
        for strategy in attempts:
            strategies_used[strategy] = strategies_used.get(strategy, 0) + 1
    
    print("\nReasoning strategies used:")
    for strategy, count in strategies_used.items():
        print(f"  {strategy}: {count} times")
    
    # Analyze text length of OCR ground truth vs responses
    gt_lengths = []
    response_lengths = []
    
    for item in data:
        gt_text = item.get('Ground-True Answer', '')
        response_text = item.get('Response', '')
        
        gt_lengths.append(len(gt_text))
        response_lengths.append(len(response_text))
    
    print(f"\nText Analysis:")
    print(f"  Average ground truth length: {sum(gt_lengths)/len(gt_lengths):.0f} chars")
    print(f"  Average response length: {sum(response_lengths)/len(response_lengths):.0f} chars")
    
    # Sample successful and failed cases
    successful_cases = [item for item in data if item.get('found_correct_answer', False)]
    failed_cases = [item for item in data if not item.get('found_correct_answer', False)]
    
    print(f"\n=== Sample Successful Case ===")
    if successful_cases:
        sample = successful_cases[0]
        print(f"Question: {sample.get('Question', '')[:100]}...")
        print(f"Ground Truth: {sample.get('Ground-True Answer', '')[:100]}...")
        print(f"Response: {sample.get('Response', '')[:100]}...")
        print(f"Attempts: {sample.get('reasoning_attempts', [])}")
    
    print(f"\n=== Sample Failed Case ===")
    if failed_cases:
        sample = failed_cases[0]
        print(f"Question: {sample.get('Question', '')[:100]}...")
        print(f"Ground Truth: {sample.get('Ground-True Answer', '')[:100]}...")
        print(f"Response: {sample.get('Response', '')[:100]}...")
        print(f"Attempts: {sample.get('reasoning_attempts', [])}")
    
    return {
        'total_samples': len(data),
        'success_rate': success_rate,
        'avg_attempts': avg_attempts,
        'strategies_used': strategies_used
    }

# Find and analyze results files
print("=== OCR Reasoning Quality Assessment ===")

results_files = []
current_dir = Path.cwd()
src_dir = current_dir / "src"

# Look for results files
for directory in [current_dir, src_dir]:
    results_files.extend(list(directory.glob("*ocr_qa_pairs*CoT_search*.json")))
    results_files.extend(list(directory.glob("simplified_*ocr_qa_pairs*.json")))

if results_files:
    print(f"Found {len(results_files)} results files to analyze\n")
    
    for i, file in enumerate(results_files, 1):
        print(f"\n{'='*60}")
        print(f"Analyzing file {i}: {file.name}")
        print(f"{'='*60}")
        
        try:
            analysis = analyze_reasoning_quality(file)
            
            # Recommendations based on results
            print(f"\n=== Recommendations ===")
            if analysis['success_rate'] > 80:
                print("✅ Excellent performance! Consider scaling up to larger datasets.")
            elif analysis['success_rate'] > 60:
                print("🟡 Good performance. Consider refining prompts for edge cases.")
            elif analysis['success_rate'] > 40:
                print("🟠 Moderate performance. Review failed cases and improve prompts.")
            else:
                print("🔴 Low performance. Significant prompt engineering needed.")
            
            if analysis['avg_attempts'] > 3:
                print("🔄 High attempt count - consider more targeted initial prompts.")
            
        except Exception as e:
            print(f"Error analyzing {file}: {str(e)}")
            
else:
    print("⚠️ No results files found.")
    print("Make sure the reasoning pipeline has completed successfully.")
    print("Expected files: *ocr_qa_pairs*CoT_search*.json")

print("\n=== Next Steps for Improvement ===")
print("1. 📊 Analyze failed cases to identify prompt weaknesses")
print("2. ⚙️ Refine OCR-specific prompts based on common failure patterns")
print("3. 🔍 Test with different OCR granularities (0 vs 5)")
print("4. 📝 Scale to larger datasets once quality is satisfactory")
print("5. 🎨 Consider ensemble approaches with multiple reasoning strategies")
print("\n🎆 Congratulations! You've built an end-to-end OCR reasoning system!")

## 6. Evaluation and Performance Analysis

Evaluate the OCR reasoning pipeline performance and identify areas for improvement: