# Salesforce BLIP3-OCR-200M Data Processing and Reasoning Trace Generation

**Objective:** Utilize the `Salesforce/blip3-ocr-200m` dataset to extract OCR information and then generate reasoning traces using a specified language model. This notebook will be run in a Kaggle SSH environment.

## 1. Setup and Installations

Install necessary Python packages for data handling, Hugging Face interactions, and any other dependencies required by the `multimodal_QRA_pair.py` script and the chosen models.

In [1]:
# Ensure these packages are installed in your Kaggle environment
# You might need to restart the kernel after running this cell for the first time.
!pip install datasets pandas huggingface_hub pyarrow tqdm
!git clone git@github.com:minojosh/moremi_reasoning.git 


## 2. Import Libraries

In [2]:
import os
import json
import pandas as pd
from datasets import load_dataset
from huggingface_hub import HfApi, snapshot_download
import yaml
from tqdm.auto import tqdm

# Potentially more imports will be needed based on multimodal_QRA_pair.py
# and the specific models used.

## 3. Configuration Loading

Load project configurations. Critical paths like `data_path` and `image_dir` need to be correctly set for the Kaggle environment.

In [None]:
# Path to the configuration file (relative to this notebook)
CONFIG_PATH = "../config/reasoning_config.yaml"
PROMPTS_PATH = "../config/reasoning_prompts.yaml"

def load_config(path):
    with open(path, 'r') as f:
        return yaml.safe_load(f)

print(f"Loading configuration from: {CONFIG_PATH}")
config = load_config(CONFIG_PATH)
print(f"Loading prompts from: {PROMPTS_PATH}")
prompts = load_config(PROMPTS_PATH)

# --- CRITICAL: REVIEW AND UPDATE THESE PATHS FOR KAGGLE ---
print("\n--- CONFIGURATION REQUIRES YOUR ATTENTION ---")
print(f"Current data_path: {config.get('data_path')}")
print("ACTION: Update 'data_path' in reasoning_config.yaml to the Kaggle path for Salesforce Parquet files.")

print(f"Current image_dir: {config.get('image_dir')}")
print("ACTION: Update 'image_dir' in reasoning_config.yaml to the Kaggle path for images, if applicable.")

print(f"Current model_name (for reasoning): {config.get('model_name')}")
print(f"API URL (for reasoning model): {config.get('api_url')}")
print("---------------------------------------------")

# Example of how you might set the Kaggle dataset path if you know it
# KAGGLE_SALESFORCE_OCR_DATASET_PATH = "/kaggle/input/salesforce-blip3-ocr-200m"
# config['data_path'] = os.path.join(KAGGLE_SALESFORCE_OCR_DATASET_PATH, "parquet_files_subfolder_if_any")
# config['image_dir'] = os.path.join(KAGGLE_SALESFORCE_OCR_DATASET_PATH, "image_files_subfolder_if_any")

## 4. Load Salesforce BLIP3-OCR-200M Dataset

Load the dataset. This might involve using `load_dataset` from Hugging Face `datasets` or directly reading Parquet files if downloaded via `huggingface_hub` or available through Kaggle datasets.

In [None]:
# Option 1: Using Hugging Face datasets library (if dataset is directly loadable and internet is available)
# try:
#     print("Attempting to load dataset via Hugging Face datasets library...")
#     ocr_dataset = load_dataset("Salesforce/blip3-ocr-200m", split='train') # Or specific splits
#     print("Dataset loaded successfully via load_dataset.")
#     print(ocr_dataset.info)
# except Exception as e:
#     print(f"Could not load dataset using load_dataset: {e}")
#     print("Falling back to manual download/load if paths are configured.")
#     ocr_dataset = None

# Option 2: Manually specify path if downloaded or using Kaggle datasets
# This assumes config['data_path'] has been correctly updated for your Kaggle environment.

SALESFORCE_DATA_PATH = config.get('data_path', None)
# IMPORTANT: Update config.get('data_path') in reasoning_config.yaml to your Kaggle path
# For local testing, you might set it directly here, e.g.:
# SALESFORCE_DATA_PATH = "/path/to/your/downloaded/salesforce_ocr_data/001.parquet"
# OR if it's a directory: SALESFORCE_DATA_PATH = "/path/to/your/downloaded/salesforce_ocr_data/"

raw_ocr_df = None

if SALESFORCE_DATA_PATH and os.path.exists(SALESFORCE_DATA_PATH):
    print(f"Attempting to load data from: {SALESFORCE_DATA_PATH}")
    if os.path.isdir(SALESFORCE_DATA_PATH):
        parquet_files = sorted([os.path.join(SALESFORCE_DATA_PATH, f) for f in os.listdir(SALESFORCE_DATA_PATH) if f.endswith('.parquet')])
        if parquet_files:
            print(f"Found {len(parquet_files)} Parquet files. Loading the first one: {parquet_files[0]}")
            try:
                raw_ocr_df = pd.read_parquet(parquet_files[0])
                print(f"Loaded dataframe from {parquet_files[0]}. Shape: {raw_ocr_df.shape}")
            except Exception as e:
                print(f"Error loading Parquet file {parquet_files[0]}: {e}")
        else:
            print(f"No Parquet files found in directory: {SALESFORCE_DATA_PATH}.")
    elif SALESFORCE_DATA_PATH.endswith('.parquet'):
        try:
            raw_ocr_df = pd.read_parquet(SALESFORCE_DATA_PATH)
            print(f"Loaded dataframe from {SALESFORCE_DATA_PATH}. Shape: {raw_ocr_df.shape}")
        except Exception as e:
            print(f"Error loading Parquet file {SALESFORCE_DATA_PATH}: {e}")
    else:
        print(f"Path '{SALESFORCE_DATA_PATH}' is not a recognized directory or .parquet file.")
elif SALESFORCE_DATA_PATH:
    print(f"Path '{SALESFORCE_DATA_PATH}' does not exist. Please verify.")
else:
    print("'data_path' not configured in reasoning_config.yaml or is empty.")

if raw_ocr_df is not None:
    print("\nFirst 5 rows of the loaded data:")
    print(raw_ocr_df.head())
    print("\nColumns in the DataFrame:", raw_ocr_df.columns.tolist())
else:
    print("\nNo data loaded. Cannot proceed with exploration.")
    raw_ocr_df = pd.DataFrame() # Ensure it's a DataFrame to prevent errors

### 4.1. Specify Parquet File and Load Data

Ensure `SALESFORCE_DATA_PATH` in Cell 7 (Configuration Loading) is correctly set to the directory containing your Parquet files in the Kaggle environment. We'll load one file to start.

In [None]:
# Assuming SALESFORCE_DATA_PATH is a directory containing the .parquet files
# This was defined in Cell 7 based on your config.

# Let's list available parquet files if the path is a directory
parquet_file_to_load = None
if SALESFORCE_DATA_PATH and os.path.exists(SALESFORCE_DATA_PATH) and os.path.isdir(SALESFORCE_DATA_PATH):
    all_parquet_files = sorted([os.path.join(SALESFORCE_DATA_PATH, f) for f in os.listdir(SALESFORCE_DATA_PATH) if f.endswith('.parquet')])
    if all_parquet_files:
        parquet_file_to_load = all_parquet_files[0] # Load the first file
        print(f"Found {len(all_parquet_files)} Parquet files. Will load: {parquet_file_to_load}")
    else:
        print(f"ERROR: No .parquet files found in directory: {SALESFORCE_DATA_PATH}")
else:
    print(f"ERROR: SALESFORCE_DATA_PATH ('{SALESFORCE_DATA_PATH}') is not a valid directory or does not exist. Please check Cell 7.")

# Load the selected Parquet file into a Pandas DataFrame
df = None
if parquet_file_to_load:
    try:
        df = pd.read_parquet(parquet_file_to_load)
        print(f"Successfully loaded {parquet_file_to_load}")
    except Exception as e:
        print(f"Error loading Parquet file {parquet_file_to_load}: {e}")

if df is not None:
    print("\nDataFrame Info:")
    df.info()
    print("\nDataFrame Head:")
    print(df.head())

### 4.2. Explore a Sample Row

Let's examine the content of a single row, focusing on `uid`, `url`, `captions`, and `metadata`.

In [None]:
def explore_sample_row(dataframe, index=0):
    if dataframe is None or dataframe.empty:
        print("DataFrame is not loaded or is empty.")
        return
    if index >= len(dataframe):
        print(f"Index {index} is out of bounds for DataFrame of length {len(dataframe)}.")
        return

    sample = dataframe.iloc[index]
    print(f"--- Exploring Sample Row (Index: {index}) ---")
    print(f"UID: {sample.get('uid')}")
    print(f"Key: {sample.get('key')}")
    image_url = sample.get('url')
    print(f"Image URL: {image_url}")

    print("\n--- Parsed Captions (from 'captions' field) ---")
    captions_str = sample.get('captions')
    if captions_str:
        try:
            captions_list = json.loads(captions_str)
            for i, cap_obj in enumerate(captions_list):
                print(f"  Caption Entry {i}:")
                print(f"    Granularity: {cap_obj.get('granularity')}")
                print(f"    Include DataComp Raw Cap: {cap_obj.get('include_datacomp_raw_cap')}")
                # Limit printing of very long text fields
                ocr_text = cap_obj.get('text', '')
                print(f"    Text: {ocr_text[:500] + ('...' if len(ocr_text) > 500 else '')}")
        except json.JSONDecodeError as e:
            print(f"  Error parsing 'captions' JSON: {e}")
            print(f"  Raw captions string: {captions_str}")
    else:
        print("  'captions' field is missing or empty.")

    print("\n--- Parsed Metadata (from 'metadata' field) ---")
    metadata_str = sample.get('metadata')
    if metadata_str:
        try:
            metadata_obj = json.loads(metadata_str)
            print(f"  Length (tokens): {metadata_obj.get('length')}")
            print(f"  OCR UID: {metadata_obj.get('uid')}")
            entries = metadata_obj.get('entries', [])
            print(f"  Number of token entries: {len(entries)}")
            if entries:
                print("    First 3 token entries (if available):")
                for i, entry in enumerate(entries[:3]):
                    print(f"      Token {i}: Text='{entry.get('text')}', Confidence={entry.get('confidence')}, BBox={entry.get('bbox')}")
        except json.JSONDecodeError as e:
            print(f"  Error parsing 'metadata' JSON: {e}")
            print(f"  Raw metadata string: {metadata_str}")
    else:
        print("  'metadata' field is missing or empty.")

if df is not None:
    explore_sample_row(df, 0) # Explore the first row
    if len(df) > 1:
        explore_sample_row(df, 1) # Explore the second row if available

### 4.3. Formulate Simple Question & Answer from OCR Data

Based on the exploration, we'll select an OCR text (e.g., from a specific granularity) to serve as the 'ground-truth answer' for a generic OCR question.

In [None]:
def get_ocr_text_for_qna(sample_row, desired_granularity=0, prefer_no_raw_datacomp=True):
    if sample_row is None:
        return None
    
    captions_str = sample_row.get('captions')
    if not captions_str:
        return None
    
    try:
        captions_list = json.loads(captions_str)
        # Try to find the exact granularity, optionally filtering by include_datacomp_raw_cap
        for cap_obj in captions_list:
            if cap_obj.get('granularity') == desired_granularity:
                if prefer_no_raw_datacomp and cap_obj.get('include_datacomp_raw_cap') == True:
                    continue
                return cap_obj.get('text')
        
        # Fallback: if exact granularity with preference not found, try without preference
        if prefer_no_raw_datacomp:
            for cap_obj in captions_list:
                if cap_obj.get('granularity') == desired_granularity:
                    return cap_obj.get('text')
        
        # Fallback: if desired_granularity not found at all, try to get any text from granularity 0 or 5
        for cap_obj in captions_list:
            if cap_obj.get('granularity') == 0:
                 if prefer_no_raw_datacomp and cap_obj.get('include_datacomp_raw_cap') == True:
                    continue
                 return cap_obj.get('text')
        for cap_obj in captions_list:
            if cap_obj.get('granularity') == 5:
                 if prefer_no_raw_datacomp and cap_obj.get('include_datacomp_raw_cap') == True:
                    continue
                 # Clean the XML-like tags from granularity 5 for a cleaner answer
                 text_g5 = cap_obj.get('text')
                 if text_g5:
                     return re.sub(r'<ocr>([^<]+)</ocr><bbox>[^<]+</bbox>', r'\1', text_g5).strip()
                 return None # Should not happen if text_g5 exists
                     
        # Final fallback: return the first caption's text if any
        if captions_list:
            return captions_list[0].get('text')
            
    except json.JSONDecodeError:
        return None
    return None

if df is not None and not df.empty:
    sample_for_qna = df.iloc[0]
    
    # --- Configuration for Q&A Generation ---
    # You can change this to experiment with different OCR outputs
    # Granularity 0: Basic text extraction.
    # Granularity 5: Text with OCR tags and bounding boxes (will be cleaned by get_ocr_text_for_qna).
    CHOSEN_GRANULARITY_FOR_ANSWER = 0 
    # --- End Configuration ---

    ground_truth_ocr_text = get_ocr_text_for_qna(sample_for_qna, CHOSEN_GRANULARITY_FOR_ANSWER)
    
    if ground_truth_ocr_text:
        question = "What is all the text visible in this image?"
        image_uid = sample_for_qna.get('uid')
        image_url_for_qna = sample_for_qna.get('url')
        
        print(f"--- Generated Q&A for Image UID: {image_uid} ---")
        print(f"Image URL: {image_url_for_qna}")
        print(f"Question: {question}")
        print(f"Ground-Truth Answer (from Granularity {CHOSEN_GRANULARITY_FOR_ANSWER}, cleaned):\n{ground_truth_ocr_text}")
        
        # This is the data structure one item of your preprocessed list would look like for multimodal_QRA_pair.py
        # (assuming multimodal_QRA_pair.py's main data loading is bypassed or adapted)
        prepared_data_item = {
            'process_id': image_uid, # Or some other unique ID
            'Open-ended Verifiable Question': question,
            'Ground-True Answer': ground_truth_ocr_text,
            'img_urls': [image_url_for_qna] # multimodal_QRA_pair.py expects a list of URLs
        }
        print("\nPrepared data item structure for the reasoning script:")
        print(json.dumps(prepared_data_item, indent=2))
    else:
        print(f"Could not extract suitable OCR text for Q&A from sample row (UID: {sample_for_qna.get('uid')}).")
else:
    print("DataFrame not loaded or empty, skipping Q&A formulation.")

# Make sure re is imported if not already
import re

## 5. OCR Data Processing and Reasoning Data Generation

This section will use the `multimodal_QRA_pair.py` script's functionalities.

**CRITICAL BLOCKER:** The `multimodal_QRA_pair.py` script is currently missing. Its logic needs to be integrated or callable from here.

In [None]:
# Placeholder for functionalities from multimodal_QRA_pair.py
# Once multimodal_QRA_pair.py is available, its functions for processing OCR data
# and generating question-reasoning-answer pairs will be called here.

print("ACTION: Provide the 'multimodal_QRA_pair.py' script.")
print("This section cannot be implemented without it.")

# Example structure (highly speculative without the script):
# def process_single_image_ocr(image_path, ocr_data_from_parquet_row):
#     # ... logic from multimodal_QRA_pair.py ...
#     pass

# def generate_reasoning_trace(image_path, question, ocr_text):
#     # ... logic using prompts from prompts.yaml and model from config.yaml ...
#     pass

# for index, row in tqdm(ocr_dataset.iterrows(), total=len(ocr_dataset)):
#     # Assuming ocr_dataset is a pandas DataFrame loaded from Parquet
#     image_id = row['uid'] # or 'key'
#     image_url = row['url']
#     captions_json = row['captions'] # This is a JSON string
#     metadata_json = row['metadata'] # This is a JSON string
    
#     # Parse the JSON strings
#     captions_data = json.loads(captions_json)
#     metadata_data = json.loads(metadata_json)
    
#     # Determine image path (this needs robust handling based on how images are stored/accessed in Kaggle)
#     # current_image_path = os.path.join(config.get('image_dir'), f"{image_id}.jpg") # Example, actual extension might vary
    
#     # Select desired OCR granularity from captions_data
#     # For example, granularity 5 (text with OCR tags and bounding boxes)
#     selected_ocr_text = None
#     for cap in captions_data:
#         if cap['granularity'] == 5 and not cap['include_datacomp_raw_cap']:
#             selected_ocr_text = cap['text']
#             break
            
#     if not selected_ocr_text:
#         # Fallback or skip if desired granularity not found
#         # print(f"Granularity 5 not found for {image_id}")
#         continue
        
#     # TODO: Define a question for the image/OCR data
#     # question = "Describe the main subject and any prominent text in this image."
    
#     # if os.path.exists(current_image_path):
#     #     reasoning_data = generate_reasoning_trace(current_image_path, question, selected_ocr_text)
#     #     # Save or process reasoning_data
#     # else:
#     #     # print(f"Image not found: {current_image_path}")
#     pass # End of loop placeholder