<a href="https://www.kaggle.com/code/nikita2998/structured-ocr-h2ovl-mississippi?scriptVersionId=230170678" target="_blank"><img align="left" alt="Kaggle" title="Open in Kaggle" src="https://kaggle.com/static/images/open-in-kaggle.svg"></a>

In [1]:
!pip install einops timm peft sentencepiece



In [2]:
!pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu118

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cu118


In [3]:
#!apt install g++

In [4]:
#!pip install flash-attn==2.7.3 --no-build-isolation

In [5]:
import os
import glob
import json
import torch
import logging
from transformers import AutoModel, AutoTokenizer
from tqdm import tqdm  # for progress bar

# Disable FLASH_ATTN if needed
os.environ["FLASH_ATTN_DISABLE"] = "1"

# -------------------------------
# 1. Set up logging configuration
# -------------------------------
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    handlers=[logging.StreamHandler()]
)
logger = logging.getLogger(__name__)

# -------------------------------
# 2. Set up the model and tokenizer for GPU
# -------------------------------
model_path = 'h2oai/h2ovl-mississippi-2b'
model = AutoModel.from_pretrained(
    model_path,
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).eval().cuda()  # Running on GPU

tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
generation_config = dict(max_new_tokens=1024, do_sample=True)

# -------------------------------
# 3. Define the prompt template for structured extraction
# -------------------------------
prompt = ("""<image>\n
Please extract the following information in JSON format:
- Company Name
- Address
- Contact Information (Telephone, Email)
- Date (DD/MM/YYYY format)
- Time (HH:MM:SS format)
- Product Details (for each product: Name, ID, Price, Quantity)
- Total Amount
- Amount Paid
- Change Returned 
Do not include json tags like ```json or ``` in the output. Just return the json object.
Always return the output as a valid JSON object with keys corresponding to these fields.
""")

# -------------------------------
# 4. Function to process one folder of images in batches and save JSON output to a specified output folder
# -------------------------------
def process_folder(input_folder, output_folder, batch_size=4):
    image_files = glob.glob(os.path.join(input_folder, "*.png"))
    logger.info(f"Found {len(image_files)} images in {input_folder}.")
    
    # Process images in batches using tqdm for progress tracking
    for i in tqdm(range(0, len(image_files), batch_size), desc=f"Processing {os.path.basename(input_folder)}"):
        batch_files = image_files[i:i+batch_size]
        logger.info(f"Processing batch: {batch_files}")
        try:
            # Pass the list of image paths to model.chat for batch processing
            response, history = model.chat(
                tokenizer, batch_files, prompt, generation_config, history=None, return_history=True
            )
        except Exception as e:
            logger.error(f"Error processing batch {batch_files}: {e}")
            continue
        
        # Assume that 'response' is a list of strings corresponding to each image in the batch
        for file, resp in zip(batch_files, response):
            logger.info(f"Raw response for {file}: {resp}")
            try:
                r = resp.strip()
                if '```json' in r:
                    r = r.split('```json')[1].split('```')[0].strip()
                extracted_data = json.loads(r)
            except Exception as e:
                logger.warning(f"Error parsing JSON for {file}: {e}")
                extracted_data = {"extracted_text": r}
            
            base_name = os.path.splitext(os.path.basename(file))[0]
            output_file = os.path.join(output_folder, f"{base_name}.json")
            with open(output_file, "w", encoding="utf-8") as f:
                json.dump(extracted_data, f, indent=4)
            logger.info(f"Saved extracted data to {output_file}")

# -------------------------------
# 5. Process dataset folders: train, val, and test and save outputs accordingly
# -------------------------------
dataset_base = "/kaggle/input/find-it-again-dataset/findit2"  # Update with your dataset path
output_base = "/kaggle/working/extracted_jsons"  # Base output folder for JSON files

for split in ["val", "test","train"]:
    input_folder = os.path.join(dataset_base, split)
    output_folder = os.path.join(output_base, split)
    if os.path.isdir(input_folder):
        os.makedirs(output_folder, exist_ok=True)
        logger.info(f"\nProcessing folder: {input_folder}")
        process_folder(input_folder, output_folder, batch_size=4)
    else:
        logger.error(f"Folder {input_folder} does not exist.")


config.json:   0%|          | 0.00/3.61k [00:00<?, ?B/s]

configuration_h2ovl_chat.py:   0%|          | 0.00/3.92k [00:00<?, ?B/s]

configuration_intern_vit.py:   0%|          | 0.00/5.55k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- configuration_h2ovl_chat.py
- configuration_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_h2ovl_chat.py:   0%|          | 0.00/15.1k [00:00<?, ?B/s]

image_process.py:   0%|          | 0.00/6.92k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- image_process.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


conversation.py:   0%|          | 0.00/5.16k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- conversation.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


modeling_intern_vit.py:   0%|          | 0.00/18.3k [00:00<?, ?B/s]

A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.
A new version of the following files was downloaded from https://huggingface.co/h2oai/h2ovl-mississippi-2b:
- modeling_h2ovl_chat.py
- image_process.py
- conversation.py
- modeling_intern_vit.py
. Make sure to double-check they do not contain any added malicious code. To avoid downloading new versions of the code file, you can pin a revision.


FlashAttention is not installed.


model.safetensors:   0%|          | 0.00/4.30G [00:00<?, ?B/s]



generation_config.json:   0%|          | 0.00/161 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/3.31k [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/493k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/199 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

Processing val: 100%|██████████| 48/48 [02:47<00:00,  3.48s/it]
Processing test: 100%|██████████| 55/55 [03:22<00:00,  3.68s/it]
Processing train: 100%|██████████| 145/145 [09:33<00:00,  3.96s/it]
