<a href="https://colab.research.google.com/github/haggarwal90/invoice-fusion-ai/blob/main/nanonets_ocr2_3b.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

```markdown
# Invoice Data Extraction using Nanonets OCR and LLM

This notebook demonstrates an end-to-end pipeline for extracting structured data from PDF invoices. It leverages the Nanonets OCR2-3B model for Optical Character Recognition (OCR) to convert PDF content into text, and then uses a Large Language Model (LLM) (specifically, a Groq-hosted OpenAI-compatible model) to extract specific fields from the OCR'd text. The extracted data is then saved into a CSV file.

## Workflow:
1.  **Setup and Authentication**: Install necessary libraries, mount Google Drive, and log in to Hugging Face and Groq.
2.  **Model Loading**: Load the Nanonets OCR2-3B model and its associated tokenizer and processor.
3.  **Utility Functions**: Define functions to read PDFs, perform OCR, list files in a directory, and save data to CSV.
4.  **LLM Configuration**: Set up the LLM with a system and user prompt for structured data extraction.
5.  **Data Processing**: Iterate through PDF files, perform OCR, extract data using the LLM, and save the results.
```

```markdown
## Prerequisites

Before running this notebook, please ensure the following:

1.  **API Keys in Colab Secrets**:
    -   `HF_TOKEN`: Your Hugging Face API token, stored as a Colab secret named `HF_TOKEN`.
    -   `GROQ_API_KEY`: Your Groq API key, stored as a Colab secret named `GROQ_API_KEY`.
    (You can add these secrets by clicking the '🔑' icon on the left panel in Colab and selecting 'Add a new secret').

2.  **Google Drive Folder Structure**:
    -   Create a folder named `invoices` in your Google Drive's root directory (`My Drive`).
    -   Inside the `invoices` folder, create another subfolder named `sample`.
    -   Place all the PDF invoice files you wish to process into the `/content/drive/MyDrive/invoices/sample` directory.

    *Example Path:* `/content/drive/MyDrive/invoices/sample/invoice1.pdf`
```

```markdown
## 1. Setup and Authentication

This section handles the necessary imports and authentication steps for accessing Google Drive and Hugging Face models, and Groq API.

- `google.colab.drive`: Used to mount Google Drive for accessing files.
- `huggingface_hub`: Used for logging into Hugging Face to download pre-trained models.
- `google.colab.userdata`: Securely retrieves API keys/tokens stored in Colab's secrets manager.
```

In [None]:
from google.colab import drive
from huggingface_hub import login
from google.colab import userdata

```markdown
### 1.1 Mount Google Drive and Login to Hugging Face

This cell mounts your Google Drive to `/content/drive`, allowing the notebook to read from and write to your Drive. It also logs into Hugging Face using a token stored in Colab's user data, which is required for downloading the Nanonets OCR model.
```

In [None]:
# mount gdrive
drive.mount('/content/drive')

# Signning in huggingface
hf_token = userdata.get('HF_TOKEN')
login(token=hf_token, add_to_git_credential=True)

Mounted at /content/drive


```markdown
### 1.2 Set Environment Variables

This cell sets the `TRANSFORMERS_VERBOSITY` environment variable to `info`, which provides more detailed logging from the Hugging Face Transformers library during model loading and usage. This can be helpful for debugging.
```

In [None]:
import os
os.environ["TRANSFORMERS_VERBOSITY"] = "info"
print("TRANSFORMERS_VERBOSITY set to:", os.environ.get("TRANSFORMERS_VERBOSITY"))

TRANSFORMERS_VERBOSITY set to: info


```markdown
## 2. Model Loading

This section loads the `nanonets/Nanonets-OCR2-3B` model, tokenizer, and processor from Hugging Face. This model is specialized for Image-to-Text tasks, making it suitable for OCR.

- `AutoTokenizer`: Loads the tokenizer associated with the model.
- `AutoProcessor`: Loads the image processor for preparing images for the model.
- `AutoModelForImageTextToText`: Loads the pre-trained OCR model.

`torch_dtype="auto"` and `device_map="auto"` are used for efficient model loading and placement on available hardware (e.g., GPU).
```

In [None]:
from PIL import Image
from transformers import AutoTokenizer, AutoProcessor, AutoModelForImageTextToText

model_path = "nanonets/Nanonets-OCR2-3B"

model = AutoModelForImageTextToText.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto"
)
model.eval()

tokenizer = AutoTokenizer.from_pretrained(model_path)
processor = AutoProcessor.from_pretrained(model_path)

JAX version 0.7.2, Flax version 0.10.7 available.


config.json: 0.00B [00:00, ?B/s]

loading configuration file config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/config.json
Model config Qwen2_5_VLConfig {
  "architectures": [
    "Qwen2_5_VLForConditionalGeneration"
  ],
  "attention_dropout": 0.0,
  "bos_token_id": 151643,
  "dtype": "bfloat16",
  "eos_token_id": 151645,
  "hidden_act": "silu",
  "hidden_size": 2048,
  "initializer_range": 0.02,
  "intermediate_size": 11008,
  "max_position_embeddings": 128000,
  "max_window_layers": 70,
  "model_type": "qwen2_5_vl",
  "num_attention_heads": 16,
  "num_hidden_layers": 36,
  "num_key_value_heads": 2,
  "rms_norm_eps": 1e-06,
  "rope_scaling": {
    "mrope_section": [
      16,
      24,
      24
    ],
    "rope_type": "default",
    "type": "default"
  },
  "rope_theta": 1000000.0,
  "sliding_window": 32768,
  "text_config": {
    "_name_or_path": "nanonets/Nanonets-OCR2-3B",
    "architectures": [
      "Qwen2_5_VLForCondition

model.safetensors.index.json: 0.00B [00:00, ?B/s]

loading weights file model.safetensors from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/model.safetensors.index.json


Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/2.51G [00:00<?, ?B/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/5.00G [00:00<?, ?B/s]

Will use dtype=torch.bfloat16 as defined in model's config object
Instantiating Qwen2_5_VLForConditionalGeneration model under default dtype torch.bfloat16.
Generate config GenerationConfig {
  "bos_token_id": 151643,
  "eos_token_id": 151645
}

Instantiating Qwen2_5_VisionTransformerPretrainedModel model under default dtype torch.bfloat16.
Instantiating Qwen2_5_VLTextModel model under default dtype torch.bfloat16.


Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/214 [00:00<?, ?B/s]

loading configuration file generation_config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/generation_config.json
Generate config GenerationConfig {
  "bos_token_id": 151643,
  "do_sample": true,
  "eos_token_id": [
    151645,
    151643
  ],
  "pad_token_id": 151643,
  "repetition_penalty": 1.05,
  "temperature": 1e-06
}

Could not locate the custom_generate/generate.py inside nanonets/Nanonets-OCR2-3B.


tokenizer_config.json: 0.00B [00:00, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

tokenizer.json:   0%|          | 0.00/11.4M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/605 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/613 [00:00<?, ?B/s]

chat_template.jinja: 0.00B [00:00, ?B/s]

loading file vocab.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/vocab.json
loading file merges.txt from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/merges.txt
loading file tokenizer.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/tokenizer.json
loading file added_tokens.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/added_tokens.json
loading file special_tokens_map.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/special_tokens_map.json
loading file tokenizer_config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00

preprocessor_config.json:   0%|          | 0.00/791 [00:00<?, ?B/s]

loading configuration file preprocessor_config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/preprocessor_config.json
loading configuration file preprocessor_config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/preprocessor_config.json
Image processor Qwen2VLImageProcessorFast {
  "crop_size": null,
  "data_format": "channels_first",
  "default_to_square": true,
  "device": null,
  "disable_grouping": null,
  "do_center_crop": null,
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_pad": null,
  "do_rescale": true,
  "do_resize": true,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_processor_type": "Qwen2VLImageProcessorFast",
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "input_data_format": null,
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_pi

video_preprocessor_config.json:   0%|          | 0.00/907 [00:00<?, ?B/s]

loading configuration file video_preprocessor_config.json from cache at /root/.cache/huggingface/hub/models--nanonets--Nanonets-OCR2-3B/snapshots/c3886ff00bb037ce7da24988c9eafaf1fe2bed72/video_preprocessor_config.json
Video processor Qwen2VLVideoProcessor {
  "crop_size": null,
  "data_format": "channels_first",
  "default_to_square": true,
  "device": null,
  "do_center_crop": null,
  "do_convert_rgb": true,
  "do_normalize": true,
  "do_pad": null,
  "do_rescale": true,
  "do_resize": true,
  "do_sample_frames": false,
  "fps": null,
  "image_mean": [
    0.48145466,
    0.4578275,
    0.40821073
  ],
  "image_std": [
    0.26862954,
    0.26130258,
    0.27577711
  ],
  "input_data_format": null,
  "max_frames": 768,
  "max_pixels": 12845056,
  "merge_size": 2,
  "min_frames": 4,
  "min_pixels": 3136,
  "num_frames": null,
  "pad_size": null,
  "patch_size": 14,
  "processor_class": "Qwen2_5_VLProcessor",
  "resample": 3,
  "rescale_factor": 0.00392156862745098,
  "return_metadata":

```markdown
## 3. Utility Functions

This section defines several helper functions required for the overall workflow, including PDF processing, OCR, and file system operations.

### 3.1 Install PyMuPDF

`PyMuPDF` (also known as `fitz`) is a powerful library for working with PDF documents. This cell installs it so we can convert PDF pages into images.
```

In [None]:
!pip install PyMuPDF

Collecting PyMuPDF
  Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.7-cp310-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m27.6 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: PyMuPDF
Successfully installed PyMuPDF-1.26.7


```markdown
### 3.2 PDF to Image Conversion Function

This function, `read_pdf_to_image`, takes a PDF file path and an optional page number, then converts the specified PDF page into a PIL (Pillow) Image object. This image can then be fed to the OCR model.
```

In [None]:
import fitz  # PyMuPDF
from PIL import Image

def read_pdf_to_image(pdf_path, page_number=0):
    """
    Reads a specific page from a PDF file and converts it into a PIL Image object.

    Args:
        pdf_path (str): The path to the PDF file.
        page_number (int): The 0-indexed page number to convert. Defaults to the first page.

    Returns:
        PIL.Image.Image: A PIL Image object of the specified PDF page.

    Raises:
        ValueError: If the page_number is out of bounds.
    """
    try:
        doc = fitz.open(pdf_path)
        if not (0 <= page_number < doc.page_count):
            raise ValueError(f"Page number {page_number} is out of bounds. PDF has {doc.page_count} pages.")

        page = doc.load_page(page_number)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        doc.close()
        return img
    except Exception as e:
        print(f"Error reading PDF or converting to image: {e}")
        return None

```markdown
### 3.3 OCR Function with Nanonets Model

The `ocr_page_with_nanonets_s` function performs OCR on a given image using the loaded Nanonets model. It constructs a specific prompt to guide the model to extract text naturally, format tables in HTML, equations in LaTeX, and handle other document-specific elements like image descriptions, watermarks, and page numbers.

- The `prompt` is carefully crafted to get structured output.
- `processor.apply_chat_template`: Formats the messages (image and prompt) into a conversational template understood by the model.
- `model.generate`: Generates the OCR text based on the processed inputs.
```

In [None]:
def ocr_page_with_nanonets_s(img, model, processor, max_new_tokens=4096):
    prompt = """Extract the text from the above document as if you were reading it naturally. Return the tables in html format. Return the equations in LaTeX representation. If there is an image in the document and image caption is not present, add a small description of the image inside the <img></img> tag; otherwise, add the image caption inside <img></img>. Watermarks should be wrapped in brackets. Ex: <watermark>OFFICIAL COPY</watermark>. Page numbers should be wrapped in brackets. Ex: <page_number>14</page_number> or <page_number>9/22</page_number>. Prefer using ☐ and ☑ for check boxes."""
    image = img
    messages = [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": [
            {"type": "image", "image": img},
            {"type": "text", "text": prompt},
        ]},
    ]
    text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=[text], images=[image], padding=True, return_tensors="pt")
    inputs = inputs.to(model.device)

    output_ids = model.generate(**inputs, max_new_tokens=max_new_tokens, do_sample=False)
    generated_ids = [output_ids[len(input_ids):] for input_ids, output_ids in zip(inputs.input_ids, output_ids)]

    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=True)
    return output_text[0]

```markdown
### 3.4 Get Filenames from Directory Function

The `get_filenames_from_directory` function is a utility to list all files and subdirectories within a given path. This is used to find the PDF invoices in the specified directory.
```

In [None]:
import os

def get_filenames_from_directory(directory_path):
    """
    Reads all file and directory names from a given directory path.

    Args:
        directory_path (str): The path to the directory.

    Returns:
        list: A list of strings, where each string is the name of a file or directory
              within the specified path.
    """
    if not os.path.isdir(directory_path):
        print(f"Error: Directory not found at {directory_path}")
        return []
    return os.listdir(directory_path)

# Example usage (you can modify this to your specific path):
# path_to_your_directory = "/content/drive/MyDrive/invoices/sample/"
# file_names = get_filenames_from_directory(path_to_your_directory)
# print(file_names)

```markdown
## 4. LLM Configuration for Data Extraction

This section sets up the Large Language Model (LLM) for extracting specific structured data from the OCR'd invoice text. It uses an OpenAI-compatible API client, configured to point to Groq's API for faster inference.

- `LLM_MODEL`: Specifies the model to be used (e.g., `openai/gpt-oss-120b`).
- `system_prompt`: Defines the role and instructions for the LLM.
- `user_prompt`: Specifies the exact fields to extract and the desired CSV output format.

The `GROQ_API_KEY` is retrieved securely from Colab's `userdata`.
```

In [None]:
LLM_MODEL = 'openai/gpt-oss-120b'

import openai
from google.colab import userdata

openai = openai.OpenAI(api_key=userdata.get('GROQ_API_KEY'), base_url="https://api.groq.com/openai/v1")

system_prompt = """
You are a professional accounting data extraction assistant.

You will be given raw invoice text.
Tabular data may appear in HTML-like format.

Your task is to accurately extract both sales and purchase information from the invoice.
Return only structured data as requested. Do not add explanations or extra text.
"""

user_prompt = """
Extract the following fields from the invoice text below.

Output format:
- Comma-separated values (CSV)
- Use ; delimiter
- Maintain the exact field order listed
- If a field value is missing or not found, return NA

Fields (in order):
buyer details,
supplier details,
buyer GSTIN,
supplier GSTIN,
item serial number,
item name,
item description,
HSN,
item quantity,
item unit,
per unit price,
SGST amount,
CGST amount,
total item tax,
total item price

Invoice Text:

"""

```markdown
### 4.1 CSV Utility Functions

These functions are used to save the extracted data.

- `save_invoice_text`: Saves the raw OCR'd text to a `.txt` file.
- `save_to_csv`: Appends lines of data to a CSV file, using `;` as a delimiter.
```

In [None]:
import csv

def save_invoice_text(file_name, file_text):
  with open(file_name, 'w') as f:
    f.write(file_text)

def save_to_csv(file_name, lines):
  with open(file_name, 'a') as f:
    writer = csv.writer(f, delimiter=';')
    writer.writerows(lines)

```markdown
### 4.2 Function to Extract Fields Using LLM

The `extract_fields_from_invoice_text` function sends the OCR'd invoice text to the configured LLM with the defined system and user prompts. The LLM then processes the text and returns the extracted fields in the specified CSV format.
```

In [None]:
def extract_fields_from_invoice_text(invoice_text):
  response = openai.chat.completions.create(
    model=LLM_MODEL,
    messages=[
      {"role": "system", "content": system_prompt},
      {"role": "user", "content": user_prompt + invoice_text}
    ]
  )

  respone_text = response.choices[0].message.content
  return respone_text

```markdown
## 5. Main Data Processing Pipeline

This is the main execution block of the notebook. It orchestrates the entire workflow:

1.  **Define `INVOICES_DIR_PATH`**: Specifies the directory where your PDF invoices are located.
2.  **List PDF Files**: Retrieves all PDF filenames from the specified directory.
3.  **Initialize CSV**: Creates a new CSV file (`result.csv`) with the header row for the extracted data.
4.  **Process Each Invoice**: Loops through each identified PDF file:
    -   Reads the PDF page into a PIL Image using `read_pdf_to_image`.
    -   Performs OCR on the image using `ocr_page_with_nanonets_s` to get the raw text.
    -   Saves the raw OCR text to a `.txt` file for review.
    -   Extracts structured fields from the OCR text using the LLM via `extract_fields_from_invoice_text`.
    -   Parses the LLM's CSV output and saves it to the `result.csv` file, prepending the original filename to each extracted row.

This process automates the conversion of unstructured PDF invoice data into a structured, machine-readable CSV format.
```

In [None]:
INVOICES_DIR_PATH = "/content/drive/MyDrive/invoices/sample"

files = get_filenames_from_directory(INVOICES_DIR_PATH)
files = [file for file in files if file.endswith(".pdf")]
files = files[:min(2, len(files))]

save_to_csv(f"{INVOICES_DIR_PATH}/result.csv", [
    ['file_name',
    'buyer_details',
    'supplier_details',
    'buyer_GST_number',
    'supplier_GST_number',
    'item_serial_number',
    'item_name',
    'item_description',
    'HSN',
    'item_quantity',
    'item_unit',
    'per_unit_price',
    'SGST_amount',
    'CGST_amount',
    'total_item_tax',
    'total_item_price']
])

for f in files:
  img = read_pdf_to_image(f"{INVOICES_DIR_PATH}/{f}")
  result = ocr_page_with_nanonets_s(img, model, processor, max_new_tokens=15000)
  save_invoice_text(f"{INVOICES_DIR_PATH}/extract_{f.replace('.pdf', '.txt')}", result)

  # Extract details from result
  details = extract_fields_from_invoice_text(result)

  details = details.split("\n")
  details = [[f] + detail.split(";") for detail in details]

  # save the detaisl to csv
  save_to_csv(f"{INVOICES_DIR_PATH}/result.csv", details)