# OCR Cookbook

---

## OCR Exploration and Simple Structured Outputs (Deprecated)
In this cookbook, we will explore the basics of OCR and leverage it together with existing models to achieve structured outputs fueled by our OCR model (we recommend using the new Annotations feature instead for better results).

You may want to do this in case current vision models are not powerful enough, hence enhancing their vision OCR capabilities with the OCR model to achieve better structured data extraction.

---

### Model Used
- Mistral OCR
- Pixtral 12B & Ministral 8B

---

**For a more up to date guide on structured outputs visit our [Annotations cookbook](https://github.com/mistralai/cookbook/blob/main/mistral/ocr/data_extraction.ipynb) on Data Extraction.**


## Setup

First, let's install `mistralai` and download the required files.

In [1]:
%%capture
!pip install mistralai

### Download PDF and image files

In [2]:
%%capture
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/mistral7b.pdf
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png

## Mistral OCR with PDF

We will need to set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/).

In [4]:
# Initialize Mistral client with API key
from mistralai import Mistral
from google.colab import userdata

api_key = userdata.get('MISTRAL_API_KEY') # Replace with your API key
client = Mistral(api_key=api_key)

There are two types of files you can apply OCR to:
- 1. PDF files
- 2. Image files

Let's start with a PDF file:

In [5]:
# Import required libraries
from pathlib import Path
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
import json

# Verify PDF file exists
pdf_file = Path("test.pdf")
assert pdf_file.is_file()

# Upload PDF file to Mistral's OCR service
uploaded_file = client.files.upload(
    file={
        "file_name": pdf_file.stem,
        "content": pdf_file.read_bytes(),
    },
    purpose="ocr",
)

# Get URL for the uploaded file
signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

# Process PDF with OCR, including embedded images
pdf_response = client.ocr.process(
    document=DocumentURLChunk(document_url=signed_url.url),
    model="mistral-ocr-latest",
    include_image_base64=True
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

print(json.dumps(response_dict, indent=4, ensure_ascii=False)[0:1000]) # check the first 1000 characters

{
    "pages": [
        {
            "index": 0,
            "markdown": "\u0bb5\u0bbe\u0b95\u0bcd\u0b95\u0bbe\u0bb3\u0bb0\u0bcd \u0baa\u0b9f\u0bcd\u0b9f\u0bbf\u0baf\u0bb2\u0bcd 2022 S22 \u0ba4\u0bae\u0bbf\u0bb4\u0bcd\u0ba8\u0bbe\u0b9f\u0bc1\n\n|  \u0b9a\u0b9f\u0bcd\u0b9f\u0bae\u0ba9\u0bcd\u0bb1\u0ba4\u0bcd \u0ba4\u0bca\u0b95\u0bc1\u0ba4\u0bbf \u0b8e\u0ba3\u0bcd, \u0baa\u0bc6\u0baf\u0bb0\u0bcd \u0bae\u0bb1\u0bcd\u0bb1\u0bc1\u0bae\u0bcd \u0b92\u0ba4\u0bc1\u0b95\u0bcd\u0b95\u0bc0\u0b9f\u0bcd\u0b9f\u0bc1\u0ba4\u0bcd\u0ba4\u0b95\u0bc1\u0ba4\u0bbf \u0ba8\u0bbf\u0bb2\u0bc8 : 110 - \u0b95\u0bcb\u0baf\u0bae\u0bcd\u0baa\u0bc1\u0ba4\u0bcd\u0ba4\u0bc2\u0bb0\u0bcd (\u0bb5\u0b9f\u0b95\u0bcd\u0b95\u0bc1) (\u0baa\u0bca\u0ba4\u0bc1) | \u0baa\u0bbe\u0b95\u0bae\u0bcd \u0b8e\u0ba3\u0bcd  |\n| --- | --- |\n|   | 1  |\n\n\u0b9a\u0b9f\u0bcd\u0b9f\u0bae\u0ba9\u0bcd\u0bb1\u0ba4\u0bcd \u0ba4\u0bca\u0b95\u0bc1\u0ba4\u0bbf \u0b85\u0b9f\u0b99\u0bcd\u0b95\u0bbf\u0baf\u0bc1\u0bb3\u0bcd\u0bb3 \u0ba8\u0bbe\u0b9f\u0

View the result with the following:

## Mistral OCR with Image

In addition to the PDF file shown above, Mistral OCR can also process image files:

In [21]:
import base64

# Verify image exists
image_file = Path("output_page_3.png")
assert image_file.is_file()

# Encode image as base64 for API
encoded = base64.b64encode(image_file.read_bytes()).decode()
base64_data_url = f"data:image/jpeg;base64,{encoded}"

# Process image with OCR
image_response = client.ocr.process(
    document=ImageURLChunk(image_url=base64_data_url),
    model="mistral-ocr-latest"
)

# Convert response to JSON
response_dict = json.loads(image_response.model_dump_json())
json_string = json.dumps(response_dict, indent=4, ensure_ascii=False)
print(json_string)
# display(Markdown(image_response.pages[0].markdown))

{
    "pages": [
        {
            "index": 0,
            "markdown": "# பிளவு எண் மற்றும் பெயர் : 1-வார்டு எண் 17 அருண் நகர்\n\n|  1 | IBU2609337 | 2 | IBU0520122 | 3 | IBU0263616  |\n| --- | --- | --- | --- | --- | --- |\n|  பெயர் : விமல்\nதந்தை பெயர் : பத்மநாபன்\nவீட்டு எண் : ஏ\nவயது : 47 பாலினம் : பெண் |  | பெயர் : புருஷோத்தமன்\nதந்தை பெயர் : கிருஷ்ணமூர்த்தி\nவீட்டு எண் : na\nவயது : 74 பாலினம் : ஆண் |  | பெயர் : துரைவேலன்\nதந்தை பெயர் : முனிசாமி\nவீட்டு எண் : na\nவயது : 69 பாலினம் : ஆண் | Photo : A\nAvailable  |\n|  4 | IBU0520601 | 5 | IBU0856781 | 6 | IBU0518720  |\n|  பெயர் : பிரபா\nகணவர் பெயர் : ரகோத்தமன்\nவீட்டு எண் : na\nவயது : 68 பாலினம் : பெண் |  | பெயர் : கோகுலகிருஷ்ணன்\nதந்தை பெயர் : சுப்பிரமணியன்\nவீட்டு எண் : Na\nவயது : 45 பாலினம் : ஆண் |  | பெயர் : திரேசா\nகணவர் பெயர் : சங்கர்\nவீட்டு எண் : na\nவயது : 42 பாலினம் : பெண் | Photo : A\nAvailable  |\n|  7 | IBU0748830 | 8 | IBU0263632 | 9 | IBU0263640  |\n|  பெயர் : ஜார்ஜ்கிருபாகர்\nதந்தை பெயர் : டேவிட்\nசுந்திரமோகன்\

## Extract structured data from OCR results

OCR results can be further processed using another model.

Our goal is to extract structured data from these results. To achieve this, we will utilize the `pixtral-12b-latest` model, supported by our OCR model, to deliver better and higher-quality answers:

In [22]:
# Get OCR results for processing
image_ocr_markdown = image_response.pages[0].markdown

# Get structured response from model
chat_response = client.chat.complete(
    model="pixtral-12b-latest",
    messages=[
        {
            "role": "user",
            "content": [
                ImageURLChunk(image_url=base64_data_url),
                TextChunk(
                    text=(
                        f"This is image's OCR in markdown:\n\n{image_ocr_markdown}\n.\n"
                        "Convert this into a sensible structured json response. "
                        "The output should be strictly be json with no extra commentary"
                    )
                ),
            ],
        }
    ],
    response_format={"type": "json_object"},
    temperature=0,
)

# Parse and return JSON response
response_dict = json.loads(chat_response.choices[0].message.content)
print(json.dumps(response_dict, indent=4, ensure_ascii=False))

{
    "data": [
        {
            "number": 1,
            "name": "விமல்",
            "father_name": "பத்மநாபன்",
            "house_number": "ஏ",
            "age": 47,
            "gender": "பெண்",
            "photo": "A",
            "status": "Available"
        },
        {
            "number": 2,
            "name": "புருஷோத்தமன்",
            "father_name": "கிருஷ்ணமூர்த்தி",
            "house_number": "na",
            "age": 74,
            "gender": "ஆண்",
            "photo": "A",
            "status": "Available"
        },
        {
            "number": 3,
            "name": "துரைவேலன்",
            "father_name": "முனிசாமி",
            "house_number": "na",
            "age": 69,
            "gender": "ஆண்",
            "photo": "A",
            "status": "Available"
        },
        {
            "number": 4,
            "name": "பிரபா",
            "spouse_name": "ரகோத்தமன்",
            "house_number": "na",
            "age": 68,
            "gender": "

In the example above, we are leveraging a model already capable of vision tasks.

However, we could also use text-only models for the structured output. Note in this case, we do not include the image in the user message:

In [23]:
# Get OCR results for processing
image_ocr_markdown = image_response.pages[0].markdown

# Get structured response from model
chat_response = client.chat.complete(
    model="ministral-8b-latest",
    messages=[
        {
            "role": "user",
            "content": [
                TextChunk(
                    text=(
                        f"This is image's OCR in markdown:\n\n{image_ocr_markdown}\n.\n"
                        "Convert this into a sensible structured json response. "
                        "The output should be strictly be json with no extra commentary"
                    )
                ),
            ],
        }
    ],
    response_format={"type": "json_object"},
    temperature=0,
)

# Parse and return JSON response
response_dict = json.loads(chat_response.choices[0].message.content)
print(json.dumps(response_dict, indent=4, ensure_ascii=False))


[
    {
        "id": 1,
        "name": "விமல்",
        "father_name": "பத்மநாபன்",
        "address": "ஏ",
        "age": 47,
        "gender": "பெண்"
    },
    {
        "id": 2,
        "name": "புருஷோத்தமன்",
        "father_name": "கிருஷ்ணமூர்த்தி",
        "address": "na",
        "age": 74,
        "gender": "ஆண்"
    },
    {
        "id": 3,
        "name": "துரைவேலன்",
        "father_name": "முனிசாமி",
        "address": "na",
        "age": 69,
        "gender": "ஆண்"
    },
    {
        "id": 4,
        "name": "பிரபா",
        "husband_name": "ரகோத்தமன்",
        "address": "na",
        "age": 68,
        "gender": "பெண்"
    },
    {
        "id": 5,
        "name": "கோகுலகிருஷ்ணன்",
        "father_name": "சுப்பிரமணியன்",
        "address": "Na",
        "age": 45,
        "gender": "ஆண்"
    },
    {
        "id": 6,
        "name": "திரேசா",
        "husband_name": "சங்கர்",
        "address": "na",
        "age": 42,
        "gender": "பெண்"
    },
    {
      

## All Together - Mistral OCR + Custom Structured Output
Let's design a simple function that takes an `image_path` file and returns a JSON structured output in a specific format. In this case, we arbitrarily decided we wanted an output respecting the following:

```python
class StructuredOCR:
    file_name: str  # can be any string
    topics: list[str]  # must be a list of strings
    languages: str  # string
    ocr_contents: dict  # any dictionary, can be freely defined by the model
```

We will make use of [custom structured outputs](https://docs.mistral.ai/capabilities/structured-output/custom_structured_output/).

In [40]:
from enum import Enum
from pathlib import Path
from pydantic import BaseModel
import base64


class StructuredOCR(BaseModel):
    file_name: str
    topics: list[str]
    languages: str
    ocr_contents: dict

def structured_ocr(image_path: str) -> StructuredOCR:
    """
    Process an image using OCR and extract structured data.

    Args:
        image_path: Path to the image file to process

    Returns:
        StructuredOCR object containing the extracted data

    Raises:
        AssertionError: If the image file does not exist
    """
    # Validate input file
    image_file = Path(image_path)
    assert image_file.is_file(), "The provided image path does not exist."

    # Read and encode the image file
    encoded_image = base64.b64encode(image_file.read_bytes()).decode()
    base64_data_url = f"data:image/jpeg;base64,{encoded_image}"

    # Process the image using OCR
    image_response = client.ocr.process(
        document=ImageURLChunk(image_url=base64_data_url),
        model="mistral-ocr-latest"
    )
    image_ocr_markdown = image_response.pages[0].markdown

    # Prompt for extracting voter data
    voter_prompt = """
    You are an expert system trained to extract structured data from Tamil Nadu electoral roll documents (PDFs or scanned images). Each document page contains up to 30 voter entries.

    Your goal is to extract each voter’s data in Tamil and transliterated English, and return them as a *strictly ordered JSON array*, sorted by **serial number** (including cases like "R 191").

    ---

    ### Voter Entry Rules:

    Each voter record is in a rectangular box with:
    - A **serial number** at the top-left (may include prefix like "R" if deleted).
    - A **photo box** (mentioning if "Photo is Available" or not).
    - A **voter ID** (e.g., "IBU3115632").
    - Tamil details including:
      - பெயர் (Name)
      - தந்தை / கணவர் பெயர் (Relative name)
      - வயது (Age)
      - பாலினம் (Gender: ஆண்/பெண்)
      - வீட்டு எண் (House No., if visible)

    ---

    ### 1. If Entry is DELETED (Marked with watermark and serial number prefix like "R"):

    Extract only:
    - `voter_id`
    - `serial_number`
    - `part_number`: "163"
    - `is_deleted`: true

    ---

    ### 2. If Entry is VALID (No "DELETED" mark):

    Extract all:
    - `voter_id`: From the box
    - serial_number: The serial number is the **first number** (top-left) in each voter's box. Ignore the second smaller number next to it.
        For example: if the top-left boxes show `1211` and `1`, take only `1211` as the `"serial_number"`.
        This is true even when the serial number contains alphabet prefixes like `1, 8, 161, R 191`, `S 203`. Never extract the second smaller number.
    - `part_number`: "163"
    - `name_tamil`: Extract from "பெயர்"
    - `name_english`: Transliterate `name_tamil` (phonetic, not translated)
    - `relative_name_tamil`: From "தந்தை பெயர்" or "கணவர் பெயர்"
    - `relative_name_english`: Transliterate phonetic equivalent
    - `age`: Extract from "வயது"
    - `gender`: Translate "ஆண்" to "Male", "பெண்" to "Female"
    - `house_no`: Extract from "வீட்டு எண்", else ""
    - `is_deleted`: false
    - `predicted_religion`: Guess based on common Tamil names

    ---

    ### Extra Instructions:

    - Always set `"part_number": "163"`
    - If the serial number starts with an alphabet (like "R 191"), mark it as deleted
    - The serial_number is found in the top-left corner of each voter’s box.
    - Each box may sometimes contains **two small boxes** at the top:
    - **Take only the first box value** as the serial_number.
    - Ignore the second number—it is not part of the serial number and should never be extracted.
    - Valid examples: "1211", "1223", "R 191", "S 225"

    - If Tamil name is missing or unrecognizable, use `""` for Tamil/English name
    - Sort final JSON array in ascending order by `serial_number` (numeric or alphanumeric sort)
    - Ensure consistent structure for every object
    - Output only a **valid compact JSON array**, no extra text or explanation

    ---

    ### Example Output:
    ```json
    [
      {
        "voter_id": "IBU3115632",
        "serial_number": "181",
        "part_number": "163",
        "name_tamil": "ஜெயபாலன்",
        "name_english": "Jeyabalan",
        "relative_name_tamil": "வெங்கடேசன்",
        "relative_name_english": "Venkatesan",
        "age": 58,
        "gender": "Male",
        "house_no": "",
        "is_deleted": false,
        "predicted_religion": "Hindu"
      },
      {
        "voter_id": "IBU2463313",
        "serial_number": "R 191",
        "part_number": "163",
        "is_deleted": true
      }
    ]

    """

    # Parse the OCR result into a structured JSON response
    chat_response = client.chat.parse(
        model="pixtral-12b-latest",
        messages=[
            {
                "role": "user",
                "content": [
                    ImageURLChunk(image_url=base64_data_url),
                    TextChunk(text=(
                        f"This is the image's OCR in markdown:\n{image_ocr_markdown}\n.\n"
                        f"Convert this into a structured JSON response \n{voter_prompt}\n"
                        )
                    )
                ]
            }
        ],
        response_format=StructuredOCR,
        temperature=0
    )

    return chat_response.choices[0].message.parsed

We can now extract structured output from any image parsed with our OCR model.

In [43]:
# Example usage
image_path = "output_page_44.png" # Path to sample receipt image
structured_response = structured_ocr(image_path) # Process image and extract data

# Parse and return JSON response
response_dict = json.loads(structured_response.model_dump_json())
print(json.dumps(response_dict, indent=4, ensure_ascii=False))

{
    "file_name": "TamilNaduElectoralRoll",
    "topics": [
        "Electoral Roll",
        "Tamil Nadu",
        "Voter Information"
    ],
    "languages": "Tamil",
    "ocr_contents": {
        "1211": {
            "voter_id": "IBU3342540",
            "serial_number": "1211",
            "part_number": "163",
            "name_tamil": "கனகாம்பாள்",
            "name_english": "Kanagamapal",
            "relative_name_tamil": "சின்னசாமி",
            "relative_name_english": "Sinnasami",
            "age": 60,
            "gender": "Female",
            "house_no": "20/1",
            "is_deleted": false,
            "predicted_religion": "Hindu"
        },
        "1212": {
            "voter_id": "IBU3342607",
            "serial_number": "1212",
            "part_number": "163",
            "name_tamil": "பிரணவ் சுதன்",
            "name_english": "Pranav Sudan",
            "relative_name_tamil": "துரை குமார்",
            "relative_name_english": "Durai Kumar",
            