# OCR Exploration and Simple Structured Outputs (Deprecated)

---

In this cookbook, we will explore the basics of OCR and leverage it together with existing models to achieve structured outputs fueled by our OCR model (we recommend using the new Annotations feature instead for better results).

You may want to do this in case current vision models are not powerful enough, hence enhancing their vision OCR capabilities with the OCR model to achieve better structured data extraction.

---

### Model Used
- Mistral OCR
- Pixtral 12B & Ministral 8B

---

**For a more up to date guide on structured outputs visit our [Annotations cookbook](https://github.com/mistralai/cookbook/blob/main/mistral/ocr/data_extraction.ipynb) on Data Extraction.**


## Setup

First, let's install `mistralai` and download the required files.

In [1]:
%%capture
!pip install mistralai

### Download PDF and image files

In [4]:
%%capture
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/mistral7b.pdf
!wget https://raw.githubusercontent.com/mistralai/cookbook/refs/heads/main/mistral/ocr/receipt.png

## Mistral OCR with PDF

We will need to set up our client. You can create an API key on our [Plateforme](https://console.mistral.ai/api-keys/).

In [1]:
# Initialize Mistral client with API key
from mistralai import Mistral

api_key = "hg8KTuHmhwVmDFeZWxMurc63KmveBLP0" # Replace with your API key
client = Mistral(api_key=api_key)

There are two types of files you can apply OCR to:
- 1. PDF files
- 2. Image files

Let's start with a PDF file:

In [7]:
# Import required libraries
from pathlib import Path
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
import json

# Verify PDF file exists
pdf_file = Path("08 Agustus 2025 - Format New SE OJK.pdf")
assert pdf_file.is_file()

# Upload PDF file to Mistral's OCR service
uploaded_file = client.files.upload(
    file={
        "file_name": pdf_file.stem,
        "content": pdf_file.read_bytes(),
    },
    purpose="ocr",
)

# Get URL for the uploaded file
signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

# Process PDF with OCR, including embedded images
pdf_response = client.ocr.process(
    document=DocumentURLChunk(document_url=signed_url.url),
    model="mistral-ocr-latest",
    include_image_base64=True
)

# Convert response to JSON format
response_dict = json.loads(pdf_response.model_dump_json())

print(json.dumps(response_dict, indent=4)[0:1000]) # check the first 1000 characters

{
    "pages": [
        {
            "index": 0,
            "markdown": "| LAPORAN POSISI KEUANGAN (NERACA) BULANAN |  |\n| :--: | :--: |\n| PT BANK MANDIRI (PERSERO) Tbk. <br> Tanggal 31 Agustus 2025 |  |\n|  | Dalam Satuan Rupiah |\n| POS-POS | NOMINAL |\n| ASET |  |\n| 1.Kas | 17.543.261 |\n| 2.Penempatan pada Bank Indonesia | 101.772 .835 |\n| 3.Penempatan pada bank lain | 58.970 .016 |\n| 4.Tagihan spot dan derivatif/forward | 8.415 .865 |\n| 5.Surat berharga yang dimiliki | 254.120 .322 |\n| 6.Surat berharga yang dijual dengan janji dibeli kembali (repo) | 44.331 .731 |\n| 7.Tagihan atas surat berharga yang dibeli dengan janji dijual kembali (reverse repo) | 1.639 .854 |\n| 8.Tagihan akseptasi | 6.410 .900 |\n| 9.Kredit yang diberikan | 1.353 .438 .264 |\n| 10.Pembiayaan syariah |  |\n| 11.Penyertaan modal | 15.016 .495 |\n| 12.Aset Keuangan Lainnya | 44.141 .292 |\n| 13.Cadangan kerugian penurunan nilai aset keuangan -/- | 40.437 .897 |\n| a. Surat berharga yang dimiliki | 8.

View the result with the following:

In [8]:
from mistralai.models import OCRResponse
from IPython.display import Markdown, display

def replace_images_in_markdown(markdown_str: str, images_dict: dict) -> str:
    """
    Replace image placeholders in markdown with base64-encoded images.

    Args:
        markdown_str: Markdown text containing image placeholders
        images_dict: Dictionary mapping image IDs to base64 strings

    Returns:
        Markdown text with images replaced by base64 data
    """
    for img_name, base64_str in images_dict.items():
        markdown_str = markdown_str.replace(
            f"![{img_name}]({img_name})", f"![{img_name}]({base64_str})"
        )
    return markdown_str

def get_combined_markdown(ocr_response: OCRResponse) -> str:
    """
    Combine OCR text and images into a single markdown document.

    Args:
        ocr_response: Response from OCR processing containing text and images

    Returns:
        Combined markdown string with embedded images
    """
    markdowns: list[str] = []
    # Extract images from page
    for page in ocr_response.pages:
        image_data = {}
        for img in page.images:
            image_data[img.id] = img.image_base64
        # Replace image placeholders with actual images
        markdowns.append(replace_images_in_markdown(page.markdown, image_data))

    return "\n\n".join(markdowns)

# Display combined markdowns and images
display(Markdown(get_combined_markdown(pdf_response)))

| LAPORAN POSISI KEUANGAN (NERACA) BULANAN |  |
| :--: | :--: |
| PT BANK MANDIRI (PERSERO) Tbk. <br> Tanggal 31 Agustus 2025 |  |
|  | Dalam Satuan Rupiah |
| POS-POS | NOMINAL |
| ASET |  |
| 1.Kas | 17.543.261 |
| 2.Penempatan pada Bank Indonesia | 101.772 .835 |
| 3.Penempatan pada bank lain | 58.970 .016 |
| 4.Tagihan spot dan derivatif/forward | 8.415 .865 |
| 5.Surat berharga yang dimiliki | 254.120 .322 |
| 6.Surat berharga yang dijual dengan janji dibeli kembali (repo) | 44.331 .731 |
| 7.Tagihan atas surat berharga yang dibeli dengan janji dijual kembali (reverse repo) | 1.639 .854 |
| 8.Tagihan akseptasi | 6.410 .900 |
| 9.Kredit yang diberikan | 1.353 .438 .264 |
| 10.Pembiayaan syariah |  |
| 11.Penyertaan modal | 15.016 .495 |
| 12.Aset Keuangan Lainnya | 44.141 .292 |
| 13.Cadangan kerugian penurunan nilai aset keuangan -/- | 40.437 .897 |
| a. Surat berharga yang dimiliki | 8.010 |
| b. Kredit yang diberikan dan pembiayaan syariah | 39.019 .406 |
| c. Lainnya | 1.410 .481 |
| 14.Aset tidak berwujud | 11.671 .277 |
| Akumulasi amortisasi aset tidak berwujud -/- | 7.800 .987 |
| 15.Aset tetap dan inventaris | 76.317 .999 |
| Akumulasi penyusutan aset tetap dan inventaris -/- | 22.077 .700 |
| 16.Aset non produktif | 4.665 .156 |
| a.Properti terbengkalai | - |
| b.Aset yang diambil alih | - |
| c.Rekening tunda | 4.547 .501 |
| d.Aset antarkantor | 117.655 |
| 17.Aset Lainnya | 25.549 .773 |
| TOTAL ASET | 1.953 .688 .456 |

| LAPORAN POSISI KEUANGAN (NERACA) BULANAN |  |
| :--: | :--: |
| PT BANK MANDIRI (PERSERO) Tbk. <br> Tanggal 31 Agustus 2025 |  |
|  | Dalam Satuan Rupiah |
| POS-POS | NOMINAL |
| LIABILITAS DAN EKUITAS |  |
| LIABILITAS |  |
| 1.Giro | 590.566 .209 |
| 2.Tabungan | 505.265 .640 |
| 3.Deposito | 339.344 .142 |
| 4.Uang elektronik | 2.104 .372 |
| 5.Liabilitas kepada Bank Indonesia | - |
| 6.Liabilitas kepada bank lain | 21.597 .498 |
| 7.Liabilitas spot dan derivatif/forward | 8.128 .315 |
| 8.Liabilitas atas surat berharga yang dijual dengan janji dibeli kembali (repo) | 41.912 .368 |
| 9.Liabilitas akseptasi | 6.410 .900 |
| 10.Surat berharga yang diterbitkan | 37.084 .848 |
| 11.Pinjaman/pembiayaan yang diterima | 117.106 .672 |
| 12.Setoran jaminan | 1.230 .838 |
| 13.Liabilitas antarkantor | - |
| 14.Liabilitas lainnya | 35.708 .402 |
| TOTAL LIABILITAS | 1.706 .460 .204 |
| EKUITAS |  |
| 15.Modal disetor | 11.666 .667 |
| a. Modal dasar | 16.000 .000 |
| b. Modal yang belum disetor -/- | 4.333 .333 |
| c. Saham yang dibeli kembali (treasury stock ) -/- | - |
| 16.Tambahan modal disetor | 19.661 .550 |
| a. Agio | 19.661 .550 |
| b. Disagio -/- | - |
| c. Dana setoran modal | - |
| d. Lainnya | - |
| 17.Penghasilan komprehensif lainnya | 36.581 .818 |
| a. Keuntungan | 36.874 .187 |
| b. Kerugian (-/-) | 292.369 |
| 18.Cadangan | 2.333 .333 |
| a. Cadangan umum | 2.333 .333 |
| b. Cadangan tujuan | - |
| 19.Laba/nigi | 176.984 .884 |
| a. Tahun-tahun lalu | 189.842 .782 |
| b. Tahun berjalan | 30.652 .641 |
| c. Dividen yang dibayarkan (-/-) | 43.510 .539 |
| TOTAL EKUITAS | 247.228 .252 |
| TOTAL LIABILITAS DAN EKUITAS | 1.953 .688 .456 |

LAPORAN LABA RUGI DAN PENGHASILAN KOMPREHENSIF LAIN BULANAN PT BANK MANDIRI (PERSERO) T04. Tanggal 31 Agustus 2025

|   | Dalam Jutaan Rupiah  |
| --- | --- |
|  PEN-POS | MANDIRA  |
|  PENDAFATAN DAN BEBAN OPERASIONAL |   |
|  A. Pendapatan dan Beban Bunga |   |
|  1. Pendapatan Bunga | 80.944 .536  |
|  2. Beban Bunga | 29.769 .841  |
|  Pendapatan (Beban) Bunga bersih | 51.174 .695  |
|  B. Pendapatan dan Beban Operasional selain Bunga |   |
|  1. Keuntungan (kerugian) dari peningkatan (penurunan) nilai wajar aset keuangan | 1.383 .813  |
|  2. Keuntungan (kerugian) dari penurunan (peningkatan) nilai wajar liabilitas keuangan | -  |
|  3. Keuntungan (kerugian) dari penjualan aset keuangan | 1.903 .638  |
|  4. Keuntungan (kerugian) dari transaksi spot dan derivatif/forward (realised) | 212.497  |
|  5. Keuntungan (kerugian) dari penyertaan dengan equity method | -  |
|  6. Keuntungan (kerugian) dari penjabaran transaksi valuta asing | -  |
|  7. Pendapatan dividen | 1.461 .980  |
|  8. Komis/provis/per dan administrasi | 12.735 .035  |
|  9. Pendapatan lainnya | 5.084 .320  |
|  10. Kerugian penurunan nilai aset keuangan (impairment) | 4.490 .099  |
|  11. Kerugian terkait risiko operasional | 18.495  |
|  12. Beban tenaga kerja | 12.190 .381  |
|  13. Beban promosi | 2.422 .384  |
|  14. Beban lainnya | 17.306 .403  |
|  Pendapatan (Beban) Operasional Lainnya | (13.646.479)  |
|  LABA (RUGI) OPERASIONAL | 37.528 .216  |
|  PENDAFATAN (BEBAN) NON OPERASIONAL |   |
|  1. Keuntungan (kerugian) penjualan aset tetap dan inventaris | 256  |
|  2. Pendapatan (beban) non operasional lainnya | 82.310  |
|  LABA (RUGI) NON OPERASIONAL | 82.566  |
|  LABA (RUGI) PERIODE BERJALAN SEBELUM PAJAK | 37.610 .782  |
|  3. Pajak Penghasilan | 6.958 .141  |
|  a. Takoiran pajak periode berjalan (-/-) | 5.147 .921  |
|  b. (Pendapatan) beban pajak tangguhan | 1.810 .220  |
|  LABA (RUGI) BERIJH PERIODE BERJALAN | 30.652 .641  |
|  PENGHASILAN KOMPREHENSIF LAIN |   |
|  1. Pos-pos yang tidak akan direklasifikasi ke laba rugi | 42.717  |
|  a. Keuntungan yang berasal dari revaluasi aset tetap | -  |
|  b. Keuntungan (kerugian) yang berasal dari pengukuran kembali atas program pensiun manfaat pasti | 42.717  |
|  c. Lainnya | -  |
|  2. Pos-pos yang akan direklasifikasi ke laba rugi | 2.917 .490  |
|  a. Keuntungan (kerugian) yang berasal dari penyesuaian akibat penjabaran laporan keuangan dalam mata uang asing | 38.297  |
|  b. Keuntungan (kerugian) dari perubahan nilai aset keuangan yang diukur pada nilai wajar melalui penghasilan komprehensif lain | 2.879 .193  |
|  PENGHASILAN KOMPREHENSIF LAIN PERIODE BERJALAN - SETELAH PAJAK | 2.960 .207  |
|  TOTAL LABA (RUGI) KOMPREHENSIF PERIODE BERJALAN | 33.612 .848  |
|  TRANSFER LABA (RUGI) KE KANTOR PUSAT |   |

| LAPORAN KOMITMEN DAN KONTINJENSI BULANAN PT BANK MANDIRI (PERSERO) Tbk. Tanggal 31 Agustus 2025 |   |
| --- | --- |
|  |   |
|  POS-POS | NOMINAL  |
|  I. TAGIHAN KOMITMEN | 521.136 .164  |
|  1. Fasilitas pinjaman/pembiayaan yang belum ditarik | -  |
|  2. Posisi valas yang akan diterima dari transaksi spot dan derivatif/forward | 521.136 .164  |
|  3. Lainnya | -  |
|  II. KEWAJIBAN KOMITMEN | 817.976 .667  |
|  1. Fasilitas kredit/pembiayaan kepada nasabah yang belum ditarik | 274.330 .297  |
|  a. Committed | 56.599 .704  |
|  b. Uncommitted | 217.730 .593  |
|  2. Irrevocable L/C yang masih berjalan | 22.788 .150  |
|  3. Posisi valas yang akan diserahkan untuk transaksi spot dan derivatif/forward | 520.858 .220  |
|  4. Lainnya | -  |
|  III.TAGIHAN KONTINJENSI | 57.242 .140  |
|  1. Garansi yang diterima | 57.207 .688  |
|  2. Lainnya | 34.452  |
|  IV.KEWAJIBAN KONTINJENSI | 163.379 .562  |
|  1. Garansi yang diberikan | 158.858 .439  |
|  2. Lainnya | 4.521 .123  |

## Mistral OCR with Image

In addition to the PDF file shown above, Mistral OCR can also process image files:

In [None]:
# import base64

# # Verify image exists
# image_file = Path("receipt.png")
# assert image_file.is_file()

# # Encode image as base64 for API
# encoded = base64.b64encode(image_file.read_bytes()).decode()
# base64_data_url = f"data:image/jpeg;base64,{encoded}"

# # Process image with OCR
# image_response = client.ocr.process(
#     document=ImageURLChunk(image_url=base64_data_url),
#     model="mistral-ocr-latest"
# )

# # Convert response to JSON
# response_dict = json.loads(image_response.model_dump_json())
# json_string = json.dumps(response_dict, indent=4)
# print(json_string)

## Extract structured data from OCR results

OCR results can be further processed using another model.

Our goal is to extract structured data from these results. To achieve this, we will utilize the `pixtral-12b-latest` model, supported by our OCR model, to deliver better and higher-quality answers:

In [None]:
# # Get OCR results for processing
# image_ocr_markdown = image_response.pages[0].markdown

# # Get structured response from model
# chat_response = client.chat.complete(
#     model="pixtral-12b-latest",
#     messages=[
#         {
#             "role": "user",
#             "content": [
#                 ImageURLChunk(image_url=base64_data_url),
#                 TextChunk(
#                     text=(
#                         f"This is image's OCR in markdown:\n\n{image_ocr_markdown}\n.\n"
#                         "Convert this into a sensible structured json response. "
#                         "The output should be strictly be json with no extra commentary"
#                     )
#                 ),
#             ],
#         }
#     ],
#     response_format={"type": "json_object"},
#     temperature=0,
# )

# # Parse and return JSON response
# response_dict = json.loads(chat_response.choices[0].message.content)
# print(json.dumps(response_dict, indent=4))

In the example above, we are leveraging a model already capable of vision tasks.

However, we could also use text-only models for the structured output. Note in this case, we do not include the image in the user message:

In [None]:
# Get OCR results for processing
# pdf_ocr_markdown = pdf_response.pages[0].markdown
pdf_ocr_markdown = [page.markdown for page in pdf_response.pages]

# Get structured response from model
chat_response = client.chat.complete(
    model="ministral-8b-latest",
    messages=[
        {
            "role": "user",
            "content": [
                TextChunk(
                    text=(
                        f"Ini adalah OCR gambar dalam markdown:\n\n{pdf_ocr_markdown}\n.\n"
                        "Ubah ini menjadi respons JSON terstruktur yang masuk akal dan dalam bahasa indonesia"
                        "Dari nama bank , jenis laporan dan periode laporan "
                        "Outputnya harus benar-benar JSON tanpa komentar tambahan."
                    )
                ),
            ],
        }
    ],
    response_format={"type": "json_object"},
    temperature=0,
)

# Parse and return JSON response
response_dict = json.loads(chat_response.choices[0].message.content)
print(json.dumps(response_dict, indent=4))


{
    "bank": "PT BANK MANDIRI (PERSERO) Tbk.",
    "laporan": "LAPORAN POSISI KEUANGAN (NERACA) BULANAN",
    "periode": "31 Agustus 2025",
    "aset": {
        "kas": 17543261,
        "penempatan_pada_bank_indonesia": 101772835,
        "penempatan_pada_bank_lain": 58970016,
        "tagihan_spot_derivatif_forward": 8415865,
        "surat_berharga_dimiliki": 254120322,
        "surat_berharga_dijual_repo": 44331731,
        "tagihan_reverse_repo": 1639854,
        "tagihan_akseptasi": 6410900,
        "kredit_diberikan": 1353438264,
        "pembiayaan_syariah": 0,
        "penyertaan_modal": 15016495,
        "aset_keuangan_lainnya": 44141292,
        "cadangan_kerugian_penurunan_nilai_aset_keuangan": -40437897,
        "surat_berharga_dimiliki_cadangan": 8010,
        "kredit_diberikan_pembiayaan_syariah_cadangan": 39019406,
        "lainnya_cadangan": 1410481,
        "aset_tidak_berwujud": 11671277,
        "akumulasi_amortisasi_aset_tidak_berwujud": -7800987,
        "aset_te

## All Together - Mistral OCR + Custom Structured Output
Let's design a simple function that takes an `image_path` file and returns a JSON structured output in a specific format. In this case, we arbitrarily decided we wanted an output respecting the following:

```python
class StructuredOCR:
    file_name: str  # can be any string
    topics: list[str]  # must be a list of strings
    languages: str  # string
    ocr_contents: dict  # any dictionary, can be freely defined by the model
```

We will make use of [custom structured outputs](https://docs.mistral.ai/capabilities/structured-output/custom_structured_output/).

In [None]:
# Import required libraries
from pathlib import Path
from mistralai import DocumentURLChunk, ImageURLChunk, TextChunk
import json


from enum import Enum
from pathlib import Path
from pydantic import BaseModel
import base64


class StructuredOCR(BaseModel):
    bank: str
    laporan: str
    periode: str
    pos: dict[str, int]


def structured_ocr(pdf_path: str) -> StructuredOCR:
    """
    Process an image using OCR and extract structured data.

    Args:
        image_path: Path to the image file to process

    Returns:
        StructuredOCR object containing the extracted data

    Raises:
        AssertionError: If the image file does not exist
    """
    # Verify PDF file exists
    # pdf_file = Path("08 Agustus 2025 - Format New SE OJK.pdf")
    pdf_file = Path(pdf_path)
    assert pdf_file.is_file()

    # Upload PDF file to Mistral's OCR service
    uploaded_file = client.files.upload(
        file={
            "file_name": pdf_file.stem,
            "content": pdf_file.read_bytes(),
        },
        purpose="ocr",
    )

    # Get URL for the uploaded file
    signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

    # Process PDF with OCR, including embedded images
    pdf_response = client.ocr.process(
        document=DocumentURLChunk(document_url=signed_url.url),
        model="mistral-ocr-latest",
        include_image_base64=True
    )# Verify PDF file exists
    pdf_file = Path("08 Agustus 2025 - Format New SE OJK.pdf")
    assert pdf_file.is_file()

    # Upload PDF file to Mistral's OCR service
    uploaded_file = client.files.upload(
        file={
            "file_name": pdf_file.stem,
            "content": pdf_file.read_bytes(),
        },
        purpose="ocr",
    )

    # Get URL for the uploaded file
    signed_url = client.files.get_signed_url(file_id=uploaded_file.id, expiry=1)

    # Process PDF with OCR, including embedded images
    pdf_response = client.ocr.process(
        document=DocumentURLChunk(document_url=signed_url.url),
        model="mistral-ocr-latest",
        include_image_base64=True
    )

    # image_ocr_markdown = image_response.pages[0].markdown
    pdf_ocr_markdown = [page.markdown for page in pdf_response.pages]


    # Get structured response from model
    chat_response = client.chat.parse(
        model="ministral-8b-latest",
        messages=[
            {
                "role": "user",
                "content": [
                    TextChunk(
                        text=(
                            f"Ini adalah OCR gambar dalam markdown:\n\n{pdf_ocr_markdown}\n.\n"
                            "Ubah ini menjadi respons JSON terstruktur yang masuk akal dan dalam bahasa indonesia"
                            "Dari nama bank , jenis laporan dan periode laporan "
                            "Outputnya harus benar-benar JSON tanpa komentar tambahan."
                        )
                    ),
                ],
            }
        ],
        response_format=StructuredOCR,
        temperature=0,
    )



    return chat_response.choices[0].message.parsed

We can now extract structured output from any image parsed with our OCR model.

In [14]:
# Example usage
pdf_path = "08 Agustus 2025 - Format New SE OJK.pdf" # Path to sample receipt image
structured_response = structured_ocr(pdf_path) # Process image and extract data
structured_response


StructuredOCR(bank=1, laporan=1, periode=2025, pos={}, kas=17543261)

In [15]:
# # Parse and return JSON response
response_dict = json.loads(structured_response.model_dump_json())
print(json.dumps(response_dict, indent=4))

{
    "bank": 1,
    "laporan": 1,
    "periode": 2025,
    "pos": {},
    "kas": 17543261
}


The original image for comparison can be found below.

In [None]:
# from PIL import Image

# image = Image.open(image_path)
# image.resize((image.width // 5, image.height // 5))
