# Docling Reader

## Overview

[Docling](https://github.com/DS4SD/docling) extracts PDF, DOCX, HTML, and other document formats into a rich representation (incl. layout, tables etc.), which it can export to Markdown or JSON.

Docling Reader and Docling Node Parser presented in this notebook seamlessly integrate Docling into LlamaIndex, enabling you to:
- use various document types in your LLM applications with ease and speed, and
- leverage Docling's rich format for advanced, document-native grounding.

## Setup

In [None]:
%pip install -q --progress-bar off --no-warn-conflicts llama-index-core llama-index-readers-docling llama-index-node-parser-docling llama-index-embeddings-huggingface llama-index-llms-huggingface-api llama-index-readers-file python-dotenv

!pip install -qU llama-index-embeddings-ibm llama-index-llms-ibm

%pip install llama-index-vector-stores-milvus 
%pip install pymilvus>=2.4.2

In [None]:
import os

os.environ["WATSONX_APIKEY"] = "<REPLACE_WITH_API_KEY>"
PROJECT_ID = "a62d3ee3-e744-4bf0-8eaf-cfddd415f62e"

We can now define the main parameters:

In [None]:
import os
from dotenv import load_dotenv
from llama_index.embeddings.ibm import WatsonxEmbeddings
from llama_index.llms.ibm import WatsonxLLM

# embedding model params
truncate_input_tokens = 512

# llm params
temperature = 0.5
max_new_tokens = 2000
additional_params = {
    "decoding_method": "sample",
    "min_new_tokens": 1,
    "top_k": 50,
    "top_p": 1,
}

load_dotenv()
EMBED_MODEL = WatsonxEmbeddings(
    model_id="intfloat/multilingual-e5-large",
    url="https://us-south.ml.cloud.ibm.com",
    project_id=PROJECT_ID,
    truncate_input_tokens=truncate_input_tokens,
)

GEN_MODEL = WatsonxLLM(
    model_id="mistralai/mistral-large",
    url="https://us-south.ml.cloud.ibm.com",
    project_id=PROJECT_ID,
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)


In [None]:
SOURCE = "https://www.btpn.com/pdf/investor/laporan-keberlanjutan/2024/laporan--keberlanjutan--bank--btpn--tahun-2023_-final.pdf"
QUERY="Tampilkan tabel lengkap Jumlah Total Karyawan Berdasarkan Kontrak Kerja Kepegawaian, Berdasarkan Jenis Kelamin"

## Using Markdown export

To create a simple RAG pipeline, we can:
- define a `DoclingReader`, which by default exports to Markdown, and
- use a standard node parser for these Markdown-based docs, e.g. a `MarkdownNodeParser`

In [None]:
from llama_index.core import VectorStoreIndex
from llama_index.core.node_parser import MarkdownNodeParser
from llama_index.readers.docling import DoclingReader
from llama_index.core import VectorStoreIndex, StorageContext
from llama_index.vector_stores.milvus import MilvusVectorStore


vector_store = MilvusVectorStore(
    uri="./milvus_demo_1.db", 
    dim=1024, 
    overwrite=True,
    hybrid_ranker="RRFRanker",
    hybrid_ranker_params={"k": 60},
)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

reader = DoclingReader()
node_parser = MarkdownNodeParser()

index = VectorStoreIndex.from_documents(
    documents=reader.load_data(SOURCE),
    transformations=[node_parser],
    embed_model=EMBED_MODEL,
    storage_context=storage_context
)


In [None]:
from IPython.display import display, Markdown, Latex

result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}")

print(f"\nAnswer:")
display(Markdown(result.response.strip()))

print(f"\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])


## Using Docling format

To leverage Docling's rich native format, we:
- create a `DoclingReader` with JSON export type, and
- employ a `DoclingNodeParser` in order to appropriately parse that Docling format.

Notice how the sources now also contain document-level grounding (e.g. page number or bounding box information):

In [None]:
from llama_index.node_parser.docling import DoclingNodeParser

reader = DoclingReader(export_type=DoclingReader.ExportType.JSON)
node_parser = DoclingNodeParser()

index = VectorStoreIndex.from_documents(
    documents=reader.load_data(SOURCE),
    transformations=[node_parser],
    embed_model=EMBED_MODEL,
    storage_context=storage_context
)

In [None]:
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])

## With Simple Directory Reader

To demonstrate this usage pattern, we first set up a test document directory.

In [None]:
from pathlib import Path
from tempfile import mkdtemp
import requests

tmp_dir_path = Path(mkdtemp())
r = requests.get(SOURCE)
with open(tmp_dir_path / f"{Path(SOURCE).name}.pdf", "wb") as out_file:
    out_file.write(r.content)

Using the `reader` and `node_parser` definitions from any of the above variants, usage with `SimpleDirectoryReader` then looks as follows:

In [None]:
from llama_index.core import SimpleDirectoryReader

dir_reader = SimpleDirectoryReader(
    input_dir=tmp_dir_path,
    file_extractor={".pdf": reader},
)

index = VectorStoreIndex.from_documents(
    documents=dir_reader.load_data(SOURCE),
    transformations=[node_parser],
    embed_model=EMBED_MODEL,
)

In [None]:
result = index.as_query_engine(llm=GEN_MODEL).query(QUERY)
print(f"Q: {QUERY}\nA: {result.response.strip()}\n\nSources:")
display([(n.text, n.metadata) for n in result.source_nodes])