In [2]:
# Works on macOS and Linux environments. Windows platforms are currently not tested.
!pip install docling --extra-index-url https://download.pytorch.org/whl/cpu

Looking in indexes: https://pypi.org/simple, https://download.pytorch.org/whl/cpu


In [3]:
!pip install docling-core python-dotenv langchain-text-splitters langchain-huggingface langchain-milvus



In [4]:
from docling.document_converter import DocumentConverter

# source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
source = "https://arxiv.org/pdf/2206.01062"
converter = DocumentConverter()
result = converter.convert_single(source)
print(result.render_as_markdown())  # output: "## Docling Technical Report[...]"

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

  net.load_state_dict(copyStateDict(torch.load(trained_model, map_location=device)))
  model.load_state_dict(torch.load(model_path, map_location=device))


## DocLayNet: A Large Human-Annotated Dataset for Document-Layout Analysis

Birgit Pfitzmann IBM Research Rueschlikon, Switzerland bpf@zurich.ibm.com

Christoph Auer IBM Research Rueschlikon, Switzerland cau@zurich.ibm.com

Michele Dolfi IBM Research Rueschlikon, Switzerland dol@zurich.ibm.com

Ahmed S. Nassar IBM Research Rueschlikon, Switzerland ahn@zurich.ibm.com

Peter Staar IBM Research Rueschlikon, Switzerland taa@zurich.ibm.com

## ABSTRACT

Accurate document layout analysis is a key requirement for highquality PDF document conversion. With the recent availability of public, large ground-truth datasets such as PubLayNet and DocBank, deep-learning models have proven to be very effective at layout detection and segmentation. While these datasets are of adequate size to train such models, they severely lack in layout variability since they are sourced from scientific article repositories such as PubMed and arXiv only. Consequently, the accuracy of the layout segmentation drops sign

In [5]:
import warnings

warnings.filterwarnings(action="ignore", category=UserWarning, module="pydantic|torch")
warnings.filterwarnings(action="ignore", category=FutureWarning, module="easyocr")

In [6]:
from enum import Enum
from typing import Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document as LCDocument
from pydantic import BaseModel

from docling.document_converter import DocumentConverter


class DocumentMetadata(BaseModel):
    dl_doc_hash: str
    # source: str


class DoclingPDFLoader(BaseLoader):
    class ParseType(str, Enum):
        MARKDOWN = "markdown"
        # JSON = "json"

    def __init__(self, file_path: str | list[str], parse_type: ParseType) -> None:
        self._file_paths = file_path if isinstance(file_path, list) else [file_path]
        self._parse_type = parse_type
        self._converter = DocumentConverter()

    def lazy_load(self) -> Iterator[LCDocument]:
        for source in self._file_paths:
            dl_doc = self._converter.convert_single(source).output
            match self._parse_type:
                case self.ParseType.MARKDOWN:
                    text = dl_doc.export_to_markdown()
                # case self.ParseType.JSON:
                #     text = dl_doc.model_dump_json()
                case _:
                    raise RuntimeError(
                        f"Unexpected parse type encountered: {self._parse_type}"
                    )
            lc_doc = LCDocument(
                page_content=text,
                metadata=DocumentMetadata(
                    dl_doc_hash=dl_doc.file_info.document_hash,
                ).model_dump(),
            )
            yield lc_doc

In [7]:
FILE_PATH = "https://arxiv.org/pdf/2206.01062"


In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

loader = DoclingPDFLoader(
    file_path=FILE_PATH,
    parse_type=DoclingPDFLoader.ParseType.MARKDOWN,
)
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
)

Fetching 7 files:   0%|          | 0/7 [00:00<?, ?it/s]

In [9]:
docs = loader.load()
splits = text_splitter.split_documents(docs)

In [10]:
from langchain_huggingface.embeddings import HuggingFaceEmbeddings

HF_EMBED_MODEL_ID = "BAAI/bge-small-en-v1.5"
embeddings = HuggingFaceEmbeddings(model_name=HF_EMBED_MODEL_ID)

In [11]:
from tempfile import TemporaryDirectory
import os
from langchain_milvus import Milvus

MILVUS_URI = os.environ.get(
    "MILVUS_URL", f"{(tmp_dir := TemporaryDirectory()).name}/milvus_demo.db"
)

vectorstore = Milvus.from_documents(
    splits,
    embeddings,
    connection_args={"uri": MILVUS_URI},
    drop_old=True,
)

In [12]:
from langchain_huggingface import HuggingFaceEndpoint
import os
from getpass import getpass

hf_api_key = getpass("HF API Key: ")
os.environ["HF_API_KEY"] = hf_api_key

HF_LLM_MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"

llm = HuggingFaceEndpoint(
    repo_id=HF_LLM_MODEL_ID,
    huggingfacehub_api_token=hf_api_key,
)

The token has not been saved to the git credentials helper. Pass `add_to_git_credential=True` in this function directly or `--add-to-git-credential` if using via `huggingface-cli` if you want to set the git credential as well.
Token is valid (permission: fineGrained).
Your token has been saved to /Users/jeevan/.cache/huggingface/token
Login successful


In [13]:
from typing import Iterable

from langchain_core.documents import Document as LCDocument
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough


def format_docs(docs: Iterable[LCDocument]):
    formatted_docs = "\n\n".join(doc.page_content for doc in docs)
    print(formatted_docs)
    return formatted_docs


retriever = vectorstore.as_retriever()

prompt = PromptTemplate.from_template(
    "Context information is below.\n---------------------\n{context}\n---------------------\nGiven the context information and not prior knowledge, answer the query.\nQuery: {question}\nAnswer:\n"
)


rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

In [16]:
retriever.invoke("What bio studies and clinical uses does the BARSKA AY11236 microscope support?")

[Document(metadata={'dl_doc_hash': '5dfbd8c115a15fd3396b68409124cfee29fc8efac7b5c846634ff924e635e0dc', 'pk': 452554224253796395}, page_content='|                | human   | MRCNN   | MRCNN   | FRCNN   | YOLO   |\n|----------------|---------|---------|---------|---------|--------|\n|                | human   | R50     | R101    | R101    | v5x6   |\n| Caption        | 84-89   | 68.4    | 71.5    | 70.1    | 77.7   |\n| Footnote       | 83-91   | 70.9    | 71.8    | 73.7    | 77.2   |\n| Formula        | 83-85   | 60.1    | 63.4    | 63.5    | 66.2   |\n| List-item      | 87-88   | 81.2    | 80.8    | 81.0    | 86.2   |\n| Page-footer    | 93-94   | 61.6    | 59.3    | 58.9    | 61.1   |\n| Page-header    | 85-89   | 71.9    | 70.0    | 72.0    | 67.9   |\n| Picture        | 69-71   | 71.7    | 72.7    | 72.0    | 77.1   |\n| Section-header | 83-84   | 67.6    | 69.3    | 68.4    | 74.6   |\n| Table          | 77-81   | 82.2    | 82.9    | 82.2    | 86.3   |\n| Text           | 84-86   |

In [14]:
rag_chain.invoke("What bio studies and clinical uses does the BARSKA AY11236 microscope support?")

In order to accommodate the different types of models currently in use by the community, we provide DocLayNet in an augmented COCO format [16]. This entails the standard COCO ground-truth file (in JSON format) with the associated page images (in PNG format, 1025 × 1025 pixels). Furthermore, custom fields have been added to each COCO record to specify document category, original document filename and page number. In addition, we also provide the original PDF pages, as well as sidecar files containing parsed PDF text and text-cell coordinates (in JSON). All additional files are linked to the primary page images by their matching filenames.

structured source data must be available. PubLayNet and DocBank were both generated from scientific document repositories (PubMed and arXiv), which provide XML or L A T E X sources. Those scientific documents present a limited variability in their layouts, because they are typeset in uniform templates provided by the publishers. Obviously, documents s

'The context information does not provide any information about the BARSKA AY11236 microscope, its bio studies, or its clinical uses.'