# Docling User Guide

## Introduction

#### Docling is a powerful tool for converting documents into different formats.
#### This guide will walk you through using Docling and give you an overview of its features and limitations.

## Installation

#### To install Docling, use pip:

In [None]:
pip install docling

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Note: you may need to restart the kernel to use updated packages.


## Usage

#### Here is a simple example of using Docling, available directly in their documentation :



## Importing libraries

#### We're going to import the modules needed to use Docling in a Python environment.

In [None]:
from docling.document_converter import DocumentConverter

In [1]:
source = "https://arxiv.org/pdf/2408.09869"  # PDF path or URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())  # output: "### Docling Technical Report[...]"

  from .autonotebook import tqdm as notebook_tqdm
Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 33465.19it/s]
Downloading detection model, please wait. This may take several minutes depending upon your network connection.


URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)>

#### This code was supposed to convert a PDF to Markdown.

#### Below, you can see that it works with HTML, however:

In [None]:
source = "https://newsroom.ibm.com/2024-11-14-ufc-names-ibm-as-first-ever-official-ai-partner?utm_medium=OSocial&utm_source=Linkedin&utm_content=WTXWW&utm_id=IBMUFC2024LinkedInNov14&sf207455141=1"  # document par chemin local ou URL
converter = DocumentConverter()
result = converter.convert(source)
print(result.document.export_to_markdown())

- News
    - All press releases
    - Think 2024
    - Artificial intelligence
    - Hybrid cloud
    - Research and innovation
    - Corporate
    - Social impact
    - Mergers & acquisitions
- Media resources

- Asset gallery
- B-roll gallery
- Media contacts
- Global newsrooms
- Inside IBM

- Leadership
- IBM boilerplate
- Investor relations
- Annual report
- Analyst reports
- CSR
- IBM policy
- Awards
- Blog

- IBM blog
- IBM Research blog
- securityintelligence.com
- Subscribe

# All press releases

# UFC Names IBM as First-Ever Official AI Partner

<!-- image -->

- 
- 
- 
- 

ARMONK, N.Y. and LAS VEGAS, Nov. 14, 2024 /PRNewswire/ -- IBM (NYSE: IBM) and UFC®️, the world's premier mixed martial arts organization and part of TKO Group Holdings (NYSE: TKO), today announced an innovative new partnership that will combine the power of IBM's AI and data platform, watsonx, with the vast global reach of UFC's content platforms to enhance the viewing experience for millions of UFC fans ar

#### Continue testing :

In [5]:
from docling.document_converter import DocumentConverter

# Example test with a PDF
source_pdf = "https://arxiv.org/pdf/2408.09869"
converter = DocumentConverter()

try:
    result = converter.convert(source_pdf)
    print("Result in Markdown:\n")
    print(result.document.export_to_markdown())
except Exception as e:
    print(f"Error during PDF conversion: {e}")
    print("Try using the CLI as an alternative.")

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 126249.95it/s]
Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Error during PDF conversion: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)>
Try using the CLI as an alternative.


In [None]:
from docling.document_converter import DocumentConverter

# Example test with a html
source_pdf = "https://arxiv.org/"
converter = DocumentConverter()

try:
    result = converter.convert(source_pdf)
    print("Result in Markdown:\n")
    print(result.document.export_to_markdown())
except Exception as e:
    print(f"Error during PDF conversion: {e}")
    print("Try using the CLI as an alternative.")


Result in Markdown:

<!-- image -->

#

Help | Advanced Search

<!-- image -->

<!-- image -->

## quick links

- Login
- Help Pages
- About

arXiv is a free distribution service and an open-access archive for nearly 2.4 million
      scholarly articles in the fields of physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics.
      Materials on this site are not peer-reviewed by arXiv.

## Physics

- Astrophysics
      (astro-ph
new,
      recent,
      search)

Astrophysics of Galaxies; Cosmology and Nongalactic Astrophysics; Earth and Planetary Astrophysics; High Energy Astrophysical Phenomena; Instrumentation and Methods for Astrophysics; Solar and Stellar Astrophysics
- Condensed Matter
      (cond-mat
new,
      recent,
      search)

Disordered Systems and Neural Networks; Materials Science; Mesoscale and Nanoscale Physics; Other Condensed Matter; Quantum Gases; Soft Condensed Matter

#### We are testing the conversion of a local PDF file to Markdown.

In [9]:
source_pdf = "/Users/quentin/Documents/GitHub/docling-testing/assets/maintenance-auto.pdf"
converter = DocumentConverter()

try:
    result = converter.convert(source_pdf)

    print(result.document.export_to_markdown())
except Exception as e:
    print("Erreur", e)

Fetching 9 files: 100%|██████████| 9/9 [00:00<00:00, 147456.00it/s]
Downloading detection model, please wait. This may take several minutes depending upon your network connection.


Erreur <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1006)>


#### We are testing now with html.

In [10]:
source_pdf = "/Users/quentin/Documents/GitHub/docling-testing/assets/meteo_montpellier.html"
converter = DocumentConverter()

try:
    result = converter.convert(source_pdf)

    print(result.document.export_to_markdown())
except Exception as e:
    print("Erreur", e)

<!-- image -->

<!-- image -->

        - Facebook
        - Twitter
        - Linkedin

<!-- image -->

<!-- image -->

Vigilance météo

Soyez attentif

        - Previsions
        - Meteo Marine
        - Meteo Montagne
        - Climat
            - Services climatiques
                - DRIAS, les futurs du climat
                - DRIAS-Eau, les futurs de l'eau
                - Climadiag Agriculture
                - Climadiag Commune
                - Climadiag Chaleur en ville
                - Climadiag Entreprise
                - Climat HD
        - Comprendre le climat
            - Le Climat Mondial
            - Le Climat en France
            - Etudier le Climat Passé
    - Le Changement Climatique
        - Observer le Changement Climatique
        - Quel Climat Futur ?
- Normales et Relevés
    - Normales et Records
    - Relevés Météorologiques
    - Bulletins Climatologiques
- Tendances à 3 Mois

- Tendances à 3 Mois
- Actus & Dossiers

- Actualités
    - A la une
 

#### Now let's go to the cli command as recommended in the documentation :

![Oder Cli](../assets/oder_cli.png)


#### Now let's test a command :

![Oder Cli](../assets/oder_ocr.png)


#### Now without ocr :

![Oder Cli](../assets/order_no_ocr.png)


#### Let's test and analyse the tool without the ocr, as it doesn't work with pdf files :

 - PDF conversion without OCR :
```bash
docling /path/to/sample.pdf --to md --no-ocr
```
 - Converting with OCR (may fail on some files):
```bash
docling /path/to/sample.pdf --to md --ocr
```
 - Convert HTML or URL to Markdown:
```bash
docling https://example.com/sample.html --to md
```

#### Let's analyse the results on a pdf about the efficiency on the tables:

#### Simple board : 

![Oder Cli](../assets/simple_board.png)

![Oder Cli](../assets/simple_board_result.png)


#### Note that for the first two tables, the contents of the ‘Description’ and ‘R134a’ columns have been swapped!

#### Otherwise, as far as the third table is concerned, the text, the placement of the images and the titles are all correct!

#### the result for a large board :

![Oder Cli](../assets/big_board.png)


![Oder Cli](../assets/big_board_result.png)


#### There was a repetition of "waste," some discrepancies, but it remains very accurate overall. It’s worth testing if this has any impact during a RAG process.

#### There are still some minor discrepancies, but it is still fairly accurate. Let's try RAG (Retrieval-Augmented Generation) to test information retrieval without vector indexing:

In [50]:
import os
import re
from dotenv import load_dotenv
from langchain.prompts import PromptTemplate
from genai import Credentials, Client
from genai.schema import TextGenerationParameters
from genai.extensions.langchain import LangChainInterface

# Charger les variables d'environnement
load_dotenv()

CHEMIN_FICHIER_MD = "/Users/quentin/Documents/GitHub/wx-demo-rag/demo-web/data/maintenance-auto.md"

prompt_personnalisé = PromptTemplate(
    template="""
    Tu es un assistant spécialisé dans la documentation. Tu as accès à un document qui peut contenir :
    - Des chapitres
    - Des sections
    - Des sous-sections
    - Des informations techniques

    Contexte du document : {contenu}

    Question : {question}

    Donne une réponse précise et structurée basée uniquement sur les informations du document tu ne dois pas modifier ou inventer.
    Si l'information n'est pas disponible dans le contexte, indique-le clairement.
    """,
    input_variables=["contenu", "question"]
)

cle_api_bam = os.getenv("GENAI_KEY")
endpoint_bam = os.getenv("GENAI_ENDPOINT")
credentials_bam = Credentials(cle_api_bam, api_endpoint=endpoint_bam)
client_bam = Client(credentials=credentials_bam)

params_bam = TextGenerationParameters(
    decoding_method="greedy",
    max_new_tokens=1000,
    min_new_tokens=10,
    repetition_penalty=1,
    temperature=0.7,
    top_k=50,
    top_p=0.95
)

modele = LangChainInterface(
    model_id="mistralai/mixtral-8x7b-instruct-v01",
    client=client_bam,
    parameters=params_bam,
)

# Charger et segmenter le fichier Markdown
def charger_et_segmenter_markdown():
    """Charge le fichier Markdown et le divise en sections."""
    if not os.path.exists(CHEMIN_FICHIER_MD):
        raise FileNotFoundError(f"Fichier Markdown introuvable au chemin {CHEMIN_FICHIER_MD}")
    
    with open(CHEMIN_FICHIER_MD, 'r', encoding='utf-8') as f:
        contenu = f.read()
    
    # Découper en sections basées sur les titres de niveau 2 (##)
    sections = re.split(r'\n##\s+', contenu)
    return sections

# Trouver la section la plus pertinente
def trouver_section_pertinente(question, sections):
    """Retourne la section la plus pertinente en fonction de la question."""
    for section in sections:
        if question.lower() in section.lower():
            return section
    return sections[0]  # Si aucune correspondance, retourner la première section

# Générer une réponse basée sur une section
def repondre_avec_section(question):
    """Répondre à une question en utilisant une section pertinente."""
    sections = charger_et_segmenter_markdown()
    section_pertinente = trouver_section_pertinente(question, sections)
    
    if not section_pertinente:
        return "Aucune section pertinente trouvée."
    
    prompt_complet = prompt_personnalisé.format(contenu=section_pertinente, question=question)
    
    try:
        reponse = modele.generate(prompts=[prompt_complet])
        return reponse.generations[0][0].text.strip()
    except Exception as e:
        return f"Erreur lors de la génération de la réponse : {str(e)}"

# Exemple d'utilisation
if __name__ == "__main__":
    try:
        print("Bienvenue ! Posez une question ou entrez 'quit' pour quitter.")
        question = input("\nEntrez votre question : ")
        
        if question.lower() in ['quit', 'exit', 'q']:
            print("Au revoir !")
        else:
            reponse = repondre_avec_section(question)
            print(f"\nRéponse :\n{reponse}")
    except KeyboardInterrupt:
        print("\nRecherche interrompue par l'utilisateur.")
    except Exception as e:
        print(f"\nErreur : {str(e)}")

Bienvenue ! Posez une question ou entrez 'quit' pour quitter.

Réponse :
Question : Qu'est-ce qu'un fluide frigorigène ?

    Réponse : Un fluide frigorigène est un fluide utilisé dans les systèmes de climatisation et de réfrigération pour transférer de la chaleur. Il est soumis à une réglementation stricte en raison de son potentiel de réchauffement planétaire et de son impact sur la couche d'ozone. Pour intervenir sur ces systèmes et manipuler ces fluides, une attestation de capacité est actuellement requise.


#### Let's try to target a section in which the model should search :

##### In the test I target document 702 and ask the question ‘The dangers of refrigerants’.

In [42]:
def lister_titres_disponibles():
    with open(CHEMIN_FICHIER_MD, 'r', encoding='utf-8') as f:
        contenu = f.read()
    titres = re.findall(r'^##\s+(.+)$', contenu, re.MULTILINE)
    return titres

if __name__ == "__main__":
    try:
        print("Bienvenue ! Voici les sections disponibles dans le document :")
        titres = lister_titres_disponibles()
        for i, titre in enumerate(titres, 1):
            print(f"{i}. {titre}")

        choix = int(input("\nEntrez le numéro de la section (ou 0 pour quitter) : "))
        if choix == 0:
            print("Au revoir !")
        else:
            titre_recherche = titres[choix - 1]
            terme_recherche = input("\nEntrez votre question : ")
            reponse = repondre_avec_contexte_precis(terme_recherche, titre_recherche)
            print(f"\nRéponse :\n{reponse}")
    except KeyboardInterrupt:
        print("\nRecherche interrompue par l'utilisateur.")
    except Exception as e:
        print(f"\nErreur : {str(e)}")


Bienvenue ! Voici les sections disponibles dans le document :
1. Des mêmes auteurs
2. Technologie fonctionnelle de l'automobile, 6 e édition, 2009
3. © Dunod, Paris, 1994, 2010 EAN 9782100554652
4. INTRODUCTION
5. Quel est l'objectif de l'ouvrage ?
6. À qui s'adresse cet ouvrage ?
7. Comment travailler avec ce manuel ?
8. TABLE DES MATIÈRES
9. TABLE DES MATIÈRES
10. TABLE DES MATIÈRES
11. TABLE DES MATIÈRES
12. RECHERCHE D'UNE PANNE
13. Objectif
14. MATÉRIELS, CONSOMMABLES ET DOCUMENTS NÉCESSAIRES
15. LES ÉTAPES DE LA RECHERCHE D'UNE PANNE
16. À NOTER
17. IDENTIFIER ET CHOISIR L'OUTILLAGE
18. OBJECTIF
19. MATÉRIELS, CONSOMMABLES ET DOCUMENTS NÉCESSAIRES
20. L'OUTILLAGE COURANT
21. Les clés
22. Le coffret à douilles
23. L'outillage pour l'entretien
24. L'OUTILLAGE SPÉCIFIqUE (ExEMPLES)
25. À NOTER
26. Les instruments de mesure
27. ORGANISER UNE RÉPARATION
28. OBJECTIF
29. MATÉRIELS, CONSOMMABLES ET DOCUMENTS NÉCESSAIRES
30. ORGANISER SON POSTE DE TRAVAIL
31. S'informer
32. Préparer
33. 

## Conclusion

#### Docling is a powerful tool for extracting structured content from a variety of formats at an impressive speed. 

#### However, there are limitations, particularly when using OCR on PDF files.
 
#### These limitations can be overcome by adjusting the parameters or processing the files beforehand, or by simply using the cli commands.

#### For a RAG method, vector indexing must be used for large file formats, otherwise it may lead to a token limitation.