### Doclings 
- 🗂️ Reads popular document formats (PDF, DOCX, PPTX, XLSX, Images, HTML, AsciiDoc & Markdown) and exports to HTML, Markdown and JSON (with embedded and referenced images)
- 📑 Advanced PDF document understanding including page layout, reading order & table structures
- 🧩 Unified, expressive DoclingDocument representation format
- 🤖 Plug-and-play integrations incl. LangChain, LlamaIndex, Crew AI & Haystack for agentic AI
- 🔍 OCR support for scanned PDFs
- 💻 Simple and convenient CLI

In [38]:
# https://github.com/DS4SD/docling
# https://python.langchain.com/docs/integrations/document_loaders/docling/


## Other libraries for document loading
# mega-parse and markitdown
# https://github.com/QuivrHQ/MegaParse
# https://github.com/microsoft/markitdown


In [None]:
# !pip install langchain_docling
# pip install --upgrade jupyter ipywidgets



In [1]:
from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType
import os

# import Path
from pathlib import Path

FILE_PATH = r"..\00 Dataset\docs\facebook\Meta-09-30-2024-Exhibit-99-1_FINAL.pdf"
FILE_PATH = r"..\00 Dataset\docs\facebook\Downloadable-BS-Q3-24.xlsx"
# FILE_PATH = "https://pdfobject.com/pdf/sample.pdf"

# get markdown of the files which don't have any figure
def get_markdown_without_figure(input_doc_path, target_dir):
    loader = DoclingLoader(file_path=input_doc_path,
                           export_type=ExportType.MARKDOWN)

    docs = loader.load()

    # doc file name
    doc_filename = Path(input_doc_path).stem

    os.makedirs(target_dir, exist_ok=True)
    md_filename =  f"{target_dir}/{doc_filename}-with-image-refs.md"

    # write markdown to file
    with open(md_filename, "w", encoding="utf-8") as f:
        f.write(docs[0].page_content)
        f.close()

    return md_filename


input_doc_path = r"..\00 Dataset\docs\facebook\Earnings-Presentation-Q3-2024.pdf"
md_filename = get_markdown_without_figure(input_doc_path, 'output')

md_filename
# docs

If this is not desired, please set os.environ['TORCH_CUDA_ARCH_LIST'].
Could not load the custom kernel for multi-scale deformable attention: Command '['where', 'cl']' returned non-zero exit status 1.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while import

'output/Earnings-Presentation-Q3-2024-with-image-refs.md'

## Reading Entire Document Directory

In [2]:
import os

def process_documents(input_dir, output_dir):

    for root, _, files in os.walk(input_dir):
        # Calculate the relative path to preserve subdirectory structure
        relative_path = os.path.relpath(root, input_dir)
        target_dir = os.path.join(output_dir, relative_path)
        os.makedirs(target_dir, exist_ok=True)  # Ensure target directory exists

        for file in files:
            print(f"Processing file: {file}")
            if file.lower().endswith(('.pdf', '.docx', '.pptx', '.xlsx')):
                input_pdf_path = os.path.join(root, file)
                print(f"Processing file: {input_pdf_path}")
                
                # Convert PDF to Markdown without figures
                md_filename = get_markdown_without_figure(input_pdf_path, target_dir)
                
                # Print the location of the saved Markdown file
                print(f"Markdown saved to: {md_filename}")

# Example Usage
input_dir = r"../00 Dataset/docs"
output_dir = r"../00 Dataset/markdown"
process_documents(input_dir, output_dir)


Processing file: llm_paper.pdf
Processing file: ../00 Dataset/docs\llm_paper.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\./llm_paper-with-image-refs.md
Processing file: amazon-10-q-q3-2024.pdf
Processing file: ../00 Dataset/docs\amazon\amazon-10-q-q3-2024.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\amazon/amazon-10-q-q3-2024-with-image-refs.md
Processing file: Downloadable-BS-Q3-24.xlsx
Processing file: ../00 Dataset/docs\facebook\Downloadable-BS-Q3-24.xlsx
Markdown saved to: ../00 Dataset/markdown\facebook/Downloadable-BS-Q3-24-with-image-refs.md
Processing file: Downloadable-PL-Q3-24.xlsx
Processing file: ../00 Dataset/docs\facebook\Downloadable-PL-Q3-24.xlsx
Markdown saved to: ../00 Dataset/markdown\facebook/Downloadable-PL-Q3-24-with-image-refs.md
Processing file: Earnings-Presentation-Q3-2024.pdf
Processing file: ../00 Dataset/docs\facebook\Earnings-Presentation-Q3-2024.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\facebook/Earnings-Presentation-Q3-2024-with-image-refs.md
Processing file: Meta-09-30-2024-Exhibit-99-1_FINAL.pdf
Processing file: ../00 Dataset/docs\facebook\Meta-09-30-2024-Exhibit-99-1_FINAL.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\facebook/Meta-09-30-2024-Exhibit-99-1_FINAL-with-image-refs.md
Processing file: META-Q3-2024-Earnings-Call-Transcript.pdf
Processing file: ../00 Dataset/docs\facebook\META-Q3-2024-Earnings-Call-Transcript.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\facebook/META-Q3-2024-Earnings-Call-Transcript-with-image-refs.md
Processing file: META-Q3-2024-Follow-Up-Call-Transcript.pdf
Processing file: ../00 Dataset/docs\facebook\META-Q3-2024-Follow-Up-Call-Transcript.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\facebook/META-Q3-2024-Follow-Up-Call-Transcript-with-image-refs.md
Processing file: goog-10-q-q3-2024.pdf
Processing file: ../00 Dataset/docs\google\goog-10-q-q3-2024.pdf


Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAttention: The specified module could not be found.
Could not load the custom kernel for multi-scale deformable attention: DLL load failed while importing MultiScaleDeformableAt

Markdown saved to: ../00 Dataset/markdown\google/goog-10-q-q3-2024-with-image-refs.md
