##  Document parsing with OCR tools

There are multiple OCR-tools specially developed for the conversion of scientific articles like [Nougat](https://github.com/facebookresearch/nougat) or [Marker](https://github.com/VikParuchuri/marker) available. As an example we will demonstrate the conversion of a PDF to a markdown file using the [Nougat](https://github.com/facebookresearch/nougat) tool.

In [4]:
import os


def convert_pdf_with_nougat(
    pdf_path, output_dir, model="0.1.0-small", batch_size=1, no_skipping=False
):
    """
    Converts a PDF to Markdown using NOUGAT.

    :param pdf_path: Path to the PDF to be converted.
    :param output_dir: Output directory for the converted files.
    :param model: Model tag to use (default: 0.1.0-small).
    :param batch_size: Batch size for processing (default: 1).
    :param no_skipping: Flag to disable the failure detection heuristic.
    """
    cmd = f"nougat {pdf_path} -o {output_dir} -m {model} -b {batch_size}"
    if no_skipping:
        cmd += " --no-skipping"
    os.system(cmd)

As an example the 1 PDF downloaded in the [data mining chapter](../obtaining_data/data_mining.ipynb) was converted into markdown files. 

```{warning}

Using these tools needs a lot of computational power. One should run these preferable on a computing cluster. 
```

In [5]:
pdf_dir = "../obtaining_data/PDFs"
output_dir = "./markdown_files"
specific_pdf_file = "10.26434_chemrxiv-2024-1l0sn.pdf"

os.makedirs(output_dir, exist_ok=True)

# Check if the specific file exists in the directory
pdf_path = os.path.join(pdf_dir, specific_pdf_file)
convert_pdf_with_nougat(pdf_path, output_dir)
print(f"Converted {specific_pdf_file} successfully.")

output_file_name = os.path.basename(pdf_path).replace(".pdf", ".mmd")
output_file_path = os.path.join(output_dir, output_file_name)

print(f"Converted file: {output_file_path}")

  from pandas.core import (
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
INFO:root:Skipping 10.26434_chemrxiv-2024-1l0sn.pdf, already computed. Run with --recompute to convert again.


pushing model to MSP
Converted 10.26434_chemrxiv-2024-1l0sn.pdf successfully.
Converted file: ./markdown_files/10.26434_chemrxiv-2024-1l0sn.mmd


In [6]:
# Display the content of the converted .mmd file
with open(output_file_path, "r", encoding="utf-8") as f:
    content = f.read()
    print(content)

[MISSING_PAGE_FAIL:1]

These types of connectivity have also been converted to extended \(\pi\)-systems by oxidative follow-up reactions, allowing a higher level of conjugation and hence strong bathochromic shifts.[8] The installation of heteroatoms has however been a challenge for some time. In 2014, Shinokubo et al. presented linearly connect monomers through an azo-bridge at the \(\beta\)-position (Figure 1A (d)).[10] Linear connectivity at the \(\alpha\)-position using heteroatoms such as sulfur has been achieved through a similarly iterative process by the groups of Hao and Jiao (Figure 1A (e)).[7] Furthermore, cyclic amine-linked oligo-BODIPYs have already been synthesized in a one-pot reaction in 2022 by Song et al., utilizing Buchwald-Hartwig conditions (Figure 1A (f)).[10]

We present a novel type of BODIPY oligomers, connected via _N_-bridges in a linear fashion (Figure 1B). Utilizing both symmetric and unsymmetric BODIPY monomers as building blocks has paved the way to selec

```{important}

To review the quality and accuracy of the conversion at least partially afterward is crucial. If the OCR-tool is not able to convert the relevant parts correctly one should think about using a different method.
```

The obtained markdown file contains some errors. These range from errors in converting chemical names up to the complete omission of the conversion of the table. Therefore, one could use a [Vision model](../beyond_text/beyond_images.ipynb) or an [Agentic approach](reference to agent section) to minimize those errors.

Afterward the received files should be [cleaned](./cleaning.ipynb).