<a href="https://colab.research.google.com/github/om2468/playground.github.io/blob/main/Machine%20Learning%20with%20Python/Optical_Character_Recognition_(OCR)_with_Meta's_Nougat!.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Nougat**

**Nougat is an encoder-decoder transformer model that can parse through PDFs to extract text, LaTeX math and tables. Nougat is built using the Document Understanding Transformer (DONUT) architecture. The models uses a visual encoder that crops the image to a specified size and outputs a sequence of embedded patches. The encoded image is decoded into a sequence of tokens using a transformer decoder. The team at Meta trained Nougat on over a million articles from arXiv, PubMed Central and the Industry Documents Library.**

**Nougat outputs the information from PDFs into a Multimarkdown file output.**

**Nougat can be trained or fine-tuned on a specified data set.**

<sup>Source: [Nougat: Neural Optical Understanding for Academic Documents](https://github.com/facebookresearch/nougat) GitHub Repository</sup>

<sup>Source: [Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) Paper on arXiv</sup>

In [None]:
from IPython import display
import os

In [None]:
!pip install git+https://github.com/facebookresearch/nougat
display.clear_output()

In [None]:
!nougat -h

usage: nougat
       [-h]
       [--batchsize BATCHSIZE]
       [--checkpoint CHECKPOINT]
       [--out OUT]
       [--recompute]
       [--markdown]
       pdf
       [pdf ...]

positional arguments:
  pdf
    PDF(s) to
    process.

options:
  -h, --help
    show this
    help
    message and
    exit
  --batchsize BATCHSIZE, -b BATCHSIZE
    Batch size
    to use.
  --checkpoint CHECKPOINT, -c CHECKPOINT
    Path to
    checkpoint
    directory.
  --out OUT, -o OUT
    Output
    directory.
  --recompute
    Recompute
    already
    computed
    PDF,
    discarding
    previous pr
    edictions.
  --markdown
    Add postpro
    cessing
    step for
    markdown co
    mpatibility
    .


## **Converting a Native PDF File**

In [None]:
!curl -o quantum_physics.pdf https://www.sydney.edu.au/science/chemistry/~mjtj/CHEM3117/Resources/postulates.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 69838  100 69838    0     0  76773      0 --:--:-- --:--:-- --:--:-- 76745


<sup>Source: [The Postulates of Quantum Mechanics](https://www.sydney.edu.au/science/chemistry/~mjtj/CHEM3117/Resources/postulates.pdf) from the University of Sydney</sup>

In [None]:
!nougat --markdown pdf '/content/quantum_physics.pdf' --out 'physics'

downloading nougat checkpoint version 0.1.0-small to path /root/.cache/torch/hub/nougat
config.json: 100% 557/557 [00:00<00:00, 2.17Mb/s]
pytorch_model.bin: 100% 956M/956M [00:20<00:00, 48.1Mb/s]
special_tokens_map.json: 100% 96.0/96.0 [00:00<00:00, 536kb/s]
tokenizer.json: 100% 2.04M/2.04M [00:00<00:00, 5.53Mb/s]
tokenizer_config.json: 100% 106/106 [00:00<00:00, 516kb/s]
INFO:root:Output directory does not exist. Creating output directory.
  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
  0% 0/1 [00:00<?, ?it/s][nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.
INFO:root:Processing file /content/quantum_physics.pdf with 2 pages
100% 1/1 [00:17<00:00, 17.61s/it]


In [None]:
display.Latex('/content/physics/quantum_physics.mmd')

## **Converting a Scanned PDF**

In [None]:
!curl -o fundamental_quantum_equations.pdf https://www.informationphilosopher.com/solutions/scientists/dirac/Fund_QM_1925.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 1196k  100 1196k    0     0   585k      0  0:00:02  0:00:02 --:--:--  586k


<sup>Source: [The Fundamental Equations of Quantum Mechanics](https://www.informationphilosopher.com/solutions/scientists/dirac/Fund_QM_1925.pdf) by Paul Dirac from the Proceedings of the Royal Society of London</sup>

In [None]:
!nougat --markdown pdf '/content/fundamental_quantum_equations.pdf' --out 'physics'

  return _VF.meshgrid(tensors, **kwargs)  # type: ignore[attr-defined]
INFO: likely hallucinated title at the end of the page: ## 2 x 2
INFO:root:Processing file /content/fundamental_quantum_equations.pdf with 13 pages
100% 4/4 [00:57<00:00, 14.46s/it]


In [None]:
display.Latex('/content/physics/fundamental_quantum_equations.mmd')

## **Batch Processing PDFs**

In [None]:
!mkdir pdfs
!curl -o pdfs/lec_1.pdf https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/7f930e013cef9cd7dec5aa88baa83f0a_MIT8_04S16_LecNotes1.pdf -o pdfs/lec_2.pdf https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/afaef4b8271759d352ac75c4e85eaee6_MIT8_04S16_LecNotes2.pdf
!curl -o pdfs/lec_3.pdf https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/f928b8dce3d6a218fddda9617c5eb4f2_MIT8_04S16_LecNotes3.pdf  -o pdfs/lec_4.pdf https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/0c07cbdc9c352c39eb9539b31ded90d7_MIT8_04S16_LecNotes4.pdf

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0100  303k  100  303k    0     0  1153k      0 --:--:-- --:--:-- --:--:-- 1154k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  509k  100  509k    0     0   264k      0  0:00:01  0:00:01 --:--:--  264k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  284k  100  284k    0     0   295k      0 --:--:-- --:--:-- --:--:--  294k
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  265k  100  265k    0     0   160k      0  0:00

<sup>Source: [Quantum Physics I](https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/) from MIT OpenCourseWare</sup>

In [None]:
nougat_cmd = "nougat --markdown --out 'batch_directory'"
pdf_path = '/content/pdfs'

for pdf in os.listdir(pdf_path):
  os.system(f"{nougat_cmd} pdf /content/pdfs/{pdf}")

In [None]:
display.Markdown('/content/batch_directory/lec_1.mmd')

# **References and Additional Learning**

## **Documentation**

- **[Nougat: Neural Optical Understanding for Academic Documents](https://github.com/facebookresearch/nougat) GitHub Repository**
- **[Nougat: Neural Optical Understanding for Academic Documents](https://arxiv.org/abs/2308.13418) Paper on arXiv**

## **Lectures**

- **[Quantum Physics I](https://ocw.mit.edu/courses/8-04-quantum-physics-i-spring-2016/) from MIT OpenCourseWare**

## **Papers**

- **[The Fundamental Equations of Quantum Mechanics](https://www.informationphilosopher.com/solutions/scientists/dirac/Fund_QM_1925.pdf) by Paul Dirac from the Proceedings of the Royal Society of London**
- **[The Postulates of Quantum Mechanics](https://www.sydney.edu.au/science/chemistry/~mjtj/CHEM3117/Resources/postulates.pdf) from the University of Sydney**

# **Connect**
- **Feel free to connect with Adrian on [YouTube](https://www.youtube.com/channel/UCPuDxI3xb_ryUUMfkm0jsRA), [LinkedIn](https://www.linkedin.com/in/adrian-dolinay-frm-96a289106/), [Twitter](https://twitter.com/DolinayG), [GitHub](https://github.com/ad17171717), [Medium](https://adriandolinay.medium.com/) and [Odysee](https://odysee.com/@adriandolinay:0). Happy coding!**