# Colab Edition

<div align="center">

[![logo](https://ipitio.github.io/ocr-pdf/public/wide.webp)](https://github.com/ipitio/ocr-pdf)

<h1><a href="https://github.com/ipitio/ocr-pdf" target="_blank" rel="noopener noreferrer">
    ocr2pdf
</a></h1>

**OCRmyPDF and Merge it**

---

[![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)

</div>

This notebook is meant to be run on [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb). It will convert your files and can optionally save them to [Drive](https://drive.google.com/drive/my-drive) `/ocr-pdf`. Open the link above for more information.

## Note

- To merge files, organize them into folders and zip each one
  - The files in each zip will be merged in alphabetical order
- If you'd like to add any options for [OCRmyPDF](https://ocrmypdf.readthedocs.io/en/latest), append them to line 23 in the cell below
- The upload button will appear below the cell after running it
- At the end, you'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected


## Steps

1. Run the cell below to get prompted to connect Drive and upload your files and/or zipped folders


In [None]:
import os

# Connect to Drive
try:
    from google.colab import files, drive
    drive.mount("/content/drive", force_remount=True)
    drive = True
except:
    drive = False

# Extract your PDFs
files.upload()
![ -d pdf ] || mkdir pdf
![ -d pdf/todo ] || mkdir pdf/todo
![ -d pdf/done ] || mkdir pdf/done
!unzip -o "*.zip" -d pdf/todo 2>/dev/null
!rm -f *.zip
!mv *.* pdf/todo 2>/dev/null

# Transform them
%pip install udocker
!udocker --allow-root install
!udocker --allow-root run -v /content/pdf:/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf
converted = os.listdir("pdf/done")

# And load
if drive and len(converted) > 0:
    ![ -d "drive/MyDrive/ocr-pdf" ] || mkdir "drive/MyDrive/ocr-pdf"
    !\cp -r "pdf/done/"* "drive/MyDrive/ocr-pdf/"

if len(converted) == 1 and os.path.isfile("pdf/done/" + converted[0]):
    files.download("pdf/done/" + converted[0])
elif len(converted) > 0:
    !zip -r "pdf.zip" "pdf/done"
    files.download("pdf.zip")
else:
    print("No PDFs found")