# Colab Edition

<div align="center">

[![logo](https://ipitio.github.io/ocr-pdf/public/wide.webp)](https://github.com/ipitio/ocr-pdf)

<h1><a href="https://github.com/ipitio/ocr-pdf" target="_blank" rel="noopener noreferrer">
    ocr2pdf
</a></h1>

**Convert images and scans to searchable PDFs!**

---

[![downloads](https://img.shields.io/badge/dynamic/json?url=https%3A%2F%2Fipitio.github.io%2Fbackage%2Fipitio%2Focr-pdf%2Focr-pdf.json&query=%24.downloads&logo=github&logoColor=959da5&labelColor=333a41&label=pulls)](https://github.com/arevindh/pihole-speedtest/pkgs/container/pihole-speedtest) [![build](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml/badge.svg)](https://github.com/ipitio/ocr-pdf/actions/workflows/publish.yml)

</div>

This notebook is meant to be run on [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb). It will convert your files and can optionally save them to [Drive](https://drive.google.com/drive/my-drive) `/ocr-pdf`. Open the link above for more information.

## Steps

1. Make two new folders, one inside the other
   - The outer one can be named anything, say `pdf`
   - The inner one must be named `todo`
2. Place your files in the `todo` folder
   - Those by themselves will just be converted
   - Those inside subfolders will also be merged in alphabetical order
3. Share the outer `pdf` folder with this notebook
   - Zip the folder
   - Open this notebook in [Colab](https://colab.research.google.com/github/ipitio/ocr-pdf/blob/master/colab.ipynb)
   - Run the cell below to be prompted to connect Drive and upload the zip

You'll be offered a zip of the converted (and merged) files to download locally, whether or not Drive was connected


In [None]:
import os

# Connect to Drive
try:
    from google.colab import files, drive
    drive.mount("/content/drive", force_remount=True)
    drive = True
except:
    drive = False

# Extract your PDFs
files.upload()

# Get the name of the zip file
pdfs = [pdf for pdf in os.listdir() if pdf.endswith(".zip")]
if len(pdfs) == 0:
    raise Exception("No ZIP file found")

# Transform them
%pip install udocker
!udocker --allow-root install

for pdf in pdfs:
    !unzip -o "$pdf"
    !rm -f "$pdf"
    !udocker --allow-root run -v /content/"$pdf":/app/pdf ghcr.io/ipitio/ocr-pdf bash predict.sh pdf
    converted = os.listdir("$pdf/done")

    # And load
    if drive and len(converted) > 0:
        ![ -d "drive/MyDrive/ocr-pdf" ] || mkdir "drive/MyDrive/ocr-pdf"
        !\cp -r "$pdf/done/"* "drive/MyDrive/ocr-pdf/"

    if len(converted) == 1 and os.path.isfile("$pdf/done/" + converted[0]):
        files.download("$pdf/done/" + converted[0])
    elif len(converted) > 0:
        !zip -r "$pdf.zip" "$pdf/done"
        files.download("$pdf.zip")
    else:
        print("No PDFs found")