Skip to content

jallendev/CS4273OCR

Repository files navigation

OCR Capstone Project

Setup

In theory, run WindowsSetup.bat to get everything set up for Windows
For Linux or Mac, WindowsSetup.bat contains a list of all dependencies that are needed to be installed.
This program has only been successfully run on Linux. This is due to a depenency with pdf2images not working on Windows

To Run

Call python3 extractText.py
Once the GUI is up, select the template file, the output csv file (which will be overwritten) and the PDF file(s) to be scanned in and select run.

Template File

The template file is | seperated fields where
The first field is the column header
After that there are three options:

  1. There are no other fields if nothing is to be inserted for a specific column
  2. There can be a second field with random text if the column is to always be filled with the same text
  3. There are four additional fields that specifies the bounding box where the code will extract text from the PDFs:
    • The first additional field is the top left x coordinate of the bounding box
    • The second additional field is the top left y coordinate of the bounding box
    • The third additional field is the bottom right x coordinate of the bounding box
    • The fourth additional field is the bottom right y coordinate of the bounding box

Future Work

Needs to be able to run on Windows (possibly by changing the pdf2image library to a different library?)
Needs to be able to create a template with a GUI
Needs to be able to detect and correct for skewing in the PDFs

About

A university machine learning OCR project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published