A Python wrapper for Tesseract-OCR
Switch branches/tags
Nothing to show
Clone or download
Pull request Compare This branch is 4 commits ahead, 39 commits behind chris-piekarski:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.


This appeared to be an abandoned project, so I am grabbing all of the source and compliling it and reworking it for personal use. If anyone would like to contribute, or would like to use this, please do.

Current Status:

Code from forks on GitHub and Google Code have been combined. Thinking about adding the Python "Tesseract API" library also, but it appears to be useless at the current time. 

First order of business is to clean up and make only the code needed. I currently am going through the source to rebuild the library, and I think I will become the new maintainer for the project. A majority of the work is done thanks to my previous coders. I still don't know who the original code started out with, as I have seen the same code replicated many times with different people in charge.

It is also needed to come up with a standard. Through all of the different codes I looked at, they all were near identical except some of the calls to the functions.

Python-tesseract is an optical character recognition (OCR) tool for python.
That is, it will recognize and "read" the text embedded in images.

Python-tesseract is a wrapper for google's Tesseract-OCR
( http://code.google.com/p/tesseract-ocr/ ).  It is also useful as a
stand-alone invocation script to tesseract, as it can read all image types
supported by the Python Imaging Library, including jpeg, png, gif, bmp, tiff,
and others, wheras tesseract-ocr by default only supports tiff and bmp.
Additionally, if used as a script, Python-tesseract will print the recognized
text in stead of writing it to a file. Support for confidence estimates and
bounding box data is planned for future releases.

From the shell:
 $ ./tesseract.py test.png                  # prints recognized text in image
 $ ./tesseract.py -l fra test-european.jpg  # recognizes french text
In python:
 > from tesseract import image2String
 > print image2String('test.png')
 > print image2String('test-european.jpg', lang='fra')

* Python-tesseract requires python 2.5 or later.
* You will need the Python Imaging Library (PIL).  Under Debian/Ubuntu, this is
  the package "python-imaging".
* Install google tesseract-ocr from http://code.google.com/p/tesseract-ocr/ .
  You must be able to invoke the tesseract command as "tesseract". If this
  isn't the case, for example because tesseract isn't in your PATH, you will
  have to change the "tesseract_cmd" variable at the top of 'tesseract.py'.

Python-tesseract is released under the GPL v3.
Copyright (c) Samuel Hoffstaetter, 2009 #who was this man? 
http://wiki.github.com/hoffstaetter/python-tesseract #no longer functional?