docr/smd stands for "Distributed OCR over a Swarm of Mobile Devices." Its purpose is to distribute the work of performing Optical Character Recognition (OCR) to a large number of clients. It does this by providing a server application (called a 'page server') that responds to a client's request for a page image in the form of a JPEG, TIFF, or other image format. The page server provides the client with the image, the client performs the OCR on the image, and sends the resulting text transcript back to the server.
The power of docr/smd is that is can respond to requests from many clients in a short period of time, in effect parallelizing the OCR task. The technical environment of docr/smd clients is independed of the page server's. Currently, docr/smd provides a Python client, but the 'm' in 'smd' stands for 'mobile': Android and iOS applications are possible because the open source Tesseract OCR engine (https://code.google.com/p/tesseract-ocr/) has been ported to both platforms. These mobile clients are currently in the planning stages. It is likely the Android client will be completed first.
The docr/smd server maintains a queue of page images that need to be OCRed. Clients periodically query the page server's REST interface for page images, perform the OCR, and PUT the resulting transcript back to the server. The server is a simple PHP application that uses SQLite to maintain its page queue. Client requests and the server's responses make heavy use of HTTP headers to supplement the REST API.
The page queue is loaded and purged by a PHP script called the queue manager, which can be run as a command-line script or as a cron job. It provides options to load the queue, list items in the queue, and purge the queue.
docr/smd has a 'peer' mode that redirects client requests to other page servers if the local server doesn't have any images to process. This ability can vastly increase the number of potential clients available to a given server, and also reduces the likelihood that clients remain idle.
The details of the client/server interaction are as follows:
- The docr/smd client issues a GET request to the server.
- The server 'checks out' (flags that the image is currently being processed) the next image in its queue for OCRing and returns it to the client.
- The server also sends the image's filesystem path to the client, which is used later as a key to update the docr page queue.
- The client performs OCR on image, and sends transcript back to the page server via a PUT; it also sending a header containing the original image's filesysmtem path.
- The server saves the OCR transcript to disk, updates the queue database with the location of the transcript, and, optionally, deletes the original image file.
The docr/ocr server is easy to install and configure. The only requirements on the server are PHP 5.3 and SQLite (make sure the sqlite driver is enabled in PHP). Also, Apache will need to have mod_rewrite enabled and 'AllowOverride All' configured for the directory that the docr/smd server will be running in. Local settings such as paths to page image and transcript directories, access control via IP whitelisting and client API tokens, and URLs of peer servers are configured in a single file, config.php.
Details on installing and configuring the page server are provided in the README.md file in the 'server' directory. The Python client requires that the Tesseract OCR engine (https://code.google.com/p/tesseract-ocr/) and the Python requests (http://www.python-requests.org/) library be installed.
Details on deploying the queue manager are provided in the server/README.md file.
File permissions, especially on the SQLite database, are the most common problem you will encounter. It must be writable by apache's user and also by the user running the queue manager script. Common symptoms of bad permissions on the database file include:
- the Python client will fail with the error "Sorry, the docr client has experienced an I/O error(None): None."
- output from the OCR engine is showing up in the $config['transcript_base_dir'] directory, but the paths to the files are not in the database when you issue a 'queue_manager.php list' command.
Also, make sure that the web server's user can write to the directory defined in $config['transcript_base_dir'].
Some specific problems and possible fixes:
- Symptom: "could not find driver" appears 1) in Apache's error log or 2) on the command line when you run the queue manager. Possible cause: PHP doesn't have the pdo_sqlite extension installed and enabled.