-
Notifications
You must be signed in to change notification settings - Fork 152
Conversation
Looks great. However right now I don't have time to review it correctly. I'll review it Sunday (I can't do it before, sorry). |
No worries, took me a month to write, it can wait a few days :-). Can you please run the unit tests from the master branch and this one so we can compare how long it takes for the tests to finish? I meant to do it, but forgot. |
Unfortunately, it seems the tests on your branch remain stuck. Here is the stacktrace I get when I do Ctrl-C:
|
worker = threading.Thread(target=run_cuneiform, args=(cmd, img_data)) | ||
worker.start() | ||
|
||
file_desc = codecs.open(output_file, 'r', encoding='utf-8', errors='replace') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8: There shouldn't be 2 spaces after '='
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: PEP8: Lines shouldn't be longer than 80 car. (same problem in tesseract.py)
From what I can tell, your changes regarding Cuneiform seem to work fine. However those regarding Tesseract don't work. So here what I suggest you do :
And everything should be good. Other than that, nice work :) By the way, sorry if some of my comments (or their number) may seem rude. I'm just trying to keep the code as good as possible. |
I did not realize you were adhering to a coding standard. I'll run it through the formatting tool. |
Ok, I made the necessary pep8 changes and checked them in. I ran the tests and they are all passing. Could you please update from the repository and run the code as is? |
It's much better regarding PEP8. |
What version of tessera ct do u have? What OS? I'll set ip a VM and test it.
|
I'm using Fedora 20 and Tesseract 3.02.02 (the one provided by Fedora) |
Memory pipes are fifo structures which for the most part behave like normal files but are much faster due to the fact that the data is stored in memory and not on the disk.
The two main differences are that pipes do not support seek operations, and that they block the thread.
To get around the blocking issue, after the two pipes are created, I start a thread that then starts the OCR engine process. This worker thread will stay blocked waiting for input until the main thread opens the pipes and writes the data.
After the main thread writes the input data, it waits for the OCR engine's thread to terminate before reading the pipe. That's done to ensure that all the OCR output has been written to the pipe.
Unfortunately, Tesseract does not support stdin and the PIL.image.save function requires seek operation, thus it can't use the pipe. I tried manually saving the pixels to the pipe, but I'm guessing I'm missing some header because Tesseract thinks the file is invalid. For the time being, I put back the file IO.