File io optimization #20

ventsyv · 2014-09-19T05:25:03Z

Memory pipes are fifo structures which for the most part behave like normal files but are much faster due to the fact that the data is stored in memory and not on the disk.
The two main differences are that pipes do not support seek operations, and that they block the thread.
To get around the blocking issue, after the two pipes are created, I start a thread that then starts the OCR engine process. This worker thread will stay blocked waiting for input until the main thread opens the pipes and writes the data.
After the main thread writes the input data, it waits for the OCR engine's thread to terminate before reading the pipe. That's done to ensure that all the OCR output has been written to the pipe.

Unfortunately, Tesseract does not support stdin and the PIL.image.save function requires seek operation, thus it can't use the pipe. I tried manually saving the pixels to the pipe, but I'm guessing I'm missing some header because Tesseract thinks the file is invalid. For the time being, I put back the file IO.

jflesch · 2014-09-19T06:29:52Z

Looks great. However right now I don't have time to review it correctly. I'll review it Sunday (I can't do it before, sorry).

ventsyv · 2014-09-19T21:08:12Z

No worries, took me a month to write, it can wait a few days :-). Can you please run the unit tests from the master branch and this one so we can compare how long it takes for the tests to finish? I meant to do it, but forgot.

jflesch · 2014-09-21T15:41:54Z

Unfortunately, it seems the tests on your branch remain stuck. Here is the stacktrace I get when I do Ctrl-C:

(...)
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
    testMethod()
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 219, in test_basic
    self.__test_txt('test.png', 'test.words')
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 201, in __test_txt
    builder=self.builder)
  File "src/pyocr/tesseract.py", line 241, in image_to_string
    with codecs.open(output_file_full_name, 'r', encoding='utf-8', errors='replace') as outFile_desc:
  File "/usr/lib64/python2.7/codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
KeyboardInterrupt

jflesch · 2014-09-21T15:43:15Z

src/pyocr/cuneiform.py

+    worker = threading.Thread(target=run_cuneiform, args=(cmd, img_data))
+    worker.start()
+
+    file_desc =  codecs.open(output_file, 'r', encoding='utf-8', errors='replace')


PEP8: There shouldn't be 2 spaces after '='

Also: PEP8: Lines shouldn't be longer than 80 car. (same problem in tesseract.py)

jflesch · 2014-09-21T16:06:52Z

From what I can tell, your changes regarding Cuneiform seem to work fine. However those regarding Tesseract don't work. So here what I suggest you do :

simply revert all your changes on src/pyocr/tesseract.py
fix all the PEP8 issues on src/pyocr/cuneiform.py

And everything should be good.

Other than that, nice work :)

By the way, sorry if some of my comments (or their number) may seem rude. I'm just trying to keep the code as good as possible.

ventsyv · 2014-09-21T22:28:05Z

I did not realize you were adhering to a coding standard. I'll run it through the formatting tool.
Did you run the unit tests before making any changes? The operations on the pipes need to occur in a particular order, otherwise the thread will block. I'll take a look in a minute, but the code I checked in was passing all tests.

ventsyv · 2014-09-21T23:54:25Z

Ok, I made the necessary pep8 changes and checked them in. I ran the tests and they are all passing. Could you please update from the repository and run the code as is?

jflesch · 2014-10-02T21:22:11Z

It's much better regarding PEP8.
However, the Tesseract tests still remain stuck when I run them.

ventsyv · 2014-10-06T13:28:39Z

What version of tessera ct do u have? What OS? I'll set ip a VM and test it.
On Oct 2, 2014 5:22 PM, "Jerome Flesch" notifications@github.com wrote:

It's much better regarding PEP8.
However, the Tesseract tests still remain stuck when I run them.

—
Reply to this email directly or view it on GitHub
#20 (comment).

jflesch · 2014-10-06T15:04:23Z

I'm using Fedora 20 and Tesseract 3.02.02 (the one provided by Fedora)

ventsyv added 3 commits September 19, 2014 00:49

Unit tests passing but tesseract input still writen to file

84dcf20

Removed code that was commented out

7e44e8b

Removed unused imports

1181326

jflesch reviewed Sep 21, 2014
View reviewed changes

Pep8 formatting

d06d5c2

jflesch closed this Feb 14, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File io optimization #20

File io optimization #20

ventsyv commented Sep 19, 2014

jflesch commented Sep 19, 2014

ventsyv commented Sep 19, 2014

jflesch commented Sep 21, 2014

jflesch Sep 21, 2014

jflesch Sep 21, 2014

jflesch commented Sep 21, 2014

ventsyv commented Sep 21, 2014

ventsyv commented Sep 21, 2014

jflesch commented Oct 2, 2014

ventsyv commented Oct 6, 2014

jflesch commented Oct 6, 2014

File io optimization #20

File io optimization #20

Conversation

ventsyv commented Sep 19, 2014

jflesch commented Sep 19, 2014

ventsyv commented Sep 19, 2014

jflesch commented Sep 21, 2014

jflesch Sep 21, 2014

Choose a reason for hiding this comment

jflesch Sep 21, 2014

Choose a reason for hiding this comment

jflesch commented Sep 21, 2014

ventsyv commented Sep 21, 2014

ventsyv commented Sep 21, 2014

jflesch commented Oct 2, 2014

ventsyv commented Oct 6, 2014

jflesch commented Oct 6, 2014