Skip to content
This repository has been archived by the owner on Jun 14, 2018. It is now read-only.

File io optimization #20

Closed
wants to merge 4 commits into from
Closed

File io optimization #20

wants to merge 4 commits into from

Conversation

ventsyv
Copy link

@ventsyv ventsyv commented Sep 19, 2014

Memory pipes are fifo structures which for the most part behave like normal files but are much faster due to the fact that the data is stored in memory and not on the disk.
The two main differences are that pipes do not support seek operations, and that they block the thread.
To get around the blocking issue, after the two pipes are created, I start a thread that then starts the OCR engine process. This worker thread will stay blocked waiting for input until the main thread opens the pipes and writes the data.
After the main thread writes the input data, it waits for the OCR engine's thread to terminate before reading the pipe. That's done to ensure that all the OCR output has been written to the pipe.

Unfortunately, Tesseract does not support stdin and the PIL.image.save function requires seek operation, thus it can't use the pipe. I tried manually saving the pixels to the pipe, but I'm guessing I'm missing some header because Tesseract thinks the file is invalid. For the time being, I put back the file IO.

@jflesch
Copy link
Member

jflesch commented Sep 19, 2014

Looks great. However right now I don't have time to review it correctly. I'll review it Sunday (I can't do it before, sorry).

@ventsyv
Copy link
Author

ventsyv commented Sep 19, 2014

No worries, took me a month to write, it can wait a few days :-). Can you please run the unit tests from the master branch and this one so we can compare how long it takes for the tests to finish? I meant to do it, but forgot.

@jflesch
Copy link
Member

jflesch commented Sep 21, 2014

Unfortunately, it seems the tests on your branch remain stuck. Here is the stacktrace I get when I do Ctrl-C:

(...)
File "/usr/lib64/python2.7/unittest/case.py", line 369, in run
    testMethod()
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 219, in test_basic
    self.__test_txt('test.png', 'test.words')
  File "/home/jflesch/git/pyocr_ventsyv/tests/tests_tesseract.py", line 201, in __test_txt
    builder=self.builder)
  File "src/pyocr/tesseract.py", line 241, in image_to_string
    with codecs.open(output_file_full_name, 'r', encoding='utf-8', errors='replace') as outFile_desc:
  File "/usr/lib64/python2.7/codecs.py", line 881, in open
    file = __builtin__.open(filename, mode, buffering)
KeyboardInterrupt

worker = threading.Thread(target=run_cuneiform, args=(cmd, img_data))
worker.start()

file_desc = codecs.open(output_file, 'r', encoding='utf-8', errors='replace')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8: There shouldn't be 2 spaces after '='

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: PEP8: Lines shouldn't be longer than 80 car. (same problem in tesseract.py)

@jflesch
Copy link
Member

jflesch commented Sep 21, 2014

From what I can tell, your changes regarding Cuneiform seem to work fine. However those regarding Tesseract don't work. So here what I suggest you do :

  • simply revert all your changes on src/pyocr/tesseract.py
  • fix all the PEP8 issues on src/pyocr/cuneiform.py

And everything should be good.

Other than that, nice work :)

By the way, sorry if some of my comments (or their number) may seem rude. I'm just trying to keep the code as good as possible.

@ventsyv
Copy link
Author

ventsyv commented Sep 21, 2014

I did not realize you were adhering to a coding standard. I'll run it through the formatting tool.
Did you run the unit tests before making any changes? The operations on the pipes need to occur in a particular order, otherwise the thread will block. I'll take a look in a minute, but the code I checked in was passing all tests.

@ventsyv
Copy link
Author

ventsyv commented Sep 21, 2014

Ok, I made the necessary pep8 changes and checked them in. I ran the tests and they are all passing. Could you please update from the repository and run the code as is?

@jflesch
Copy link
Member

jflesch commented Oct 2, 2014

It's much better regarding PEP8.
However, the Tesseract tests still remain stuck when I run them.

@ventsyv
Copy link
Author

ventsyv commented Oct 6, 2014

What version of tessera ct do u have? What OS? I'll set ip a VM and test it.
On Oct 2, 2014 5:22 PM, "Jerome Flesch" notifications@github.com wrote:

It's much better regarding PEP8.
However, the Tesseract tests still remain stuck when I run them.


Reply to this email directly or view it on GitHub
#20 (comment).

@jflesch
Copy link
Member

jflesch commented Oct 6, 2014

I'm using Fedora 20 and Tesseract 3.02.02 (the one provided by Fedora)

@jflesch jflesch closed this Feb 14, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants