Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for the OCR engine included in Windows 10. #7361

Merged
merged 11 commits into from Aug 1, 2017
Merged

Support for the OCR engine included in Windows 10. #7361

merged 11 commits into from Aug 1, 2017

Conversation

jcsteh
Copy link
Contributor

@jcsteh jcsteh commented Jul 6, 2017

Link to issue number:

None. This is a new feature I started experimenting with a few months ago.

Edit by @LeonarddeR: Fixes #3050, since this adds the json module.

Summary of the issue:

There are many cases where text is inaccessible and it is useful to be able to use OCR to access it. There is already an OCR add-on for NVDA, but this requires a separate download. We could never include it in NVDA because of the size of the dependencies. Also, it uses the open source Tesseract engine, which has some quality problems. Windows 10 includes an OCR engine with support for 25+ languages. Having support for this out-of-the-box means users can use it without any additional downloads. Because it's a commercial engine, it's possible that it is better quality (though I haven't done much in the way of comparison).

Description of how this pull request fixes the issue:

This accesses the UWP OCR API via code in the nvdaHelperLocalWin10 C++/CX dll.
Users press NVDA+r to recognize the text of the current navigator object. Once recognition is complete, the result is presented in a document which can be read with the cursor keys, etc. Enter can also be pressed to click the text at the cursor.
Much of the base content recognition functionality has been abstracted into the new contentRecog framework, allowing other recognizers to be easily implemented in future (in both NVDA core and add-ons).

Testing performed:

  1. Tested in the Run dialog, both the entire dialog and individual controls.
  2. Tested on the nvaccess.org website, both on the logo and various paragraphs of text.

Known issues with pull request:

  1. Obviously, this only works in Windows 10.
  2. Content that is visually separate but positioned horizontally gets rendered on separate lines; e.g. the buttons in the Run dialog. The OCR engine is choosing to treat this as separate lines. Whether this is desirable is controversial. We could probably compare y coordinates in future if there is sufficient demand.

Change log entry:

New Features:

- NVDA can now use the OCR functionality included in Windows 10 to recognize the text of images or inaccessible applications. (#7361)
 - The language can be set from the new Windows 10 OCR dialog in NVDA Preferences.
 - To recognize the content of the current navigator object, press NVDA+r.
 - See the Content Recognition section of the User Guide for further details.

Changes for Developers:

- Support for content recognizers such as OCR and image description tools can be easily implemented using the new contentRecog package. (#7361)
- The Python json package is now included in NVDA binary builds. (#3050)

This accesses the UWP OCR API via code in the nvdaHelperLocalWin10 C++/CX dll.
Users press NVDA+r to recognize the text of the current navigator object. Once recognition is complete, the result is presented in a document which can be read with the cursor keys, etc. Enter can also be pressed to click the text at the cursor.
Much of the base content recognition functionality has been abstracted into the new contentRecog framework, allowing other recognizers to be easily implemented in future (in both NVDA core and add-ons).
@jcsteh jcsteh requested a review from feerrenrut July 6, 2017 06:42
@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 6, 2017

Forgot to note that I tested clicking buttons in the Run dialog by pressing enter on them. This is an important piece of the functionality.

@derekriemer
Copy link
Collaborator

does the review cursor support now go away (Originally it was doing this with review).

@LeonarddeR
Copy link
Collaborator

I wonder whether it would be possible to have the OCR language tied to the NVDA language by default, with a fallback to English when the language is not available for OCR? Setting the default to English means that in many cases, one will have to change the default.

@PratikP1
Copy link

PratikP1 commented Jul 6, 2017

Once incubated, I'll do comparison testing with the Tesseract engine output. I probably won't have time to do testing with multiple languages. If there's a call for it, I'll put testing Spanish, French, German, and Italian on my list.

@Brian1Gaff
Copy link

Brian1Gaff commented Jul 6, 2017 via email

@josephsl
Copy link
Collaborator

josephsl commented Jul 6, 2017 via email

@josephsl
Copy link
Collaborator

josephsl commented Jul 6, 2017 via email

@LeonarddeR
Copy link
Collaborator

@jcsteh: May be this is the right time to add appropriate headers to screenBitmap.py?

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 7, 2017

@derekriemer commented on 6 Jul 2017, 19:24 GMT+10:

does the review cursor support now go away (Originally it was doing this with review).

You can still use the review cursor, but you may as well use the system cursor, since you get thrown into a fake document with the result. I had three reasons for doing this:

  1. Some users seemed to be confused by having to use the review cursor after doing OCR.
  2. The approach used by the OCR add-on overrode the object's TextInfo. This means that, for example, if you ran this on an editable text control, pressing the arrow keys afterwards would throw exceptions.
  3. It's nice to just be able to press enter on a piece of text to activate it, just like browse mode. :)

@leonardder commented on 6 Jul 2017, 19:29 GMT+10:

I wonder whether it would be possible to have the OCR language tied to the NVDA language by default, with a fallback to English when the language is not available for OCR? Setting the default to English means that in many cases, one will have to change the default.

Originally, I couldn't think of a decent way to do this; the UWP OCR code doesn't get loaded until it's needed, so there's no obvious place to put this. However, I think I might have some other ideas, so I'll see what I can do.

@josephsl commented on 7 Jul 2017, 01:17 GMT+10:

NVDA language, Windows display language, and then to English as a last resort.

The NVDA language is set based on the Windows display language in most cases unless the user explicitly overrides it.

@josephsl commented on 7 Jul 2017, 01:03 GMT+10:

Hi, would it be possible for someone to include online - AI-based contentRecog into this later?

The framework does allow for that, yes. Note that most cloud image description APIs are not free.

@leonardder commented on 7 Jul 2017, 01:31 GMT+10:

May be this is the right time to add appropriate headers to screenBitmap.py?

Shall do.

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 7, 2017

I believe I've now addressed these comments. Thanks for the pre-review feedback. :)

@LeonarddeR
Copy link
Collaborator

You might want to fix the merge conflict before the review takes place.

Another thing I'd like to bring up: The script in globalCommands is now specific for windows 10 ocr. How about a system where new OCR engines can register themselves? You could change the windows 10 ocr dialog to a more generic dialog where you can set the OCR engine to use if multiple engines are available. This means that, on windows<10, an add-on can provide an engine and people will be able to use nvda+r out of the box. Thus, NVDA+r will OCR with the currently selected OCR engine.

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 7, 2017 via email

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 9, 2017 via email

Copy link
Contributor

@feerrenrut feerrenrut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't made it all the way through this. But here are my comments so far.


#include <robuffer.h>

byte* getBytes(Windows::Storage::Streams::IBuffer^ buffer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation?

*/

#pragma once
#define export __declspec(dllexport)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case I dont notice later, how does this become an import?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case I dont notice later, how does this become an import?

This exports the symbol from the dll so that a caller loading the dll can access it. We then access it using ctypes (which internally uses GetProcAddress).

export UwpOcr* __stdcall uwpOcr_initialize(const char16* language, uwpOcr_Callback callback);
export void __stdcall uwpOcr_terminate(UwpOcr* instance);
export void __stdcall uwpOcr_recognize(UwpOcr* instance, const RGBQUAD* image, unsigned int width, unsigned int height);
export BSTR __stdcall uwpOcr_getLanguages();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be handy to know what formats the languages come back in?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A COM string.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, since this gets several languages. Presumably each language is in the form en-gb and there is some separator?

UwpOcr* __stdcall uwpOcr_initialize(const char16* language, uwpOcr_Callback callback) {
auto engine = OcrEngine::TryCreateFromLanguage(ref new Language(ref new String(language)));
if (!engine)
return NULL;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use null_ptr instead of NULL

self.language = getConfigLanguage()
self._dll = NVDAHelper.getHelperLocalWin10Dll()

def recognize(self, pixels, width, height, coordConv, onResult):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this protect against width or height of zero or less?

raise NotImplementedError

# Used by LinesWordsResult.
LwrWord = namedtuple("LwrWord", ("offset", "left", "top"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused by the name LwrWord does it stand for "Lines words Result Word"?

word = nextWord
return textInfos.Point(word.left, word.top)

class SimpleTextResult(RecognitionResult):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might become clear later in the review and probably needs to be handled higher up than here, but at the very least we would know the area of the screen that the image used for the OCR came from (assuming nvda did the screen grab). Obviously this is not true for an image obtained from elsewhere.

def recognizeNavigatorObject(recognizer):
"""User interface function to recognize content in the navigator object.
This should be called from a script or in response to a GUI action.
@param recognizer: The content recognizer ot use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo ot -> to

left, top, width, height = nav.location
resize = recognizer.getResizeFactor(width, height)
coordConv = ResultCoordConverter(left, top, resize)
destWidth = int(width * resize)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be nice if the caller did not need to know about which direction the resize works, or the offsets with the resultCoordConverter.

self.assertEqual(actual, (110, 220))

def test_noOffsetWithResize(self):
conv = contentRecog.ResultCoordConverter(0, 0, 2)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed its a little confusing to me that 2 results in a halving of the coords.

pass

class TestLinesWordsResult(unittest.TestCase):
"""Tests contentRecog.LinesWordsResult and contentRecog.LwrTextInfo.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth expanding on this a little bit. This is essentially testing that we are able to parse and interpret the json we expect to be returned from an OCR library, right?

@jcsteh jcsteh requested a review from feerrenrut July 11, 2017 04:54
in the supplied image to screen coordinates.
This should be used when returning coordinates to NVDA.
@type coordConverter: L{ResultCoordConverter}
@param imageInfo: Informationabout the image for recognition.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing space between Information and about

@derekriemer
Copy link
Collaborator

what's with the assignment of @jcsteh here by nvaccessAuto?

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 11, 2017 via email

@LeonarddeR
Copy link
Collaborator

LeonarddeR commented Jul 12, 2017

Two remarks:

  1. It is not possible to use the find functionality for the resulting cursor manager. You can open the find dialog and even find something, but as soon as you return, the focus will return back to the original focus object.
  2. Would it be an option to also add the space bar as a default binding for activate?

…g it on the current thread.

UWP OCR calls the result callback in a background thread, but NVDA events must never be handled in a background thread.
This was causing MSHTML vbufs to die due to COM queries failing because they were on the wrong thread.
Added a debugWarning in IAccessibleRole which would have made this easier to catch.
Fixes #7399.
@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 18, 2017

@leonardder commented on 12 Jul 2017, 20:52 GMT+12:

  1. It is not possible to use the find functionality for the resulting cursor manager. You can open the find dialog and even find something, but as soon as you return, the focus will return back to the original focus object.

For now, I've disabled the find commands; the user gets a message indicating that they aren't supported. This is tricky to support because the dialog causes the focus to move and we thus lose our fake focus. There are ways we can work around this, but I'd like to get this into 2017.3 and there isn't time ot explore them. I think the functionality is sufficient as it is to be worth releasing. I filed #7415 as a feature request for this.

  1. Would it be an option to also add the space bar as a default binding for activate?

Done.

Thanks for the feedback.

@jcsteh jcsteh requested a review from feerrenrut July 18, 2017 04:15
@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 18, 2017

@feerrenrut, could you please review the latest three commits ASAP? As discussed, we want to get this into 2017.3 if possible, hence the time constraint. Thanks.

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 18, 2017

Sorry @feerrenrut; just pushed one commit because I forgot to commit the User Guide update. :(

@@ -74,7 +74,7 @@ def setFocus(self):
# we want the cursor to move to the focus (#3145).
# However, we don't want this for recognition results, as these aren't focusable.
ti._enteringFromOutside = True
eventHandler.executeEvent("gainFocus", self)
eventHandler.queueEvent("gainFocus", self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might be worth having a comment here to say why we do this. You can probably pretty much copy your commit message.

jcsteh added a commit that referenced this pull request Jul 19, 2017
@tmthywynn8
Copy link
Sponsor

I assume that at this point pressing Enter/Space will perform the default action (NVDA+Enter) on the coordinates of the given text? What about a left double click, a right click, moving the mouse pointer, etc?

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 20, 2017 via email

@tmthywynn8
Copy link
Sponsor

tmthywynn8 commented Jul 20, 2017 via email

@LeonarddeR
Copy link
Collaborator

It seems that in the current situation, one is able to run multiple ocr actions at once, e.g. when the OCR result is shown, you can do another ocr on the result object, and it will show another result.

@derekriemer
Copy link
Collaborator

This isn't really a bug.

@jcsteh
Copy link
Contributor Author

jcsteh commented Jul 31, 2017

@leonardder commented on 24 Jul 2017, 16:06 GMT+10:

It seems that in the current situation, one is able to run multiple ocr actions at once, e.g. when the OCR result is shown, you can do another ocr on the result object, and it will show another result.

I don't think this should delay this from merging to master or going into 2017.3, but I'll file a follow up PR to deal with it.

@jcsteh jcsteh merged commit 5527f27 into master Aug 1, 2017
@nvaccessAuto nvaccessAuto added this to the 2017.3 milestone Aug 1, 2017
@jcsteh jcsteh deleted the uwpOcr branch August 1, 2017 23:17
@fisher729
Copy link

I suggest the string(?) used when pressing the shortcut in input help says 'recognizes rather than recognize' to fall in line with other descriptors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add JSon module to NVDA distribution
10 participants