New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for the OCR engine included in Windows 10. #7361
Conversation
This accesses the UWP OCR API via code in the nvdaHelperLocalWin10 C++/CX dll. Users press NVDA+r to recognize the text of the current navigator object. Once recognition is complete, the result is presented in a document which can be read with the cursor keys, etc. Enter can also be pressed to click the text at the cursor. Much of the base content recognition functionality has been abstracted into the new contentRecog framework, allowing other recognizers to be easily implemented in future (in both NVDA core and add-ons).
Forgot to note that I tested clicking buttons in the Run dialog by pressing enter on them. This is an important piece of the functionality. |
does the review cursor support now go away (Originally it was doing this with review). |
I wonder whether it would be possible to have the OCR language tied to the NVDA language by default, with a fallback to English when the language is not available for OCR? Setting the default to English means that in many cases, one will have to change the default. |
Once incubated, I'll do comparison testing with the Tesseract engine output. I probably won't have time to do testing with multiple languages. If there's a call for it, I'll put testing Spanish, French, German, and Italian on my list. |
Are you saying that there is an ocr in windows 10?
Also was there not some talk awhile back about a new version of the open
source ocr and the problem was its size as I recall.
bglists@blueyonder.co.uk
Sent via blueyonder.
Please address personal email to:-
briang1@blueyonder.co.uk, putting 'Brian Gaff'
in the display name field.
----- Original Message -----
|
Hi, would it be possible for someone to include online - AI-based contentRecog into this later? Thanks.
From: James Teh [mailto:notifications@github.com]
Sent: Wednesday, July 5, 2017 11:44 PM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [nvaccess/nvda] Support for the OCR engine included in Windows 10. (#7361)
Forgot to note that I tested clicking buttons in the Run dialog by pressing enter on them. This is an important piece of the functionality.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#7361 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AHgLkCsBWDpXIimdadw9dvBRY0Mp6me-ks5sLIIqgaJpZM4OPOFb> .
|
Hi, or rather, NVDA language, Windows display language, and then to English as a last resort. Thanks.
From: Leonard de Ruijter [mailto:notifications@github.com]
Sent: Thursday, July 6, 2017 2:30 AM
To: nvaccess/nvda <nvda@noreply.github.com>
Cc: Subscribed <subscribed@noreply.github.com>
Subject: Re: [nvaccess/nvda] Support for the OCR engine included in Windows 10. (#7361)
I wonder whether it would be possible to have the OCR language tied to the NVDA language by default, with a fallback to English when the language is not available for OCR? Setting the default to English means that in many cases, one will have to change the default.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub <#7361 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AHgLkKpfi3fBtlNla5fmXlAuRhyQ4w4pks5sLKj6gaJpZM4OPOFb> .
|
@jcsteh: May be this is the right time to add appropriate headers to screenBitmap.py? |
@derekriemer commented on 6 Jul 2017, 19:24 GMT+10:
You can still use the review cursor, but you may as well use the system cursor, since you get thrown into a fake document with the result. I had three reasons for doing this:
@leonardder commented on 6 Jul 2017, 19:29 GMT+10:
Originally, I couldn't think of a decent way to do this; the UWP OCR code doesn't get loaded until it's needed, so there's no obvious place to put this. However, I think I might have some other ideas, so I'll see what I can do. @josephsl commented on 7 Jul 2017, 01:17 GMT+10:
The NVDA language is set based on the Windows display language in most cases unless the user explicitly overrides it. @josephsl commented on 7 Jul 2017, 01:03 GMT+10:
The framework does allow for that, yes. Note that most cloud image description APIs are not free. @leonardder commented on 7 Jul 2017, 01:31 GMT+10:
Shall do. |
I believe I've now addressed these comments. Thanks for the pre-review feedback. :) |
You might want to fix the merge conflict before the review takes place. Another thing I'd like to bring up: The script in globalCommands is now specific for windows 10 ocr. How about a system where new OCR engines can register themselves? You could change the windows 10 ocr dialog to a more generic dialog where you can set the OCR engine to use if multiple engines are available. This means that, on windows<10, an add-on can provide an engine and people will be able to use nvda+r out of the box. Thus, NVDA+r will OCR with the currently selected OCR engine. |
Blerg. Thanks for pointing out the merge conflict; I'll take a look
In the first version of this code (before I submitted a PR), there was a system for registering recognizers. I imagined the user would select a recognizer to use as a setting and then NVDA+r would use that, much as you suggest. However, this system was designed to be for any kind of image to text recognition, not just OCR. Discussions with @derekriemer and @md_curran convinced me to drop the idea of a single selectable recognizer because that would make it tedious to frequently use both Win 10 OCR and Microsoft Computer Vision for image description, for example. also, given how new image description is, users may conceivably want to try multiple services to get the best of all worlds. This might even be true for OCR, though perhaps a little less likely.
|
Obviously, that's the general idea, but it seems the changes I made to address comments after I requested review (but before it occurred) inadvertently caused a merge conflict. No idea why yet; will investigate when I'm back at work, probably before Reef gets to it. :)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't made it all the way through this. But here are my comments so far.
nvdaHelper/localWin10/utils.h
Outdated
|
||
#include <robuffer.h> | ||
|
||
byte* getBytes(Windows::Storage::Streams::IBuffer^ buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some documentation?
*/ | ||
|
||
#pragma once | ||
#define export __declspec(dllexport) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case I dont notice later, how does this become an import?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case I dont notice later, how does this become an import?
This exports the symbol from the dll so that a caller loading the dll can access it. We then access it using ctypes (which internally uses GetProcAddress).
nvdaHelper/localWin10/uwpOcr.h
Outdated
export UwpOcr* __stdcall uwpOcr_initialize(const char16* language, uwpOcr_Callback callback); | ||
export void __stdcall uwpOcr_terminate(UwpOcr* instance); | ||
export void __stdcall uwpOcr_recognize(UwpOcr* instance, const RGBQUAD* image, unsigned int width, unsigned int height); | ||
export BSTR __stdcall uwpOcr_getLanguages(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be handy to know what formats the languages come back in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A COM string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, since this gets several languages. Presumably each language is in the form en-gb
and there is some separator?
nvdaHelper/localWin10/uwpOcr.cpp
Outdated
UwpOcr* __stdcall uwpOcr_initialize(const char16* language, uwpOcr_Callback callback) { | ||
auto engine = OcrEngine::TryCreateFromLanguage(ref new Language(ref new String(language))); | ||
if (!engine) | ||
return NULL; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you use null_ptr
instead of NULL
source/contentRecog/uwpOcr.py
Outdated
self.language = getConfigLanguage() | ||
self._dll = NVDAHelper.getHelperLocalWin10Dll() | ||
|
||
def recognize(self, pixels, width, height, coordConv, onResult): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this protect against width or height of zero or less?
source/contentRecog/__init__.py
Outdated
raise NotImplementedError | ||
|
||
# Used by LinesWordsResult. | ||
LwrWord = namedtuple("LwrWord", ("offset", "left", "top")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little confused by the name LwrWord
does it stand for "Lines words Result Word"?
word = nextWord | ||
return textInfos.Point(word.left, word.top) | ||
|
||
class SimpleTextResult(RecognitionResult): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might become clear later in the review and probably needs to be handled higher up than here, but at the very least we would know the area of the screen that the image used for the OCR came from (assuming nvda did the screen grab). Obviously this is not true for an image obtained from elsewhere.
source/contentRecog/recogUi.py
Outdated
def recognizeNavigatorObject(recognizer): | ||
"""User interface function to recognize content in the navigator object. | ||
This should be called from a script or in response to a GUI action. | ||
@param recognizer: The content recognizer ot use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo ot
-> to
source/contentRecog/recogUi.py
Outdated
left, top, width, height = nav.location | ||
resize = recognizer.getResizeFactor(width, height) | ||
coordConv = ResultCoordConverter(left, top, resize) | ||
destWidth = int(width * resize) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be nice if the caller did not need to know about which direction the resize works, or the offsets with the resultCoordConverter.
self.assertEqual(actual, (110, 220)) | ||
|
||
def test_noOffsetWithResize(self): | ||
conv = contentRecog.ResultCoordConverter(0, 0, 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As discussed its a little confusing to me that 2 results in a halving of the coords.
pass | ||
|
||
class TestLinesWordsResult(unittest.TestCase): | ||
"""Tests contentRecog.LinesWordsResult and contentRecog.LwrTextInfo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its worth expanding on this a little bit. This is essentially testing that we are able to parse and interpret the json we expect to be returned from an OCR library, right?
source/contentRecog/__init__.py
Outdated
in the supplied image to screen coordinates. | ||
This should be used when returning coordinates to NVDA. | ||
@type coordConverter: L{ResultCoordConverter} | ||
@param imageInfo: Informationabout the image for recognition. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing space between Information
and about
what's with the assignment of @jcsteh here by nvaccessAuto? |
The person who incubates an issue gets assigned to it, since they're
responsible for merging it to master, etc.
|
Two remarks:
|
…g it on the current thread. UWP OCR calls the result callback in a background thread, but NVDA events must never be handled in a background thread. This was causing MSHTML vbufs to die due to COM queries failing because they were on the wrong thread. Added a debugWarning in IAccessibleRole which would have made this easier to catch. Fixes #7399.
…ogs and cause the fake focus to be lost.
@leonardder commented on 12 Jul 2017, 20:52 GMT+12:
For now, I've disabled the find commands; the user gets a message indicating that they aren't supported. This is tricky to support because the dialog causes the focus to move and we thus lose our fake focus. There are ways we can work around this, but I'd like to get this into 2017.3 and there isn't time ot explore them. I think the functionality is sufficient as it is to be worth releasing. I filed #7415 as a feature request for this.
Done. Thanks for the feedback. |
@feerrenrut, could you please review the latest three commits ASAP? As discussed, we want to get this into 2017.3 if possible, hence the time constraint. Thanks. |
Sorry @feerrenrut; just pushed one commit because I forgot to commit the User Guide update. :( |
source/contentRecog/recogUi.py
Outdated
@@ -74,7 +74,7 @@ def setFocus(self): | |||
# we want the cursor to move to the focus (#3145). | |||
# However, we don't want this for recognition results, as these aren't focusable. | |||
ti._enteringFromOutside = True | |||
eventHandler.executeEvent("gainFocus", self) | |||
eventHandler.queueEvent("gainFocus", self) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be worth having a comment here to say why we do this. You can probably pretty much copy your commit message.
I assume that at this point pressing Enter/Space will perform the default action (NVDA+Enter) on the coordinates of the given text? What about a left double click, a right click, moving the mouse pointer, etc? |
Pressing enter or space moves the mouse and left clicks. If you want to do
any other mouse things, you can roue mouse to review and then do the
appropriate click, etc.
|
Got it. As I have not tried it yet, and the documentation had no mentions of
mouse commands, I assumed that either (1) the functionality was not implemented
yet, or (2) anything you can do in a virtual buffer that does not cause any sort
of movement can also be done in the OCR results. Thanks for the
clarification.
|
It seems that in the current situation, one is able to run multiple ocr actions at once, e.g. when the OCR result is shown, you can do another ocr on the result object, and it will show another result. |
This isn't really a bug. |
@leonardder commented on 24 Jul 2017, 16:06 GMT+10:
I don't think this should delay this from merging to master or going into 2017.3, but I'll file a follow up PR to deal with it. |
I suggest the string(?) used when pressing the shortcut in input help says 'recognizes rather than recognize' to fall in line with other descriptors. |
… of "Recognize" for consistency. Re #7361.
Link to issue number:
None. This is a new feature I started experimenting with a few months ago.
Edit by @LeonarddeR: Fixes #3050, since this adds the json module.
Summary of the issue:
There are many cases where text is inaccessible and it is useful to be able to use OCR to access it. There is already an OCR add-on for NVDA, but this requires a separate download. We could never include it in NVDA because of the size of the dependencies. Also, it uses the open source Tesseract engine, which has some quality problems. Windows 10 includes an OCR engine with support for 25+ languages. Having support for this out-of-the-box means users can use it without any additional downloads. Because it's a commercial engine, it's possible that it is better quality (though I haven't done much in the way of comparison).
Description of how this pull request fixes the issue:
This accesses the UWP OCR API via code in the nvdaHelperLocalWin10 C++/CX dll.
Users press NVDA+r to recognize the text of the current navigator object. Once recognition is complete, the result is presented in a document which can be read with the cursor keys, etc. Enter can also be pressed to click the text at the cursor.
Much of the base content recognition functionality has been abstracted into the new contentRecog framework, allowing other recognizers to be easily implemented in future (in both NVDA core and add-ons).
Testing performed:
Known issues with pull request:
Change log entry:
New Features:
Changes for Developers: