-
-
Notifications
You must be signed in to change notification settings - Fork 625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto language detection based on script detection In t2990 review #7629
Conversation
…and language priority selection by users
shouldChangeScript is not used
fixes some issues with UnicodeEncodeError and log messages, fixes handling of number characters followed by Japanese
nvaccess#2990 Japanese test cases
…rement for outer punctuation to be associated for outer script
…nguage detection does not work for a language, it can be ignored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a big change, that I will have to go through carefully again. Thanks for all the work!
#while adding new languages we filter out existing languages in the prefered language list | ||
ignoreLanguages = {x[0] for x in self.languageNames} | ||
languageList = languageDetection.getLanguagesWithDescriptions(ignoreLanguages) | ||
dialog = wx.SingleChoiceDialog(None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you consider using a wx.MultiChoiceDialog
? This would allow the user to add all of the languages they care about at once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why we are not using multi choice dialog is when we add a language, it gets added on top, and in most of the cases users would only add one language.
Sequence of these languages also affects which language gets priority.
source/unicodeScriptData.py
Outdated
@@ -0,0 +1,824 @@ | |||
scriptCode= [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the usual copyright header. Also a docstring / explanation for this would be helpful.
Include in the docstring answers to the following:
- What each value represents, is it (unicodeStartOfRange, unicodeEndOfRange, rangeDescription)?
- Should value ranges overlap?
- Should all values be represented?
- Are these values contiguous?
Please also add unit tests to confirm the above restrictions should they exists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added copyright and doc string with details about each item, but adding test cases may not be needed as these values are obtained from scripts.txt and it is done occasionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though this data is created programmatically, tests for the data will help to ensure its correctness. It will be hard to test the script that creates this data, and it will complicate the code in NVDA to remove the assumptions made. An automated test to ensure that these assumptions are correct will make it much easier to track down the kind of bugs that would arise if this data does get out of order / values overlapping.
Please also state if the start / end of each range is inclusive or exclusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This range comes from scripts.txt which comes from unicode.org. Should we still check it? I am assuming and I could be wrong that Unicode would not make an overlapping list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point, this list is sorted and is created once in a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm more concerned that a mistake is made in the process of creating this. Perhaps the source data on unicode.org changes format and this causes a bug in the importer, this is even more likely that a mistake will be made if its been a while since someone last went through the process, and there is nothing in the production code to ensure the assumptions are met. It's quite hard to manually check that our assumptions about the scriptCode
data hold true, but it's quite easy to write a unit test to do the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added 3 tests for checking the validity of the list.
- test_unicodeRangesEntryStartLessEqualEnd
- test_unicodeRangesEntriesDoNotOverlapAndAreSorted
- test_unicodeRangesEntryScriptNamesExist
Do let me know if we need to add more tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For completeness, we could also test that the first elements start value is greater than 0.
source/languageDetection.py
Outdated
elif 0xff10 <= characterUnicodeCode <= 0xff19: | ||
return "FullWidthNumber" | ||
while( mEnd >= mStart ): | ||
midPoint = (mStart + mEnd ) >> 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there not a built in way of performing this search? So we don't reinvent the wheel, so to speak?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so as is it not a simple binary search.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could use bisect
, in the initialisation create a list of the rangeEndValues
like so:
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]
Then in this function you can do:
# Based on the following assumptions:
# - ranges must overlap
# - range end and start values are included in that range
# - there may be gaps between ranges.
# Approach: Look for the first index of a range where the range end value is greater
# than the code we are searching for. If this is found, and the start value for this range
# is less than or equal to the code we are searching for then we have found the range.
# That is startValue <= characterUnicodeCode <= endValue
index = bisect.bisect_left(unicodeScriptRangeEnd, characterUnicodeCode )
if index == len(unicodeScriptRangeEnd):
# there is no value of index such that: `characterUnicodeCode <= scriptCode[index][1]`
# characterUnicodeCode is larger than all of the range end values so a range is not
# found for the value:
return None
# Since the range at index is the first where `characterUnicodeCode <= rangeEnd` is True,
# we now ensure that for the range at the index `characterUnicodeCode >= rangeStart`
# is also True.
candidateRange = scriptCode[index]
rangeStart = candidateRange[0]
if rangeStart > characterUnicodeCode :
# characterUnicodeCode comes before the start of the range at index so a range
# is not found for the value
return None
rangeName = candidateRange[2]
return rangeName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption of ranges overlapping is not valid.
At the outset this algorithm looks to me is O(N) and binary search is O(log N) n being the number of ranges.
The loop
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]
runs for number of ranges.
getScriptCode is called for every character so it should be as fast as possible. Considering that I think binary search is better.
I am not familiar with bisect so I am looking in to it.
Isn't it better to not modify this function if it is working fine?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assumption of ranges overlapping is not valid.
Oh, the comment meant to say "must not overlap"
At the outset this algorithm looks to me is O(N) and binary search is O(log N) n being the number of ranges.
I believe bisect is O(log N). There is the step to split out the rangeEndValues, though this should only be done once, and as such should not be noticable. If the performance of the function is such a concern, then we should measure it and compare results of optimised results.
Isn't it better to not modify this function if it is working fine?
My concerns were readability, and for edge cases that are easy to miss, and therefor write tests for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I said earlier, bisect would be O(log N) but the loop before that would be O(N). We can take that loop out, but that method looks more complicated to me.
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]
I would try to write test than change this algorithm which took me some time to test. But at that time, we didn't have unit tests in NVDA so I had only done manual testing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the creation of unicodeScriptRangeEnd
is O(N). But it is only done once, so it does not need to be considered. Since speed is a concern I thought I would measure these approaches, see this gist with a timing script. Place the file testSpeedUnicodeCharacterLookup.py
in the <nvdaRepoRoot>/source/
directory of your branch.
The results on my machine were:
time python testSpeedUnicodeCharacterLookup.py
using 100 iterations
testing over range 0-65535, a total of 65535 values
withBisect: 5.69379344301
customBinarySearch: 12.9052367034
real 0m18.679s
user 0m0.015s
sys 0m0.000s
I have not profiled the two solutions, so I can not say why bisect is so much faster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will also notice in this script that there are problems with using ord()
on values of 0x10000
/ 65536
or greater. This is something that might need to be considered for this PR. Is there any situation where chr
passed to getScriptCode()
could be this large? The values in scriptRanges
go up to 0Xe007f
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your time check. I have implemented your suggestion and removed my custom implementation of binary search.
source/languageDetection.py
Outdated
@rtype: string""" | ||
# we are using loop during search to maintain priority | ||
for priorityLanguage, priorityScript, priorityDescription in languagePriorityListSpec: | ||
log.debugWarning(u"priorityLanguage {}, priorityScript {}, priorityDescription {}".format(priorityLanguage, priorityScript, priorityDescription ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
source/languageDetection.py
Outdated
if isinstance(item,ScriptChangeCommand): | ||
scriptCode = item.scriptCode | ||
else: | ||
log.debugWarning(u"script: {} for text {} ".format( scriptCode , unicode(item) ) ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be here? What does this log message tell us?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
url = 'http://www.unicode.org/Public/UNIDATA/Scripts.txt' | ||
scriptDataFile = urllib2.urlopen(url) | ||
for line in scriptDataFile: | ||
p = re.findall(r'([0-9A-F]+)(?:\.\.([0-9A-F]+))?\W+(\w+)\s*#\s*(\w+)', line) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please provide an example (in a comment) of the data that this should work on, including edge cases that had to be worked around.
@feerrenrut would this be part of 2018.1?
|
@feerrenrut requested changes. Once those review comments / changes are addressed, this PR will go under another review and hopefully be accepted or other changes will be requested. @dineshkaushal could you please look into the review comments above? |
Oh, I am really sorry, I missed the entire comment just read that it would take some time for @feerrenrut <https://github.com/feerrenrut> to review it.
|
@param scriptCode: the script identifier | ||
@type scriptCode: int | ||
""" | ||
self.scriptCode =scriptCode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a little bit worried about the name collision with the import on line 14: from unicodeScriptData import scriptCode
. I don't think there is anywhere where this causes a bug. But I think it's a bit confusing, and isn't immediately obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the confusion by moving getScriptCode function from languageDetection.py to unicodeScriptData.py. Earlier I did not put this function in unicodeScriptData as the entire file was being generating by unicodeScriptPrep. now unicodeScriptPrep is generating unicodeScriptDataTemp from which the scriptRanges can be copied to unicodeScriptData. scriptRanges list was earlier known as scriptCode.
…nicodeScriptData, added comments in unicodeScriptPrep explaining what the regular expression does and updated the unicode script ranges
characterUnicodeCode = ord(chr) | ||
# Number should respect preferred language setting | ||
# FullWidthNumber is in Common category, however, it indicates Japanese language context | ||
if DIGIT_ZERO <= characterUnicodeCode <= DIGIT_NINE: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than having a special condition for "Number" and "FullWidthNumber", I think its cleaner to make these entries in the scriptRanges
. Add these two entries explicitly from unicodeScriptPrep
. Actually, this will require ensuring that there is no overlap with other entries which may be tricky to do. So maybe not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of this suggestion? It would be interesting to test the performance of the getScriptFunction
without the two special cases, and manually put those ranges for numbers in the scriptRanges
list.
…@feerrenrut and added test cases for testing integraty of scriptRanges
ord() should be able to check for larger values if we compile with ucs4, but I could not find if NVDA build is compiled with that.
We check each character as it comes, and I am trying to determine whether each character that we check comes as UTF16 or UTF32.
|
@@ -885,28 +887,49 @@ | |||
( 0Xe0020 , 0Xe007f , "Common" ), | |||
] | |||
|
|||
|
|||
unicodeScriptRangeEnd = [ k[1] for k in scriptRanges] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a doc string here please? Mention that for performance reasons this should only be created once.
If NVDA can not process unicode characters greater
Again this is something that should only be done once, and we should perhaps first test the timing to find how big of a difference this actually makes by just manually splitting the list within the tests. The reason I suggest testing for narrow python build, and only splitting in that case, is that in the future the build may change, and then we have support for this. Though there are likely many things that would need to be updated for that anyway. Will the move to python 3 affect this? |
Ok, there is not much difference in performance.
Without limiting the unicodeRanges, performance for standard English characters ranging from ASCII 0x20 to 0x7F over 10000 iterations is 0.784565008271.
After limiting the range to 0x10000, the performance for the same iterations is 0.76481559737
I expect typical text processing to be 100 to 200 characters at a time so this seems ok.
See the full result:
Full range:
using 10000 iterations
testing over range 32-127, a total of 95 values
withBisect: 0.784565008271
customBinarySearch: 1.55462321386
with range limited to 0x10000
using 10000 iterations
testing over range 32-127, a total of 95 values
withBisect: 0.76481559737
customBinarySearch: 1.3909142517
|
…criptRanges is greater than or equal to zero
Added the doc string.
… @feerrenrut commented on this pull request.
In source/unicodeScriptData.py <#7629 (comment)> :
@@ -885,28 +887,49 @@
( 0Xe0020 , 0Xe007f , "Common" ),
]
+
+unicodeScriptRangeEnd = [ k[1] for k in scriptRanges]
Could you add a doc string here please? Mention that for performance reasons this should only be created once.
|
Added the test for checking whether range start for the first range is greater than or equal to 0.
… @feerrenrut commented on this pull request.
In source/unicodeScriptData.py <#7629 (comment)> :
@@ -0,0 +1,824 @@
+scriptCode= [
For completeness, we could also test that the first elements start value is greater than 0.
|
Could you fork that gist to show how you got this result? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hiragana and Katakana is only used in Japanese. Bopomofo is used in Chinese
@feerrenrut <https://github.com/feerrenrut> As far as I understand only one function is the entry point for the detection code so that should be enough. Secondly, this code only gets invoked when the user chooses for the detection so for those, for whom this feature is not helpful, it does not impact in any way.
It is however surprising that it is not in the priority even though many users in multilingual regions might benefit from it. I am willing to work on it, but even after repeated requests I have not got an opportunity to setup a call to discuss it.
If we wait for every feature to be perfect, then we may not release anything.
|
@feerrenrut were there any further discussions on this PR with @dineshkaushal? Are there any further suggestions to be added / considered? I think this PR is waiting far too long for being finalized. |
@dineshkaushal did you consider further testing as @feerrenrut proposed above? I guess if this PR is not perfect, it is an optional setting and further issues can be reported by many users. I think this is stable enough at least for a try build or an alpha build. Then if there are big issues caused users will report them and it can be reverted. So is there major performance lag? Note that if you enable pages in document formating settings NVDA becomes incredibly slow in big word documents and though that feature is implemented in final release although lots of users complain about performance issues in MS Word. So testing this PR in a big word document with many tables and with pages enabled will bring NVDA really to its limits if there is significant performance issues caused by this PR. |
I completely agree with @Adriani90 . as @dineshkaushal stressed, this PR is optional and won't have any impact if not enabled. I really think it deserves to see the light of day. @michaelDCurran could you kindly also give us your 2 cents on this matter? |
I also completely agree with @Adriani90. |
I understand the frustrations expressed here. Regarding priorities, I agree this is important, and expect this PR to become my main focus soon. This PR is stuck because there has still not been a thorough explanation of the performance impact.
How often is this function called per NVDA update loop / pump? I gave some suggestions for how to put the performance impact into perspective:
This is fine, but we want to be sure that we understand the performance impact of the feature when it is enabled. Speaking generally, it may seem like a good idea to merge a PR, waiting for complaints and then fix them. But it actually means that NVDA maintainers are committing to maintain this code. It is also relying on users to test the feature, sometimes issues are not found for several releases. There is no guarantee that the original contributor will address issues raised. This is particularly concerning when our code review feedback is not addressed. |
At least Chinese and Japanese is hard to be differentiated by Unicode
Script Type.
For example, in "実用日本语検定 - 百度文库" words before dash is Japanese ,words after
dash is Chinese.They both only contain CJK Unified Ideographs.
…On Sun, Jan 6, 2019 at 7:36 PM Reef Turner ***@***.***> wrote:
I understand the frustrations expressed here. Regarding priorities, I
agree this is important, and expect this PR to become my main focus soon.
This PR is stuck because there has still not been a thorough explanation of
the performance impact.
only one function is the entry point for the detection code
How often is this function called per NVDA update loop / pump?
I gave some suggestions for how to put the performance impact into
perspective:
- The current timing specifically looks at one function and gets an
average of 2.4 millisecond time for the detection.
- Some context needs to be given to this, how long (absolute and
relative) is this compared to the standard core pump time?
- Analysis of the number of times this code will run within each pump.
( The min, norm and max)
Secondly, this code only gets invoked when the user chooses for the
detection so for those, for whom this feature is not helpful, it does not
impact in any way.
This is fine, but we want to be sure that we understand the performance
impact of the feature when it is enabled.
Speaking generally, it may seem like a good idea to merge a PR, waiting
for complaints and then fix them. But it actually means that NVDA
maintainers are committing to maintain this code. It is also relying on
users to test the feature, sometimes issues are not found for several
releases. There is no guarantee that the original contributor will address
issues raised. This is particularly concerning when our code review
feedback is not addressed.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#7629 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AoCGGfCvCv-srGq6f2flE5Rx7pqx2Ljwks5vAd-ngaJpZM4PjufH>
.
|
I have started looking at this PR. The results of running the unit test on my machine are as follows, I believe the units are seconds:
The Manual TestingUsing the "SampleText.txt" document (from the unit tests, whixh has many lines in different scripts) I open it in Notepad++ and move by line (using espeak synth) paying attention to the responsiveness of NVDA.
Further performance testingTo put the performance of this feature in perspecive let's answer the following:
To achieve this I:
UX
Other issues
|
Perhaps I should clarify my last comment. I am happy with the performance of this part of the code, though there does seem to be a performance issue with Notepad++ when opening the sample.txt file. This seems to be unrelated to NVDA, it also occurs when NVDA is not running. However, I would advise against using the sample.txt file for testing. |
In order for this issue to progress, there are several other concerns that I have also highlighted. Several errors, some issues with the UX, lacking user guide updates, and the branch needs to be rebased and made ready for merging. |
Found the following error: |
Closing this PR due to lack of activity. This feature is on our road map, in the mean time anyone interested may take it on. |
I vote for reopening this to make it easier to find for new developers who want to take over this very valuable work. Indeed the concept of this PR has not been rejected, and in my view finding closed pull requests is very difficult on github. I suggest to label this pull request as abandoned rather than closing it. cc: @CyrilleB79 maybe you have also an opinion on this. In the end we need a decision from @seanbudd and @michaelDCurran on how to proceed with this. |
Since I'm asked for my opinion: It seem to me that the process is that only PRs that have a chance to be finalized by the initial author are kept open. So NV Access can have a look to open PRs to review them. A contributor developer wanting to take over an abandoned PR may search for closed (not merged) PRs with these 2 labels, provided PRs are correctly labeled. Else, a triage work should be done on closed PRs to label them correctly. If needed, a comment in the corresponding issue can be added to mention this existing PR and summarize its state (i.e. quite advanced development). Triage and contribution documentations are currently being updated; it may be the opportunity to clarify these points if it's not already clear. |
This assumption is actually wrong and contradicts the open source principles of this project. There are pull requests that have been overtaken by others see for example #15331 replacing #11270. Would #11270 have been closend, I am quite sure @LeonarddeR would have not found it as fast unless he was tagged on that PR of he was aware of the work without knowing the PR number exactly. |
@Adriani90 I disagree with you when you state that my assumption contradicts the open source principles of this project. @seanbudd, @michaelDCurran: |
Abandoned PRs should be closed. Draft state implies that work will continue. Ready state implies that it is ready for review. |
For this the PR needs to have ofcourse the correct label. I marked this as abandoned so someone can find it easier by filtering the coresponding label. |
Link to issue number:
#2990
Summary of the issue:
Often text is written in multiple languages, but some applications such as notepad do not detect language automatically. Synthesizer needs to know which language to use for a specified script.
Description of how this pull request fixes the issue:
We use Unicode script property to detect script of a character. This approach is different from block based approaches. The problem with block based approach is that many characters could be in different blocks.
At next level, if a script could be used by multiple languages, there is an option in Language detection dialog to choose preferred language.
Testing performed:
@nishimoto and I have developed 21 test cases to verify our expectations.
Known issues with pull request:
The language detection may not work for some languages, so I have added disable script detection option in language detection dialog. for now this option is disabled by default. Thanks to @nishimoto, Japanese language should be working fine.
Change log entry:
Add ability to detect language based on script property of Unicode