Auto language detection based on script detection In t2990 review #7629

dineshkaushal · 2017-09-26T06:17:38Z

Link to issue number:

Summary of the issue:

Often text is written in multiple languages, but some applications such as notepad do not detect language automatically. Synthesizer needs to know which language to use for a specified script.

Description of how this pull request fixes the issue:

We use Unicode script property to detect script of a character. This approach is different from block based approaches. The problem with block based approach is that many characters could be in different blocks.

At next level, if a script could be used by multiple languages, there is an option in Language detection dialog to choose preferred language.

Testing performed:

@nishimoto and I have developed 21 test cases to verify our expectations.

Known issues with pull request:

The language detection may not work for some languages, so I have added disable script detection option in language detection dialog. for now this option is disabled by default. Thanks to @nishimoto, Japanese language should be working fine.

Change log entry:

New features
Add ability to detect language based on script property of Unicode

…and language priority selection by users

shouldChangeScript is not used

fixes some issues with UnicodeEncodeError and log messages, fixes handling of number characters followed by Japanese

nvaccess#2990 Japanese test cases

…rement for outer punctuation to be associated for outer script

…nguage detection does not work for a language, it can be ignored

feerrenrut

This is a big change, that I will have to go through carefully again. Thanks for all the work!

source/languageDetection.py

source/gui/settingsDialogs.py

feerrenrut · 2018-01-03T05:26:02Z

source/gui/settingsDialogs.py

+		#while adding new languages we filter out existing languages in the prefered language list
+		ignoreLanguages = {x[0] for x in self.languageNames}
+		languageList = languageDetection.getLanguagesWithDescriptions(ignoreLanguages)
+		dialog = wx.SingleChoiceDialog(None,


Would you consider using a wx.MultiChoiceDialog? This would allow the user to add all of the languages they care about at once.

The reason why we are not using multi choice dialog is when we add a language, it gets added on top, and in most of the cases users would only add one language.

Sequence of these languages also affects which language gets priority.

feerrenrut · 2018-01-03T06:43:21Z

source/unicodeScriptData.py

@@ -0,0 +1,824 @@
+scriptCode= [


Please add the usual copyright header. Also a docstring / explanation for this would be helpful.
Include in the docstring answers to the following:

What each value represents, is it (unicodeStartOfRange, unicodeEndOfRange, rangeDescription)?

Should value ranges overlap?

Should all values be represented?

Are these values contiguous?

Please also add unit tests to confirm the above restrictions should they exists.

Added copyright and doc string with details about each item, but adding test cases may not be needed as these values are obtained from scripts.txt and it is done occasionally.

Even though this data is created programmatically, tests for the data will help to ensure its correctness. It will be hard to test the script that creates this data, and it will complicate the code in NVDA to remove the assumptions made. An automated test to ensure that these assumptions are correct will make it much easier to track down the kind of bugs that would arise if this data does get out of order / values overlapping.

Please also state if the start / end of each range is inclusive or exclusive.

This range comes from scripts.txt which comes from unicode.org. Should we still check it? I am assuming and I could be wrong that Unicode would not make an overlapping list.

Another point, this list is sorted and is created once in a while.

I'm more concerned that a mistake is made in the process of creating this. Perhaps the source data on unicode.org changes format and this causes a bug in the importer, this is even more likely that a mistake will be made if its been a while since someone last went through the process, and there is nothing in the production code to ensure the assumptions are met. It's quite hard to manually check that our assumptions about the scriptCode data hold true, but it's quite easy to write a unit test to do the same.

Added 3 tests for checking the validity of the list.

test_unicodeRangesEntryStartLessEqualEnd

test_unicodeRangesEntriesDoNotOverlapAndAreSorted

test_unicodeRangesEntryScriptNamesExist

Do let me know if we need to add more tests.

For completeness, we could also test that the first elements start value is greater than 0.

feerrenrut · 2018-01-03T06:50:59Z

source/languageDetection.py

+	elif 0xff10 <= characterUnicodeCode <= 0xff19:
+		return "FullWidthNumber"
+	while( mEnd >= mStart ):
+		midPoint = (mStart + mEnd ) >> 1


Is there not a built in way of performing this search? So we don't reinvent the wheel, so to speak?

I don't think so as is it not a simple binary search.

You could use bisect, in the initialisation create a list of the rangeEndValues like so:
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]
Then in this function you can do:

# Based on the following assumptions: # - ranges must overlap # - range end and start values are included in that range # - there may be gaps between ranges. # Approach: Look for the first index of a range where the range end value is greater # than the code we are searching for. If this is found, and the start value for this range # is less than or equal to the code we are searching for then we have found the range. # That is startValue <= characterUnicodeCode <= endValue index = bisect.bisect_left(unicodeScriptRangeEnd, characterUnicodeCode ) if index == len(unicodeScriptRangeEnd): # there is no value of index such that: `characterUnicodeCode <= scriptCode[index][1]` # characterUnicodeCode is larger than all of the range end values so a range is not # found for the value: return None # Since the range at index is the first where `characterUnicodeCode <= rangeEnd` is True, # we now ensure that for the range at the index `characterUnicodeCode >= rangeStart` # is also True. candidateRange = scriptCode[index] rangeStart = candidateRange[0] if rangeStart > characterUnicodeCode : # characterUnicodeCode comes before the start of the range at index so a range # is not found for the value return None rangeName = candidateRange[2] return rangeName

The assumption of ranges overlapping is not valid.

At the outset this algorithm looks to me is O(N) and binary search is O(log N) n being the number of ranges.
The loop
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]
runs for number of ranges.
getScriptCode is called for every character so it should be as fast as possible. Considering that I think binary search is better.

I am not familiar with bisect so I am looking in to it.
Isn't it better to not modify this function if it is working fine?

The assumption of ranges overlapping is not valid.

Oh, the comment meant to say "must not overlap"

At the outset this algorithm looks to me is O(N) and binary search is O(log N) n being the number of ranges.

I believe bisect is O(log N). There is the step to split out the rangeEndValues, though this should only be done once, and as such should not be noticable. If the performance of the function is such a concern, then we should measure it and compare results of optimised results.

Isn't it better to not modify this function if it is working fine?
My concerns were readability, and for edge cases that are easy to miss, and therefor write tests for.

As I said earlier, bisect would be O(log N) but the loop before that would be O(N). We can take that loop out, but that method looks more complicated to me.
unicodeScriptRangeEnd = [ k[1] for k in scriptCode]

I would try to write test than change this algorithm which took me some time to test. But at that time, we didn't have unit tests in NVDA so I had only done manual testing.

Yes, the creation of unicodeScriptRangeEnd is O(N). But it is only done once, so it does not need to be considered. Since speed is a concern I thought I would measure these approaches, see this gist with a timing script. Place the file testSpeedUnicodeCharacterLookup.py in the <nvdaRepoRoot>/source/ directory of your branch.

The results on my machine were:

time python testSpeedUnicodeCharacterLookup.py using 100 iterations testing over range 0-65535, a total of 65535 values withBisect: 5.69379344301 customBinarySearch: 12.9052367034 real 0m18.679s user 0m0.015s sys 0m0.000s

I have not profiled the two solutions, so I can not say why bisect is so much faster.

You will also notice in this script that there are problems with using ord() on values of 0x10000 / 65536 or greater. This is something that might need to be considered for this PR. Is there any situation where chr passed to getScriptCode() could be this large? The values in scriptRanges go up to 0Xe007f .

Thanks for your time check. I have implemented your suggestion and removed my custom implementation of binary search.

feerrenrut · 2018-01-03T06:51:57Z

source/languageDetection.py

+	@rtype: string"""
+	# we are using loop during search to maintain priority
+	for priorityLanguage, priorityScript, priorityDescription in  languagePriorityListSpec:
+		log.debugWarning(u"priorityLanguage {}, priorityScript {}, priorityDescription {}".format(priorityLanguage, priorityScript, priorityDescription )  )


Please remove

feerrenrut · 2018-01-03T06:54:20Z

source/languageDetection.py

+		if isinstance(item,ScriptChangeCommand):
+			scriptCode = item.scriptCode
+		else:
+			log.debugWarning(u"script: {} for text {} ".format( scriptCode , unicode(item) ) )


Should this be here? What does this log message tell us?

feerrenrut · 2018-01-03T06:55:47Z

source/unicodeScriptPrep.py

+	url = 'http://www.unicode.org/Public/UNIDATA/Scripts.txt'
+	scriptDataFile = urllib2.urlopen(url)
+	for line in scriptDataFile:
+		p = re.findall(r'([0-9A-F]+)(?:\.\.([0-9A-F]+))?\W+(\w+)\s*#\s*(\w+)', line)


Please provide an example (in a comment) of the data that this should work on, including edge cases that had to be worked around.

dineshkaushal · 2018-01-18T23:49:24Z

@feerrenrut would this be part of 2018.1?

mohdshara · 2018-01-19T10:44:22Z

@feerrenrut requested changes. Once those review comments / changes are addressed, this PR will go under another review and hopefully be accepted or other changes will be requested.

@dineshkaushal could you please look into the review comments above?

dineshkaushal · 2018-01-19T11:19:34Z

Oh, I am really sorry, I missed the entire comment just read that it would take some time for @feerrenrut <https://github.com/feerrenrut> to review it.

…ht statements

feerrenrut · 2018-01-29T04:46:58Z

source/languageDetection.py

+		@param scriptCode: the script identifier
+		@type scriptCode: int
+		"""
+		self.scriptCode =scriptCode 


I'm a little bit worried about the name collision with the import on line 14: from unicodeScriptData import scriptCode. I don't think there is anywhere where this causes a bug. But I think it's a bit confusing, and isn't immediately obvious.

removed the confusion by moving getScriptCode function from languageDetection.py to unicodeScriptData.py. Earlier I did not put this function in unicodeScriptData as the entire file was being generating by unicodeScriptPrep. now unicodeScriptPrep is generating unicodeScriptDataTemp from which the scriptRanges can be copied to unicodeScriptData. scriptRanges list was earlier known as scriptCode.

…nicodeScriptData, added comments in unicodeScriptPrep explaining what the regular expression does and updated the unicode script ranges

feerrenrut · 2018-01-30T04:09:12Z

source/unicodeScriptData.py

+	characterUnicodeCode = ord(chr)
+	# Number should respect preferred language setting
+	# FullWidthNumber is in Common category, however, it indicates Japanese language context
+	if DIGIT_ZERO  <= characterUnicodeCode <= DIGIT_NINE:


Rather than having a special condition for "Number" and "FullWidthNumber", I think its cleaner to make these entries in the scriptRanges. Add these two entries explicitly from unicodeScriptPrep. Actually, this will require ensuring that there is no overlap with other entries which may be tricky to do. So maybe not.

What do you think of this suggestion? It would be interesting to test the performance of the getScriptFunction without the two special cases, and manually put those ranges for numbers in the scriptRanges list.

…@feerrenrut

…@feerrenrut and added test cases for testing integraty of scriptRanges

dineshkaushal · 2018-02-03T05:56:10Z

ord() should be able to check for larger values if we compile with ucs4, but I could not find if NVDA build is compiled with that. We check each character as it comes, and I am trying to determine whether each character that we check comes as UTF16 or UTF32.

feerrenrut · 2018-02-06T00:12:04Z

source/unicodeScriptData.py

@@ -885,28 +887,49 @@
 	( 0Xe0020 , 0Xe007f , "Common" ), 
 ]

+
+unicodeScriptRangeEnd = [ k[1] for k in scriptRanges]


Could you add a doc string here please? Mention that for performance reasons this should only be created once.

feerrenrut · 2018-02-06T00:29:23Z

ord() should be able to check for larger values if we compile with ucs4, but I could not find if NVDA build is compiled with that.

If NVDA can not process unicode characters greater 0x10000 perhaps we should test and slice the scriptRanges list so that NVDA is not searching through a list much larger than necessary. Something like:

OUTSIDE_RANGE_NARROW_PYTHON_BUILD = 0x10000
try:
    unichr(OUTSIDE_RANGE_NARROW_PYTHON_BUILD)
except ValueError:
    newEndIndex = bisect.bisect_left(unicodeScriptRangeEnd, OUTSIDE_RANGE_NARROW_PYTHON_BUILD)
    scriptRanges = scriptRanges[:newEndIndex]
    unicodeScriptRangeEnd = [ k[1] for k in scriptRanges]

Again this is something that should only be done once, and we should perhaps first test the timing to find how big of a difference this actually makes by just manually splitting the list within the tests.

The reason I suggest testing for narrow python build, and only splitting in that case, is that in the future the build may change, and then we have support for this. Though there are likely many things that would need to be updated for that anyway. Will the move to python 3 affect this?

dineshkaushal · 2018-02-06T07:57:13Z

Ok, there is not much difference in performance. Without limiting the unicodeRanges, performance for standard English characters ranging from ASCII 0x20 to 0x7F over 10000 iterations is 0.784565008271. After limiting the range to 0x10000, the performance for the same iterations is 0.76481559737 I expect typical text processing to be 100 to 200 characters at a time so this seems ok. See the full result: Full range: using 10000 iterations testing over range 32-127, a total of 95 values withBisect: 0.784565008271 customBinarySearch: 1.55462321386 with range limited to 0x10000 using 10000 iterations testing over range 32-127, a total of 95 values withBisect: 0.76481559737 customBinarySearch: 1.3909142517

…criptRanges is greater than or equal to zero

dineshkaushal · 2018-02-06T08:17:19Z

Added the doc string.

…

@feerrenrut commented on this pull request. In source/unicodeScriptData.py <#7629 (comment)> : @@ -885,28 +887,49 @@ ( 0Xe0020 , 0Xe007f , "Common" ), ] + +unicodeScriptRangeEnd = [ k[1] for k in scriptRanges] Could you add a doc string here please? Mention that for performance reasons this should only be created once.

dineshkaushal · 2018-02-06T08:18:41Z

Added the test for checking whether range start for the first range is greater than or equal to 0.

…

@feerrenrut commented on this pull request. In source/unicodeScriptData.py <#7629 (comment)> : @@ -0,0 +1,824 @@ +scriptCode= [ For completeness, we could also test that the first elements start value is greater than 0.

feerrenrut · 2018-02-06T23:07:10Z

After limiting the range to 0x10000, the performance for the same iterations is 0.76481559737

Could you fork that gist to show how you got this result?

larry801

Hiragana and Katakana is only used in Japanese. Bopomofo is used in Chinese

dineshkaushal · 2018-09-03T08:44:33Z

@feerrenrut <https://github.com/feerrenrut> As far as I understand only one function is the entry point for the detection code so that should be enough. Secondly, this code only gets invoked when the user chooses for the detection so for those, for whom this feature is not helpful, it does not impact in any way. It is however surprising that it is not in the priority even though many users in multilingual regions might benefit from it. I am willing to work on it, but even after repeated requests I have not got an opportunity to setup a call to discuss it. If we wait for every feature to be perfect, then we may not release anything.

Adriani90 · 2019-01-05T15:00:30Z

@feerrenrut were there any further discussions on this PR with @dineshkaushal? Are there any further suggestions to be added / considered? I think this PR is waiting far too long for being finalized.
cc: @michaelDCurran, @Qchristensen, @josephsl, @derekriemer.
@dineshkaushal thank you for your important work on this.

Adriani90 · 2019-01-05T15:07:26Z

@dineshkaushal did you consider further testing as @feerrenrut proposed above? I guess if this PR is not perfect, it is an optional setting and further issues can be reported by many users. I think this is stable enough at least for a try build or an alpha build. Then if there are big issues caused users will report them and it can be reverted. So is there major performance lag? Note that if you enable pages in document formating settings NVDA becomes incredibly slow in big word documents and though that feature is implemented in final release although lots of users complain about performance issues in MS Word. So testing this PR in a big word document with many tables and with pages enabled will bring NVDA really to its limits if there is significant performance issues caused by this PR.

mohdshara · 2019-01-06T07:20:29Z

I completely agree with @Adriani90 . as @dineshkaushal stressed, this PR is optional and won't have any impact if not enabled. I really think it deserves to see the light of day. @michaelDCurran could you kindly also give us your 2 cents on this matter?

aaclause · 2019-01-06T09:58:45Z

I also completely agree with @Adriani90.
@dineshkaushal In the meantime, maybe that an add-on could be an alternative for this work?

feerrenrut · 2019-01-06T11:36:04Z

I understand the frustrations expressed here. Regarding priorities, I agree this is important, and expect this PR to become my main focus soon. This PR is stuck because there has still not been a thorough explanation of the performance impact.

only one function is the entry point for the detection code

How often is this function called per NVDA update loop / pump?

I gave some suggestions for how to put the performance impact into perspective:

The current timing specifically looks at one function and gets an average of 2.4 millisecond time for the detection.

Some context needs to be given to this, how long (absolute and relative) is this compared to the standard core pump time?

Analysis of the number of times this code will run within each pump. ( The min, norm and max)

Secondly, this code only gets invoked when the user chooses for the detection so for those, for whom this feature is not helpful, it does not impact in any way.

This is fine, but we want to be sure that we understand the performance impact of the feature when it is enabled.

Speaking generally, it may seem like a good idea to merge a PR, waiting for complaints and then fix them. But it actually means that NVDA maintainers are committing to maintain this code. It is also relying on users to test the feature, sometimes issues are not found for several releases. There is no guarantee that the original contributor will address issues raised. This is particularly concerning when our code review feedback is not addressed.

larry801 · 2019-01-07T01:25:19Z

At least Chinese and Japanese is hard to be differentiated by Unicode Script Type. For example, in "実用日本语検定 - 百度文库" words before dash is Japanese ,words after dash is Chinese.They both only contain CJK Unified Ideographs.

…

On Sun, Jan 6, 2019 at 7:36 PM Reef Turner ***@***.***> wrote: I understand the frustrations expressed here. Regarding priorities, I agree this is important, and expect this PR to become my main focus soon. This PR is stuck because there has still not been a thorough explanation of the performance impact. only one function is the entry point for the detection code How often is this function called per NVDA update loop / pump? I gave some suggestions for how to put the performance impact into perspective: - The current timing specifically looks at one function and gets an average of 2.4 millisecond time for the detection. - Some context needs to be given to this, how long (absolute and relative) is this compared to the standard core pump time? - Analysis of the number of times this code will run within each pump. ( The min, norm and max) Secondly, this code only gets invoked when the user chooses for the detection so for those, for whom this feature is not helpful, it does not impact in any way. This is fine, but we want to be sure that we understand the performance impact of the feature when it is enabled. Speaking generally, it may seem like a good idea to merge a PR, waiting for complaints and then fix them. But it actually means that NVDA maintainers are committing to maintain this code. It is also relying on users to test the feature, sometimes issues are not found for several releases. There is no guarantee that the original contributor will address issues raised. This is particularly concerning when our code review feedback is not addressed. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#7629 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AoCGGfCvCv-srGq6f2flE5Rx7pqx2Ljwks5vAd-ngaJpZM4PjufH> .

feerrenrut · 2019-01-21T15:21:36Z

I have started looking at this PR. The results of running the unit test on my machine are as follows, I believe the units are seconds:

Additional time per line with detection on 0.000394934361649
total timeTakenWithDetectionOn 2.50316016579
Total timeTakenWithDetectionOff 0.0123091468666
Number of lines 6307 text length 1545715

ThetimeTakenWithDetectionOff value is the time taken to run the same code with the feature turned off using config. Which results in it falling back to regular language switching."

Manual Testing

Using the "SampleText.txt" document (from the unit tests, whixh has many lines in different scripts) I open it in Notepad++ and move by line (using espeak synth) paying attention to the responsiveness of NVDA.

Some lines seem to be ok, but when encountering a new script there is occasionally a long pause.
When trying to use OneCore or SAPI5 I do not get any output for the majority of these lines. I assume that it is because I do not have the required languages installed.

Further performance testing

To put the performance of this feature in perspecive let's answer the following:

What is the total time taken for an loop in NVDA?
How much of this time is taken by the language detection code?
How many times per loop is the language detection code run?

To achieve this I:

I added some metrics to core.py main() corepump(), speech.py speak() and speech.py speakSpellingGen()
Each time we detect language it is timed and added to a total, and a counter is incremented
In corePump() we time the whole pump, and at the end reset the detectLanguage total time and counter, and print out the values to the log.
Results look like this:
end of core pump, took 0.0716s, detect language took 0.0002s, called 2 times

Stats:

Measure Core Loop Language Detection

Number of samples: 279 279

Mean 0.1515 0.0001

Mode 0.0016 0

Max 9.4875 0.0015

Min 0 0

Std Dev 0.9173 0.0001

This is interesting, but it does not seem that the delays are coming from the language detection code. Looking at the top 10 records for core loop time:

Core loop	Language detection time	Number of calls to language detection
9.4875	0.001	3
8.7615	0.0001	1
5.4363	0.0001	1
5.1658	0.0001	1
4.083	0	1
0.3585	0	0
0.2061	0.0007	1
0.184	0.0009	1
0.1828	0.0006	1
0.182	0.0015	1

UX

Add button should add the next item to the bottom of the list.
Due to the presence of the "move up" and "move down" buttons, it seems like the order of the preferred languages is important, however this is not explained in the GUI.
The explanation for preferred languages needs improvement.
If this is tied to a setting in another dialog, ("automatic language switching" in voice settings) perhaps we should rethink its location. Perhaps automatic language switching should be present in both.
- From the users standpoint, why is this a different option from automatic language switching?
The label for "Disable scriptDetection"
- This is not a very user friendly name. Consider those who are not developers.
- It's clearer if this the wording / logic of this check box is inverted. Check to enable, uncheck to disable.
  If all the settings on this dialog are related to script detection, enabling / disabling script detection should be the first choice, other controls should be disabled when script detection is disabled.

Other issues

This branch needs to be cleaned up and rebased onto master.
I Got an error when using input help, occurred with several keys incl. NVDA+F1:

INFO - inputCore.InputManager._handleInputHelp (13:39:03.426):
Input help: gesture kb(desktop):NVDA
ERROR - queueHandler.flushQueue (13:39:03.433):
Error in func _handleInputHelp from eventQueue
Traceback (most recent call last):
  File "queueHandler.py", line 50, in flushQueue
    func(*args,**kwargs)
  File "inputCore.py", line 514, in _handleInputHelp
    speech.speakText(textList[0], reason=controlTypes.REASON_MESSAGE, symbolLevel=characterProcessing.SYMLVL_ALL)
  File "speech.py", line 431, in speakText
    speak(speechSequence,symbolLevel=symbolLevel)
  File "speech.py", line 544, in speak
    detectedLanguageSequence = languageDetection.detectLanguage(item , prevLanguage)
  File "languageDetection.py", line 204, in detectLanguage
    tempSequence = detectScript(text)
  File "languageDetection.py", line 168, in detectScript
    if unicodedata.category(text[index] ) == "Ps":
TypeError: category() argument 1 must be unicode, not str

I got the following error after setting preferred languages:

ERROR - eventHandler.executeEvent (14:28:27.943):
error executing event: gainFocus on <NVDAObjects.Dynamic_TerminalDisplayModelLiveTextIAccessibleWindowNVDAObject object at 0x05414530> with extra args of {}
Traceback (most recent call last):
  File "eventHandler.py", line 143, in executeEvent
    _EventExecuter(eventName,obj,kwargs)
  File "eventHandler.py", line 91, in __init__
    self.next()
  File "eventHandler.py", line 98, in next
    return func(*args, **self.kwargs)
  File "NVDAObjects\behaviors.py", line 352, in event_gainFocus
    super(Terminal, self).event_gainFocus()
  File "NVDAObjects\__init__.py", line 936, in event_gainFocus
    self.reportFocus()
  File "NVDAObjects\__init__.py", line 831, in reportFocus
    speech.speakObject(self,reason=controlTypes.REASON_FOCUS)
  File "speech.py", line 387, in speakObject
    speakObjectProperties(obj,reason=reason,index=index,**allowProperties)
  File "speech.py", line 325, in speakObjectProperties
    speakText(text,index=index)
  File "speech.py", line 431, in speakText
    speak(speechSequence,symbolLevel=symbolLevel)
  File "speech.py", line 544, in speak
    detectedLanguageSequence = languageDetection.detectLanguage(item , prevLanguage)
  File "languageDetection.py", line 221, in detectLanguage
    languageCode = getLangID( item.scriptCode  )
  File "languageDetection.py", line 123, in getLangID
    if scriptName in priorityLanguage.scriptID:
TypeError: argument of type 'NoneType' is not iterable

feerrenrut · 2019-01-29T07:15:37Z

Perhaps I should clarify my last comment. I am happy with the performance of this part of the code, though there does seem to be a performance issue with Notepad++ when opening the sample.txt file. This seems to be unrelated to NVDA, it also occurs when NVDA is not running. However, I would advise against using the sample.txt file for testing.

feerrenrut · 2019-01-29T07:18:28Z

In order for this issue to progress, there are several other concerns that I have also highlighted. Several errors, some issues with the UX, lacking user guide updates, and the branch needs to be rebased and made ready for merging.

dpy013 · 2019-08-24T03:12:13Z

Found the following error:
Conflicting files
source/config/configSpec.py
source/gui/settingsDialogs.py
source/speech.py
See if it can be fixed?
thank

feerrenrut · 2020-04-01T09:36:09Z

Closing this PR due to lack of activity. This feature is on our road map, in the mean time anyone interested may take it on.
A new PR will also make this easier to follow, since many of the comments on this one are no longer relevant.

Adriani90 · 2023-08-28T18:13:01Z

I vote for reopening this to make it easier to find for new developers who want to take over this very valuable work. Indeed the concept of this PR has not been rejected, and in my view finding closed pull requests is very difficult on github. I suggest to label this pull request as abandoned rather than closing it.

cc: @CyrilleB79 maybe you have also an opinion on this.

In the end we need a decision from @seanbudd and @michaelDCurran on how to proceed with this.

CyrilleB79 · 2023-08-28T19:57:24Z

Since I'm asked for my opinion:
I do notthink that reopening this PR is a good idea since its author is not active anymore.

It seem to me that the process is that only PRs that have a chance to be finalized by the initial author are kept open. So NV Access can have a look to open PRs to review them.
If a PR has no available active author, there is the risk that it remains indefinitely open. I think that there is an "abandoned" label for such PRs. There is also the "concept approved" label for PRs whose motivation has been approved by NV Access.

A contributor developer wanting to take over an abandoned PR may search for closed (not merged) PRs with these 2 labels, provided PRs are correctly labeled. Else, a triage work should be done on closed PRs to label them correctly.

If needed, a comment in the corresponding issue can be added to mention this existing PR and summarize its state (i.e. quite advanced development).

Triage and contribution documentations are currently being updated; it may be the opportunity to clarify these points if it's not already clear.

Adriani90 · 2023-09-04T09:25:21Z

It seem to me that the process is that only PRs that have a chance to be finalized by the initial author are kept open. So NV Access can have a look to open PRs to review them.

This assumption is actually wrong and contradicts the open source principles of this project. There are pull requests that have been overtaken by others see for example #15331 replacing #11270. Would #11270 have been closend, I am quite sure @LeonarddeR would have not found it as fast unless he was tagged on that PR of he was aware of the work without knowing the PR number exactly.
In fact, I would prefer to have such valuable work being visible, even if it is open forever. Because this is really work that should not be hidden.
Given the small number of PRs in this project, I still think labeling a PR as abandoned should be enough to express inactive work instead of hiding it under almost 3,000 PRs by closing.
I am reopening this PR and hope that NV Access will share a strong argument why this work should be hidden even though the concept as such is approved according to the comments above.

CyrilleB79 · 2023-09-04T09:44:08Z

It seem to me that the process is that only PRs that have a chance to be finalized by the initial author are kept open. So NV Access can have a look to open PRs to review them.

This assumption is actually wrong and contradicts the open source principles of this project. There are pull requests that have been overtaken by others see for example #15331 replacing #11270. Would #11270 have been closend, I am quite sure @LeonarddeR would have not found it as fast unless he was tagged on that PR of he was aware of the work without knowing the PR number exactly. In fact, I would prefer to have such valuable work being visible, even if it is open forever. Because this is really work that should not be hidden. Given the small number of PRs in this project, I still think labeling a PR as abandoned should be enough to express inactive work instead of hiding it under almost 3,000 PRs by closing. I am reopening this PR and hope that NV Access will share a strong argument why this work should be hidden even though the concept as such is approved according to the comments above.

@Adriani90 I disagree with you when you state that my assumption contradicts the open source principles of this project.
There's no point in making assumptions on NV Access thoughts; better ask them to clarify this point.

@seanbudd, @michaelDCurran:
What is recommended for abandoned PRs? Keep open or close them? Add a specific label to be able to find them?

seanbudd · 2023-09-04T23:40:05Z

What is recommended for abandoned PRs? Keep open or close them? Add a specific label to be able to find them?

Abandoned PRs should be closed. Draft state implies that work will continue. Ready state implies that it is ready for review.
You can easily find abandoned PRs by searching by label.

Adriani90 · 2023-09-06T18:41:38Z

For this the PR needs to have ofcourse the correct label. I marked this as abandoned so someone can find it easier by filtering the coresponding label.

dineshkaushal and others added 11 commits August 14, 2017 19:50

added script and language detection based on unicode script property …

364bcb2

…and language priority selection by users

fixes some issues with UnicodeEncodeError and log messages

97804e1

cosmetic changes

7f44e86

shouldChangeScript is not used

work around number characters

4c04df7

fixed errata

b74daa2

Merge pull request #1 from nishimotz/in-t2990-review

07a9315

fixes some issues with UnicodeEncodeError and log messages, fixes handling of number characters followed by Japanese

added unit tests and fixed few problems

9fbe85c

nvaccess#2990 Japanese test cases

9e8f601

Merge pull request #2 from nishimotz/in-t2990-review

73ae659

nvaccess#2990 Japanese test cases

added language script mapping and added another test for uax 24 requi…

7fee0be

…rement for outer punctuation to be associated for outer script

added disable script option in the language detection dialog so if la…

e9b9f9d

…nguage detection does not work for a language, it can be ignored

feerrenrut mentioned this pull request Nov 27, 2017

Automatic language detection based on unicode ranges #2990

Open

feerrenrut self-requested a review November 27, 2017 07:51

feerrenrut suggested changes Jan 3, 2018

View reviewed changes

removed log statements, used namedtuple, and added or updated copyrig…

73ff201

…ht statements

feerrenrut reviewed Jan 29, 2018

View reviewed changes

moved the function getScriptCode function from languageDetection to u…

03d4951

…nicodeScriptData, added comments in unicodeScriptPrep explaining what the regular expression does and updated the unicode script ranges

feerrenrut reviewed Jan 30, 2018

View reviewed changes

changed own implementation for getScriptCode to bisect as suggested by …

9ed69bd

…@feerrenrut and added test cases for testing integraty of scriptRanges

feerrenrut reviewed Feb 6, 2018

View reviewed changes

added doc string and test for checking whether first range start of s…

82b0b99

…criptRanges is greater than or equal to zero

larry801 reviewed Aug 10, 2018

View reviewed changes

LeonarddeR added AfterPython3Transition and removed AfterPython3Transition labels Apr 24, 2019

feerrenrut removed AfterPython3Transition labels May 8, 2019

feerrenrut closed this Apr 1, 2020

Adriani90 reopened this Sep 4, 2023

Adriani90 requested a review from a team as a code owner September 4, 2023 09:25

Adriani90 requested review from seanbudd and removed request for feerrenrut September 4, 2023 09:25

seanbudd closed this Sep 4, 2023

Adriani90 added the Abandoned requested reports or updates are missing since more than 1 year, author or users are not available. label Sep 6, 2023

Auto language detection based on script detection In t2990 review #7629

Auto language detection based on script detection In t2990 review #7629

Conversation

dineshkaushal commented Sep 26, 2017

Link to issue number:

Summary of the issue:

Description of how this pull request fixes the issue:

Testing performed:

Known issues with pull request:

Change log entry:

feerrenrut left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dineshkaushal commented Jan 18, 2018 via email • edited by feerrenrut Loading

mohdshara commented Jan 19, 2018

dineshkaushal commented Jan 19, 2018 via email

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dineshkaushal commented Feb 3, 2018 via email

Choose a reason for hiding this comment

feerrenrut commented Feb 6, 2018

dineshkaushal commented Feb 6, 2018 via email • edited by feerrenrut Loading

dineshkaushal commented Feb 6, 2018 via email • edited by feerrenrut Loading

dineshkaushal commented Feb 6, 2018 via email • edited by feerrenrut Loading

feerrenrut commented Feb 6, 2018

larry801 left a comment

Choose a reason for hiding this comment

dineshkaushal commented Sep 3, 2018 via email

Adriani90 commented Jan 5, 2019

Adriani90 commented Jan 5, 2019

mohdshara commented Jan 6, 2019

aaclause commented Jan 6, 2019

feerrenrut commented Jan 6, 2019

larry801 commented Jan 7, 2019 via email

feerrenrut commented Jan 21, 2019

Manual Testing

Further performance testing

UX

Other issues

feerrenrut commented Jan 29, 2019

feerrenrut commented Jan 29, 2019

dpy013 commented Aug 24, 2019

feerrenrut commented Apr 1, 2020

Adriani90 commented Aug 28, 2023

CyrilleB79 commented Aug 28, 2023

Adriani90 commented Sep 4, 2023

CyrilleB79 commented Sep 4, 2023

seanbudd commented Sep 4, 2023

Adriani90 commented Sep 6, 2023

dineshkaushal commented Jan 18, 2018 via email •

edited by feerrenrut

Loading

dineshkaushal commented Feb 6, 2018 via email •

edited by feerrenrut

Loading

dineshkaushal commented Feb 6, 2018 via email •

edited by feerrenrut

Loading

dineshkaushal commented Feb 6, 2018 via email •

edited by feerrenrut

Loading