Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto language detection based on script detection In t2990 review #7629

Closed
wants to merge 17 commits into from
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
Show all changes
17 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions source/languageDetection.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
# maintains list of priority languages as a list of languageID, ScriptName, and LanguageDescription
languagePriorityListSpec = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be a good place to use a namedtuple, then we can refer to these by name rather than remembering the order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestions! using namedtuple has improved code readability. I didn't know about this feature.


"""scriptIDToLangID is reverse of langIDToScriptID and is used to obtain language of the current script. language of a script is used to detect whether a chunk should be broken for languages that use multiple scripts."""
scriptIDToLangID = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doc string? what is this used for, why does it exist when langIDtoScriptID exists?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scriptIDToLangID is reverse of langIDToScriptID and is used to obtain language of the current script. language of a script is used to detect whether a chunk should be broken for languages that use multiple scripts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the doc string


LanguageDescription = namedtuple("LanguageDescription" , "languageID description")
Expand Down
45 changes: 34 additions & 11 deletions source/unicodeScriptData.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
These entries should not overlap, but there could be gaps
"""

import bisect

# unicode digit constants
DIGIT_ZERO = 0x30
DIGIT_NINE = 0x39
Expand Down Expand Up @@ -885,28 +887,49 @@
( 0Xe0020 , 0Xe007f , "Common" ),
]


unicodeScriptRangeEnd = [ k[1] for k in scriptRanges]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a doc string here please? Mention that for performance reasons this should only be created once.


def getScriptCode(chr):
"""performs a binary search in scripCodes for unicode ranges
@param chr: character for which a script should be found
@type chr: string
@return: script code
@rtype: int"""
mStart = 0
mEnd = len(scriptRanges)-1
characterUnicodeCode = ord(chr)
# Number should respect preferred language setting
# FullWidthNumber is in Common category, however, it indicates Japanese language context
if DIGIT_ZERO <= characterUnicodeCode <= DIGIT_NINE:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than having a special condition for "Number" and "FullWidthNumber", I think its cleaner to make these entries in the scriptRanges. Add these two entries explicitly from unicodeScriptPrep. Actually, this will require ensuring that there is no overlap with other entries which may be tricky to do. So maybe not.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of this suggestion? It would be interesting to test the performance of the getScriptFunction without the two special cases, and manually put those ranges for numbers in the scriptRanges list.

return "Number"
elif FULLWIDTH_ZERO <= characterUnicodeCode <= FULLWIDTH_NINE:
return "FullWidthNumber"
while( mEnd >= mStart ):
midPoint = (mStart + mEnd ) >> 1
if characterUnicodeCode < scriptRanges[midPoint][0]:
mEnd = midPoint -1
elif characterUnicodeCode > scriptRanges[midPoint][1]:
mStart = midPoint + 1
else:
return scriptRanges[midPoint][2]
return None

# Based on the following assumptions:
# - ranges must not overlap
# - range end and start values are included in that range
# - there may be gaps between ranges.

# Approach: Look for the first index of a range where the range end value is greater
# than the code we are searching for. If this is found, and the start value for this range
# is less than or equal to the code we are searching for then we have found the range.
# That is startValue <= characterUnicodeCode <= endValue

index = bisect.bisect_left(unicodeScriptRangeEnd, characterUnicodeCode )
if index == len(unicodeScriptRangeEnd):
# there is no value of index such that: `characterUnicodeCode <= scriptCode[index][1]`
# characterUnicodeCode is larger than all of the range end values so a range is not
# found for the value:
return None

# Since the range at index is the first where `characterUnicodeCode <= rangeEnd` is True,
# we now ensure that for the range at the index `characterUnicodeCode >= rangeStart`
# is also True.
candidateRange = scriptRanges[index]
rangeStart = candidateRange[0]
if rangeStart > characterUnicodeCode :
# characterUnicodeCode comes before the start of the range at index so a range
# is not found for the value
return None

rangeName = candidateRange[2]
return rangeName
17 changes: 17 additions & 0 deletions tests/unit/test_languageDetection.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
import unittest
import languageDetection
from speech import LangChangeCommand
from unicodeScriptData import scriptRanges
import config

class TestLanguageDetection(unittest.TestCase):
Expand Down Expand Up @@ -271,3 +272,19 @@ def test_englishWithGreekTextWithEnglishAsDefaultAndPreferedLanguageAsHindi(self
languageDetection.updateLanguagePriorityFromConfig()
self.compareSpeechSequence(detectedLanguageSequence , testSequence)

def test_unicodeRangesEntryStartLessEqualEnd(self):
for scriptRangeStart, scriptRangeEnd, scriptName in scriptRanges:
self.assertTrue(scriptRangeStart <= scriptRangeEnd)

def test_unicodeRangesEntriesDoNotOverlapAndAreSorted(self):
for index in xrange( len(scriptRanges) -1):
#check is there is no overlap
currentRange = scriptRanges[index]
nextRange = scriptRanges[index+1]
currentRangeEnd = currentRange[1]
nextRangeStart = nextRange[0]
self.assertTrue(currentRangeEnd < nextRangeStart)

def test_unicodeRangesEntryScriptNamesExist(self):
for scriptRangeStart, scriptRangeEnd, scriptName in scriptRanges:
self.assertTrue(scriptName)