-
-
Notifications
You must be signed in to change notification settings - Fork 626
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto language detection based on script detection In t2990 review #7629
Changes from 1 commit
364bcb2
97804e1
7f44e86
4c04df7
b74daa2
07a9315
9fbe85c
9e8f601
73ae659
7fee0be
e9b9f9d
73ff201
03d4951
9ed69bd
82b0b99
2183f23
b76de80
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -19,6 +19,7 @@ | |
# maintains list of priority languages as a list of languageID, ScriptName, and LanguageDescription | ||
languagePriorityListSpec = [] | ||
|
||
"""scriptIDToLangID is reverse of langIDToScriptID and is used to obtain language of the current script. language of a script is used to detect whether a chunk should be broken for languages that use multiple scripts.""" | ||
scriptIDToLangID = {} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. doc string? what is this used for, why does it exist when There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. scriptIDToLangID is reverse of langIDToScriptID and is used to obtain language of the current script. language of a script is used to detect whether a chunk should be broken for languages that use multiple scripts. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. added the doc string |
||
|
||
LanguageDescription = namedtuple("LanguageDescription" , "languageID description") | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -10,6 +10,8 @@ | |
These entries should not overlap, but there could be gaps | ||
""" | ||
|
||
import bisect | ||
|
||
# unicode digit constants | ||
DIGIT_ZERO = 0x30 | ||
DIGIT_NINE = 0x39 | ||
|
@@ -885,28 +887,49 @@ | |
( 0Xe0020 , 0Xe007f , "Common" ), | ||
] | ||
|
||
|
||
unicodeScriptRangeEnd = [ k[1] for k in scriptRanges] | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you add a doc string here please? Mention that for performance reasons this should only be created once. |
||
|
||
def getScriptCode(chr): | ||
"""performs a binary search in scripCodes for unicode ranges | ||
@param chr: character for which a script should be found | ||
@type chr: string | ||
@return: script code | ||
@rtype: int""" | ||
mStart = 0 | ||
mEnd = len(scriptRanges)-1 | ||
characterUnicodeCode = ord(chr) | ||
# Number should respect preferred language setting | ||
# FullWidthNumber is in Common category, however, it indicates Japanese language context | ||
if DIGIT_ZERO <= characterUnicodeCode <= DIGIT_NINE: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Rather than having a special condition for "Number" and "FullWidthNumber", I think its cleaner to make these entries in the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think of this suggestion? It would be interesting to test the performance of the |
||
return "Number" | ||
elif FULLWIDTH_ZERO <= characterUnicodeCode <= FULLWIDTH_NINE: | ||
return "FullWidthNumber" | ||
while( mEnd >= mStart ): | ||
midPoint = (mStart + mEnd ) >> 1 | ||
if characterUnicodeCode < scriptRanges[midPoint][0]: | ||
mEnd = midPoint -1 | ||
elif characterUnicodeCode > scriptRanges[midPoint][1]: | ||
mStart = midPoint + 1 | ||
else: | ||
return scriptRanges[midPoint][2] | ||
return None | ||
|
||
# Based on the following assumptions: | ||
# - ranges must not overlap | ||
# - range end and start values are included in that range | ||
# - there may be gaps between ranges. | ||
|
||
# Approach: Look for the first index of a range where the range end value is greater | ||
# than the code we are searching for. If this is found, and the start value for this range | ||
# is less than or equal to the code we are searching for then we have found the range. | ||
# That is startValue <= characterUnicodeCode <= endValue | ||
|
||
index = bisect.bisect_left(unicodeScriptRangeEnd, characterUnicodeCode ) | ||
if index == len(unicodeScriptRangeEnd): | ||
# there is no value of index such that: `characterUnicodeCode <= scriptCode[index][1]` | ||
# characterUnicodeCode is larger than all of the range end values so a range is not | ||
# found for the value: | ||
return None | ||
|
||
# Since the range at index is the first where `characterUnicodeCode <= rangeEnd` is True, | ||
# we now ensure that for the range at the index `characterUnicodeCode >= rangeStart` | ||
# is also True. | ||
candidateRange = scriptRanges[index] | ||
rangeStart = candidateRange[0] | ||
if rangeStart > characterUnicodeCode : | ||
# characterUnicodeCode comes before the start of the range at index so a range | ||
# is not found for the value | ||
return None | ||
|
||
rangeName = candidateRange[2] | ||
return rangeName |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might be a good place to use a
namedtuple
, then we can refer to these by name rather than remembering the order.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your suggestions! using namedtuple has improved code readability. I didn't know about this feature.