Skip to content

Commit

Permalink
optimise is_cjk(character) (#139)
Browse files Browse the repository at this point in the history
The call to `ord()` and the list comprehension both showed up in my profiler.

Some observations:
- The call to `ord()` doesn't need to happen every iteration of that list comprehension.
- A list isn't necessary, we can stop searching once we've found a hit.
- The list is sorted, so we can be sure that if a char is lower than the upper bound of a group, we don't need to evaluate any of the higher ranges.
- Technically we could also do a binary search instead of a linear one, but I'm assuming that, overall, most chars will be in the lowest ranges (ascii) and the loop will abort on its first iteration.
  • Loading branch information
jelmervdl committed Sep 14, 2023
1 parent 1b161ea commit cb03e74
Showing 1 changed file with 8 additions and 18 deletions.
26 changes: 8 additions & 18 deletions sacremoses/util.py
Original file line number Diff line number Diff line change
Expand Up @@ -91,6 +91,9 @@ class CJKChars(object):
]


_CJKChars_ranges = CJKChars().ranges


def is_cjk(character):
"""
This checks for CJK character.
Expand All @@ -106,24 +109,11 @@ def is_cjk(character):
:type character: char
:return: bool
"""
return any(
[
start <= ord(character) <= end
for start, end in [
(4352, 4607),
(11904, 42191),
(43072, 43135),
(44032, 55215),
(63744, 64255),
(65072, 65103),
(65381, 65500),
(94208, 101119),
(110592, 110895),
(110960, 111359),
(131072, 196607),
]
]
)
char = ord(character)
for start, end in _CJKChars_ranges:
if char < end:
return char > start
return False


def xml_escape(text):
Expand Down

0 comments on commit cb03e74

Please sign in to comment.