Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with non-overlapping matches of ligature pattern in string when ligature is skipped due to its form mismatch #86

Open
jurajmichalak1 opened this issue Oct 13, 2022 · 0 comments

Comments

@jurajmichalak1
Copy link

jurajmichalak1 commented Oct 13, 2022

When I render this text: "مين" (b'\u0645\u064a\u0646') using PIL(default: libraqm layout) where text is reshaped using libraqm library I get:
image
Text was transoformed into b'\ufee3\uFC94' . That is expected behavior because "\u064a\u0646" was transformed into ligature "\uFC94" and initial b'\u0645' transformed into initial form b'\ufee3'.
Note: b'\u0645\u064a\u0646' - all letters are in unshaped form

But when I use arabic_reshaper(text) I get:
image
NOTE: image was created using PIL(basic_layout). It means that PIL does rendering letter by letter from "left to right" and therefore your arabic_reshaper will do its work but makes failure in this case.
The text I get from reshaper is b'\ufee3\ufef4\ufee6' instead of expected b'\ufee3\uFC94'

The source of problem: ligature regex is performed and first match is '\u0645\u064A' y(ARABIC LIGATURE MEEM WITH YEH) which has only isolated form and you therefore correctly skip (continue) it here:
https://github.com/mpcabd/python-arabic-reshaper/blob/master/arabic_reshaper/arabic_reshaper.py#L220
But as subsequent ligature "\u064a\u0646" is overlapping with previous match "\u064a\u0646" it's not returned by finditer function and therefore not applied.

I have simple fix:

diff --git a/arabic_reshaper/arabic_reshaper.py b/arabic_reshaper/arabic_reshaper.py
index 4721a6a..a94cd0f 100644
--- a/arabic_reshaper/arabic_reshaper.py
+++ b/arabic_reshaper/arabic_reshaper.py
@@ -186,14 +186,17 @@ class ArabicReshaper(object):
             if delete_tatweel:
                 text = text.replace(TATWEEL, '')
 
-            for match in re.finditer(self._ligatures_re, text):
+            regex_start = 0
+            matchIt = re.finditer(self._ligatures_re, text)
+            match = next(matchIt, None)
+            while match:
                 group_index = next((
                     i for i, group in enumerate(match.groups()) if group
                 ), -1)
                 forms = self._get_ligature_forms_from_re_group_index(
                     group_index
                 )
-                a, b = match.span()
+                a, b = tuple(i+regex_start for i in match.span())
                 a_form = output[a][FORM]
                 b_form = output[b - 1][FORM]
                 ligature_form = None
@@ -218,9 +221,13 @@ class ArabicReshaper(object):
                     else:
                         ligature_form = MEDIAL
                 if not forms[ligature_form]:
+                    regex_start = a+1
+                    matchIt = re.finditer(self._ligatures_re, text[regex_start:])
+                    match = next(matchIt, None)
                     continue
                 output[a] = (forms[ligature_form], NOT_SUPPORTED)
                 output[a+1:b] = repeat(('', NOT_SUPPORTED), b - 1 - a)
+                match = next(matchIt, None)
 
         result = []
         if not delete_harakat and -1 in positions_harakat:
jurajmichalak1 pushed a commit to jurajmichalak1/python-arabic-reshaper that referenced this issue Oct 13, 2022
…tches of ligature pattern in string when previous overlapping ligature candidate is skipped due to its form mismatch
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant