Skip to content

Additional confusables characters + deduping existing#24612

Merged
diox merged 7 commits intomozilla:masterfrom
diox:even-more-additional-homoglyphs-again
Mar 16, 2026
Merged

Additional confusables characters + deduping existing#24612
diox merged 7 commits intomozilla:masterfrom
diox:even-more-additional-homoglyphs-again

Conversation

@diox
Copy link
Copy Markdown
Member

@diox diox commented Mar 16, 2026

More confusable characters for our custom normalization, but also deduping the existing ones:

  • Remove characters already handled by unicodedata normalization
  • Remove characters already known to be confusables by unicode
  • Remove characters already considered confusable with another letter: we only use this table to normalize into a single character, so having confusable characters in more than one list is not useful. This was mostly a problem for i and l.

Testing

With the patch applied, you can check what we think a given character is confusable with by doing:

>>> s =  'ⵊ'
>>> list(generate_lowercase_homoglyphs_variants_for_string(normalize_string_for_name_checks(s)))
['i', 'l']

That allows spot-checking any characters you are having doubts for.

@diox diox marked this pull request as ready for review March 16, 2026 15:33
@diox diox requested a review from eviljeff March 16, 2026 15:33
Copy link
Copy Markdown
Member

@eviljeff eviljeff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's feasible have a test that checks for duplication programmatically? Manually de-duping seems like an inefficient process.

@diox
Copy link
Copy Markdown
Member Author

diox commented Mar 16, 2026

I automated the deduping, so I'll look into add a test so that we don't accidentally add more dupes in the future yeah. I fear it might be a bit too slow though...

@diox
Copy link
Copy Markdown
Member Author

diox commented Mar 16, 2026

I've added a script to check for dupes in a0fd570 but left it separate in scripts/, with the expectation that we run it next time we upgrade homoglyphs_fork or add new characters. It's too slow to run as part of the test suite IMHO.

@diox diox merged commit e89e6f1 into mozilla:master Mar 16, 2026
43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants