Skip to content

updated the version to 2.0.0, introduced multi lingual filtering, #1

Merged
master-wayne7 merged 3 commits intomasterfrom
version-2.0.0
Mar 13, 2026
Merged

updated the version to 2.0.0, introduced multi lingual filtering, #1
master-wayne7 merged 3 commits intomasterfrom
version-2.0.0

Conversation

@master-wayne7
Copy link
Copy Markdown
Owner

@master-wayne7 master-wayne7 commented Mar 11, 2026

improved performance by applying aho-corasick algo

Summary by CodeRabbit

  • New Features

    • Major v2.0.0 release with a modular API: SafeTextFilter and PhoneNumberChecker; significant performance improvements (single‑pass search, up to ~20× faster) and 75+ language support.
  • Deprecations

    • SafeText class deprecated; migrate to SafeTextFilter or PhoneNumberChecker.
  • Documentation

    • README rewritten with v2.0.0 upgrade guidance and new examples.
  • Chores

    • Removed bundled static blacklist asset.
  • Style

    • Workspace formatting setting adjusted (line length 80).

…roved performance by applying aho-corasick algo
@master-wayne7 master-wayne7 self-assigned this Mar 11, 2026
@master-wayne7 master-wayne7 added the enhancement New feature or request label Mar 11, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 11, 2026

📝 Walkthrough

Walkthrough

Release v2.0.0: introduces SafeTextFilter and PhoneNumberChecker backed by an Aho‑Corasick engine, removes the bundled asset/bad_words.txt, adds 40+ language data files, and rewrites README/CHANGELOG and workspace settings.

Changes

Cohort / File(s) Summary
Configuration & Docs
/.vscode/settings.json, CHANGELOG.md, README.md
Adds VSCode dart lineLength setting; documents v2.0.0, modular API (SafeTextFilter, PhoneNumberChecker), Aho‑Corasick engine, migration guidance, and rewrites README examples.
Removed legacy list
asset/bad_words.txt
Removes the bundled global blacklist (large 1,734-line file). Verify any runtime or packaging references to this asset.
Language data assets (bulk)
assets/data/*.txt, assets/data/.../ar.txt, assets/data/.../es.txt, assets/data/.../it.txt, assets/data/.../nl.txt, assets/data/.../de.txt
Adds 40+ newline-delimited language-specific word lists (profane/explicit terms) as static data for the new trie/Aho‑Corasick-based filtering. Pure data additions—no code changes here.
Language data assets (small sets)
assets/data/piy.txt, assets/data/pih.txt, assets/data/kab.txt, assets/data/mi.txt
Several smaller language files added (short lists).

Sequence Diagram(s)

sequenceDiagram
  participant Client as Client
  participant Filter as SafeTextFilter
  participant Engine as AhoCorasickEngine
  participant Data as LanguageDataStore

  Client->>Filter: init(languages)
  Filter->>Data: loadFiles(languages)
  Data-->>Filter: wordLists
  Filter->>Engine: buildTrie(wordLists)
  Engine-->>Filter: trieReady
  Client->>Filter: filterText(text, options)
  Filter->>Engine: search(text)
  Engine-->>Filter: matches
  Filter->>Client: filteredText / matches
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~30 minutes

Poem

🐰 I hopped through folders, new and old,
Collected words in languages bold,
AC trie stitched every thread,
Filters hum where data led,
A tiny rabbit cheers the code.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main changes: version bump to 2.0.0 and introduction of multilingual filtering support, which align with the extensive file changes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch version-2.0.0
📝 Coding Plan
  • Generate coding plan for human review comments

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can enforce grammar and style rules using `languagetool`.

Configure the reviews.tools.languagetool setting to enable/disable rules and categories. Refer to the LanguageTool Community to learn more.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (30)
assets/data/et.txt-78-78 (1)

78-78: ⚠️ Potential issue | 🟠 Major

Fix the mojibake before publishing this Estonian dataset.

pãµrgu and vã¤ike are encoding-corrupted forms, so these patterns will miss real Estonian input and suggest the source file was imported with the wrong character encoding.

Also applies to: 139-139

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/et.txt` at line 78, Replace the mojibake tokens "pãµrgu" and
"vã¤ike" with their correct Estonian forms "põrgu" and "väike" throughout the
file (including the other occurrence noted), and save the file in UTF-8 encoding
(or re-run a proper encoding conversion from the original charset) so future
edits/publishes preserve the correct characters.
assets/data/cs.txt-11-13 (1)

11-13: ⚠️ Potential issue | 🟠 Major

Prune neutral Czech vocabulary from this blacklist.

Several entries in these ranges are common, non-abusive words (bodnutí, děloha, knoflík, mezery, odvaha, odvážný, peklo, pohlaví, počínání, příruba, vybudovat, zpomalit, úryvek). Shipping them in the profanity list will overblock ordinary Czech text and materially hurt precision.

Also applies to: 33-33, 66-67, 101-103, 120-122, 136-145, 153-169, 199-205, 221-225

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/cs.txt` around lines 11 - 13, Remove the neutral Czech words from
the profanity blacklist: delete the entries "bodnutí", "děloha", "knoflík",
"mezery", "odvaha", "odvážný", "peklo", "pohlaví", "počínání", "příruba",
"vybudovat", "zpomalit", and "úryvek" (and any duplicates) so ordinary Czech
text isn't overblocked; apply the same removals in the other specified ranges
(lines corresponding to ranges 33-33, 66-67, 101-103, 120-122, 136-145, 153-169,
199-205, 221-225) by scanning those ranges for these neutral terms and pruning
them from the blacklist file.
assets/data/mi.txt-14-15 (1)

14-15: ⚠️ Potential issue | 🟠 Major

This Māori list includes a lot of ordinary vocabulary.

Entries like kararehe, pakaru, reta, tino, tino pai, uaua, wahine, and multiple generic whaka* verbs are not profane on their own. Keeping them here will make the multilingual filter fire on normal Māori sentences.

Also applies to: 20-23, 27-38, 45-58

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/mi.txt` around lines 14 - 15, The Māori profanity list contains
many ordinary, non-profane words (e.g., "kararehe", "pakaru", "reta", "tino",
"tino pai", "uaua", "wahine", and generic "whaka*" verbs) which will trigger
false positives; go through the entries referenced (lines around
"kararehe"/"karihika" and also ranges 20-23, 27-38, 45-58), remove or relocate
any non-profane vocabulary to a separate whitelist or neutral-word file, and
keep only genuinely offensive terms in the profanity list (update
functions/consumers that load this list to read the new whitelist if needed).
assets/data/ca.txt-6-6 (1)

6-6: ⚠️ Potential issue | 🟠 Major

The Catalan dataset is polluted with foreign or corrupted entries.

Rows like harnröhre, pansexualität, masturbarse, tetas, retrocã©s, and the Cyrillic terms at the end do not belong in a Catalan blacklist. They point to a bad source import and will either miss real Catalan text or introduce cross-language false positives.

Also applies to: 64-64, 79-79, 91-91, 112-112, 122-122, 135-136

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ca.txt` at line 6, The Catalan blacklist contains
foreign/corrupted entries that must be removed and the source import
revalidated: remove the listed non-Catalan tokens (e.g., "harnröhre",
"pansexualität", "masturbarse", "tetas", "retrocã©s", the Cyrillic terms and the
current "armando reques" entry) and any similar entries at the other reported
lines (64, 79, 91, 112, 122, 135-136); then update the import pipeline or source
filter to enforce Catalan-language detection (or whitelist trusted sources) so
future imports don’t introduce cross-language or corrupted terms and run a quick
QA pass to confirm only valid Catalan lemmas remain.
assets/data/af.txt-21-22 (1)

21-22: ⚠️ Potential issue | 🟠 Major

Remove unrelated names and neutral proper nouns from the Afrikaans blocklist.

dawid jonker, dawie joubert, luna, oom gert, shrek, and wicked do not read like profanity terms. Keeping them here will flag benign text and names for no moderation reason.

Also applies to: 163-163, 190-190, 217-217, 252-252

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/af.txt` around lines 21 - 22, Remove the unrelated benign entries
from the Afrikaans profanity blocklist by deleting the specific strings "dawid
jonker", "dawie joubert", "luna", "oom gert", "shrek", and "wicked" (and any
duplicate occurrences at the other noted locations) from the af.txt blocklist so
benign names and neutral nouns are no longer flagged; search for those exact
tokens in af.txt and remove each line containing them, leaving the rest of the
blocklist intact.
assets/data/hu.txt-49-49 (1)

49-49: ⚠️ Potential issue | 🟠 Major

This Hungarian blocklist includes routine vocabulary.

bátorság, fogantyú, kemény, kiáltotta, majom, részeg, and épít are normal words. Including them in the shipped matcher will create false positives in regular Hungarian text.

Also applies to: 98-98, 133-139, 170-170, 237-237, 295-295

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hu.txt` at line 49, Remove common Hungarian words that cause
false positives from the blocklist: delete entries like "bátorság", "fogantyú",
"kemény", "kiáltotta", "majom", "részeg", and "épít" (and the other listed
occurrences at the indicated ranges) from the hu blocklist so the matcher only
contains genuinely disallowed tokens; after removal, run the matcher tests on
Hungarian sample text to verify no routine vocabulary triggers remain.
assets/data/gl.txt-12-13 (1)

12-13: ⚠️ Potential issue | 🟠 Major

The Galician dataset is over-inclusive and language-mixed.

can fuck is not Galician, and entries like cariño, durmir, falando, galiñas, galo, and pálido are ordinary words. If these are fed into the profanity automaton, normal Galician text will be incorrectly blocked.

Also applies to: 23-23, 32-32, 41-42, 63-63

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/gl.txt` around lines 12 - 13, The Galician profanity list is
over-inclusive and language-mixed: remove non-Galician entries like "can fuck"
and strip ordinary Galician words ("cariño", "durmir", "falando", "galiñas",
"galo", "pálido") from the automaton; instead whitelist common lemmas and only
keep validated profane lemmas. Update the token normalization routine to
normalize/strip diacritics and punctuation before matching, split multi-word
entries into single-token checks (or remove multiword English phrases), and add
a small validation step (language detection or native-speaker review) for any
remaining entries to ensure only true Galician profanity is retained.
assets/data/eu.txt-17-17 (1)

17-17: ⚠️ Potential issue | 🟠 Major

Normalize this Basque list before shipping it.

This file mixes comma-joined records and non-Basque hybrids (ipurdian-fucker, jainko-damned, prononcés, seme-of-a-puta, urbanístico). That will both miss intended matches and introduce locale-specific false positives.

Also applies to: 23-24, 37-37, 40-40, 46-46

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/eu.txt` at line 17, The Basque word list in assets/data/eu.txt
contains mixed formats and non-Basque hybrids (e.g., the comma-joined entry
"gixajo, gisarajo" and items like "ipurdian-fucker", "jainko-damned",
"prononcés", "seme-of-a-puta", "urbanístico") that break normalization and
matching; clean this by splitting comma-joined records into individual entries
(so "gixajo, gisarajo" becomes two lines "gixajo" and "gisarajo"), remove or
replace hybrid/foreign entries with pure Basque equivalents, ensure each token
is lowercased and diacritics normalized consistently, and apply the same fixes
to the other affected lines (noted around lines 23–24, 37, 40, 46) so the
functions that read this file (the list loader that expects one token per line)
will yield correct, locale-pure Basque entries.
assets/data/id.txt-162-163 (1)

162-163: ⚠️ Potential issue | 🟠 Major

This Indonesian list contains many everyday words.

Entries such as detik, dikirim, gagah, keberanian, penggemar, ruang, terima kasih, united, and ya tuhan are normal language, not profanity. Shipping them in the matcher will create widespread false positives for Indonesian users.

Also applies to: 190-190, 284-284, 468-468, 495-495, 546-546, 571-571, 579-579

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/id.txt` around lines 162 - 163, The listed Indonesian entries
(e.g., "detik", "dikirim", "gagah", "keberanian", "penggemar", "ruang", "terima
kasih", "united", "ya tuhan" and the other specified line entries) are common
non-profanity words and should not be in the profanity matcher; remove these
tokens from the profanity list (or move them into a distinct whitelist/allowlist
used by the matcher) so they no longer generate false positives; update any
tests or documentation that assumed these tokens were profane and ensure the
matcher still performs on true profanity cases after the change.
assets/data/hy.txt-90-90 (1)

90-90: ⚠️ Potential issue | 🟠 Major

This list includes common Armenian vocabulary, not just abusive terms.

Entries like vor, ամուսին, ամուսինս, գաղտնի, դրսում, and լեգենդ are ordinary words. If this file feeds the profanity matcher, these will produce a large number of false positives in normal Armenian text.

Also applies to: 104-105, 130-130, 140-140, 156-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hy.txt` at line 90, The Armenian profanity wordlist contains
common non-abusive tokens (e.g., "vor", "ամուսին", "ամուսինս", "գաղտնի",
"դրսում", "լեգենդ") which will cause false positives; edit assets/data/hy.txt to
remove these ordinary words (and the other mentioned lines 104-105, 130, 140,
156) from the profanity list, or relocate them into a separate
benign/common-words list used by the matcher for allowlisting/whitelisting, and
ensure the profanity matcher reads only the sanitized profanity list.
assets/data/mk.txt-77-82 (1)

77-82: ⚠️ Potential issue | 🟠 Major

Remove unrelated foreign phrases and placeholder tokens from the Macedonian list.

Lines 77-82 and 84 contain English profanity (suck mine dick, up your mums cunt), Serbo-Croatian entries (turi mu kur, u picku mater, veliko mandalo, volim te), and a malformed placeholder (x оценет). These entries will inflate false positives and degrade Macedonian language filtering precision.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/mk.txt` around lines 77 - 82, Remove the unrelated and offensive
entries from the Macedonian wordlist by deleting the English profanity items
("suck mine dick", "up your mums cunt"), the Serbo-Croatian phrases ("turi mu
kur", "u picku mater", "veliko mandalo", "volim te"), and the malformed
placeholder token ("x оценет"); ensure only valid Macedonian words remain in the
list and run a quick scan for any other non-Macedonian or placeholder strings to
prevent false positives in filtering.
assets/data/ja.txt-11-11 (1)

11-11: ⚠️ Potential issue | 🟠 Major

Several romaji entries appear to be junk or placeholder data rather than legitimate Japanese profanity terms.

baka bah oom (line 11), okiesawada (line 97), watashiwa fucketsu desu ne (line 135), and xx (line 136) do not conform to valid Japanese romaji. For example, baka bah oom combines a real term (baka) with phonetically invalid elements; watashiwa fucketsu desu ne contains the invalid word fucketsu; and xx is a clear placeholder. These entries will not function properly in a production filter and should be removed or replaced with legitimate Japanese profanity terms before automaton building.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ja.txt` at line 11, Remove or replace the invalid romaji entries
used as profanity tokens: delete "baka bah oom", "okiesawada", "watashiwa
fucketsu desu ne", and "xx" from the profanity list and either supply correct
romaji equivalents or legitimate Japanese profanity terms (e.g., "baka",
"fuzakeru" where appropriate) so the automaton is built from valid tokens;
update any code that references these exact strings (search for the literal
tokens "baka bah oom", "okiesawada", "watashiwa fucketsu desu ne", "xx") to use
the corrected entries.
assets/data/ja.txt-360-362 (1)

360-362: ⚠️ Potential issue | 🟠 Major

Remove neutral Japanese nouns from the blocklist.

(woman), 女の子 (girl), (king), (corner), and (gap) are ordinary words. Blocking them will cause excessive false positives when filtering normal Japanese content.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ja.txt` around lines 360 - 362, Remove the neutral Japanese nouns
from the blocklist by deleting the entries for "女", "女の子", "王", "隅", and "隙" in
the ja.txt blocklist so ordinary words aren't falsely filtered; locate these
literal tokens in assets/data/ja.txt (e.g., the lines containing 女, 女の子, 女子高生,
王, 隅, 隙) and remove them, ensuring the file remains one-token-per-line and is
saved without altering unrelated entries or order.
assets/data/it.txt-590-590 (1)

590-590: ⚠️ Potential issue | 🟠 Major

Remove ordinary Italian names, everyday phrases, and archaic expressions from the blocklist.

The terms filippo (common given name), garibaldi (historical figure/surname), la mamma (everyday phrase for "the mother"), perbacco (archaic exclamation), and vada a bordo (nautical phrase) are not offensive or vulgar. Blocking these in normal Italian text will generate clear false positives since they regularly appear in legitimate writing.

Also applies to: lines 640, 757, 1038, 1694

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/it.txt` at line 590, The blocklist currently includes benign
Italian tokens that cause false positives; remove the entries for "filippo",
"garibaldi", "la mamma", "perbacco", and "vada a bordo" from the Italian
blocklist file (the occurrences shown in the diff and the duplicates at the
other reported locations), ensuring you delete each exact token/phrase and any
duplicate lines (lines referenced at 640, 757, 1038, 1694) while preserving file
encoding and line endings so the rest of the list remains intact.
assets/data/my.txt-29-29 (1)

29-29: ⚠️ Potential issue | 🟠 Major

Remove mixed-language artifacts from Burmese lexicon.

The entries at lines 29, 33, 54, and 66 are mixed-script/mixed-language artifacts that should not be in a production Burmese lexicon:

  • Line 29: x ကအဆင့်သတ်မှတ်ထားသော (Latin prefix)
  • Line 33: ကော်ဇော muncher (English suffix)
  • Line 54: မိခင် fucker (English suffix)
  • Line 66: လူရှုပ်-off (English suffix)

These entries do not represent authentic Burmese words and will cause matching failures. Remove or normalize them before building the trie.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/my.txt` at line 29, The Burmese lexicon contains mixed-language
artifacts (e.g., the exact strings "x ကအဆင့်သတ်မှတ်ထားသော", "ကော်ဇော muncher",
"မိခင် fucker", "လူရှုပ်-off"); update the lexicon-loading step (where you read
and pass entries into buildTrie or loadLexicon) to filter or normalize entries:
drop any line containing Latin letters, ASCII punctuation/hyphens, or obvious
English profanity, or alternatively strip trailing/leading Latin tokens so only
Burmese script remains; ensure the check uses a Unicode script test (e.g.,
reject if regex /[A-Za-z0-9\-]/ matches) before inserting into the trie so these
four offending entries are removed/normalized.
assets/data/bg.txt-126-126 (1)

126-126: ⚠️ Potential issue | 🟠 Major

The Bulgarian dataset contains encoding-corrupted entries with non-Cyrillic characters.

Tokens like x оѧенен, аборѣ, анѣѡ, брѣѣалноѡѣ, клиѣор਍ഀ, and лайнഀ include non-Bulgarian characters (Gurmukhi and Malayalam ), indicating mojibake. Indexing these corrupted entries will add garbage states to the automaton and miss legitimate user input.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/bg.txt` at line 126, The bg.txt dataset contains lines with
mojibake/non-Bulgarian characters (examples: "x оѧенен", "аборѣ", "анѣѡ",
"брѣѣалноѡѣ", "клиѣор਍ഀ", "лайнഀ"); scan bg.txt for characters outside the
Bulgarian/Cyrillic Unicode ranges (and allowed punctuation) and either remove or
replace those entries; normalize text to Unicode NFC and strip/control-map any
Latin, Gurmukhi, Malayalam or other-script codepoints before indexing so the
automaton only receives valid Cyrillic tokens (or move suspicious lines to a
quarantine report file for manual review).
assets/data/ms.txt-14-14 (1)

14-14: ⚠️ Potential issue | 🟠 Major

Remove Malaysian politician names from the Malay dataset.

The entries anwar ibrahim, lancau najib, najib kongkek anwar, and rosmah bodoh are names of Malaysian political figures (Anwar Ibrahim, Najib Razak, and Rosmah Mansor). Keeping them in a profanity filter will cause false-positives on benign political news and commentary that mentions these names.

Applies to lines: 14, 102, 126, 165

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ms.txt` at line 14, Remove the four political-name entries from
the Malay profanity dataset by deleting the exact strings "anwar ibrahim",
"lancau najib", "najib kongkek anwar", and "rosmah bodoh" from the dataset file;
ensure these tokens are removed (or commented out) wherever they appear so they
no longer trigger the profanity filter, and run a quick scan to verify no other
variants of these exact phrases remain.
assets/data/mn.txt-40-48 (1)

40-48: ⚠️ Potential issue | 🟠 Major

Remove common Mongolian words from profanity filter list.

This list contains multiple everyday words that will cause excessive filtering of normal text. Examples include:

  • өдөр (day) – fundamental everyday word
  • бөмбөг (ball), титэм (crown), тогоонууд (pots) – standard nouns
  • дуртай (favorite/beloved), бөх (wrestling) – common descriptors
  • хамгийн сайн сайхан (most beautiful), харилцаа холбоо (communication/relationship) – standard phrases
  • эрхлэгч (manager/administrator), смит (smith) – common terms

These should be removed to prevent over-censoring of legitimate Mongolian text.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/mn.txt` around lines 40 - 48, The profanity list contains common
Mongolian words that will cause over-filtering; remove everyday terms such as
"өдөр", "бөмбөг", "бөх", "дуртай", "титэм", "тогоонууд", "хамгийн сайн сайхан",
"харилцаа холбоо", "эрхлэгч", and "смит" (and similar non-profane entries found
in mn.txt like "ану", "бурханлиг", "гичий", "гэж хэлэв") from assets/data/mn.txt
so the filter only contains true profanities and slurs; review the file for
other high-frequency nouns/phrases and delete them or move questionable items to
a separate moderated list.
assets/data/ms.txt-77-77 (1)

77-77: ⚠️ Potential issue | 🟠 Major

Remove or relocate Māori language entries that contaminate this Malay word list.

The file contains te reo Māori (Māori language) entries mixed with Malay/Indonesian content. This cross-contamination compromises the profanity filter by both missing real Malay profanity and creating false positives for unrelated text.

Māori entries: lines 77, 97–98, 176, 181–182, 186, 188, 190–191, 196–199.

These should be either removed or moved to a dedicated Māori language file (e.g., mi.txt, which already exists in the dataset).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ms.txt` at line 77, Remove the te reo Māori entries contaminating
the Malay list by extracting the listed Māori strings (e.g., "kai whakatahe" and
the entries at positions you noted: the sets at lines 97–98, 176, 181–182, 186,
188, 190–191, 196–199) from the Malay wordlist and either delete them or append
them to the existing Māori wordlist (mi.txt); ensure the removed items are no
longer present in assets/data/ms.txt and, if moved, are deduplicated and added
to mi.txt in the same plain-text format so both lists remain accurate.
assets/data/el.txt-68-68 (1)

68-68: ⚠️ Potential issue | 🟠 Major

Drop the non-profane Greek terms and club names.

Line 68 (olympiacos), Line 72-Line 75 (panathinaikos, paok, vazelos, villara), Line 127 (επάνω), Line 129 (ευτυχώς), Line 171 (ναι), and Line 223 (συνάδελφο) are normal words/proper nouns. Keeping them in the Greek profanity list will censor routine sports/news/chat content.

Also applies to: 72-75, 127-129, 171-171, 223-223

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/el.txt` at line 68, Remove the non-profane Greek terms currently
included in the profanity list: "olympiacos", "panathinaikos", "paok",
"vazelos", "villara", "επάνω", "ευτυχώς", "ναι", and "συνάδελφο" (these exact
tokens appear in the diff) — either delete these entries from the list or move
them to a safe/whitelist so they are not treated as profane; ensure any
normalization or lowercase matching logic that references this list (the
profanity-check lookup that loads these tokens) is updated to reflect the
removal so ordinary sports/news/chat content is not censored.
assets/data/be.txt-18-18 (1)

18-18: ⚠️ Potential issue | 🟠 Major

Remove obviously benign Belarusian terms from this list.

Line 40 (кран), Line 71 (прыклад), Line 99 (усклікаць), Line 117 (янот), and Line 118 (ўрваць) are ordinary words, not profanities. Shipping them in the default Belarusian lexicon will flag normal text and materially raise false positives.

Also applies to: 20-20, 40-41, 71-71, 99-99, 117-118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/be.txt` at line 18, Remove the listed benign Belarusian words
from the default lexicon to avoid false positives: delete the entries for
"кран", "прыклад", "усклікаць", "янот", and "ўрваць" (and any duplicated
occurrences in the same file/ranges noted, e.g., the occurrences around lines
referenced as 20, 40-41, 71, 99, 117-118); ensure any surrounding commas or line
breaks are cleaned up so the file remains a valid newline-separated word list
and run a quick grep to confirm no remaining benign terms remain in the
Belarusian lexicon file.
assets/data/mr.txt-48-48 (1)

48-48: ⚠️ Potential issue | 🟠 Major

Prune the everyday Marathi words from this blocklist.

Line 48 (marathi), Line 100 (अप्सरा), Line 140 (गुप्तचर), Line 161 (दृश्य), Line 162 (देव), Line 215 (मांजर), Line 249 (शीर्षक), and Line 256 (सुगंध) are not profanities. These will create noisy false positives in completely benign text.

Also applies to: 100-100, 140-140, 161-163, 215-215, 249-249, 256-256

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/mr.txt` at line 48, Remove the listed benign Marathi words from
the profanity blocklist: delete the entries "marathi", "अप्सरा", "गुप्तचर",
"दृश्य", "देव", "मांजर", "शीर्षक", and "सुगंध" (the occurrences referenced at
lines 48, 100, 140, 161-163, 215, 249, 256) so they no longer produce false
positives; ensure the blocklist file (assets/data/mr.txt) no longer contains
those tokens and update any related comments or tests that expect them to be
blocked.
assets/data/hi.txt-582-771 (1)

582-771: ⚠️ Potential issue | 🟠 Major

hi.txt is carrying a full Sinhala lexicon block.

From Line 582 onward, this file switches to Sinhala script. If the runtime selects lexicons by locale filename, the Hindi dataset will unexpectedly flag Sinhala text and the language split introduced in 2.0.0 stops being trustworthy.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hi.txt` around lines 582 - 771, hi.txt contains a block of
Sinhala script starting around the shown diff, which will mislabel Sinhala as
Hindi; remove the Sinhala lines from hi.txt and place them into a new si.txt
(Sinhala) lexicon file, then update the lexicon loader/registry (e.g.,
loadLexicons or registerLocale / getLocaleLexicon) to load hi.txt only for "hi"
and si.txt for "si" (and add validation to reject entries outside expected
Unicode ranges for the locale to prevent future cross-contamination).
assets/data/lv.txt-19-19 (1)

19-19: ⚠️ Potential issue | 🟠 Major

This Latvian list also has encoding-corrupted entries.

Line 19 (bezdçt), Line 30 (dirsâ), Line 100 (pajât), Line 142 (pimpja dzîsla), Line 144 (pipïuks), Line 154 (pïûtîzers), and Line 180-Line 182 (sûdcaurums, sûdu stûmçjs, sûdîgs) are malformed. With exact string matching, these tokens will miss the real Latvian forms entirely.

Also applies to: 30-30, 100-100, 142-144, 154-154, 176-182

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/lv.txt` at line 19, The listed Latvian tokens are corrupted by
wrong encoding (e.g., "bezdçt", "dirsâ", "pajât", "pimpja dzîsla", "pipïuks",
"pïûtîzers", "sûdcaurums", "sûdu stûmçjs", "sûdîgs"); open assets/data/lv.txt in
a UTF-8-capable editor or re-decode the file using the correct source encoding,
normalize the file to UTF-8 NFC, and replace each malformed token with the
correct Latvian form (or restore from the original source list) so that exact
string matching will succeed for "bezdçt", "dirsâ", "pajât", "pimpja dzîsla",
"pipïuks", "pïûtîzers", "sûdcaurums", "sûdu stûmçjs", and "sûdîgs".
assets/data/da.txt-21-22 (1)

21-22: ⚠️ Potential issue | 🟠 Major

Fix the mojibake before publishing this Danish lexicon.

Line 21, Line 22, Line 127, Line 128, Line 154, Line 166, Line 183, and Line 184 contain broken UTF-8/Latin-1 text (bã¦skubber, bã¸sserã¸v, rã¸vbanan, ã˜lfisse, etc.). The matcher will never hit the real Danish spellings, so these entries are effectively dead and the locale coverage is wrong.

Also applies to: 127-128, 154-154, 166-166, 183-184

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/da.txt` around lines 21 - 22, The file assets/data/da.txt
contains mojibake (mis-encoded Latin-1/UTF-8) entries such as "bã¦skubber",
"bã¸sserã¸v", "rã¸vbanan" and "ã˜lfisse" that must be fixed to the correct
Danish spellings (replace the garbled sequences with the proper characters: æ,
ø, Å/Ø, etc.) — either re-open the file with the correct encoding and save as
UTF-8 or run a re-encoding step (e.g., convert from Latin-1/ISO-8859-1 to
UTF-8), then replace the mis-decoded tokens on the noted lines
(21,22,127-128,154,166,183-184) so the entries read the real Danish words (e.g.,
change "bã¦" -> "bæ", "bã¸" -> "bø", "rã¸v" -> "røv", "ã˜" -> "Ø" as
appropriate) and re-run the matcher to confirm locale coverage.
assets/data/ml.txt-101-101 (1)

101-101: ⚠️ Potential issue | 🟠 Major

Normalize the malformed Malayalam tokens.

Line 101 and Line 252 include trailing punctuation, and Line 236 (ഗര് ഭച്ഛിദ്രം), Line 323 (മൂന്നു വര് ഷത്തിനുള്ളില്), and Line 331 (രതിമൂര് ച്ഛ) contain intra-word splits. These entries will only match the malformed source text, not the actual Malayalam spellings users type.

Also applies to: 236-236, 252-252, 323-323, 331-331

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/ml.txt` at line 101, The file contains malformed tokens with
intra-word splits and trailing punctuation (examples: "ninte ammede kothil.",
"ഗര് ഭച്ഛിദ്രം", "മൂന്നു വര് ഷത്തിനുള്ളില്", "രതിമൂര് ച്ഛ"); fix by replacing
each malformed token with its contiguous/canonical form (remove internal spaces
and trailing punctuation) and apply Unicode normalization (NFC) so entries match
normal Malayalam orthography; update the specific entries for the tokens shown
(remove the trailing "." from "ninte ammede kothil." and join split Malayalam
tokens like "ഗര് ഭച്ഛിദ്രം", "മൂന്നു വര് ഷത്തിനുള്ളില്", "രതിമൂര് ച്ഛ" into
their correct single-word spellings) so they will match user input.
assets/data/cy.txt-1-1 (1)

1-1: ⚠️ Potential issue | 🟠 Major

Prune common Welsh vocabulary from the blocklist.

adferiad, brawf, cyfrifiadur, orau, and tafarn are normal words, not profanity. Keeping them here will cause broad false positives in ordinary Welsh content.

Also applies to: 9-9, 41-41, 96-96, 143-143

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/cy.txt` at line 1, Remove the listed Welsh words from the
blocklist in assets/data/cy.txt: delete the entries "adferiad", "brawf",
"cyfrifiadur", "orau", and "tafarn" (and remove their duplicate occurrences at
the other indicated locations). Search the file for those exact tokens (and the
other mentioned line occurrences) and remove them so common Welsh vocabulary is
not treated as profanity; ensure no other unrelated words are altered.
assets/data/de.txt-21-24 (1)

21-24: ⚠️ Potential issue | 🟠 Major

Remove obviously benign German terms from the profanity list.

ambiguität, ameisenbär, babyface, and ich liebe dich are ordinary phrases, not abusive content. Shipping them as blocked tokens will produce false positives in normal text.

Also applies to: 66-66, 289-289

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/de.txt` around lines 21 - 24, Remove the listed benign German
tokens from the profanity block: delete "ambiguität", "ameisenbär", "babyface",
and the phrase "ich liebe dich" from assets/data/de.txt (they also appear at the
other occurrences referenced around lines 66 and 289); ensure the profanity
array or list entries (the literal strings "ambiguität", "ameisenbär",
"babyface", "ich liebe dich") are removed so they no longer produce false
positives while preserving the rest of the file.
assets/data/de.txt-22-23 (1)

22-23: ⚠️ Potential issue | 🟠 Major

Don’t block neutral identity labels as standalone profanity.

Entries like ambisexuell, homo, lesbe, pansexuell, schwul, and schwuler are neutral descriptors on their own. Treating them as profanity will incorrectly censor non-abusive content and creates an avoidable moderation bias.

Also applies to: 260-260, 343-343, 412-412, 508-509

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/de.txt` around lines 22 - 23, The profanity list currently
contains neutral identity labels (e.g., "ambisexuell", "ambosexuell", "homo",
"lesbe", "pansexuell", "schwul", "schwuler"); remove these entries from the
blocked/profanity set or move them into an explicit "allowed_identity_terms"
whitelist and update matching so these tokens are only flagged when appearing in
abusive context (e.g., require surrounding insult patterns or negative
sentiment) rather than being blocked as standalone words; ensure the same change
is applied for the other occurrences called out (lines referenced in the
comment).
assets/data/de.txt-108-108 (1)

108-108: ⚠️ Potential issue | 🟠 Major

Fix the mojibake entries before building the German trie.

These lines contain broken UTF-8 text (scheiãÿegal, jã¤ger, beschiãÿ, scheiãÿe, drecksgã¶r). They will miss real matches and add dead entries to the automaton.

Suggested cleanup
-das ist mir scheiãÿegal
+das ist mir scheißegal
-sie sind das essen und wir sind die jã¤ger
+sie sind das essen und wir sind die jäger
-so ein beschiãÿ
+so ein beschiß
-verdammte scheiãÿe
+verdammte scheiße
-verhurtes drecksgã¶r
+verhurtes drecksgör

Also applies to: 514-514, 517-517, 564-564, 575-575

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/de.txt` at line 108, Lines in assets/data/de.txt contain mojibake
sequences (e.g. "scheiãÿegal", "jã¤ger", "beschiãÿ", "scheiãÿe", "drecksgã¶r")
that must be replaced with their correct UTF-8 German forms before building the
trie; update those entries to the proper strings (for example: scheißegal,
jäger, beschiss/beschiß as appropriate, scheiße, drecksgöre) and scan the file
for the other reported occurrences (lines around the listed offsets) to replace
all similar mojibake variants so the automaton receives valid UTF-8 tokens.
🟡 Minor comments (5)
assets/data/kab.txt-19-20 (1)

19-20: ⚠️ Potential issue | 🟡 Minor

Deduplicate this repeated Kabyle entry.

tteqqun appears twice consecutively. If the loader does not normalize duplicates, this adds useless trie nodes and can also produce duplicate matches.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/kab.txt` around lines 19 - 20, Remove the duplicated Kabyle entry
"tteqqun" so it appears only once in the data set and, if applicable, update the
loader normalization step that reads this file to ignore or deduplicate
identical consecutive entries (e.g., collapse duplicate tokens during parsing)
to prevent creating redundant trie nodes or duplicate matches.
assets/data/hy.txt-49-49 (1)

49-49: ⚠️ Potential issue | 🟡 Minor

Remove the invisible character from this token.

Line 49 appears to contain a trailing soft hyphen/non-printing character after klris glookh. That will make exact matching unreliable and is very hard to debug once this file is compiled into the automaton.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hy.txt` at line 49, The token "klris glookh" contains a trailing
invisible soft-hyphen/non-printing character; open the file and remove the
non-printing character after "klris glookh" (e.g., U+00AD or other zero-width
chars), save the file in UTF-8, and re-run a grep/hex check (search for U+00AD
or use a tool to show hidden characters) to confirm no residual invisible
characters remain in the token.
assets/data/eo.txt-48-48 (1)

48-48: ⚠️ Potential issue | 🟡 Minor

Fix the corrupted Esperanto entry on line 48.

åœit contains non-Esperanto characters (å, œ) that indicate UTF-8 corruption. This entry will not match normalized Esperanto input and should be corrected to the intended word.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/eo.txt` at line 48, The entry "åœit" is corrupted and must be
replaced with the correct Esperanto word: locate the token "åœit" in the eo.txt
file and replace it with the intended Esperanto spelling (retrieve the correct
word from the original source or translation memory), ensure the file is saved
as UTF-8 and normalized (NFC) so Esperanto diacritics (ĉ, ĝ, ĥ, ĵ, ŝ, ŭ) are
preserved, and run a quick validation (e.g., grep or a script) to confirm the
corrected word matches normalized Esperanto input.
assets/data/es.txt-525-525 (1)

525-525: ⚠️ Potential issue | 🟡 Minor

Inconsistent data format with tab character and English translation

Line 525 contains a tab character followed by an English translation:

cua fea balla al mearda	bitch go to shit

This format is inconsistent with the rest of the file where each line contains only a single term. While this may not cause functional issues depending on how the data is parsed, it suggests incomplete data cleaning.

📝 Recommended fix for consistent formatting
-cua fea balla al mearda	bitch go to shit
+cua fea balla al mearda

If translations are needed for documentation purposes, consider maintaining them in a separate file or using a structured format (e.g., JSON with key-value pairs).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/es.txt` at line 525, Remove the English translation and stray tab
from the line containing the term "cua fea balla al mearda" so the file returns
to the single-term-per-line format; if the translation must be preserved, move
the English phrase ("bitch go to shit") into a separate translations file or a
structured mapping (e.g., JSON) and reference that instead, ensuring the
original assets/data/es.txt contains only the single-term entry "cua fea balla
al mearda".
assets/data/cy.txt-3-3 (1)

3-3: ⚠️ Potential issue | 🟡 Minor

Remove the stray + from this literal entry.

anws blewog+ looks malformed. In a line-delimited dictionary this will only match text that literally ends with +; if any downstream code treats entries as patterns, it changes the term’s meaning entirely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/cy.txt` at line 3, The entry "anws blewog+" contains a stray '+'
at the end; open the data file and remove the trailing '+' so the literal
becomes "anws blewog" (update the exact record "anws blewog+" in
assets/data/cy.txt).
🧹 Nitpick comments (2)
assets/data/hr.txt (1)

1-243: Consider establishing data governance practices for language assets.

With 75+ language data files being introduced, establishing clear processes will help maintain quality and consistency:

  1. Data provenance: Document the sources and methodology for each language dataset
  2. Update cadence: Define how frequently these files should be reviewed and updated
  3. Quality assurance: Consider implementing validation scripts to check for encoding issues, duplicates, or formatting inconsistencies across all data files
  4. Cultural review: Given the sensitive nature of profanity filtering, consider involving native speakers or cultural consultants for accuracy
  5. Versioning: Track changes to individual language files to understand the evolution of datasets
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hr.txt` around lines 1 - 243, The hr.txt profanity wordlist lacks
governance: add metadata and validation—prepend a standardized header to
assets/data/hr.txt (source, author, date, license, language code "hr", review
cadence) and apply a validation pipeline (scripts invoked in CI) that checks
UTF-8 encoding, trims/normalizes lines, deduplicates entries, enforces
one-term-per-line formatting, and flags suspicious characters; add a process doc
describing provenance and update cadence, require native-speaker cultural review
for hr.txt and other language files, and ensure each language file is
versioned/changed with meaningful git commits so history for assets/data/hr.txt
(and the other 75+ language files) is auditable.
README.md (1)

97-101: Minor technical imprecision in performance explanation

The explanation states: "With 10,000+ words, this grew exponentially slow."

More precisely, a nested loop approach grows linearly with the number of patterns (O(N × P)), not exponentially (O(2^N)). The performance difference vs. Aho-Corasick is still dramatic but describing it as linear growth would be more technically accurate.

📚 Suggested refinement for technical accuracy
-Legacy versions used a nested loop approach (for every bad word, run a regex). With 10,000+ words, this grew exponentially slow. 
+Legacy versions used a nested loop approach (for every bad word, run a regex). With 10,000+ words, performance degraded linearly with pattern count (O(N × P)).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@README.md` around lines 97 - 101, The README's performance description
incorrectly says the nested-loop approach grew "exponentially slow"; update that
sentence in the "Why is v2.0.0 so fast?" section to state the complexity grows
linearly with the number of patterns (e.g., O(N × P) where N is text length and
P is number of patterns) instead of using "exponentially", and optionally add a
short comparison noting Aho-Corasick runs in O(N) for scanning the text once;
edit the paragraph containing "With 10,000+ words, this grew exponentially
slow." and the following sentence referencing Aho-Corasick to correct the
complexity claim.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 03a0441f-3a6f-405b-bf97-c31b70a5244a

📥 Commits

Reviewing files that changed from the base of the PR and between 80e8a9b and 4558d95.

⛔ Files ignored due to path filters (1)
  • assets/image/safeText.png is excluded by !**/*.png
📒 Files selected for processing (88)
  • .vscode/settings.json
  • CHANGELOG.md
  • README.md
  • asset/bad_words.txt
  • assets/data/af.txt
  • assets/data/am.txt
  • assets/data/ar.txt
  • assets/data/az.txt
  • assets/data/be.txt
  • assets/data/bg.txt
  • assets/data/ca.txt
  • assets/data/ceb.txt
  • assets/data/cs.txt
  • assets/data/cy.txt
  • assets/data/da.txt
  • assets/data/de.txt
  • assets/data/dz.txt
  • assets/data/el.txt
  • assets/data/en.txt
  • assets/data/eo.txt
  • assets/data/es.txt
  • assets/data/et.txt
  • assets/data/eu.txt
  • assets/data/fa.txt
  • assets/data/fi.txt
  • assets/data/fil.txt
  • assets/data/fr.txt
  • assets/data/gd.txt
  • assets/data/gl.txt
  • assets/data/hi.txt
  • assets/data/hr.txt
  • assets/data/hu.txt
  • assets/data/hy.txt
  • assets/data/id.txt
  • assets/data/is.txt
  • assets/data/it.txt
  • assets/data/ja.txt
  • assets/data/kab.txt
  • assets/data/kh.txt
  • assets/data/ko.txt
  • assets/data/la.txt
  • assets/data/lt.txt
  • assets/data/lv.txt
  • assets/data/mi.txt
  • assets/data/mk.txt
  • assets/data/ml.txt
  • assets/data/mn.txt
  • assets/data/mr.txt
  • assets/data/ms.txt
  • assets/data/mt.txt
  • assets/data/my.txt
  • assets/data/nl.txt
  • assets/data/no.txt
  • assets/data/pih.txt
  • assets/data/piy.txt
  • assets/data/pl.txt
  • assets/data/pt.txt
  • assets/data/ro.txt
  • assets/data/rop.txt
  • assets/data/ru.txt
  • assets/data/sk.txt
  • assets/data/sl.txt
  • assets/data/sm.txt
  • assets/data/sq.txt
  • assets/data/sr.txt
  • assets/data/sv.txt
  • assets/data/ta.txt
  • assets/data/te.txt
  • assets/data/tet.txt
  • assets/data/th.txt
  • assets/data/tlh.txt
  • assets/data/to.txt
  • assets/data/tr.txt
  • assets/data/uk.txt
  • assets/data/uz.txt
  • assets/data/vi.txt
  • assets/data/yid.txt
  • assets/data/zh.txt
  • assets/data/zu.txt
  • example/lib/main.dart
  • lib/safe_text.dart
  • lib/src/aho_corasick.dart
  • lib/src/models/language.dart
  • lib/src/phone_number_checker.dart
  • lib/src/safe_text_filter.dart
  • pubspec.yaml
  • test/phone_number_test.dart
  • test/profanity_filter_test.dart
💤 Files with no reviewable changes (1)
  • asset/bad_words.txt

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (3)
assets/data/it.txt (1)

1685-1685: ⚠️ Potential issue | 🔴 Critical

Drop the remaining malformed Italian entries.

va ciapã i rat and x valutato are still garbage tokens, so they won't match real Italian input and just pollute the Aho–Corasick trie.

Suggested fix
-va ciapã i rat
-x valutato

Also applies to: 1732-1732

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/it.txt` at line 1685, Remove the malformed Italian tokens from
the dataset: delete the entries "va ciapã i rat" and "x valutato" (and the
duplicate at the other reported location) so they no longer populate the
Aho–Corasick trie; search for these exact strings in assets/data/it.txt and
remove those lines to keep only valid Italian phrases.
assets/data/es.txt (1)

1122-1122: ⚠️ Potential issue | 🔴 Critical

Purge GitHub/UI strings from the Spanish lexicon.

open, pricing, privacy, projects, pull requests, security, skip to content, terms, and training are not profanity. Their presence means the source data still contains scraped site chrome, which will cause false positives.

Also applies to: 1300-1303, 1311-1311, 1414-1414, 1435-1435, 1481-1481, 1532-1532

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/es.txt` at line 1122, Remove the UI/chrome terms found in the
Spanish lexicon entries (e.g., the literal tokens "open", "pricing", "privacy",
"projects", "pull requests", "security", "skip to content", "terms", "training")
because they are scraped GitHub/UI strings causing false positives; search the
es.txt dataset for those exact tokens (including the duplicates noted at the
ranges around lines with "open" and the other listed entries) and delete or
exclude those lines/entries from the lexicon, then re-run the dataset
sanitization to ensure no other site-chrome tokens remain.
assets/data/ar.txt (1)

960-965: ⚠️ Potential issue | 🔴 Critical

Remove truncated Arabic fragments before building the trie.

Entries like ابو ر, الحس ط, الله لا, ان, بز, and ف are fragments or near-universal tokens, so the automaton will match ordinary Arabic text far too often.

Also applies to: 991-1019, 1045-1050, 1186-1186

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/data/cs.txt`:
- Line 127: Remove the entry "peklo" from the Czech blacklist file
(assets/data/cs.txt) so the profanity list no longer flags it; this aligns with
the test expectation in test/language_data_test.dart which asserts
containsBadWord('peklo') returns false for Language.czech. Update the cs.txt
content to exclude "peklo" and run the tests to verify containsBadWord and
Language.czech behavior is correct.

In `@assets/data/es.txt`:
- Line 630: Remove the standalone token 'es' from the stopword/substring list:
locate the literal entry "es" in the data file and delete it so the
substring-matching automaton no longer includes this common Spanish
verb/stopword; after removal, regenerate or reload any derived automaton/index
and run the substring-matching tests to confirm no excessive matches now occur.

In `@assets/data/fa.txt`:
- Line 149: The Persian words 'برنامه نویس', 'خداحافظ', 'مروارید', and 'نجار'
are benign and must be removed from the profanity blacklist in the fa.txt data
file; remove these exact tokens wherever they appear (including duplicate
occurrences) so normal Persian text is not falsely flagged, keeping the file
format (one token per line) intact and running any blacklist/unit tests to
confirm no regressions.

In `@assets/data/hi.txt`:
- Line 424: Remove the four neutral Hindi entries from the Hindi blacklist:
"आरएसी", "गोली चलाने की आवाज़", "निर्माण", and "पृथ्वी पर" (they are ordinary
words causing false positives); edit assets/data/hi.txt to delete or comment out
these exact string entries wherever they occur (including the occurrences noted
around the file regions referenced), then run the filter/unit tests to confirm
no Hindi-only false positives remain.

---

Duplicate comments:
In `@assets/data/es.txt`:
- Line 1122: Remove the UI/chrome terms found in the Spanish lexicon entries
(e.g., the literal tokens "open", "pricing", "privacy", "projects", "pull
requests", "security", "skip to content", "terms", "training") because they are
scraped GitHub/UI strings causing false positives; search the es.txt dataset for
those exact tokens (including the duplicates noted at the ranges around lines
with "open" and the other listed entries) and delete or exclude those
lines/entries from the lexicon, then re-run the dataset sanitization to ensure
no other site-chrome tokens remain.

In `@assets/data/it.txt`:
- Line 1685: Remove the malformed Italian tokens from the dataset: delete the
entries "va ciapã i rat" and "x valutato" (and the duplicate at the other
reported location) so they no longer populate the Aho–Corasick trie; search for
these exact strings in assets/data/it.txt and remove those lines to keep only
valid Italian phrases.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7ebd260d-5b76-4000-9774-6b6dd2d85786

📥 Commits

Reviewing files that changed from the base of the PR and between 4558d95 and 6316e0a.

📒 Files selected for processing (14)
  • assets/data/ar.txt
  • assets/data/cs.txt
  • assets/data/es.txt
  • assets/data/et.txt
  • assets/data/fa.txt
  • assets/data/hi.txt
  • assets/data/it.txt
  • assets/data/ja.txt
  • assets/data/kh.txt
  • assets/data/kn.txt
  • assets/data/ko.txt
  • assets/data/zh.txt
  • lib/src/models/language.dart
  • test/language_data_test.dart
✅ Files skipped from review due to trivial changes (3)
  • assets/data/kn.txt
  • assets/data/et.txt
  • assets/data/kh.txt

بخورش
بدبخت
بدم
برنامه نویس
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove benign Persian words from the profanity set.

برنامه نویس, خداحافظ, مروارید, and نجار are neutral Persian words. Putting them in the blacklist will flag normal text and undermine the new language-specific filtering.

Also applies to: 233-233, 442-442, 460-460

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/fa.txt` at line 149, The Persian words 'برنامه نویس', 'خداحافظ',
'مروارید', and 'نجار' are benign and must be removed from the profanity
blacklist in the fa.txt data file; remove these exact tokens wherever they
appear (including duplicate occurrences) so normal Persian text is not falsely
flagged, keeping the file format (one token per line) intact and running any
blacklist/unit tests to confirm no regressions.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (1)
assets/data/fa.txt (1)

440-440: ⚠️ Potential issue | 🟠 Major

Remove neutral Persian words from the blacklist.

مروارید and نجار are benign words and can trigger false positives in normal Persian text.

Also applies to: 458-459

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/fa.txt` at line 440, Remove the benign Persian words causing
false positives by deleting the blacklist entries for "مروارید" and "نجار" from
the fa.txt blacklist and also remove their duplicate occurrences noted around
the other reported lines; ensure you only remove those exact tokens (keeping
file encoding/line endings intact) and run a quick scan to confirm no other
duplicate neutral Persian words remain in the list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@assets/data/cs.txt`:
- Line 9: Remove the malformed/non‑Czech tokens from the Czech lexicon by
deleting entries like "biezh do haaye", "churak", "fuc ty", "koòomrd", "kraavo",
"shinging", and "show-staat" (and the similar noise noted at the other
occurrences) so the lexicon only contains valid Czech words; locate these exact
token strings in the cs.txt word list and remove or replace them with correct
Czech equivalents where available, then run the lexicon validation/CI check to
ensure no encoding or malformed entries remain.

In `@assets/data/es.txt`:
- Around line 58-59: Replace the mojibake tokens in the Spanish data file with
their correct UTF-8 forms: locate and change "andã¡" to "anda", "coã±o" to
"coño", "cuzã£o" to "cuzão" (or the intended word), "me cago en el corazã³n" to
"me cago en el corazón", "me jodã­" to "me jodí", and similar corrupted
sequences (e.g., any occurrences of á, é, í, ó, ú, ñ, ã) to their proper
accented characters; verify and normalize the whole file encoding to UTF-8 (no
BOM) after edits to prevent recurrence and ensure these tokens no longer create
spurious automaton states.
- Line 523: The entry "cua fea balla al mearda\tbitch go to shit" mixes a
Spanish profanity token with an English gloss; remove the tab and English gloss
so the lexicon line contains only the single profanity token ("cua fea balla al
mearda"), or alternatively split into two valid entries (one for the Spanish
token and a separate mapping/line for the English gloss) to keep each lexicon
line a single token.

In `@assets/data/fa.txt`:
- Line 103: Remove the scraped classification artifact token "x امتیاز" from the
Persian lexicon in assets/data/fa.txt; open the file and delete the exact line
containing 'x امتیاز' (and ensure no trailing whitespace or duplicate empty
lines are left), then save the file so the lexicon only contains valid profanity
entries.

In `@assets/data/hi.txt`:
- Around line 566-755: The hi.txt file contains Sinhala-script entries that are
being loaded by Language.hindi (which only loads assets/data/hi.txt), causing
Sinhala text to be matched as Hindi; remove or relocate the Sinhala block from
assets/data/hi.txt into the correct Sinhala lexicon (e.g., create/update
assets/data/si.txt) and ensure Language.hindi only loads Devanagari/true-Hindi
entries; as a precaution, add a small validation step where the Hindi loader
(Language.hindi) filters or rejects entries that contain non-Devanagari
codepoints before adding them to the Hindi lexicon.

In `@assets/data/it.txt`:
- Line 1284: The line containing "sborra, sburro" combines two separate terms;
split this entry into two distinct lines—one with "sborra" and one with
"sburro"—so look-up/matching will work for each term independently and remove
the comma-separated form.

---

Duplicate comments:
In `@assets/data/fa.txt`:
- Line 440: Remove the benign Persian words causing false positives by deleting
the blacklist entries for "مروارید" and "نجار" from the fa.txt blacklist and
also remove their duplicate occurrences noted around the other reported lines;
ensure you only remove those exact tokens (keeping file encoding/line endings
intact) and run a quick scan to confirm no other duplicate neutral Persian words
remain in the list.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 46116ce7-6850-475c-86f4-fad88b2e2b2c

📥 Commits

Reviewing files that changed from the base of the PR and between 6316e0a and 7b78a42.

📒 Files selected for processing (6)
  • assets/data/cs.txt
  • assets/data/es.txt
  • assets/data/fa.txt
  • assets/data/hi.txt
  • assets/data/it.txt
  • lib/src/safe_text_filter.dart

batich
bestialita
bestiální
biezh do haaye
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove malformed/non-Czech tokens from the Czech lexicon.

These entries look like noise or encoding artifacts (biezh do haaye, churak, fuc ty, koòomrd, kraavo, shinging, show-staat) and are unlikely to match intended Czech profanity reliably.

Suggested cleanup
-biezh do haaye
-churak
-fuc ty
-koòomrd
-kraavo
-shinging
-show-staat

Also applies to: 21-21, 49-49, 77-77, 79-79, 164-165

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/cs.txt` at line 9, Remove the malformed/non‑Czech tokens from the
Czech lexicon by deleting entries like "biezh do haaye", "churak", "fuc ty",
"koòomrd", "kraavo", "shinging", and "show-staat" (and the similar noise noted
at the other occurrences) so the lexicon only contains valid Czech words; locate
these exact token strings in the cs.txt word list and remove or replace them
with correct Czech equivalents where available, then run the lexicon
validation/CI check to ensure no encoding or malformed entries remain.

Comment on lines +58 to +59
andã¡ a hacerte coger
andã¡ a hacerte ver el culo
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix mojibake/encoding-corrupted Spanish entries.

Multiple entries appear UTF-8-corrupted (andã¡, coã±o, cuzã£o, me cago en el corazã³n..., me jodã­, etc.). These patterns will miss real input and add noisy automaton states.

Also applies to: 142-142, 402-402, 515-515, 562-562, 1017-1018, 1033-1033

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/es.txt` around lines 58 - 59, Replace the mojibake tokens in the
Spanish data file with their correct UTF-8 forms: locate and change "andã¡" to
"anda", "coã±o" to "coño", "cuzã£o" to "cuzão" (or the intended word), "me cago
en el corazã³n" to "me cago en el corazón", "me jodã­" to "me jodí", and similar
corrupted sequences (e.g., any occurrences of á, é, í, ó, ú, ñ, ã) to
their proper accented characters; verify and normalize the whole file encoding
to UTF-8 (no BOM) after edits to prevent recurrence and ensure these tokens no
longer create spurious automaton states.

cremita
creta
crica
cua fea balla al mearda bitch go to shit
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Remove mixed translation text from a lexicon line.

This line includes a tab and an English gloss (bitch go to shit) in the same entry, which is not a valid single profanity token.

Suggested fix
-cua fea balla al mearda	bitch go to shit
+cua fea balla al mearda
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
cua fea balla al mearda bitch go to shit
cua fea balla al mearda
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/es.txt` at line 523, The entry "cua fea balla al mearda\tbitch go
to shit" mixes a Spanish profanity token with an English gloss; remove the tab
and English gloss so the lexicon line contains only the single profanity token
("cua fea balla al mearda"), or alternatively split into two valid entries (one
for the Spanish token and a separate mapping/line for the English gloss) to keep
each lexicon line a single token.

tokhm
torke char
valade zenaa
x امتیاز
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Remove classification artifact token from Persian data.

x امتیاز looks like scraped metadata rather than a profanity term and should not be part of the lexicon.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/fa.txt` at line 103, Remove the scraped classification artifact
token "x امتیاز" from the Persian lexicon in assets/data/fa.txt; open the file
and delete the exact line containing 'x امتیاز' (and ensure no trailing
whitespace or duplicate empty lines are left), then save the file so the lexicon
only contains valid profanity entries.

Comment on lines +566 to +755
අකේ පුරිනවා
අකේ පුරින්නා
අප්පට හුකනවා
අප්පට හුකන්න
අම්බකිස්ස
අම්මගෙ රෙද්ද
අම්මට හුකනවා
අම්මට හුකන්න
අම්ම්ට
අම්මාගේ රෙද්ද
ඇට
ඇට දෙක
ඇට්ටිම්බ
ඇන්ට පාර
ඇරගනින්
ඇහැළමල් වේසි
උත්ති
උත්තික් කොල්ලා
උත්තියේ
උඹෙ අම්මගෙ
උඹෙ අම්මට
උඹෙ අම්මා
උඹේ අම්මා
උරනවා
උරපං
උරපන්
ඒසි
කළු බඩ්ඩ
කැන්න පොරි
කැන්න පොරියා
කැරි
කැරි තම්බියො
කැරි බඩුව
කැරි මයිල්
කැරි ලොඳයා
කැරි වේසි
කැරි සැප
කැරි සීන්
කැරි හම්බයො
කැරිය
කැරියා
කැස් පුරි
කැස්පුරි
කෑස් පුරියා
කෑස්පුරි
කිම්බ
කිඹි කාවයියා
කිඹි ජූස්
කිඹි සිඹා
කිඹිසිඹා
කුක්කු
කුට්ටං චූටි
කොන කපනවා
කොනකපාල
කොනකපාලා
කොයිපටා
කෝටු පයිය
ගැට මිරිකනවා
ගැනු හොරා
ගෑණු හොරා
ගොං කැරිය
ගොං බිජ්ජා
ගොටුකොළ පොන්නයා
ගොටුකොළ හුකන්නා
ජැක් ගහපන්
ජැක්සන්
ජුන්ඩා
ටඉකේ
ටඋකනවා
ටහිකේ
ටහුකන්න
ටොපා
ටොම්බ හිල
ටෞ
ටෞකණ්ඩ
ටෞකන්න
ටෞක්
තන
තන් දෙක
තොංගලේ
තොගෙ අම්මට
දාර පයිය
දාර බඩුව
දිව දානව
දිව දානවා
නිල්මානෙල් හුත්ති
පක
පකට
පකය
පකයා
පකේ
පකෝ
පට්ට වේසි
පම්ක
පම්කයා
පයි චූටි
පයිකොටා
පයිය
පයියා
පර බැල්ලි
පර වේසි
පාලම් කොකා
පුක
පුක පලනවා
පුක වටේ
පුක සුදුද
පුක්මන්තා
පුකේ අරිනවා
පුකේ මයිල්
පුකේ මයිලා
පුකේ හිල
පෙට්ටිය
පෙට්ටිය කඩනවා
පෙට්ටිය කැඩිලද
පොන්න කැරියා
පොන්න පකයා
පොන්න පකා
පොන්න වේස මනමාලි
පොන්නයා
පොරෝ පාර
පොල් කාවයිය
පොල් කාවයියා
පොල්ල
පෝලිමක් දානවා
පෝලිම් දානවා
බඩ වෙනවා
බඩු ආව
බඩු ආවා
බඩු කාරයා
බඩු ලොල්ලා
බැක් ගහනව
බැක් ගහනවා
බැක්සයිඩ් ඕකේ
බැල්ලි
බැල්ලිගෙ පුතා
බේසික්ගෙ
බේසිකෙ
බේසිගෙ
මෑ ඇටේ
මෑ මල
මුහුදු හුකන්නා
රෙද්ද උස්සගෙන
රෙද්ද උස්සනවා
ලබ්බ
ලව් ජූස්
ලෑලි පුක
ලෙවකනවා
ලොල්ලා
ලොවනවා
වටඇල්ල
වටේ යනවා
වම්බටු පයියා
වලත්තයා
වල්ල
වල්ල පට්ට
වැල
වැල බලනවා
වේස බල්ලා
වේස බැල්ලි
වේසාවා
වේසිගෙ පුතා
වේසිගේ පුතා
වේසෞත්තා
සක්
සක්කිලි
සක්කිලි බල්ලා
සක්කිලියා
හම්බ කැරියා
හැමිනෙනව
හැමිනෙනවා
හුකනව දාලා
හුකනවනෙ ඉතින්
හුකනවනේ ඉතින්
හුකනවා
හුකනවා දාලා
හුකන්න
හුකන්නා
හුකපං
හුක්
හුකා
හුත්ත
හුත්තා
හුත්තික් කොල්ලා
හුත්තිගෙ කොල්ලා
හුත්තිගෙ පුතා
හුත්තිගේ පුතා
හුප්තා
හුම්කන
හුම්තා
ෆක්
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Hindi lexicon includes Sinhala-script entries.

This block is Sinhala, not Hindi. Because Language.hindi loads only assets/data/hi.txt, these entries make Hindi filtering match Sinhala text unintentionally.

🧰 Tools
🪛 LanguageTool

[uncategorized] ~575-~575: Ger doublet «ඇට ඇට».
Context: ...ුකනවා අම්මට හුකන්න අම්ම්ට අම්මාගේ රෙද්ද ඇට ඇට දෙක ඇට්ටිම්බ ඇන්ට පාර ඇරගනින් ඇහැළමල් ව...

(GER_DOUBLET)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/hi.txt` around lines 566 - 755, The hi.txt file contains
Sinhala-script entries that are being loaded by Language.hindi (which only loads
assets/data/hi.txt), causing Sinhala text to be matched as Hindi; remove or
relocate the Sinhala block from assets/data/hi.txt into the correct Sinhala
lexicon (e.g., create/update assets/data/si.txt) and ensure Language.hindi only
loads Devanagari/true-Hindi entries; as a precaution, add a small validation
step where the Hindi loader (Language.hindi) filters or rejects entries that
contain non-Devanagari codepoints before adding them to the Hindi lexicon.

sborona
sborone
sborra
sborra, sburro
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Split combined terms into separate entries.

Line 1284 stores two terms in one line (sborra, sburro), so matching will require the comma form instead of either word independently.

Suggested fix
-sborra, sburro
+sborra
+sburro
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
sborra, sburro
sborra
sburro
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@assets/data/it.txt` at line 1284, The line containing "sborra, sburro"
combines two separate terms; split this entry into two distinct lines—one with
"sborra" and one with "sburro"—so look-up/matching will work for each term
independently and remove the comma-separated form.

@master-wayne7 master-wayne7 merged commit bcfb409 into master Mar 13, 2026
2 checks passed
@master-wayne7 master-wayne7 deleted the version-2.0.0 branch April 6, 2026 06:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant