updated the version to 2.0.0, introduced multi lingual filtering, #1
updated the version to 2.0.0, introduced multi lingual filtering, #1master-wayne7 merged 3 commits intomasterfrom
Conversation
…roved performance by applying aho-corasick algo
📝 WalkthroughWalkthroughRelease v2.0.0: introduces SafeTextFilter and PhoneNumberChecker backed by an Aho‑Corasick engine, removes the bundled Changes
Sequence Diagram(s)sequenceDiagram
participant Client as Client
participant Filter as SafeTextFilter
participant Engine as AhoCorasickEngine
participant Data as LanguageDataStore
Client->>Filter: init(languages)
Filter->>Data: loadFiles(languages)
Data-->>Filter: wordLists
Filter->>Engine: buildTrie(wordLists)
Engine-->>Filter: trieReady
Client->>Filter: filterText(text, options)
Filter->>Engine: search(text)
Engine-->>Filter: matches
Filter->>Client: filteredText / matches
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~30 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment Tip CodeRabbit can enforce grammar and style rules using `languagetool`.Configure the |
There was a problem hiding this comment.
Actionable comments posted: 5
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (30)
assets/data/et.txt-78-78 (1)
78-78:⚠️ Potential issue | 🟠 MajorFix the mojibake before publishing this Estonian dataset.
pãµrguandvã¤ikeare encoding-corrupted forms, so these patterns will miss real Estonian input and suggest the source file was imported with the wrong character encoding.Also applies to: 139-139
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/et.txt` at line 78, Replace the mojibake tokens "pãµrgu" and "vã¤ike" with their correct Estonian forms "põrgu" and "väike" throughout the file (including the other occurrence noted), and save the file in UTF-8 encoding (or re-run a proper encoding conversion from the original charset) so future edits/publishes preserve the correct characters.assets/data/cs.txt-11-13 (1)
11-13:⚠️ Potential issue | 🟠 MajorPrune neutral Czech vocabulary from this blacklist.
Several entries in these ranges are common, non-abusive words (
bodnutí,děloha,knoflík,mezery,odvaha,odvážný,peklo,pohlaví,počínání,příruba,vybudovat,zpomalit,úryvek). Shipping them in the profanity list will overblock ordinary Czech text and materially hurt precision.Also applies to: 33-33, 66-67, 101-103, 120-122, 136-145, 153-169, 199-205, 221-225
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/cs.txt` around lines 11 - 13, Remove the neutral Czech words from the profanity blacklist: delete the entries "bodnutí", "děloha", "knoflík", "mezery", "odvaha", "odvážný", "peklo", "pohlaví", "počínání", "příruba", "vybudovat", "zpomalit", and "úryvek" (and any duplicates) so ordinary Czech text isn't overblocked; apply the same removals in the other specified ranges (lines corresponding to ranges 33-33, 66-67, 101-103, 120-122, 136-145, 153-169, 199-205, 221-225) by scanning those ranges for these neutral terms and pruning them from the blacklist file.assets/data/mi.txt-14-15 (1)
14-15:⚠️ Potential issue | 🟠 MajorThis Māori list includes a lot of ordinary vocabulary.
Entries like
kararehe,pakaru,reta,tino,tino pai,uaua,wahine, and multiple genericwhaka*verbs are not profane on their own. Keeping them here will make the multilingual filter fire on normal Māori sentences.Also applies to: 20-23, 27-38, 45-58
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/mi.txt` around lines 14 - 15, The Māori profanity list contains many ordinary, non-profane words (e.g., "kararehe", "pakaru", "reta", "tino", "tino pai", "uaua", "wahine", and generic "whaka*" verbs) which will trigger false positives; go through the entries referenced (lines around "kararehe"/"karihika" and also ranges 20-23, 27-38, 45-58), remove or relocate any non-profane vocabulary to a separate whitelist or neutral-word file, and keep only genuinely offensive terms in the profanity list (update functions/consumers that load this list to read the new whitelist if needed).assets/data/ca.txt-6-6 (1)
6-6:⚠️ Potential issue | 🟠 MajorThe Catalan dataset is polluted with foreign or corrupted entries.
Rows like
harnröhre,pansexualität,masturbarse,tetas,retrocã©s, and the Cyrillic terms at the end do not belong in a Catalan blacklist. They point to a bad source import and will either miss real Catalan text or introduce cross-language false positives.Also applies to: 64-64, 79-79, 91-91, 112-112, 122-122, 135-136
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ca.txt` at line 6, The Catalan blacklist contains foreign/corrupted entries that must be removed and the source import revalidated: remove the listed non-Catalan tokens (e.g., "harnröhre", "pansexualität", "masturbarse", "tetas", "retrocã©s", the Cyrillic terms and the current "armando reques" entry) and any similar entries at the other reported lines (64, 79, 91, 112, 122, 135-136); then update the import pipeline or source filter to enforce Catalan-language detection (or whitelist trusted sources) so future imports don’t introduce cross-language or corrupted terms and run a quick QA pass to confirm only valid Catalan lemmas remain.assets/data/af.txt-21-22 (1)
21-22:⚠️ Potential issue | 🟠 MajorRemove unrelated names and neutral proper nouns from the Afrikaans blocklist.
dawid jonker,dawie joubert,luna,oom gert,shrek, andwickeddo not read like profanity terms. Keeping them here will flag benign text and names for no moderation reason.Also applies to: 163-163, 190-190, 217-217, 252-252
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/af.txt` around lines 21 - 22, Remove the unrelated benign entries from the Afrikaans profanity blocklist by deleting the specific strings "dawid jonker", "dawie joubert", "luna", "oom gert", "shrek", and "wicked" (and any duplicate occurrences at the other noted locations) from the af.txt blocklist so benign names and neutral nouns are no longer flagged; search for those exact tokens in af.txt and remove each line containing them, leaving the rest of the blocklist intact.assets/data/hu.txt-49-49 (1)
49-49:⚠️ Potential issue | 🟠 MajorThis Hungarian blocklist includes routine vocabulary.
bátorság,fogantyú,kemény,kiáltotta,majom,részeg, andépítare normal words. Including them in the shipped matcher will create false positives in regular Hungarian text.Also applies to: 98-98, 133-139, 170-170, 237-237, 295-295
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/hu.txt` at line 49, Remove common Hungarian words that cause false positives from the blocklist: delete entries like "bátorság", "fogantyú", "kemény", "kiáltotta", "majom", "részeg", and "épít" (and the other listed occurrences at the indicated ranges) from the hu blocklist so the matcher only contains genuinely disallowed tokens; after removal, run the matcher tests on Hungarian sample text to verify no routine vocabulary triggers remain.assets/data/gl.txt-12-13 (1)
12-13:⚠️ Potential issue | 🟠 MajorThe Galician dataset is over-inclusive and language-mixed.
can fuckis not Galician, and entries likecariño,durmir,falando,galiñas,galo, andpálidoare ordinary words. If these are fed into the profanity automaton, normal Galician text will be incorrectly blocked.Also applies to: 23-23, 32-32, 41-42, 63-63
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/gl.txt` around lines 12 - 13, The Galician profanity list is over-inclusive and language-mixed: remove non-Galician entries like "can fuck" and strip ordinary Galician words ("cariño", "durmir", "falando", "galiñas", "galo", "pálido") from the automaton; instead whitelist common lemmas and only keep validated profane lemmas. Update the token normalization routine to normalize/strip diacritics and punctuation before matching, split multi-word entries into single-token checks (or remove multiword English phrases), and add a small validation step (language detection or native-speaker review) for any remaining entries to ensure only true Galician profanity is retained.assets/data/eu.txt-17-17 (1)
17-17:⚠️ Potential issue | 🟠 MajorNormalize this Basque list before shipping it.
This file mixes comma-joined records and non-Basque hybrids (
ipurdian-fucker,jainko-damned,prononcés,seme-of-a-puta,urbanístico). That will both miss intended matches and introduce locale-specific false positives.Also applies to: 23-24, 37-37, 40-40, 46-46
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/eu.txt` at line 17, The Basque word list in assets/data/eu.txt contains mixed formats and non-Basque hybrids (e.g., the comma-joined entry "gixajo, gisarajo" and items like "ipurdian-fucker", "jainko-damned", "prononcés", "seme-of-a-puta", "urbanístico") that break normalization and matching; clean this by splitting comma-joined records into individual entries (so "gixajo, gisarajo" becomes two lines "gixajo" and "gisarajo"), remove or replace hybrid/foreign entries with pure Basque equivalents, ensure each token is lowercased and diacritics normalized consistently, and apply the same fixes to the other affected lines (noted around lines 23–24, 37, 40, 46) so the functions that read this file (the list loader that expects one token per line) will yield correct, locale-pure Basque entries.assets/data/id.txt-162-163 (1)
162-163:⚠️ Potential issue | 🟠 MajorThis Indonesian list contains many everyday words.
Entries such as
detik,dikirim,gagah,keberanian,penggemar,ruang,terima kasih,united, andya tuhanare normal language, not profanity. Shipping them in the matcher will create widespread false positives for Indonesian users.Also applies to: 190-190, 284-284, 468-468, 495-495, 546-546, 571-571, 579-579
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/id.txt` around lines 162 - 163, The listed Indonesian entries (e.g., "detik", "dikirim", "gagah", "keberanian", "penggemar", "ruang", "terima kasih", "united", "ya tuhan" and the other specified line entries) are common non-profanity words and should not be in the profanity matcher; remove these tokens from the profanity list (or move them into a distinct whitelist/allowlist used by the matcher) so they no longer generate false positives; update any tests or documentation that assumed these tokens were profane and ensure the matcher still performs on true profanity cases after the change.assets/data/hy.txt-90-90 (1)
90-90:⚠️ Potential issue | 🟠 MajorThis list includes common Armenian vocabulary, not just abusive terms.
Entries like
vor,ամուսին,ամուսինս,գաղտնի,դրսում, andլեգենդare ordinary words. If this file feeds the profanity matcher, these will produce a large number of false positives in normal Armenian text.Also applies to: 104-105, 130-130, 140-140, 156-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/hy.txt` at line 90, The Armenian profanity wordlist contains common non-abusive tokens (e.g., "vor", "ամուսին", "ամուսինս", "գաղտնի", "դրսում", "լեգենդ") which will cause false positives; edit assets/data/hy.txt to remove these ordinary words (and the other mentioned lines 104-105, 130, 140, 156) from the profanity list, or relocate them into a separate benign/common-words list used by the matcher for allowlisting/whitelisting, and ensure the profanity matcher reads only the sanitized profanity list.assets/data/mk.txt-77-82 (1)
77-82:⚠️ Potential issue | 🟠 MajorRemove unrelated foreign phrases and placeholder tokens from the Macedonian list.
Lines 77-82 and 84 contain English profanity (
suck mine dick,up your mums cunt), Serbo-Croatian entries (turi mu kur,u picku mater,veliko mandalo,volim te), and a malformed placeholder (x оценет). These entries will inflate false positives and degrade Macedonian language filtering precision.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/mk.txt` around lines 77 - 82, Remove the unrelated and offensive entries from the Macedonian wordlist by deleting the English profanity items ("suck mine dick", "up your mums cunt"), the Serbo-Croatian phrases ("turi mu kur", "u picku mater", "veliko mandalo", "volim te"), and the malformed placeholder token ("x оценет"); ensure only valid Macedonian words remain in the list and run a quick scan for any other non-Macedonian or placeholder strings to prevent false positives in filtering.assets/data/ja.txt-11-11 (1)
11-11:⚠️ Potential issue | 🟠 MajorSeveral romaji entries appear to be junk or placeholder data rather than legitimate Japanese profanity terms.
baka bah oom(line 11),okiesawada(line 97),watashiwa fucketsu desu ne(line 135), andxx(line 136) do not conform to valid Japanese romaji. For example,baka bah oomcombines a real term (baka) with phonetically invalid elements;watashiwa fucketsu desu necontains the invalid wordfucketsu; andxxis a clear placeholder. These entries will not function properly in a production filter and should be removed or replaced with legitimate Japanese profanity terms before automaton building.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ja.txt` at line 11, Remove or replace the invalid romaji entries used as profanity tokens: delete "baka bah oom", "okiesawada", "watashiwa fucketsu desu ne", and "xx" from the profanity list and either supply correct romaji equivalents or legitimate Japanese profanity terms (e.g., "baka", "fuzakeru" where appropriate) so the automaton is built from valid tokens; update any code that references these exact strings (search for the literal tokens "baka bah oom", "okiesawada", "watashiwa fucketsu desu ne", "xx") to use the corrected entries.assets/data/ja.txt-360-362 (1)
360-362:⚠️ Potential issue | 🟠 MajorRemove neutral Japanese nouns from the blocklist.
女(woman),女の子(girl),王(king),隅(corner), and隙(gap) are ordinary words. Blocking them will cause excessive false positives when filtering normal Japanese content.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ja.txt` around lines 360 - 362, Remove the neutral Japanese nouns from the blocklist by deleting the entries for "女", "女の子", "王", "隅", and "隙" in the ja.txt blocklist so ordinary words aren't falsely filtered; locate these literal tokens in assets/data/ja.txt (e.g., the lines containing 女, 女の子, 女子高生, 王, 隅, 隙) and remove them, ensuring the file remains one-token-per-line and is saved without altering unrelated entries or order.assets/data/it.txt-590-590 (1)
590-590:⚠️ Potential issue | 🟠 MajorRemove ordinary Italian names, everyday phrases, and archaic expressions from the blocklist.
The terms
filippo(common given name),garibaldi(historical figure/surname),la mamma(everyday phrase for "the mother"),perbacco(archaic exclamation), andvada a bordo(nautical phrase) are not offensive or vulgar. Blocking these in normal Italian text will generate clear false positives since they regularly appear in legitimate writing.Also applies to: lines 640, 757, 1038, 1694
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/it.txt` at line 590, The blocklist currently includes benign Italian tokens that cause false positives; remove the entries for "filippo", "garibaldi", "la mamma", "perbacco", and "vada a bordo" from the Italian blocklist file (the occurrences shown in the diff and the duplicates at the other reported locations), ensuring you delete each exact token/phrase and any duplicate lines (lines referenced at 640, 757, 1038, 1694) while preserving file encoding and line endings so the rest of the list remains intact.assets/data/my.txt-29-29 (1)
29-29:⚠️ Potential issue | 🟠 MajorRemove mixed-language artifacts from Burmese lexicon.
The entries at lines 29, 33, 54, and 66 are mixed-script/mixed-language artifacts that should not be in a production Burmese lexicon:
- Line 29:
x ကအဆင့်သတ်မှတ်ထားသော(Latin prefix)- Line 33:
ကော်ဇော muncher(English suffix)- Line 54:
မိခင် fucker(English suffix)- Line 66:
လူရှုပ်-off(English suffix)These entries do not represent authentic Burmese words and will cause matching failures. Remove or normalize them before building the trie.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/my.txt` at line 29, The Burmese lexicon contains mixed-language artifacts (e.g., the exact strings "x ကအဆင့်သတ်မှတ်ထားသော", "ကော်ဇော muncher", "မိခင် fucker", "လူရှုပ်-off"); update the lexicon-loading step (where you read and pass entries into buildTrie or loadLexicon) to filter or normalize entries: drop any line containing Latin letters, ASCII punctuation/hyphens, or obvious English profanity, or alternatively strip trailing/leading Latin tokens so only Burmese script remains; ensure the check uses a Unicode script test (e.g., reject if regex /[A-Za-z0-9\-]/ matches) before inserting into the trie so these four offending entries are removed/normalized.assets/data/bg.txt-126-126 (1)
126-126:⚠️ Potential issue | 🟠 MajorThe Bulgarian dataset contains encoding-corrupted entries with non-Cyrillic characters.
Tokens like
x оѧенен,аборѣ,анѣѡ,брѣѣалноѡѣ,клиѣорഀ, andлайнഀinclude non-Bulgarian characters (Gurmukhiand Malayalamഀ), indicating mojibake. Indexing these corrupted entries will add garbage states to the automaton and miss legitimate user input.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/bg.txt` at line 126, The bg.txt dataset contains lines with mojibake/non-Bulgarian characters (examples: "x оѧенен", "аборѣ", "анѣѡ", "брѣѣалноѡѣ", "клиѣорഀ", "лайнഀ"); scan bg.txt for characters outside the Bulgarian/Cyrillic Unicode ranges (and allowed punctuation) and either remove or replace those entries; normalize text to Unicode NFC and strip/control-map any Latin, Gurmukhi, Malayalam or other-script codepoints before indexing so the automaton only receives valid Cyrillic tokens (or move suspicious lines to a quarantine report file for manual review).assets/data/ms.txt-14-14 (1)
14-14:⚠️ Potential issue | 🟠 MajorRemove Malaysian politician names from the Malay dataset.
The entries
anwar ibrahim,lancau najib,najib kongkek anwar, androsmah bodohare names of Malaysian political figures (Anwar Ibrahim, Najib Razak, and Rosmah Mansor). Keeping them in a profanity filter will cause false-positives on benign political news and commentary that mentions these names.Applies to lines: 14, 102, 126, 165
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ms.txt` at line 14, Remove the four political-name entries from the Malay profanity dataset by deleting the exact strings "anwar ibrahim", "lancau najib", "najib kongkek anwar", and "rosmah bodoh" from the dataset file; ensure these tokens are removed (or commented out) wherever they appear so they no longer trigger the profanity filter, and run a quick scan to verify no other variants of these exact phrases remain.assets/data/mn.txt-40-48 (1)
40-48:⚠️ Potential issue | 🟠 MajorRemove common Mongolian words from profanity filter list.
This list contains multiple everyday words that will cause excessive filtering of normal text. Examples include:
өдөр(day) – fundamental everyday wordбөмбөг(ball),титэм(crown),тогоонууд(pots) – standard nounsдуртай(favorite/beloved),бөх(wrestling) – common descriptorsхамгийн сайн сайхан(most beautiful),харилцаа холбоо(communication/relationship) – standard phrasesэрхлэгч(manager/administrator),смит(smith) – common termsThese should be removed to prevent over-censoring of legitimate Mongolian text.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/mn.txt` around lines 40 - 48, The profanity list contains common Mongolian words that will cause over-filtering; remove everyday terms such as "өдөр", "бөмбөг", "бөх", "дуртай", "титэм", "тогоонууд", "хамгийн сайн сайхан", "харилцаа холбоо", "эрхлэгч", and "смит" (and similar non-profane entries found in mn.txt like "ану", "бурханлиг", "гичий", "гэж хэлэв") from assets/data/mn.txt so the filter only contains true profanities and slurs; review the file for other high-frequency nouns/phrases and delete them or move questionable items to a separate moderated list.assets/data/ms.txt-77-77 (1)
77-77:⚠️ Potential issue | 🟠 MajorRemove or relocate Māori language entries that contaminate this Malay word list.
The file contains te reo Māori (Māori language) entries mixed with Malay/Indonesian content. This cross-contamination compromises the profanity filter by both missing real Malay profanity and creating false positives for unrelated text.
Māori entries: lines 77, 97–98, 176, 181–182, 186, 188, 190–191, 196–199.
These should be either removed or moved to a dedicated Māori language file (e.g.,
mi.txt, which already exists in the dataset).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ms.txt` at line 77, Remove the te reo Māori entries contaminating the Malay list by extracting the listed Māori strings (e.g., "kai whakatahe" and the entries at positions you noted: the sets at lines 97–98, 176, 181–182, 186, 188, 190–191, 196–199) from the Malay wordlist and either delete them or append them to the existing Māori wordlist (mi.txt); ensure the removed items are no longer present in assets/data/ms.txt and, if moved, are deduplicated and added to mi.txt in the same plain-text format so both lists remain accurate.assets/data/el.txt-68-68 (1)
68-68:⚠️ Potential issue | 🟠 MajorDrop the non-profane Greek terms and club names.
Line 68 (
olympiacos), Line 72-Line 75 (panathinaikos,paok,vazelos,villara), Line 127 (επάνω), Line 129 (ευτυχώς), Line 171 (ναι), and Line 223 (συνάδελφο) are normal words/proper nouns. Keeping them in the Greek profanity list will censor routine sports/news/chat content.Also applies to: 72-75, 127-129, 171-171, 223-223
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/el.txt` at line 68, Remove the non-profane Greek terms currently included in the profanity list: "olympiacos", "panathinaikos", "paok", "vazelos", "villara", "επάνω", "ευτυχώς", "ναι", and "συνάδελφο" (these exact tokens appear in the diff) — either delete these entries from the list or move them to a safe/whitelist so they are not treated as profane; ensure any normalization or lowercase matching logic that references this list (the profanity-check lookup that loads these tokens) is updated to reflect the removal so ordinary sports/news/chat content is not censored.assets/data/be.txt-18-18 (1)
18-18:⚠️ Potential issue | 🟠 MajorRemove obviously benign Belarusian terms from this list.
Line 40 (
кран), Line 71 (прыклад), Line 99 (усклікаць), Line 117 (янот), and Line 118 (ўрваць) are ordinary words, not profanities. Shipping them in the default Belarusian lexicon will flag normal text and materially raise false positives.Also applies to: 20-20, 40-41, 71-71, 99-99, 117-118
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/be.txt` at line 18, Remove the listed benign Belarusian words from the default lexicon to avoid false positives: delete the entries for "кран", "прыклад", "усклікаць", "янот", and "ўрваць" (and any duplicated occurrences in the same file/ranges noted, e.g., the occurrences around lines referenced as 20, 40-41, 71, 99, 117-118); ensure any surrounding commas or line breaks are cleaned up so the file remains a valid newline-separated word list and run a quick grep to confirm no remaining benign terms remain in the Belarusian lexicon file.assets/data/mr.txt-48-48 (1)
48-48:⚠️ Potential issue | 🟠 MajorPrune the everyday Marathi words from this blocklist.
Line 48 (
marathi), Line 100 (अप्सरा), Line 140 (गुप्तचर), Line 161 (दृश्य), Line 162 (देव), Line 215 (मांजर), Line 249 (शीर्षक), and Line 256 (सुगंध) are not profanities. These will create noisy false positives in completely benign text.Also applies to: 100-100, 140-140, 161-163, 215-215, 249-249, 256-256
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/mr.txt` at line 48, Remove the listed benign Marathi words from the profanity blocklist: delete the entries "marathi", "अप्सरा", "गुप्तचर", "दृश्य", "देव", "मांजर", "शीर्षक", and "सुगंध" (the occurrences referenced at lines 48, 100, 140, 161-163, 215, 249, 256) so they no longer produce false positives; ensure the blocklist file (assets/data/mr.txt) no longer contains those tokens and update any related comments or tests that expect them to be blocked.assets/data/hi.txt-582-771 (1)
582-771:⚠️ Potential issue | 🟠 Major
hi.txtis carrying a full Sinhala lexicon block.From Line 582 onward, this file switches to Sinhala script. If the runtime selects lexicons by locale filename, the Hindi dataset will unexpectedly flag Sinhala text and the language split introduced in 2.0.0 stops being trustworthy.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/hi.txt` around lines 582 - 771, hi.txt contains a block of Sinhala script starting around the shown diff, which will mislabel Sinhala as Hindi; remove the Sinhala lines from hi.txt and place them into a new si.txt (Sinhala) lexicon file, then update the lexicon loader/registry (e.g., loadLexicons or registerLocale / getLocaleLexicon) to load hi.txt only for "hi" and si.txt for "si" (and add validation to reject entries outside expected Unicode ranges for the locale to prevent future cross-contamination).assets/data/lv.txt-19-19 (1)
19-19:⚠️ Potential issue | 🟠 MajorThis Latvian list also has encoding-corrupted entries.
Line 19 (
bezdçt), Line 30 (dirsâ), Line 100 (pajât), Line 142 (pimpja dzîsla), Line 144 (pipïuks), Line 154 (pïûtîzers), and Line 180-Line 182 (sûdcaurums,sûdu stûmçjs,sûdîgs) are malformed. With exact string matching, these tokens will miss the real Latvian forms entirely.Also applies to: 30-30, 100-100, 142-144, 154-154, 176-182
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/lv.txt` at line 19, The listed Latvian tokens are corrupted by wrong encoding (e.g., "bezdçt", "dirsâ", "pajât", "pimpja dzîsla", "pipïuks", "pïûtîzers", "sûdcaurums", "sûdu stûmçjs", "sûdîgs"); open assets/data/lv.txt in a UTF-8-capable editor or re-decode the file using the correct source encoding, normalize the file to UTF-8 NFC, and replace each malformed token with the correct Latvian form (or restore from the original source list) so that exact string matching will succeed for "bezdçt", "dirsâ", "pajât", "pimpja dzîsla", "pipïuks", "pïûtîzers", "sûdcaurums", "sûdu stûmçjs", and "sûdîgs".assets/data/da.txt-21-22 (1)
21-22:⚠️ Potential issue | 🟠 MajorFix the mojibake before publishing this Danish lexicon.
Line 21, Line 22, Line 127, Line 128, Line 154, Line 166, Line 183, and Line 184 contain broken UTF-8/Latin-1 text (
bã¦skubber,bã¸sserã¸v,rã¸vbanan,ã˜lfisse, etc.). The matcher will never hit the real Danish spellings, so these entries are effectively dead and the locale coverage is wrong.Also applies to: 127-128, 154-154, 166-166, 183-184
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/da.txt` around lines 21 - 22, The file assets/data/da.txt contains mojibake (mis-encoded Latin-1/UTF-8) entries such as "bã¦skubber", "bã¸sserã¸v", "rã¸vbanan" and "ã˜lfisse" that must be fixed to the correct Danish spellings (replace the garbled sequences with the proper characters: æ, ø, Å/Ø, etc.) — either re-open the file with the correct encoding and save as UTF-8 or run a re-encoding step (e.g., convert from Latin-1/ISO-8859-1 to UTF-8), then replace the mis-decoded tokens on the noted lines (21,22,127-128,154,166,183-184) so the entries read the real Danish words (e.g., change "bã¦" -> "bæ", "bã¸" -> "bø", "rã¸v" -> "røv", "ã˜" -> "Ø" as appropriate) and re-run the matcher to confirm locale coverage.assets/data/ml.txt-101-101 (1)
101-101:⚠️ Potential issue | 🟠 MajorNormalize the malformed Malayalam tokens.
Line 101 and Line 252 include trailing punctuation, and Line 236 (
ഗര് ഭച്ഛിദ്രം), Line 323 (മൂന്നു വര് ഷത്തിനുള്ളില്), and Line 331 (രതിമൂര് ച്ഛ) contain intra-word splits. These entries will only match the malformed source text, not the actual Malayalam spellings users type.Also applies to: 236-236, 252-252, 323-323, 331-331
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/ml.txt` at line 101, The file contains malformed tokens with intra-word splits and trailing punctuation (examples: "ninte ammede kothil.", "ഗര് ഭച്ഛിദ്രം", "മൂന്നു വര് ഷത്തിനുള്ളില്", "രതിമൂര് ച്ഛ"); fix by replacing each malformed token with its contiguous/canonical form (remove internal spaces and trailing punctuation) and apply Unicode normalization (NFC) so entries match normal Malayalam orthography; update the specific entries for the tokens shown (remove the trailing "." from "ninte ammede kothil." and join split Malayalam tokens like "ഗര് ഭച്ഛിദ്രം", "മൂന്നു വര് ഷത്തിനുള്ളില്", "രതിമൂര് ച്ഛ" into their correct single-word spellings) so they will match user input.assets/data/cy.txt-1-1 (1)
1-1:⚠️ Potential issue | 🟠 MajorPrune common Welsh vocabulary from the blocklist.
adferiad,brawf,cyfrifiadur,orau, andtafarnare normal words, not profanity. Keeping them here will cause broad false positives in ordinary Welsh content.Also applies to: 9-9, 41-41, 96-96, 143-143
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/cy.txt` at line 1, Remove the listed Welsh words from the blocklist in assets/data/cy.txt: delete the entries "adferiad", "brawf", "cyfrifiadur", "orau", and "tafarn" (and remove their duplicate occurrences at the other indicated locations). Search the file for those exact tokens (and the other mentioned line occurrences) and remove them so common Welsh vocabulary is not treated as profanity; ensure no other unrelated words are altered.assets/data/de.txt-21-24 (1)
21-24:⚠️ Potential issue | 🟠 MajorRemove obviously benign German terms from the profanity list.
ambiguität,ameisenbär,babyface, andich liebe dichare ordinary phrases, not abusive content. Shipping them as blocked tokens will produce false positives in normal text.Also applies to: 66-66, 289-289
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/de.txt` around lines 21 - 24, Remove the listed benign German tokens from the profanity block: delete "ambiguität", "ameisenbär", "babyface", and the phrase "ich liebe dich" from assets/data/de.txt (they also appear at the other occurrences referenced around lines 66 and 289); ensure the profanity array or list entries (the literal strings "ambiguität", "ameisenbär", "babyface", "ich liebe dich") are removed so they no longer produce false positives while preserving the rest of the file.assets/data/de.txt-22-23 (1)
22-23:⚠️ Potential issue | 🟠 MajorDon’t block neutral identity labels as standalone profanity.
Entries like
ambisexuell,homo,lesbe,pansexuell,schwul, andschwulerare neutral descriptors on their own. Treating them as profanity will incorrectly censor non-abusive content and creates an avoidable moderation bias.Also applies to: 260-260, 343-343, 412-412, 508-509
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/de.txt` around lines 22 - 23, The profanity list currently contains neutral identity labels (e.g., "ambisexuell", "ambosexuell", "homo", "lesbe", "pansexuell", "schwul", "schwuler"); remove these entries from the blocked/profanity set or move them into an explicit "allowed_identity_terms" whitelist and update matching so these tokens are only flagged when appearing in abusive context (e.g., require surrounding insult patterns or negative sentiment) rather than being blocked as standalone words; ensure the same change is applied for the other occurrences called out (lines referenced in the comment).assets/data/de.txt-108-108 (1)
108-108:⚠️ Potential issue | 🟠 MajorFix the mojibake entries before building the German trie.
These lines contain broken UTF-8 text (
scheiãÿegal,jã¤ger,beschiãÿ,scheiãÿe,drecksgã¶r). They will miss real matches and add dead entries to the automaton.Suggested cleanup
-das ist mir scheiãÿegal +das ist mir scheißegal -sie sind das essen und wir sind die jã¤ger +sie sind das essen und wir sind die jäger -so ein beschiãÿ +so ein beschiß -verdammte scheiãÿe +verdammte scheiße -verhurtes drecksgã¶r +verhurtes drecksgörAlso applies to: 514-514, 517-517, 564-564, 575-575
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/de.txt` at line 108, Lines in assets/data/de.txt contain mojibake sequences (e.g. "scheiãÿegal", "jã¤ger", "beschiãÿ", "scheiãÿe", "drecksgã¶r") that must be replaced with their correct UTF-8 German forms before building the trie; update those entries to the proper strings (for example: scheißegal, jäger, beschiss/beschiß as appropriate, scheiße, drecksgöre) and scan the file for the other reported occurrences (lines around the listed offsets) to replace all similar mojibake variants so the automaton receives valid UTF-8 tokens.
🟡 Minor comments (5)
assets/data/kab.txt-19-20 (1)
19-20:⚠️ Potential issue | 🟡 MinorDeduplicate this repeated Kabyle entry.
tteqqunappears twice consecutively. If the loader does not normalize duplicates, this adds useless trie nodes and can also produce duplicate matches.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/kab.txt` around lines 19 - 20, Remove the duplicated Kabyle entry "tteqqun" so it appears only once in the data set and, if applicable, update the loader normalization step that reads this file to ignore or deduplicate identical consecutive entries (e.g., collapse duplicate tokens during parsing) to prevent creating redundant trie nodes or duplicate matches.assets/data/hy.txt-49-49 (1)
49-49:⚠️ Potential issue | 🟡 MinorRemove the invisible character from this token.
Line 49 appears to contain a trailing soft hyphen/non-printing character after
klris glookh. That will make exact matching unreliable and is very hard to debug once this file is compiled into the automaton.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/hy.txt` at line 49, The token "klris glookh" contains a trailing invisible soft-hyphen/non-printing character; open the file and remove the non-printing character after "klris glookh" (e.g., U+00AD or other zero-width chars), save the file in UTF-8, and re-run a grep/hex check (search for U+00AD or use a tool to show hidden characters) to confirm no residual invisible characters remain in the token.assets/data/eo.txt-48-48 (1)
48-48:⚠️ Potential issue | 🟡 MinorFix the corrupted Esperanto entry on line 48.
åœitcontains non-Esperanto characters (å,œ) that indicate UTF-8 corruption. This entry will not match normalized Esperanto input and should be corrected to the intended word.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/eo.txt` at line 48, The entry "åœit" is corrupted and must be replaced with the correct Esperanto word: locate the token "åœit" in the eo.txt file and replace it with the intended Esperanto spelling (retrieve the correct word from the original source or translation memory), ensure the file is saved as UTF-8 and normalized (NFC) so Esperanto diacritics (ĉ, ĝ, ĥ, ĵ, ŝ, ŭ) are preserved, and run a quick validation (e.g., grep or a script) to confirm the corrected word matches normalized Esperanto input.assets/data/es.txt-525-525 (1)
525-525:⚠️ Potential issue | 🟡 MinorInconsistent data format with tab character and English translation
Line 525 contains a tab character followed by an English translation:
cua fea balla al mearda bitch go to shitThis format is inconsistent with the rest of the file where each line contains only a single term. While this may not cause functional issues depending on how the data is parsed, it suggests incomplete data cleaning.
📝 Recommended fix for consistent formatting
-cua fea balla al mearda bitch go to shit +cua fea balla al meardaIf translations are needed for documentation purposes, consider maintaining them in a separate file or using a structured format (e.g., JSON with key-value pairs).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/es.txt` at line 525, Remove the English translation and stray tab from the line containing the term "cua fea balla al mearda" so the file returns to the single-term-per-line format; if the translation must be preserved, move the English phrase ("bitch go to shit") into a separate translations file or a structured mapping (e.g., JSON) and reference that instead, ensuring the original assets/data/es.txt contains only the single-term entry "cua fea balla al mearda".assets/data/cy.txt-3-3 (1)
3-3:⚠️ Potential issue | 🟡 MinorRemove the stray
+from this literal entry.
anws blewog+looks malformed. In a line-delimited dictionary this will only match text that literally ends with+; if any downstream code treats entries as patterns, it changes the term’s meaning entirely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/cy.txt` at line 3, The entry "anws blewog+" contains a stray '+' at the end; open the data file and remove the trailing '+' so the literal becomes "anws blewog" (update the exact record "anws blewog+" in assets/data/cy.txt).
🧹 Nitpick comments (2)
assets/data/hr.txt (1)
1-243: Consider establishing data governance practices for language assets.With 75+ language data files being introduced, establishing clear processes will help maintain quality and consistency:
- Data provenance: Document the sources and methodology for each language dataset
- Update cadence: Define how frequently these files should be reviewed and updated
- Quality assurance: Consider implementing validation scripts to check for encoding issues, duplicates, or formatting inconsistencies across all data files
- Cultural review: Given the sensitive nature of profanity filtering, consider involving native speakers or cultural consultants for accuracy
- Versioning: Track changes to individual language files to understand the evolution of datasets
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/hr.txt` around lines 1 - 243, The hr.txt profanity wordlist lacks governance: add metadata and validation—prepend a standardized header to assets/data/hr.txt (source, author, date, license, language code "hr", review cadence) and apply a validation pipeline (scripts invoked in CI) that checks UTF-8 encoding, trims/normalizes lines, deduplicates entries, enforces one-term-per-line formatting, and flags suspicious characters; add a process doc describing provenance and update cadence, require native-speaker cultural review for hr.txt and other language files, and ensure each language file is versioned/changed with meaningful git commits so history for assets/data/hr.txt (and the other 75+ language files) is auditable.README.md (1)
97-101: Minor technical imprecision in performance explanationThe explanation states: "With 10,000+ words, this grew exponentially slow."
More precisely, a nested loop approach grows linearly with the number of patterns (O(N × P)), not exponentially (O(2^N)). The performance difference vs. Aho-Corasick is still dramatic but describing it as linear growth would be more technically accurate.
📚 Suggested refinement for technical accuracy
-Legacy versions used a nested loop approach (for every bad word, run a regex). With 10,000+ words, this grew exponentially slow. +Legacy versions used a nested loop approach (for every bad word, run a regex). With 10,000+ words, performance degraded linearly with pattern count (O(N × P)).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@README.md` around lines 97 - 101, The README's performance description incorrectly says the nested-loop approach grew "exponentially slow"; update that sentence in the "Why is v2.0.0 so fast?" section to state the complexity grows linearly with the number of patterns (e.g., O(N × P) where N is text length and P is number of patterns) instead of using "exponentially", and optionally add a short comparison noting Aho-Corasick runs in O(N) for scanning the text once; edit the paragraph containing "With 10,000+ words, this grew exponentially slow." and the following sentence referencing Aho-Corasick to correct the complexity claim.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 03a0441f-3a6f-405b-bf97-c31b70a5244a
⛔ Files ignored due to path filters (1)
assets/image/safeText.pngis excluded by!**/*.png
📒 Files selected for processing (88)
.vscode/settings.jsonCHANGELOG.mdREADME.mdasset/bad_words.txtassets/data/af.txtassets/data/am.txtassets/data/ar.txtassets/data/az.txtassets/data/be.txtassets/data/bg.txtassets/data/ca.txtassets/data/ceb.txtassets/data/cs.txtassets/data/cy.txtassets/data/da.txtassets/data/de.txtassets/data/dz.txtassets/data/el.txtassets/data/en.txtassets/data/eo.txtassets/data/es.txtassets/data/et.txtassets/data/eu.txtassets/data/fa.txtassets/data/fi.txtassets/data/fil.txtassets/data/fr.txtassets/data/gd.txtassets/data/gl.txtassets/data/hi.txtassets/data/hr.txtassets/data/hu.txtassets/data/hy.txtassets/data/id.txtassets/data/is.txtassets/data/it.txtassets/data/ja.txtassets/data/kab.txtassets/data/kh.txtassets/data/ko.txtassets/data/la.txtassets/data/lt.txtassets/data/lv.txtassets/data/mi.txtassets/data/mk.txtassets/data/ml.txtassets/data/mn.txtassets/data/mr.txtassets/data/ms.txtassets/data/mt.txtassets/data/my.txtassets/data/nl.txtassets/data/no.txtassets/data/pih.txtassets/data/piy.txtassets/data/pl.txtassets/data/pt.txtassets/data/ro.txtassets/data/rop.txtassets/data/ru.txtassets/data/sk.txtassets/data/sl.txtassets/data/sm.txtassets/data/sq.txtassets/data/sr.txtassets/data/sv.txtassets/data/ta.txtassets/data/te.txtassets/data/tet.txtassets/data/th.txtassets/data/tlh.txtassets/data/to.txtassets/data/tr.txtassets/data/uk.txtassets/data/uz.txtassets/data/vi.txtassets/data/yid.txtassets/data/zh.txtassets/data/zu.txtexample/lib/main.dartlib/safe_text.dartlib/src/aho_corasick.dartlib/src/models/language.dartlib/src/phone_number_checker.dartlib/src/safe_text_filter.dartpubspec.yamltest/phone_number_test.darttest/profanity_filter_test.dart
💤 Files with no reviewable changes (1)
- asset/bad_words.txt
There was a problem hiding this comment.
Actionable comments posted: 4
♻️ Duplicate comments (3)
assets/data/it.txt (1)
1685-1685:⚠️ Potential issue | 🔴 CriticalDrop the remaining malformed Italian entries.
va ciapã i ratandx valutatoare still garbage tokens, so they won't match real Italian input and just pollute the Aho–Corasick trie.Suggested fix
-va ciapã i rat -x valutatoAlso applies to: 1732-1732
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/it.txt` at line 1685, Remove the malformed Italian tokens from the dataset: delete the entries "va ciapã i rat" and "x valutato" (and the duplicate at the other reported location) so they no longer populate the Aho–Corasick trie; search for these exact strings in assets/data/it.txt and remove those lines to keep only valid Italian phrases.assets/data/es.txt (1)
1122-1122:⚠️ Potential issue | 🔴 CriticalPurge GitHub/UI strings from the Spanish lexicon.
open,pricing,privacy,projects,pull requests,security,skip to content,terms, andtrainingare not profanity. Their presence means the source data still contains scraped site chrome, which will cause false positives.Also applies to: 1300-1303, 1311-1311, 1414-1414, 1435-1435, 1481-1481, 1532-1532
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/es.txt` at line 1122, Remove the UI/chrome terms found in the Spanish lexicon entries (e.g., the literal tokens "open", "pricing", "privacy", "projects", "pull requests", "security", "skip to content", "terms", "training") because they are scraped GitHub/UI strings causing false positives; search the es.txt dataset for those exact tokens (including the duplicates noted at the ranges around lines with "open" and the other listed entries) and delete or exclude those lines/entries from the lexicon, then re-run the dataset sanitization to ensure no other site-chrome tokens remain.assets/data/ar.txt (1)
960-965:⚠️ Potential issue | 🔴 CriticalRemove truncated Arabic fragments before building the trie.
Entries like
ابو ر,الحس ط,الله لا,ان,بز, andفare fragments or near-universal tokens, so the automaton will match ordinary Arabic text far too often.Also applies to: 991-1019, 1045-1050, 1186-1186
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/data/cs.txt`:
- Line 127: Remove the entry "peklo" from the Czech blacklist file
(assets/data/cs.txt) so the profanity list no longer flags it; this aligns with
the test expectation in test/language_data_test.dart which asserts
containsBadWord('peklo') returns false for Language.czech. Update the cs.txt
content to exclude "peklo" and run the tests to verify containsBadWord and
Language.czech behavior is correct.
In `@assets/data/es.txt`:
- Line 630: Remove the standalone token 'es' from the stopword/substring list:
locate the literal entry "es" in the data file and delete it so the
substring-matching automaton no longer includes this common Spanish
verb/stopword; after removal, regenerate or reload any derived automaton/index
and run the substring-matching tests to confirm no excessive matches now occur.
In `@assets/data/fa.txt`:
- Line 149: The Persian words 'برنامه نویس', 'خداحافظ', 'مروارید', and 'نجار'
are benign and must be removed from the profanity blacklist in the fa.txt data
file; remove these exact tokens wherever they appear (including duplicate
occurrences) so normal Persian text is not falsely flagged, keeping the file
format (one token per line) intact and running any blacklist/unit tests to
confirm no regressions.
In `@assets/data/hi.txt`:
- Line 424: Remove the four neutral Hindi entries from the Hindi blacklist:
"आरएसी", "गोली चलाने की आवाज़", "निर्माण", and "पृथ्वी पर" (they are ordinary
words causing false positives); edit assets/data/hi.txt to delete or comment out
these exact string entries wherever they occur (including the occurrences noted
around the file regions referenced), then run the filter/unit tests to confirm
no Hindi-only false positives remain.
---
Duplicate comments:
In `@assets/data/es.txt`:
- Line 1122: Remove the UI/chrome terms found in the Spanish lexicon entries
(e.g., the literal tokens "open", "pricing", "privacy", "projects", "pull
requests", "security", "skip to content", "terms", "training") because they are
scraped GitHub/UI strings causing false positives; search the es.txt dataset for
those exact tokens (including the duplicates noted at the ranges around lines
with "open" and the other listed entries) and delete or exclude those
lines/entries from the lexicon, then re-run the dataset sanitization to ensure
no other site-chrome tokens remain.
In `@assets/data/it.txt`:
- Line 1685: Remove the malformed Italian tokens from the dataset: delete the
entries "va ciapã i rat" and "x valutato" (and the duplicate at the other
reported location) so they no longer populate the Aho–Corasick trie; search for
these exact strings in assets/data/it.txt and remove those lines to keep only
valid Italian phrases.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7ebd260d-5b76-4000-9774-6b6dd2d85786
📒 Files selected for processing (14)
assets/data/ar.txtassets/data/cs.txtassets/data/es.txtassets/data/et.txtassets/data/fa.txtassets/data/hi.txtassets/data/it.txtassets/data/ja.txtassets/data/kh.txtassets/data/kn.txtassets/data/ko.txtassets/data/zh.txtlib/src/models/language.darttest/language_data_test.dart
✅ Files skipped from review due to trivial changes (3)
- assets/data/kn.txt
- assets/data/et.txt
- assets/data/kh.txt
assets/data/fa.txt
Outdated
| بخورش | ||
| بدبخت | ||
| بدم | ||
| برنامه نویس |
There was a problem hiding this comment.
Remove benign Persian words from the profanity set.
برنامه نویس, خداحافظ, مروارید, and نجار are neutral Persian words. Putting them in the blacklist will flag normal text and undermine the new language-specific filtering.
Also applies to: 233-233, 442-442, 460-460
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/fa.txt` at line 149, The Persian words 'برنامه نویس', 'خداحافظ',
'مروارید', and 'نجار' are benign and must be removed from the profanity
blacklist in the fa.txt data file; remove these exact tokens wherever they
appear (including duplicate occurrences) so normal Persian text is not falsely
flagged, keeping the file format (one token per line) intact and running any
blacklist/unit tests to confirm no regressions.
…the language unicodes
There was a problem hiding this comment.
Actionable comments posted: 6
♻️ Duplicate comments (1)
assets/data/fa.txt (1)
440-440:⚠️ Potential issue | 🟠 MajorRemove neutral Persian words from the blacklist.
مرواریدandنجارare benign words and can trigger false positives in normal Persian text.Also applies to: 458-459
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@assets/data/fa.txt` at line 440, Remove the benign Persian words causing false positives by deleting the blacklist entries for "مروارید" and "نجار" from the fa.txt blacklist and also remove their duplicate occurrences noted around the other reported lines; ensure you only remove those exact tokens (keeping file encoding/line endings intact) and run a quick scan to confirm no other duplicate neutral Persian words remain in the list.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@assets/data/cs.txt`:
- Line 9: Remove the malformed/non‑Czech tokens from the Czech lexicon by
deleting entries like "biezh do haaye", "churak", "fuc ty", "koòomrd", "kraavo",
"shinging", and "show-staat" (and the similar noise noted at the other
occurrences) so the lexicon only contains valid Czech words; locate these exact
token strings in the cs.txt word list and remove or replace them with correct
Czech equivalents where available, then run the lexicon validation/CI check to
ensure no encoding or malformed entries remain.
In `@assets/data/es.txt`:
- Around line 58-59: Replace the mojibake tokens in the Spanish data file with
their correct UTF-8 forms: locate and change "andã¡" to "anda", "coã±o" to
"coño", "cuzã£o" to "cuzão" (or the intended word), "me cago en el corazã³n" to
"me cago en el corazón", "me jodã" to "me jodí", and similar corrupted
sequences (e.g., any occurrences of á, é, Ã, ó, ú, ñ, ã) to their proper
accented characters; verify and normalize the whole file encoding to UTF-8 (no
BOM) after edits to prevent recurrence and ensure these tokens no longer create
spurious automaton states.
- Line 523: The entry "cua fea balla al mearda\tbitch go to shit" mixes a
Spanish profanity token with an English gloss; remove the tab and English gloss
so the lexicon line contains only the single profanity token ("cua fea balla al
mearda"), or alternatively split into two valid entries (one for the Spanish
token and a separate mapping/line for the English gloss) to keep each lexicon
line a single token.
In `@assets/data/fa.txt`:
- Line 103: Remove the scraped classification artifact token "x امتیاز" from the
Persian lexicon in assets/data/fa.txt; open the file and delete the exact line
containing 'x امتیاز' (and ensure no trailing whitespace or duplicate empty
lines are left), then save the file so the lexicon only contains valid profanity
entries.
In `@assets/data/hi.txt`:
- Around line 566-755: The hi.txt file contains Sinhala-script entries that are
being loaded by Language.hindi (which only loads assets/data/hi.txt), causing
Sinhala text to be matched as Hindi; remove or relocate the Sinhala block from
assets/data/hi.txt into the correct Sinhala lexicon (e.g., create/update
assets/data/si.txt) and ensure Language.hindi only loads Devanagari/true-Hindi
entries; as a precaution, add a small validation step where the Hindi loader
(Language.hindi) filters or rejects entries that contain non-Devanagari
codepoints before adding them to the Hindi lexicon.
In `@assets/data/it.txt`:
- Line 1284: The line containing "sborra, sburro" combines two separate terms;
split this entry into two distinct lines—one with "sborra" and one with
"sburro"—so look-up/matching will work for each term independently and remove
the comma-separated form.
---
Duplicate comments:
In `@assets/data/fa.txt`:
- Line 440: Remove the benign Persian words causing false positives by deleting
the blacklist entries for "مروارید" and "نجار" from the fa.txt blacklist and
also remove their duplicate occurrences noted around the other reported lines;
ensure you only remove those exact tokens (keeping file encoding/line endings
intact) and run a quick scan to confirm no other duplicate neutral Persian words
remain in the list.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 46116ce7-6850-475c-86f4-fad88b2e2b2c
📒 Files selected for processing (6)
assets/data/cs.txtassets/data/es.txtassets/data/fa.txtassets/data/hi.txtassets/data/it.txtlib/src/safe_text_filter.dart
| batich | ||
| bestialita | ||
| bestiální | ||
| biezh do haaye |
There was a problem hiding this comment.
Remove malformed/non-Czech tokens from the Czech lexicon.
These entries look like noise or encoding artifacts (biezh do haaye, churak, fuc ty, koòomrd, kraavo, shinging, show-staat) and are unlikely to match intended Czech profanity reliably.
Suggested cleanup
-biezh do haaye
-churak
-fuc ty
-koòomrd
-kraavo
-shinging
-show-staatAlso applies to: 21-21, 49-49, 77-77, 79-79, 164-165
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/cs.txt` at line 9, Remove the malformed/non‑Czech tokens from the
Czech lexicon by deleting entries like "biezh do haaye", "churak", "fuc ty",
"koòomrd", "kraavo", "shinging", and "show-staat" (and the similar noise noted
at the other occurrences) so the lexicon only contains valid Czech words; locate
these exact token strings in the cs.txt word list and remove or replace them
with correct Czech equivalents where available, then run the lexicon
validation/CI check to ensure no encoding or malformed entries remain.
| andã¡ a hacerte coger | ||
| andã¡ a hacerte ver el culo |
There was a problem hiding this comment.
Fix mojibake/encoding-corrupted Spanish entries.
Multiple entries appear UTF-8-corrupted (andã¡, coã±o, cuzã£o, me cago en el corazã³n..., me jodã, etc.). These patterns will miss real input and add noisy automaton states.
Also applies to: 142-142, 402-402, 515-515, 562-562, 1017-1018, 1033-1033
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/es.txt` around lines 58 - 59, Replace the mojibake tokens in the
Spanish data file with their correct UTF-8 forms: locate and change "andã¡" to
"anda", "coã±o" to "coño", "cuzã£o" to "cuzão" (or the intended word), "me cago
en el corazã³n" to "me cago en el corazón", "me jodã" to "me jodí", and similar
corrupted sequences (e.g., any occurrences of á, é, Ã, ó, ú, ñ, ã) to
their proper accented characters; verify and normalize the whole file encoding
to UTF-8 (no BOM) after edits to prevent recurrence and ensure these tokens no
longer create spurious automaton states.
| cremita | ||
| creta | ||
| crica | ||
| cua fea balla al mearda bitch go to shit |
There was a problem hiding this comment.
Remove mixed translation text from a lexicon line.
This line includes a tab and an English gloss (bitch go to shit) in the same entry, which is not a valid single profanity token.
Suggested fix
-cua fea balla al mearda bitch go to shit
+cua fea balla al mearda📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| cua fea balla al mearda bitch go to shit | |
| cua fea balla al mearda |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/es.txt` at line 523, The entry "cua fea balla al mearda\tbitch go
to shit" mixes a Spanish profanity token with an English gloss; remove the tab
and English gloss so the lexicon line contains only the single profanity token
("cua fea balla al mearda"), or alternatively split into two valid entries (one
for the Spanish token and a separate mapping/line for the English gloss) to keep
each lexicon line a single token.
| tokhm | ||
| torke char | ||
| valade zenaa | ||
| x امتیاز |
There was a problem hiding this comment.
Remove classification artifact token from Persian data.
x امتیاز looks like scraped metadata rather than a profanity term and should not be part of the lexicon.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/fa.txt` at line 103, Remove the scraped classification artifact
token "x امتیاز" from the Persian lexicon in assets/data/fa.txt; open the file
and delete the exact line containing 'x امتیاز' (and ensure no trailing
whitespace or duplicate empty lines are left), then save the file so the lexicon
only contains valid profanity entries.
| අකේ පුරිනවා | ||
| අකේ පුරින්නා | ||
| අප්පට හුකනවා | ||
| අප්පට හුකන්න | ||
| අම්බකිස්ස | ||
| අම්මගෙ රෙද්ද | ||
| අම්මට හුකනවා | ||
| අම්මට හුකන්න | ||
| අම්ම්ට | ||
| අම්මාගේ රෙද්ද | ||
| ඇට | ||
| ඇට දෙක | ||
| ඇට්ටිම්බ | ||
| ඇන්ට පාර | ||
| ඇරගනින් | ||
| ඇහැළමල් වේසි | ||
| උත්ති | ||
| උත්තික් කොල්ලා | ||
| උත්තියේ | ||
| උඹෙ අම්මගෙ | ||
| උඹෙ අම්මට | ||
| උඹෙ අම්මා | ||
| උඹේ අම්මා | ||
| උරනවා | ||
| උරපං | ||
| උරපන් | ||
| ඒසි | ||
| කළු බඩ්ඩ | ||
| කැන්න පොරි | ||
| කැන්න පොරියා | ||
| කැරි | ||
| කැරි තම්බියො | ||
| කැරි බඩුව | ||
| කැරි මයිල් | ||
| කැරි ලොඳයා | ||
| කැරි වේසි | ||
| කැරි සැප | ||
| කැරි සීන් | ||
| කැරි හම්බයො | ||
| කැරිය | ||
| කැරියා | ||
| කැස් පුරි | ||
| කැස්පුරි | ||
| කෑස් පුරියා | ||
| කෑස්පුරි | ||
| කිම්බ | ||
| කිඹි කාවයියා | ||
| කිඹි ජූස් | ||
| කිඹි සිඹා | ||
| කිඹිසිඹා | ||
| කුක්කු | ||
| කුට්ටං චූටි | ||
| කොන කපනවා | ||
| කොනකපාල | ||
| කොනකපාලා | ||
| කොයිපටා | ||
| කෝටු පයිය | ||
| ගැට මිරිකනවා | ||
| ගැනු හොරා | ||
| ගෑණු හොරා | ||
| ගොං කැරිය | ||
| ගොං බිජ්ජා | ||
| ගොටුකොළ පොන්නයා | ||
| ගොටුකොළ හුකන්නා | ||
| ජැක් ගහපන් | ||
| ජැක්සන් | ||
| ජුන්ඩා | ||
| ටඉකේ | ||
| ටඋකනවා | ||
| ටහිකේ | ||
| ටහුකන්න | ||
| ටොපා | ||
| ටොම්බ හිල | ||
| ටෞ | ||
| ටෞකණ්ඩ | ||
| ටෞකන්න | ||
| ටෞක් | ||
| තන | ||
| තන් දෙක | ||
| තොංගලේ | ||
| තොගෙ අම්මට | ||
| දාර පයිය | ||
| දාර බඩුව | ||
| දිව දානව | ||
| දිව දානවා | ||
| නිල්මානෙල් හුත්ති | ||
| පක | ||
| පකට | ||
| පකය | ||
| පකයා | ||
| පකේ | ||
| පකෝ | ||
| පට්ට වේසි | ||
| පම්ක | ||
| පම්කයා | ||
| පයි චූටි | ||
| පයිකොටා | ||
| පයිය | ||
| පයියා | ||
| පර බැල්ලි | ||
| පර වේසි | ||
| පාලම් කොකා | ||
| පුක | ||
| පුක පලනවා | ||
| පුක වටේ | ||
| පුක සුදුද | ||
| පුක්මන්තා | ||
| පුකේ අරිනවා | ||
| පුකේ මයිල් | ||
| පුකේ මයිලා | ||
| පුකේ හිල | ||
| පෙට්ටිය | ||
| පෙට්ටිය කඩනවා | ||
| පෙට්ටිය කැඩිලද | ||
| පොන්න කැරියා | ||
| පොන්න පකයා | ||
| පොන්න පකා | ||
| පොන්න වේස මනමාලි | ||
| පොන්නයා | ||
| පොරෝ පාර | ||
| පොල් කාවයිය | ||
| පොල් කාවයියා | ||
| පොල්ල | ||
| පෝලිමක් දානවා | ||
| පෝලිම් දානවා | ||
| බඩ වෙනවා | ||
| බඩු ආව | ||
| බඩු ආවා | ||
| බඩු කාරයා | ||
| බඩු ලොල්ලා | ||
| බැක් ගහනව | ||
| බැක් ගහනවා | ||
| බැක්සයිඩ් ඕකේ | ||
| බැල්ලි | ||
| බැල්ලිගෙ පුතා | ||
| බේසික්ගෙ | ||
| බේසිකෙ | ||
| බේසිගෙ | ||
| මෑ ඇටේ | ||
| මෑ මල | ||
| මුහුදු හුකන්නා | ||
| රෙද්ද උස්සගෙන | ||
| රෙද්ද උස්සනවා | ||
| ලබ්බ | ||
| ලව් ජූස් | ||
| ලෑලි පුක | ||
| ලෙවකනවා | ||
| ලොල්ලා | ||
| ලොවනවා | ||
| වටඇල්ල | ||
| වටේ යනවා | ||
| වම්බටු පයියා | ||
| වලත්තයා | ||
| වල්ල | ||
| වල්ල පට්ට | ||
| වැල | ||
| වැල බලනවා | ||
| වේස බල්ලා | ||
| වේස බැල්ලි | ||
| වේසාවා | ||
| වේසිගෙ පුතා | ||
| වේසිගේ පුතා | ||
| වේසෞත්තා | ||
| සක් | ||
| සක්කිලි | ||
| සක්කිලි බල්ලා | ||
| සක්කිලියා | ||
| හම්බ කැරියා | ||
| හැමිනෙනව | ||
| හැමිනෙනවා | ||
| හුකනව දාලා | ||
| හුකනවනෙ ඉතින් | ||
| හුකනවනේ ඉතින් | ||
| හුකනවා | ||
| හුකනවා දාලා | ||
| හුකන්න | ||
| හුකන්නා | ||
| හුකපං | ||
| හුක් | ||
| හුකා | ||
| හුත්ත | ||
| හුත්තා | ||
| හුත්තික් කොල්ලා | ||
| හුත්තිගෙ කොල්ලා | ||
| හුත්තිගෙ පුතා | ||
| හුත්තිගේ පුතා | ||
| හුප්තා | ||
| හුම්කන | ||
| හුම්තා | ||
| ෆක් |
There was a problem hiding this comment.
Hindi lexicon includes Sinhala-script entries.
This block is Sinhala, not Hindi. Because Language.hindi loads only assets/data/hi.txt, these entries make Hindi filtering match Sinhala text unintentionally.
🧰 Tools
🪛 LanguageTool
[uncategorized] ~575-~575: Ger doublet «ඇට ඇට».
Context: ...ුකනවා අම්මට හුකන්න අම්ම්ට අම්මාගේ රෙද්ද ඇට ඇට දෙක ඇට්ටිම්බ ඇන්ට පාර ඇරගනින් ඇහැළමල් ව...
(GER_DOUBLET)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/hi.txt` around lines 566 - 755, The hi.txt file contains
Sinhala-script entries that are being loaded by Language.hindi (which only loads
assets/data/hi.txt), causing Sinhala text to be matched as Hindi; remove or
relocate the Sinhala block from assets/data/hi.txt into the correct Sinhala
lexicon (e.g., create/update assets/data/si.txt) and ensure Language.hindi only
loads Devanagari/true-Hindi entries; as a precaution, add a small validation
step where the Hindi loader (Language.hindi) filters or rejects entries that
contain non-Devanagari codepoints before adding them to the Hindi lexicon.
| sborona | ||
| sborone | ||
| sborra | ||
| sborra, sburro |
There was a problem hiding this comment.
Split combined terms into separate entries.
Line 1284 stores two terms in one line (sborra, sburro), so matching will require the comma form instead of either word independently.
Suggested fix
-sborra, sburro
+sborra
+sburro📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| sborra, sburro | |
| sborra | |
| sburro |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@assets/data/it.txt` at line 1284, The line containing "sborra, sburro"
combines two separate terms; split this entry into two distinct lines—one with
"sborra" and one with "sburro"—so look-up/matching will work for each term
independently and remove the comma-separated form.
improved performance by applying aho-corasick algo
Summary by CodeRabbit
New Features
Deprecations
Documentation
Chores
Style