feat: Add Spanish + Chilean Spanish wordlist#5
Open
satelerd wants to merge 2 commits into
Open
Conversation
Adds ~200 Spanish-language profanity entries grouped into 27 families
covering general Spanish, Mexican, Argentine, and (heavily) Chilean
slang — the dialect that swears the most per capita and was missing
entirely from the detector.
General Spanish coverage:
mierda, puta, joder, coño, cabrón, pendejo, chingar, gilipollas,
maricón, verga, polla, boludo / pelotudo, imbécil, idiota, estúpido,
tarado, tonto + chat abbreviations (hdp, lpm, ptm).
Chilean (es-CL) coverage:
weón / hueón / huevón family, aweonao / ahueonado, wea / hueá /
huevada, culiao / culiado, conchetumare / conchesumadre + ctm/csm,
chucha, sacowea, cresta, gil, cagada / cagaste, leso.
Notes:
- Both accented and non-accented variants are listed (people type
"weon" and "weón"; phones autocorrect, devs skip accents).
- Severity reflects use as a swear, not dictionary meaning. e.g.
"weón" is often a friendly filler in Chile but it's still profanity
→ moderate. Hardcore insults (culiao, ctm, conchetumare) → strong.
- Group names kept ASCII so existing rollup output stays stable.
- Verified no false positives on common English words ("went",
"gilbert", "hello world").
JS regex `\b` treats accented vowels (á é í ó ú) as non-word chars, which creates artificial word boundaries inside Spanish words. Three short entries from the previous commit triggered this: - "cono" (de-accented form of "coño") matched inside "ícono" / "íconos" — extremely common in design/UI contexts. - "gil" / "giles" / "gila" matched inside "ágil" / "frágil" / "frágiles" — extremely common in any tech corpus that mentions agile methodology. Fix: drop the de-accented "cono" (keep only "coño"), and drop the entire "gil" family. Real Chilean usage of "gil" exists but at 3 chars it's outweighed by the noise from "ágil" in dev contexts. Comments left in-place explaining why. Verified on a 12.5k-message corpus: false matches eliminated, all legitimate Spanish/Chilean detections preserved.
Author
|
Self-reviewed against a real 12,508-message corpus (Claude Code + Codex sessions). Found and fixed two regex false-positive classes caused by Spanish accented vowels acting as non-word chars in JS
Both fixed in the second commit by dropping the de-accented Top words discovered (real usage, not meta):
Marking as ready for review. Happy to split or trim further if you'd prefer a smaller surface area. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds ~200 Spanish-language profanity entries grouped into 27 families covering general Spanish, Mexican, Argentine, and (heavily) Chilean slang. Right now devrage shows 0 swears for any Spanish-speaking dev — and trust me, we swear at our coding agents plenty.
What's added
General Spanish:
mierda,puta,joder,coño,cabrón,pendejo,chingar(MX),gilipollas(ES),maricón,verga,polla,boludo/pelotudo(AR),imbécil,idiota,estúpido,tarado,tonto+ chat abbreviations (hdp,lpm,ptm).Chilean (es-CL) — the heavy hitters:
weón/hueón/huevónfamily (the Chilean staple)aweonao/ahueonadofamilywea/hueá/huevadafamilyculiao/culiadofamilyconchetumare/conchesumadre+ctm/csmchucha,sacowea,cresta,gil,cagada/cagaste,lesoDesign notes
weonandweóninterchangeably (autocorrect, accent-less keyboards, etc.). Listing both avoids relying on Unicode normalization in the regex.weónis often a friendly filler in Chile (hola weón!) but it's still profanity — markedmoderate. Hardcore insults (culiao,ctm,conchetumare) markedstrong.weon,cono,conchetumare) so the existing rollup output stays stable across terminals.Verification
npm run typecheckpassesnpm run buildsucceedswent(nowea),gilbert(nogil),hello world, plain prosefuck this shitstill matches as before)hueeeón,WEÓN MIERDA)Test transcript
Happy to split this into multiple PRs (general Spanish vs Chilean, or per-family) if that's easier to review. Cheers from Santiago 🇨🇱