Skip to content

feat: Add Spanish + Chilean Spanish wordlist#5

Open
satelerd wants to merge 2 commits into
gricha:mainfrom
satelerd:add-spanish-chilean-words
Open

feat: Add Spanish + Chilean Spanish wordlist#5
satelerd wants to merge 2 commits into
gricha:mainfrom
satelerd:add-spanish-chilean-words

Conversation

@satelerd
Copy link
Copy Markdown

@satelerd satelerd commented May 10, 2026

Summary

Adds ~200 Spanish-language profanity entries grouped into 27 families covering general Spanish, Mexican, Argentine, and (heavily) Chilean slang. Right now devrage shows 0 swears for any Spanish-speaking dev — and trust me, we swear at our coding agents plenty.

What's added

General Spanish:
mierda, puta, joder, coño, cabrón, pendejo, chingar (MX), gilipollas (ES), maricón, verga, polla, boludo / pelotudo (AR), imbécil, idiota, estúpido, tarado, tonto + chat abbreviations (hdp, lpm, ptm).

Chilean (es-CL) — the heavy hitters:

  • weón / hueón / huevón family (the Chilean staple)
  • aweonao / ahueonado family
  • wea / hueá / huevada family
  • culiao / culiado family
  • conchetumare / conchesumadre + ctm / csm
  • chucha, sacowea, cresta, gil, cagada / cagaste, leso

Design notes

  • Accent + plain forms both listed — people type weon and weón interchangeably (autocorrect, accent-less keyboards, etc.). Listing both avoids relying on Unicode normalization in the regex.
  • Severity reflects usage as a swear, not dictionary meaning. weón is often a friendly filler in Chile (hola weón!) but it's still profanity — marked moderate. Hardcore insults (culiao, ctm, conchetumare) marked strong.
  • Group names kept ASCII (weon, cono, conchetumare) so the existing rollup output stays stable across terminals.
  • Followed the existing file structure — same comment style, same family groupings, additions placed after the English block with a clear section header.

Verification

  • npm run typecheck passes
  • npm run build succeeds
  • ✅ Tested against a battery of Spanish/Chilean phrases — all detected with correct group + severity
  • ✅ No false positives on common English words: went (no wea), gilbert (no gil), hello world, plain prose
  • ✅ Existing English detection unaffected (fuck this shit still matches as before)
  • ✅ Repeated-char + uppercase normalization works (hueeeón, WEÓN MIERDA)

Test transcript

"weón, esto es una mierda"           → 2 matches: weón(weon/moderate), mierda(mierda/strong)
"Ese culiao está bien aweonao, ctm"  → 3 matches: culiao(culiao/strong), aweonao(aweonao/strong), ctm(conchetumare/strong)
"qué wea, conchetumare"              → 2 matches: wea(wea/moderate), conchetumare(conchetumare/strong)
"sacowea hueón, qué chucha"          → 3 matches: sacowea(sacowea/strong), hueón(weon/moderate), chucha(chucha/moderate)
"pendejo cabrón, joder"              → 3 matches: pendejo(pendejo/strong), cabrón(cabron/strong), joder(joder/strong)
"fuck this shit"                     → 2 matches: fuck(fuck/strong), shit(shit/strong)  [unchanged]
"I went to the store"                → 0 matches  [no false positives]
"gilbert had a great time"           → 0 matches  [no false positives]

Happy to split this into multiple PRs (general Spanish vs Chilean, or per-family) if that's easier to review. Cheers from Santiago 🇨🇱

Adds ~200 Spanish-language profanity entries grouped into 27 families
covering general Spanish, Mexican, Argentine, and (heavily) Chilean
slang — the dialect that swears the most per capita and was missing
entirely from the detector.

General Spanish coverage:
  mierda, puta, joder, coño, cabrón, pendejo, chingar, gilipollas,
  maricón, verga, polla, boludo / pelotudo, imbécil, idiota, estúpido,
  tarado, tonto + chat abbreviations (hdp, lpm, ptm).

Chilean (es-CL) coverage:
  weón / hueón / huevón family, aweonao / ahueonado, wea / hueá /
  huevada, culiao / culiado, conchetumare / conchesumadre + ctm/csm,
  chucha, sacowea, cresta, gil, cagada / cagaste, leso.

Notes:
- Both accented and non-accented variants are listed (people type
  "weon" and "weón"; phones autocorrect, devs skip accents).
- Severity reflects use as a swear, not dictionary meaning. e.g.
  "weón" is often a friendly filler in Chile but it's still profanity
  → moderate. Hardcore insults (culiao, ctm, conchetumare) → strong.
- Group names kept ASCII so existing rollup output stays stable.
- Verified no false positives on common English words ("went",
  "gilbert", "hello world").
@satelerd satelerd marked this pull request as draft May 10, 2026 13:57
JS regex `\b` treats accented vowels (á é í ó ú) as non-word chars,
which creates artificial word boundaries inside Spanish words. Three
short entries from the previous commit triggered this:

- "cono" (de-accented form of "coño") matched inside "ícono" /
  "íconos" — extremely common in design/UI contexts.
- "gil" / "giles" / "gila" matched inside "ágil" / "frágil" /
  "frágiles" — extremely common in any tech corpus that mentions
  agile methodology.

Fix: drop the de-accented "cono" (keep only "coño"), and drop the
entire "gil" family. Real Chilean usage of "gil" exists but at 3
chars it's outweighed by the noise from "ágil" in dev contexts.
Comments left in-place explaining why.

Verified on a 12.5k-message corpus: false matches eliminated, all
legitimate Spanish/Chilean detections preserved.
@satelerd satelerd marked this pull request as ready for review May 10, 2026 14:30
@satelerd
Copy link
Copy Markdown
Author

Self-reviewed against a real 12,508-message corpus (Claude Code + Codex sessions). Found and fixed two regex false-positive classes caused by Spanish accented vowels acting as non-word chars in JS \b:

  • cono (de-accented coño) was matching inside ícono / íconos — common in UI/design messages.
  • gil / giles / gila were matching inside ágil / frágil / frágiles — endemic in any tech corpus that mentions agile methodology.

Both fixed in the second commit by dropping the de-accented cono (keeping only coño) and dropping the whole gil family with a comment explaining why. After the fix the detector found 1,317 Spanish swears in the corpus that the English-only build missed entirely (it had 121).

Top words discovered (real usage, not meta):

  • wea 456 · weon 271 (incl. wn 149, weón 89) · mierda 140 · cagada 140 · puta 80 · chucha 51 · conchetumare 31

Marking as ready for review. Happy to split or trim further if you'd prefer a smaller surface area.

smirea added a commit to smirea/devrage that referenced this pull request Jun 3, 2026
Apply upstream gricha#5 as a single squash commit.
Resolved overlap with gricha#1 by keeping both wordlist expansions.

Co-Authored-By: AI <me+ai@stefanmirea.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant