Skip to content

Improve profanity expectation to not match when it has profanity as s…#63

Merged
JonPurvis merged 1 commit intopestphp:masterfrom
pablopetr:improve-profanity-check-to-not-match-similar-words
Jun 1, 2025
Merged

Improve profanity expectation to not match when it has profanity as s…#63
JonPurvis merged 1 commit intopestphp:masterfrom
pablopetr:improve-profanity-check-to-not-match-similar-words

Conversation

@pablopetr
Copy link
Copy Markdown
Contributor

Problem

I found some edge cases in Portuguese where words that are perfectly safe contain profanities as substrings, causing our check to throw exceptions incorrectly. The same happens in English.

Examples

In Portuguese:

  • "custo" → contains "cu" (a profanity), but "custo" means "cost" and is not offensive.
  • "prolixo" → contains "lixo" (trash), but it means "verbose" or "long-winded" and may be used legitimately in programming contexts to describe a method.

In English:

  • "analytics" → contains "anal", but "analytics" is not a profanity.
  • "analyst" → contains "anal", but "analyst" is not a profanity.
  • "jsonObj" → contains "bj", but it's a common naming pattern. While using "Obj" might not be ideal stylistically, it is not offensive and may appear in many codebases.
  • "cockpit" → contains "cock", but it's a technical aviation term.

Previously, we’ve been solving these issues by adding false positives to src/Config/tolerated.php. However, that requires manual effort and doesn't scale well — we would constantly need to add new combinations.

Instead, a better approach would be to detect whether a profanity appears as an exact isolated word, not as a substring inside a larger, valid word.

How do we detect this?

We will improve the detection logic to identify profanities only when they appear as isolated words, based on:

  • Whitespace-separated words: "this shit is bad"
  • camelCase or PascalCase: "thisShitIsBad" or "ThisShitIsBad"
  • snake_case: "this_shit_is_bad"
  • Isolated in a method call: dd('shit')

Code Explanation

src/Expectations/Profanity.php — Line 50

if (preg_match('/\b'.preg_quote($word, '/').'\b/i', $fileContents)) {
    return true;
}

Matches:

  • "this shit is a test" ❌
  • // this code is a shit but it is working, let's refactor when we have time ❌
  • dd('shit') ❌
  • dd('anal') ❌
  • dd('analysing this function'); ✅ (because it contains the profanity as a substring, but the word itself is not a profanity)

Line 54 — snake_case detection:

preg_match_all('/[a-zA-Z]\w*/', $fileContents, $matches);

foreach ($matches[0] as $token) {
    $snakeParts = explode('_', $token);

    foreach ($snakeParts as $part) {
        if (strcasecmp($part, $word) === 0) {
            return true;
        }
    }

Matches:

  • $array['this_shit_is_working'] = true; ❌
  • const THIS_SHIT_CONST_WILL_WORK = true; ❌
  • $array['employee_is_analyst'] = true; ✅ (because has profanity as substring but the word is not a profanity)
  • const OBJ_NUMBERS = 5; ✅
  • const BJ_PROFANITY = 5; ✅

Line 65 — camelCase/PascalCase detection:

if (! is_array($camelParts)) {
    return false;
}

foreach ($camelParts as $subpart) {
    if (strcasecmp($subpart, $word) === 0) {
        return true;
    }
}

Matches:

  • public function sTart(): void {} ❌
  • public function itIsAShitMethod(): void {} ❌
  • public function HereWeHaveShitMethod(): void {} ❌
  • public function getJsonObj ✅

Benefits of This Change

This approach improves scalability and accuracy. For example:

  • In English, a single word like "ass" can cause dozens of false positives (assign, assistant, etc.), which we can’t realistically handle manually, adding each to src/Config/tolerated.php.
  • In Portuguese, the same happens with "cu" or "lixo" inside safe words, and I image that it can happen in other languages too.
  • With this new logic, we can remove the src/Config/tolerated.php file entirely and rely on smart context-aware detection.

@JonPurvis
Copy link
Copy Markdown
Collaborator

Thanks for this! V4 actually removes the tolerated file completely. If you want to take V4 for a test, go right ahead! I'll get this merged and tag a new release for V3 though.

@JonPurvis JonPurvis merged commit d8535ff into pestphp:master Jun 1, 2025
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants