Improve profanity expectation to not match when it has profanity as s… by pablopetr · Pull Request #63 · pestphp/pest-plugin-profanity

pablopetr · 2025-05-31T19:24:32Z

Problem

I found some edge cases in Portuguese where words that are perfectly safe contain profanities as substrings, causing our check to throw exceptions incorrectly. The same happens in English.

Examples

In Portuguese:

"custo" → contains "cu" (a profanity), but "custo" means "cost" and is not offensive.
"prolixo" → contains "lixo" (trash), but it means "verbose" or "long-winded" and may be used legitimately in programming contexts to describe a method.

In English:

"analytics" → contains "anal", but "analytics" is not a profanity.
"analyst" → contains "anal", but "analyst" is not a profanity.
"jsonObj" → contains "bj", but it's a common naming pattern. While using "Obj" might not be ideal stylistically, it is not offensive and may appear in many codebases.
"cockpit" → contains "cock", but it's a technical aviation term.

Previously, we’ve been solving these issues by adding false positives to src/Config/tolerated.php. However, that requires manual effort and doesn't scale well — we would constantly need to add new combinations.

Instead, a better approach would be to detect whether a profanity appears as an exact isolated word, not as a substring inside a larger, valid word.

How do we detect this?

We will improve the detection logic to identify profanities only when they appear as isolated words, based on:

Whitespace-separated words: "this shit is bad"
camelCase or PascalCase: "thisShitIsBad" or "ThisShitIsBad"
snake_case: "this_shit_is_bad"
Isolated in a method call: dd('shit')

Code Explanation

src/Expectations/Profanity.php — Line 50

if (preg_match('/\b'.preg_quote($word, '/').'\b/i', $fileContents)) {
    return true;
}

Matches:

"this shit is a test" ❌
// this code is a shit but it is working, let's refactor when we have time ❌
dd('shit') ❌
dd('anal') ❌
dd('analysing this function'); ✅ (because it contains the profanity as a substring, but the word itself is not a profanity)

Line 54 — snake_case detection:

preg_match_all('/[a-zA-Z]\w*/', $fileContents, $matches);

foreach ($matches[0] as $token) {
    $snakeParts = explode('_', $token);

    foreach ($snakeParts as $part) {
        if (strcasecmp($part, $word) === 0) {
            return true;
        }
    }

Matches:

$array['this_shit_is_working'] = true; ❌
const THIS_SHIT_CONST_WILL_WORK = true; ❌
$array['employee_is_analyst'] = true; ✅ (because has profanity as substring but the word is not a profanity)
const OBJ_NUMBERS = 5; ✅
const BJ_PROFANITY = 5; ✅

Line 65 — camelCase/PascalCase detection:

if (! is_array($camelParts)) {
    return false;
}

foreach ($camelParts as $subpart) {
    if (strcasecmp($subpart, $word) === 0) {
        return true;
    }
}

Matches:

public function sTart(): void {} ❌
public function itIsAShitMethod(): void {} ❌
public function HereWeHaveShitMethod(): void {} ❌
public function getJsonObj ✅

Benefits of This Change

This approach improves scalability and accuracy. For example:

In English, a single word like "ass" can cause dozens of false positives (assign, assistant, etc.), which we can’t realistically handle manually, adding each to src/Config/tolerated.php.
In Portuguese, the same happens with "cu" or "lixo" inside safe words, and I image that it can happen in other languages too.
With this new logic, we can remove the src/Config/tolerated.php file entirely and rely on smart context-aware detection.

…ubstring

JonPurvis · 2025-06-01T20:46:19Z

Thanks for this! V4 actually removes the tolerated file completely. If you want to take V4 for a test, go right ahead! I'll get this merged and tag a new release for V3 though.

Improve profanity expectation to not match when it has profanity as s…

36a0c19

…ubstring

JonPurvis merged commit d8535ff into pestphp:master Jun 1, 2025
20 checks passed

JonPurvis mentioned this pull request Jun 2, 2025

Remove tolerated config JonPurvis/squeaky#16

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve profanity expectation to not match when it has profanity as s…#63

Improve profanity expectation to not match when it has profanity as s…#63
JonPurvis merged 1 commit intopestphp:masterfrom
pablopetr:improve-profanity-check-to-not-match-similar-words

pablopetr commented May 31, 2025

Uh oh!

JonPurvis commented Jun 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

pablopetr commented May 31, 2025

Problem

Examples

How do we detect this?

Code Explanation

Benefits of This Change

Uh oh!

JonPurvis commented Jun 1, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants