Skip to content

GlobalReplace with patterns ending in .*$ scans entire input through BitState unnecessarily #605

@cosu

Description

@cosu

When using RE2::GlobalReplace with patterns ending in .*$ (or .*) where the .* contributes no capture groups, BitState scans the entire input string to track the .* states, even though it is guaranteed to consume the rest of the string.

This is a pretty common usecase for log processing where a regex like ^https?://(?:www\.)?([^/]+)/.*$ is used to extract hostnames from web logs.

It should be possible to short-circuit the match once everything before the trailing .* has been matched, avoiding the cost of scanning the remainder of the input.

I prototyped this by stripping the trailing.*$ and using RE2::Extract instead of GlobalReplace, which significantly reduced BitState CPU. However, I had to add a guard to fall back when the replacement references \0 (full match), since the matched portion changes.

Is this approach sound? Are there other edge cases or correctness issues I might be missing? Would this be worth handling natively in RE2?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions