Skip to content

Allow an option for \b regexp regression in JDK >= 19 #24

@arysin

Description

@arysin

JDK19 has changed the default behavior for \b regexp (to be compatible with \w).
The problem is that all the existing Java code with regular expressions that relied on \b for unicode characters would be pretty much broken when using JDK>=19. See (languagetool-org/languagetool#9854)
LanguageTool project mentioned above uses segment project and has tons of \b in segmentation rules for 25 languages.
The fix would be to add (?U) for each regexp that uses \b
It would be nice to have an option in net.loomchild.segment project to specify Pattern.UNICODE_CHARACTER_CLASS to compile rules from segment.srx so there's no need to adjust each rule separately (we may not be able to turn on Pattern.UNICODE_CHARACTER_CLASS by default as it may have regressions for other projects, even though probability is low).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions