-
Notifications
You must be signed in to change notification settings - Fork 10
Description
JDK19 has changed the default behavior for \b regexp (to be compatible with \w).
The problem is that all the existing Java code with regular expressions that relied on \b for unicode characters would be pretty much broken when using JDK>=19. See (languagetool-org/languagetool#9854)
LanguageTool project mentioned above uses segment project and has tons of \b in segmentation rules for 25 languages.
The fix would be to add (?U) for each regexp that uses \b
It would be nice to have an option in net.loomchild.segment project to specify Pattern.UNICODE_CHARACTER_CLASS to compile rules from segment.srx so there's no need to adjust each rule separately (we may not be able to turn on Pattern.UNICODE_CHARACTER_CLASS by default as it may have regressions for other projects, even though probability is low).