Skip to content

RegEx for detecting words are not handling special characters correctly #1424

@TrondTollefsen

Description

@TrondTollefsen

Hi,

In document.py, the line:

FIND_WORD_RE = re.compile(r"([a-zA-Z0-9]+|[^a-zA-Z0-9_\s]+)")

will either detect words that are only alfanumericals or words that are only special characters, but not words that are a combination of the two. See attached image.
example regex bug

As shown in picture, words containing both alfanumericals and special characters are split into several words.

I suggest changing RegEx to:

_FIND_WORD_RE = re.compile(r"([^\s])+
example regex fix

as used later in document.py.

This would also fix issue 1280

Regards,
Trond

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions