Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug report in SequenceTokenizerNew #657

Closed
Jabher opened this issue Nov 27, 2022 · 4 comments
Closed

Bug report in SequenceTokenizerNew #657

Jabher opened this issue Nov 27, 2022 · 4 comments
Labels

Comments

@Jabher
Copy link

Jabher commented Nov 27, 2022

SequenceTokenizerNew fails on following call:

sentenceTokenizer.tokenize('"All ticketed passengers should now be in the Blue Concourse sleep lounge. Make sure your validation papers are in order. Thank you". The upstairs lounge was not at all grungy.') (quote from "The Jaunt" by Stephen King)

with following message:

{
    "message": "Expected [ \\t\\n\\r.?!] or [)\\]}\"'`’] but \"M\" found.",
    "expected": [
        {
            "type": "class",
            "parts": [
                " ",
                "\t",
                "\n",
                "\r",
                ".",
                "?",
                "!"
            ],
            "inverted": false,
            "ignoreCase": false
        },
        {
            "type": "class",
            "parts": [
                ")",
                "]",
                "}",
                "\"",
                "'",
                "`",
                "’"
            ],
            "inverted": false,
            "ignoreCase": false
        }
    ],
    "found": "M",
    "location": {
        "start": {
            "offset": 75,
            "line": 1,
            "column": 76
        },
        "end": {
            "offset": 76,
            "line": 1,
            "column": 77
        }
    },
    "name": "SyntaxError"
}
@Hugo-ter-Doest
Copy link
Collaborator

Hugo-ter-Doest commented Dec 3, 2022

Srictly speaking the second double quote is in the wrong place, It should be right behind the period like this:

Thank you."

This results in a syntax error as well, so I will try and make it more robust.

Hugo-ter-Doest added a commit that referenced this issue Dec 5, 2022
Hugo-ter-Doest added a commit that referenced this issue Dec 5, 2022
* Issue #657 part one

* Lint issues

* Removed files that were incidently added

* lint correction
@Hugo-ter-Doest
Copy link
Collaborator

Partly solved with #658

It now can handle multiple sentences surrounded by quotes. It cannot handle the position of the quote symbol in your example. So I will leave the issue open and will try later to make it more robust,

@Jabher
Copy link
Author

Jabher commented Dec 6, 2022

Srictly speaking the second double quote is in the wrong place, It should be right behind the period like this:


Thank you."

This results in a syntax error as well, so I will try and make it more robust.

I totally agree :) that's why I specifically left a note that it is not my text but sci-fi classics I used.

@Hugo-ter-Doest
Copy link
Collaborator

Won't fix it further since the parser mechanism underlying this sentence tokenizer is not flexible enough for this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants