-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[en] Different sentence splitting in the GUI and the command line #6318
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Does it do that explicitly? I guess it's just the default on Windows? |
I tried to make a disambiguation rule that applies a SENT_END and a SENT_START postag to text that contains a line feed (LF). I could not. To try to understand why I could not make a disambiguation rule, I made a grammar rule to find the end of a sentence:
LF = U+000A (https://www.compart.com/en/unicode/U+000A). When I run testrules, I get this message:
I found this message from 2012 that discusses a similar problem: Refer to https://forum.languagetool.org/t/searching-for-specific-unicode-characters/116 @SpaceIshtar wrote, "I'm wondering whether or not I can work on this issue and open a pull request." Yes please. I would be very grateful if you could find a solution. @languagetool-org/developers. It would be really nice if LT let users find any Unicode characters that they want to find. |
The different sentence splitting causes inconsistent analysis of text.
Example sentences:
The GUI shows 3 sentences:

If a sentence contains 2 instances of the word 'cat', this rule finds the first instance:
Put the example sentences into a file (
test-data.txt
) and use the command line to analyse the sentences. The results from the command line give a false warning:The sentence splitting is different in the command line analysis. There is only one sentence:
The text was updated successfully, but these errors were encountered: