Skip to content

[en] Different sentence splitting in the GUI and the command line #6318

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
MikeUnwalla opened this issue Feb 11, 2022 · 4 comments
Open

[en] Different sentence splitting in the GUI and the command line #6318

MikeUnwalla opened this issue Feb 11, 2022 · 4 comments

Comments

@MikeUnwalla
Copy link
Contributor

MikeUnwalla commented Feb 11, 2022

The different sentence splitting causes inconsistent analysis of text.

Example sentences:

The cat sat on the mat.
The dog ate a bone.
Then another cat sat on the mat.

The GUI shows 3 sentences:
image

If a sentence contains 2 instances of the word 'cat', this rule finds the first instance:

      <rule id="TEST_SENTENCE_SPLIT" name="test sentence splitting">
        <pattern>
          <marker>
            <token skip="-1">cat</token>
          </marker>
          <token>cat</token>
        </pattern>
        <message>Found the first cat.</message>
        <example type="incorrect">My <marker>cat</marker> and your cat both sat on a mat.</example>
        <example type="correct">Then another cat sat on the mat.</example>
      </rule>

Put the example sentences into a file (test-data.txt) and use the command line to analyse the sentences. The results from the command line give a false warning:

D:\LanguageTool-5.7-SNAPSHOT>java -jar languagetool-commandline.jar -l en-US -eo -e TEST_SENTENCE_SPLIT test-data.txt
Expected text language: English (US) (no spell checking active)
Working on test-data.txt...
1.) Line 1, column 5, Rule ID: TEST_SENTENCE_SPLIT[1]
Message: Found the first cat.
?The cat sat on the mat.  The dog ate a bone.  Then a...
     ^^^
Time: 1638ms for 1 sentences (0.6 sentences/sec)

The sentence splitting is different in the command line analysis. There is only one sentence:

D:\LanguageTool-5.7-SNAPSHOT>java -jar languagetool-commandline.jar -l en-US -t -eo -e TEST_SENTENCE_SPLIT test-data.txt
Expected text language: English (US) (no spell checking active)
Working on test-data.txt...
<S>  The[the/DT] cat[cat/NN,E-NP-singular] sat[sit/VBD,B-VP] on[on/IN,on/JJ,on/RP,B-PP] the[the/DT,B-NP-singular] mat[mat/JJ,mat/NN,E-NP-singular].[./.,./PCT,O]  The[the/DT,B-NP-singular] dog[dog/NN,E-NP-singular] ate[ate/NN,eat/VBD,B-VP] a[a/DT,B-NP-singular] bone[bone/NN:UN,E-NP-singular].[./.,./PCT,O]  My[I/PRP$_A1S,my/PRP$,B-NP-singular] cat[cat/NN,E-NP-singular] likes[like/NNS,like/VBZ,B-VP] your[you/PRP$_A2S,you/PRP$_A2P,your/PRP$,B-NP-singular] dog[dog/NN,E-NP-singular].[./.,</S>./PCT,O]
@MikeUnwalla MikeUnwalla changed the title [en] Different sentence splitting in the GUI and the command line can cause an incorrect analysis of text [en] Different sentence splitting in the GUI and the command line Feb 11, 2022
@SpaceIshtar
Copy link

Hi,
I've reproduced and investigated this problem.
I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line, which cannot be successfully recognized by current rules.
The following screenshots show the difference:
The text string using GUI:
image
The text string using command line:
1647702256(1)
I think the simplest way to solve this problem is replacing "\r" with "" somewhere in the code.
So I tried to replace "\r" with "" in method static List<String> tokenize(String text, SrxDocument srxDocument, String code) in SrxTools.java.
1647702561(1)
After that, I re-built the project and found it works without bringing any new bugs.(it could pass all the tests using mvn clean test)
1647702700(1)
I'm wondering whether or not I can work on this issue and open a pull request to solve this problem?
And if there is other better solutions to this issue, please let me know so that I could solve the problem in a better way.

@danielnaber
Copy link
Member

I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line

Does it do that explicitly? I guess it's just the default on Windows?

@SpaceIshtar
Copy link

I found that when we use GUI, the input string uses \n as the indicator of the next line. However, when command line reads the txt file, it uses \r\n as the indicator of the next line

Does it do that explicitly? I guess it's just the default on Windows?

Yes, you are right. So on Linux, this bug doesn't exist fundamentally.
1647741741(1)

@MikeUnwalla
Copy link
Contributor Author

MikeUnwalla commented Sep 7, 2022

I tried to make a disambiguation rule that applies a SENT_END and a SENT_START postag to text that contains a line feed (LF). I could not. To try to understand why I could not make a disambiguation rule, I made a grammar rule to find the end of a sentence:

    <rule id="FIND_SENTENCE_END" name="Find the end of a sentence">
      <pattern>
        <marker>
          <token regexp="yes">\u000A</token>
        </marker>
      </pattern>
        <message>Found the end of a sentence.</message>
        <example type="incorrect">The cat is on the mat.
<marker/></example></rule>

LF = U+000A (https://www.compart.com/en/unicode/U+000A).

image

When I run testrules, I get this message:

Checking regexp syntax of 5540 rules for English...
*** WARNING: The English rule: FIND_SENTENCE_END[1], token [1], contains "\u000A" that is marked as regular expression but probably is not one.

I found this message from 2012 that discusses a similar problem: Refer to https://forum.languagetool.org/t/searching-for-specific-unicode-characters/116

@SpaceIshtar wrote, "I'm wondering whether or not I can work on this issue and open a pull request." Yes please. I would be very grateful if you could find a solution.

@languagetool-org/developers. It would be really nice if LT let users find any Unicode characters that they want to find.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants