Erro nos atributos .start e .end #15

felipefcunica · 2022-09-16T21:41:38Z

Não sei exatamente se é uma causa específica do cogroo4py ou se aplica-se à todo Cogroo.

Segue o código mostrando o erro:

CODE1 :

from cogroo4py.cogroo import Cogroo
entrada = "Refiro-me à trabalho remunerado."
cogroo = Cogroo()
docVar = cogroo.grammar_check(entrada)
misVar = docVar.mistakes

for mistake in misVar:
    print( {"errorMsg":mistake.short_msg,
            "rule":mistake.rule_id,
            "start":mistake.start,
            "end":mistake.end})
    print(f' Erro literal: "{entrada[mistake.start:mistake.end]}"')

OUTPUT1:

{'errorMsg': 'Não ocorre crase antes de palavras masculinas.', 'rule': 'xml:1', 'start': 10, 'end': 11}
 Erro literal: "à"

CODE2:

from cogroo4py.cogroo import Cogroo
entrada = """Aqui tem uma coisa antes.Refiro-me à trabalho remunerado. Aqui tem outra cosia depois, estamos escrevendo mais."""
            
cogroo = Cogroo()
docVar = cogroo.grammar_check(entrada)
misVar = docVar.mistakes

for mistake in misVar:
    print( {"errorMsg":mistake.short_msg,
            "rule":mistake.rule_id,
            "start":mistake.start,
            "end":mistake.end})
    print(f' Erro literal: "{entrada[mistake.start:mistake.end]}"')

OUTPUT2:

{'errorMsg': 'Não ocorre crase antes de palavras masculinas.', 'rule': 'xml:1', 'start': 36, 'end': 37}
 Erro literal: " "

Como observado, aparentemente existe um erro nos atributos .start e .end presentes no cogroo4py quando a frase inserida é maior. Usando apenas a sessão de erro, o retorno é correto. Então pode ser que seja pela presenção ou não do '.' antes de 'Refiro-me'. Fiz ainda o último teste:

CODE3:

from cogroo4py.cogroo import Cogroo
entrada = """Aqui tem uma coisa antes. Refiro-me à trabalho remunerado. Aqui tem outra cosia depois, estamos escrevendo mais."""
            
cogroo = Cogroo()
docVar = cogroo.grammar_check(entrada)
misVar = docVar.mistakes

for mistake in misVar:
    print( {"errorMsg":mistake.short_msg,
            "rule":mistake.rule_id,
            "start":mistake.start,
            "end":mistake.end})
    print(f' Erro literal: "{entrada[mistake.start:mistake.end]}"')

OUPUT3:

{'errorMsg': 'Não ocorre crase antes de palavras masculinas.', 'rule': 'xml:1', 'start': 36, 'end': 37}
 Erro literal: "à"

Informações de reprodutibilidade:
Códigos executados no Jupyter Notebook do VScode.
OS Win 10 Home
Python 3.10
Execução ocorre em Ambiente Virtual Python

The text was updated successfully, but these errors were encountered:

gpassero · 2022-09-17T08:32:08Z

Hi Felipe, I've also had this problem before. It seems like a bug in cogroo4.

felipefcunica · 2022-09-17T19:03:09Z

Apparently the _preproc function changes the amount of characters in the text. Could this be the source of the bug?

Take a look:

Using the same input:

input = "Aqui tem uma coisa antes.Refiro-me à trabalho remunerado. Aqui tem outra cosia depois, estamos escrevendo mais."

we can see:

>>len(input)
>>111

>>len(cogroo._preproc(input))
>>112

What you think?

felipefcunica · 2022-09-17T21:17:52Z

I got a satisfactory result using the following function to run Cogroo4py with adjustFactor == 1:

def adjustedGrammarCheck(input:str,adjustFactor:int):
  doc = cogroo.grammar_check(input)
  adjustedMistakes = []
  
  for index in range(len(doc.mistakes)):
    error = doc.mistakes[index]
    ocorr = doc.text[error.start:error.end]
    ocorrContexto = doc.text[
        error.start-adjustFactor:
        error.end+adjustFactor]

    for finded in re.finditer(ocorrContexto,input):
      FragmentStart = finded.start()
      FragmentEnd = finded.end()
      TrueFragment = input[FragmentStart:FragmentEnd]
      
      for fragmentFinded in re.finditer(ocorr,TrueFragment):
        TrueStart = fragmentFinded.start() + FragmentStart
        TrueEnd = fragmentFinded.end() + FragmentStart
        data = {
            'ocorrencia':input[TrueStart:TrueEnd],
            'errorMsg':error.short_msg,
            'rule':error.rule_id,
            'start':TrueStart,
            'end':TrueEnd}
        adjustedMistakes.append(data)
  
  return adjustedMistakes

Perhaps we can implement it more sophisticatedly in the library.

gpassero · 2022-09-19T15:54:43Z

Good catch! So this is a problem caused by the pre-processing of texts in https://github.com/gpassero/cogroo4py/blob/master/python/cogroo4py/cogroo.py#L219. If I remember right, bad things happened when this wasn't done but I can't remember exactly what. A comment would have been nice to justify this step.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Erro nos atributos .start e .end #15

Erro nos atributos .start e .end #15

felipefcunica commented Sep 16, 2022

gpassero commented Sep 17, 2022

felipefcunica commented Sep 17, 2022

felipefcunica commented Sep 17, 2022 •

edited

Loading

gpassero commented Sep 19, 2022

Erro nos atributos .start e .end #15

Erro nos atributos .start e .end #15

Comments

felipefcunica commented Sep 16, 2022

gpassero commented Sep 17, 2022

felipefcunica commented Sep 17, 2022

felipefcunica commented Sep 17, 2022 • edited Loading

gpassero commented Sep 19, 2022

felipefcunica commented Sep 17, 2022 •

edited

Loading