Incorrect text span start and end returned #49

dakinggg · 2019-11-06T01:50:32Z

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)]

The text was updated successfully, but these errors were encountered:

Closes #49

nipunsadvilkar · 2019-11-17T17:38:30Z

@danielkingai2 #49 and #53 character offset issue is due not \r not being handled in sentence_boundary_punctuation and returning correct TextSpan.

🚧WIP branch fix can be found here:

https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix

pySBD/pysbd/processor.py

Line 81 in 995af9e

text = self.text.replace('\r', "❦")

What I have been trying to do is, adding ❦ identifier on which sentence should be split. Though segmented sentences are correct, the offsets are getting modified due to the addition of ❦ identifier.
Next challenge is, spacy Doc.char_span returns None if I consider white span in character offset - explosion/spaCy#2637.
Example:

>>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)

>>> text[0:17]                                                                                                                              
# 'a. The first item'

>>> text[0:18]                                                                                                                              
# 'a. The first item ' # Note whitespace here

>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'

>>> doc.char_span(0,18) # with whitespace offset
# None

It would be nice if you can also think of any workaround or building on top of WIP branch.

dakinggg · 2019-11-18T21:46:43Z

with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something

dakinggg · 2019-11-19T01:08:55Z

Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace?

nipunsadvilkar · 2019-11-19T17:00:01Z

Yes, a possible solution should account for the following:

Original input - a. The first item b. The second item c. The third list item
preprocessed - ❦a∯ The first item ❦b∯ The second item ❦c∯ The third list item

non-strict pattern to get text within ❦ - r'[^❦]+'

"a∯ The first item ", 1, 19
"b∯ The second item ", 20, 39
"c∯ The third list item", 40, 62

Actual spans with whitespace:

"a∯ The first item ", 0, 18
"b∯ The second item ", 18, 37
"c∯ The third list item", 37, 59

spaCy char_span needed without whitespaces:

"a∯ The first item ", 0, 17
"b∯ The second item ", 18, 36
"c∯ The third list item", 37, 59

To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times ❦ occurred before.

dakinggg · 2019-11-19T17:39:11Z

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

nipunsadvilkar · 2019-11-19T17:57:13Z

Should be within pySBD and getting appropriate TextSpan objects

…

On Tue, Nov 19, 2019, 11:09 PM Daniel King ***@***.***> wrote: I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library? — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#49?email_source=notifications&email_token=ADS5LCCS6U3QZPSW4IE6RSLQUQQD7A5CNFSM4JJOKMQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPB74Y#issuecomment-555622387>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ADS5LCADOVX5CYNPXX3XVDDQUQQD7ANCNFSM4JJOKMQA> .

dkarmon · 2020-02-04T06:41:46Z

Is there update on when this issue will be resolved?

dakinggg mentioned this issue Nov 6, 2019

Use pysbd char_span functionality allenai/scispacy#178

Merged

nipunsadvilkar added the bug label Nov 16, 2019

nipunsadvilkar added a commit that referenced this issue Nov 17, 2019

🐛 List items offset fix

995af9e

Closes #49

nipunsadvilkar self-assigned this Nov 17, 2019

andrewhead mentioned this issue Mar 13, 2020

Fix up sentence splitting allenai/scholarphi#65

Closed

2 tasks

nipunsadvilkar mentioned this issue May 26, 2020

✨ 💫 sent char_span through with spaCy & regex & ♻️ Refactoring for more languages support #63

Merged

4 tasks

nipunsadvilkar closed this as completed in 68dc962 Jun 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect text span start and end returned #49

Incorrect text span start and end returned #49

dakinggg commented Nov 6, 2019

nipunsadvilkar commented Nov 17, 2019

dakinggg commented Nov 18, 2019

dakinggg commented Nov 19, 2019 •

edited

nipunsadvilkar commented Nov 19, 2019 •

edited

dakinggg commented Nov 19, 2019

nipunsadvilkar commented Nov 19, 2019 via email

dkarmon commented Feb 4, 2020

Incorrect text span start and end returned #49

Incorrect text span start and end returned #49

Comments

dakinggg commented Nov 6, 2019

nipunsadvilkar commented Nov 17, 2019

dakinggg commented Nov 18, 2019

dakinggg commented Nov 19, 2019 • edited

nipunsadvilkar commented Nov 19, 2019 • edited

dakinggg commented Nov 19, 2019

nipunsadvilkar commented Nov 19, 2019 via email

dkarmon commented Feb 4, 2020

dakinggg commented Nov 19, 2019 •

edited

nipunsadvilkar commented Nov 19, 2019 •

edited