Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect text span start and end returned #49

Closed
dakinggg opened this issue Nov 6, 2019 · 7 comments
Closed

Incorrect text span start and end returned #49

dakinggg opened this issue Nov 6, 2019 · 7 comments
Assignees
Labels

Comments

@dakinggg
Copy link
Contributor

dakinggg commented Nov 6, 2019

Looks like something weird happening in this case, note that the indices of the second text span are incorrect:

>>> seg = pysbd.Segmenter(language='en', clean=False, char_span=True)
>>> seg.segment("1) The first item. 2) The second item.")                                                                                
[TextSpan(sent='1) The first item.', start=0, end=18), TextSpan(sent='2) The second item.', start=0, end=19)] 
@nipunsadvilkar
Copy link
Owner

@danielkingai2 #49 and #53 character offset issue is due not \r not being handled in sentence_boundary_punctuation and returning correct TextSpan.

🚧WIP branch fix can be found here:

https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix

text = self.text.replace('\r', "❦")

What I have been trying to do is, adding identifier on which sentence should be split. Though segmented sentences are correct, the offsets are getting modified due to the addition of identifier.
Next challenge is, spacy Doc.char_span returns None if I consider white span in character offset - explosion/spaCy#2637.
Example:

>>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)

>>> text[0:17]                                                                                                                              
# 'a. The first item'

>>> text[0:18]                                                                                                                              
# 'a. The first item ' # Note whitespace here

>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'

>>> doc.char_span(0,18) # with whitespace offset
# None

It would be nice if you can also think of any workaround or building on top of WIP branch.

@nipunsadvilkar nipunsadvilkar self-assigned this Nov 17, 2019
@dakinggg
Copy link
Contributor Author

with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something

@dakinggg
Copy link
Contributor Author

dakinggg commented Nov 19, 2019

Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace?

@nipunsadvilkar
Copy link
Owner

nipunsadvilkar commented Nov 19, 2019

Yes, a possible solution should account for the following:

Original input - a. The first item b. The second item c. The third list item
preprocessed - ❦a∯ The first item ❦b∯ The second item ❦c∯ The third list item

non-strict pattern to get text within - r'[^❦]+'

"a∯ The first item ", 1, 19
"b∯ The second item ", 20, 39
"c∯ The third list item", 40, 62

Actual spans with whitespace:

"a∯ The first item ", 0, 18
"b∯ The second item ", 18, 37
"c∯ The third list item", 37, 59

spaCy char_span needed without whitespaces:

"a∯ The first item ", 0, 17
"b∯ The second item ", 18, 36
"c∯ The third list item", 37, 59

To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times occurred before.

@dakinggg
Copy link
Contributor Author

I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library?

@nipunsadvilkar
Copy link
Owner

nipunsadvilkar commented Nov 19, 2019 via email

@dkarmon
Copy link

dkarmon commented Feb 4, 2020

Is there update on when this issue will be resolved?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants