New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Incorrect text span start and end returned #49
Comments
@danielkingai2 #49 and #53 character offset issue is due not 🚧WIP branch fix can be found here: https://github.com/nipunsadvilkar/pySBD/tree/npn-carriage-return-fix Line 81 in 995af9e
What I have been trying to do is, adding >>> import spacy
>>> nlp = spacy.blank("en")
>>> text = 'a. The first item b. The second item c. The third list item'
>>> doc = nlp(text)
>>> text[0:17]
# 'a. The first item'
>>> text[0:18]
# 'a. The first item ' # Note whitespace here
>>> doc.char_span(0,17) # without whitespace offset
# 'a. The first item'
>>> doc.char_span(0,18) # with whitespace offset
# None It would be nice if you can also think of any workaround or building on top of WIP branch. |
with respect to the whitespace thing, can't you just take character offsets after trimming trailing whitespace or something? like if the text ends in a whitespace, subtract one from the character offset or something |
Ah I didn't totally understand what you were saying before, but I think I get it now. So, to rephrase, is it possible to just check the text spans you are returning, and if they start/end with whitespace, edit the start/end indices appropriately? Maybe its ok if up front you state the char span of a sentence explicitly does not include leading or trailing whitespace? |
Yes, a possible solution should account for the following: Original input - non-strict pattern to get text within
Actual spans with whitespace:
spaCy char_span needed without whitespaces:
To get actual spans of original input out of preprocessed text one would require to consecutively subtract N number of whitespaces and also the number of times |
I don't quite understand, is the solution going to be at the level of the pysbd library? or the level of downstream uses of the pysbd library? |
Should be within pySBD and getting appropriate TextSpan objects
…On Tue, Nov 19, 2019, 11:09 PM Daniel King ***@***.***> wrote:
I don't quite understand, is the solution going to be at the level of the
pysbd library? or the level of downstream uses of the pysbd library?
—
You are receiving this because you were assigned.
Reply to this email directly, view it on GitHub
<#49?email_source=notifications&email_token=ADS5LCCS6U3QZPSW4IE6RSLQUQQD7A5CNFSM4JJOKMQKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEEPB74Y#issuecomment-555622387>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ADS5LCADOVX5CYNPXX3XVDDQUQQD7ANCNFSM4JJOKMQA>
.
|
Is there update on when this issue will be resolved? |
Looks like something weird happening in this case, note that the indices of the second text span are incorrect:
The text was updated successfully, but these errors were encountered: