-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cannot annotate across a sentence break #786
Comments
For sentence breaks created by the brat sentence splitter, this can most easily be done by switching off the sentence splitting. For "hard" sentence breaks (newlines in the source text), this is a bit more difficult, as the brat standoff format (http://brat.nlplab.org/standoff.html) is line-oriented and incorporates the annotated text in the .ann files, which would require newline characters in the source text to be escaped. This is currently not done. For these cases, it would be easiest to replace newline characters with space in the source text (if possible). |
@spyysalo: I have a feeling this question may arise in the future as well, could you add the to the FAQ then close the issue? |
@ninjin : I believe @amadanmath had some ideas on how to permit this if necessary using the mechanisms that were introduced for discontinuous annotations. I wouldn't want to close this without a resolution. |
@spyysalo: Um... you mean to treat each sentence as a separate component of an annotation? Pardon my French, but that is a f*ckin' ugly hack. |
Why? It would work. Having sentences as display-only fragments is not nearly as horrible as trying to fragment things so that they fit on the screen (which would be way more work, and an ugly hack to boot). |
@amadanmath: The problem is with the format IMHO, the whole idea of the "comment" portion of it falls apart. I know it was well-intended but it shouldn't have been there. |
Ah. Indeed. Also, that's so not a comment, if it merely duplicates the text segment (and enforces the identity!) |
With discontinuous annotations, the newline character does not need to be part of the span. I thought that was the trick. Yeah, it's not pretty, but it would allow us to resolve this issue. |
@spyysalo: It isn't as ugly as it could have been. But lesson learnt, formats should store the information, sanity should go elsewhere. Supporting go-ahead with the "hack". |
Indeed, if it's "real" discontinuous annotation, the newline wouldn't be a part of the span. However, then that's none of my business - rather, you need to do server-side pre-processing to generate the annotation file with discontinuities, and the file should render correctly without any intervention into clientside code. What I meant is, split a span by sentences into fragment "on the fly" before rendering, in which case no pre-processing would be required. But in this case, Pontus's complaint stands. |
OK, gotcha. Longer-term, we'll probably need to adjust the storage format to allow also newlines (and probably tabs, while we're at it). Standard C-style escaping would do the job. For now, splitting into discontinuous spans server-side should do. Shall we add an option for whether to allow this? |
Opened #819 for the longer-term solution. |
With all respect i totally disagree. We are annotating chemicals over OCR patents. Annotating across sentences is really important in our case. |
@akhondi : thanks for the input -- could you please clarify which part you disagree with? |
Was any of this implemented in the end? I'm generating json annotation structures from text and some of the annotations often cross sentence boundaries. My use case is a language detection task, where I have one entity type per language, and large sections of text may be a single entity. Util.embed doesn't like this at all, and collapses everything into a single, thin line (see below). I tried to use discontinuous annotations with the individual token indices as a workaround, but the results weren't visually pleasing. Right now the only solution seems to be to manually split across sentence boundaries. |
@pflaquerre: No, this issue isn't resolved yet. I have just had a discussion with @spyysalo regarding how we resolve this, we have a resolution in the pipeline that will hopefully reach Assigning to @amadanmath, once you have removed the client blocking annotation across sentence breaks, assign to me for the back-end implementation. Although late, this one is going into v1.3. |
Removed. There is a strange thing where the post-edit displays newline immediately after the span, and a reload kills the newline completely. It works without modification on sentences that were introduced by senrtence splitter; I did not dare try to annotate a hard LF. :p |
@amadanmath: Thanks, I'll harass you about the whole post-edit thing on the IM and get cracking with the back-end. |
As of fdd275a you can annotate across newlines, both hard ones from your text and from the built-in sentence splitter. Good job team! Closing! File new issues if there are bugs, hopefully there are none. |
Not bugs as such, but a concern: the sentence numbering will be different, and as a result sentence annotations and sentence links become unreliable. |
@amadanmath: good point. Now that you mention it, this issue isn't entirely new, as e.g. switching the sentence splitter off, adding an annotation crossing a "soft" newline otherwise (i.e. from a tagger), or upgrading to a version fixing some sentence splitter errors would cause the same unreliability. Perhaps we should anchor sentence annotations to offsets rather than index them by whatever the sentence splitting algorithm happens to do? Open a new issue? |
Sure, go ahead. It might not be that hard to resolve, if you pack a sentence identifier together with sentence offsets array. (Offsets are not that good an idea since they're meaningless to the user.) |
Is there a possibility to allow annotate across a sentence break? maybe if you point me to the script
The text was updated successfully, but these errors were encountered: