Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot annotate across a sentence break #786

Closed
akhondi opened this issue May 25, 2012 · 23 comments
Closed

cannot annotate across a sentence break #786

akhondi opened this issue May 25, 2012 · 23 comments

Comments

@akhondi
Copy link

akhondi commented May 25, 2012

Is there a possibility to allow annotate across a sentence break? maybe if you point me to the script

@ghost ghost assigned spyysalo May 28, 2012
@spyysalo
Copy link
Member

For sentence breaks created by the brat sentence splitter, this can most easily be done by switching off the sentence splitting.

For "hard" sentence breaks (newlines in the source text), this is a bit more difficult, as the brat standoff format (http://brat.nlplab.org/standoff.html) is line-oriented and incorporates the annotated text in the .ann files, which would require newline characters in the source text to be escaped. This is currently not done. For these cases, it would be easiest to replace newline characters with space in the source text (if possible).

@ghost
Copy link

ghost commented Jun 28, 2012

@spyysalo: I have a feeling this question may arise in the future as well, could you add the to the FAQ then close the issue?

@spyysalo
Copy link
Member

@ninjin : I believe @amadanmath had some ideas on how to permit this if necessary using the mechanisms that were introduced for discontinuous annotations. I wouldn't want to close this without a resolution.

@ghost
Copy link

ghost commented Jul 3, 2012

@spyysalo: Um... you mean to treat each sentence as a separate component of an annotation? Pardon my French, but that is a f*ckin' ugly hack.

@amadanmath
Copy link
Contributor

Why? It would work. Having sentences as display-only fragments is not nearly as horrible as trying to fragment things so that they fit on the screen (which would be way more work, and an ugly hack to boot).

@ghost
Copy link

ghost commented Jul 3, 2012

@amadanmath: The problem is with the format IMHO, the whole idea of the "comment" portion of it falls apart. I know it was well-intended but it shouldn't have been there.

@amadanmath
Copy link
Contributor

Ah. Indeed. Also, that's so not a comment, if it merely duplicates the text segment (and enforces the identity!)

@spyysalo
Copy link
Member

spyysalo commented Jul 3, 2012

With discontinuous annotations, the newline character does not need to be part of the span. I thought that was the trick.

Yeah, it's not pretty, but it would allow us to resolve this issue.

@ghost
Copy link

ghost commented Jul 3, 2012

@spyysalo: It isn't as ugly as it could have been. But lesson learnt, formats should store the information, sanity should go elsewhere. Supporting go-ahead with the "hack".

@amadanmath
Copy link
Contributor

Indeed, if it's "real" discontinuous annotation, the newline wouldn't be a part of the span. However, then that's none of my business - rather, you need to do server-side pre-processing to generate the annotation file with discontinuities, and the file should render correctly without any intervention into clientside code.

What I meant is, split a span by sentences into fragment "on the fly" before rendering, in which case no pre-processing would be required. But in this case, Pontus's complaint stands.

@spyysalo
Copy link
Member

spyysalo commented Jul 3, 2012

OK, gotcha. Longer-term, we'll probably need to adjust the storage format to allow also newlines (and probably tabs, while we're at it). Standard C-style escaping would do the job.

For now, splitting into discontinuous spans server-side should do. Shall we add an option for whether to allow this?

@spyysalo
Copy link
Member

spyysalo commented Jul 3, 2012

Opened #819 for the longer-term solution.

@akhondi
Copy link
Author

akhondi commented Jul 3, 2012

With all respect i totally disagree. We are annotating chemicals over  OCR patents. Annotating across sentences is really important in our case.
 

@spyysalo
Copy link
Member

spyysalo commented Jul 3, 2012

@akhondi : thanks for the input -- could you please clarify which part you disagree with?

@ghost
Copy link

ghost commented Jul 5, 2012

@spyysalo: I think it is the cross-sentence part.

@akhondi: What kind of annotations do you make to the patents? Is it something like entities, events or maybe section marking?

@pflaquerre
Copy link
Contributor

Was any of this implemented in the end? I'm generating json annotation structures from text and some of the annotations often cross sentence boundaries. My use case is a language detection task, where I have one entity type per language, and large sections of text may be a single entity.

Util.embed doesn't like this at all, and collapses everything into a single, thin line (see below). I tried to use discontinuous annotations with the individual token indices as a workaround, but the results weren't visually pleasing. Right now the only solution seems to be to manually split across sentence boundaries.

alt

@ghost
Copy link

ghost commented Oct 3, 2012

@pflaquerre: No, this issue isn't resolved yet. I have just had a discussion with @spyysalo regarding how we resolve this, we have a resolution in the pipeline that will hopefully reach master very soon. I'll close the issue when this happens. We may have dragged our feet for a little bit too long on this one, the resolution may be simpler than we thought.

Assigning to @amadanmath, once you have removed the client blocking annotation across sentence breaks, assign to me for the back-end implementation. Although late, this one is going into v1.3.

@amadanmath
Copy link
Contributor

Removed. There is a strange thing where the post-edit displays newline immediately after the span, and a reload kills the newline completely.

It works without modification on sentences that were introduced by senrtence splitter; I did not dare try to annotate a hard LF. :p

@ghost ghost self-assigned this Oct 3, 2012
@ghost
Copy link

ghost commented Oct 3, 2012

@amadanmath: Thanks, I'll harass you about the whole post-edit thing on the IM and get cracking with the back-end.

@ghost
Copy link

ghost commented Oct 3, 2012

As of fdd275a you can annotate across newlines, both hard ones from your text and from the built-in sentence splitter. Good job team! Closing! File new issues if there are bugs, hopefully there are none.

@ghost ghost closed this as completed Oct 3, 2012
@amadanmath
Copy link
Contributor

Not bugs as such, but a concern: the sentence numbering will be different, and as a result sentence annotations and sentence links become unreliable.

@spyysalo
Copy link
Member

spyysalo commented Oct 4, 2012

@amadanmath: good point. Now that you mention it, this issue isn't entirely new, as e.g. switching the sentence splitter off, adding an annotation crossing a "soft" newline otherwise (i.e. from a tagger), or upgrading to a version fixing some sentence splitter errors would cause the same unreliability.

Perhaps we should anchor sentence annotations to offsets rather than index them by whatever the sentence splitting algorithm happens to do? Open a new issue?

@amadanmath
Copy link
Contributor

Sure, go ahead. It might not be that hard to resolve, if you pack a sentence identifier together with sentence offsets array. (Offsets are not that good an idea since they're meaningless to the user.)

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants