The Changing Content Problem #115

nickstenning · 2012-03-27T16:56:53Z

So, I've been thinking about this, and it really can't be that hard to get an 80% case working. As I see it, there are two scenarios we have to worry about.

It's possible to heuristically determine the new location of the annotation. The exact details of the heuristic don't really matter, and can be encapsulated in a plugin, but the simplest possible routine would just search the document for the contents of the annotation quote field, and if it found an exact match, it would update the annotation's ranges and save it back to the server.
It's not possible. All heuristics have failed. Still, this doesn't really matter if we implement a Memento timegate (see http://www.mementoweb.org/). We can simply send a request to the timegate for the current URL and an Accept-Datetime header set to the value of the annotation updated field, and if we get a nonzero list of links back we can pop up a message to the user along the lines of "hey, this page has changed since some annotations were made, [click here] to see them and links to historical versions of this page."

Can someone tell me if I've missed anything obvious?

Really, all we need is a plugin to encapsulate the dumb heuristics side of this (which would potentially include flagging the annotations in the UI as "I tried to automatically reposition myself" and only actually saving them if a human confirms that they make sense) and a plugin to talk to a timegate and display appropriate UI.

The text was updated successfully, but these errors were encountered:

philipn · 2012-03-27T17:34:57Z

One case that's not covered here is when there are some sucesses and some failures. I guess showing the successes and also showing a link to the most recent stored, annotated page would work, but there could be lots of hidden annotations - it could span back much further than one revision?

nickstenning · 2012-03-27T20:09:49Z

A fair point if you interpreted what I wrote as being above as being an "all-or-nothing" attempt for all the annotations on the page, but I think it's pretty clear that the problem you describe goes away if you treat this as a per-annotation algorithm.

In fact, there's a pretty fundamental problem with what I've described above, namely that the Timegate returns a link to the Memento in the form of an HTTP Location header, which browsers are obliged to follow transparently in XMLHttp requests.

So, this suggests a required but much nicer UI: if any of the annotations in the page completely fail to load, we display a notification somewhere saying "Some annotations couldn't be loaded on this page, because it's changed since they were added. [Load this page in the history viewer] to see them". That would be a link to a MementoFox/Etherpad-style time-slider viewer hosted on AnnotateIt. This solves a number of problems at once, such as how to reinstantiate the annotator in the Memento, how to know which URL to search the annotator store for, and so on.

rufuspollock · 2012-03-28T20:03:53Z

This was my contribution to the debate (on-list): http://lists.okfn.org/pipermail/annotator-dev/2012-March/000263.html

nickstenning · 2012-03-30T10:58:16Z

Some links of interest from the Annotator community call:

http://arxiv.org/abs/1003.2643 -- Rob Sanderson's paper about persistent web annotation and Memento
http://www9.org/w9cdrom/312/312.html -- Robust Intra-document locations
http://www.openannotation.org/spec/core/#Setup -- modeling in OA

nickstenning · 2012-04-04T17:15:02Z

Noting another heuristic approach to this taken by @donohoe for Emphasis.js. See the blog post for details, but the principle is to identify paragraphs of text in a page by the first letters of the first N words of the first and last sentences of a paragraph. So this paragraph would be NahStp.

It occurs to me that extending this idea, by doing a Levenshtein comparison of the set of first letters of all the words in a sentence between paragraphs, is massively cheaper than doing full Levenshtein and probably works pretty well. What I really want to avoid is something which has obvious pathological cases. For example, Emphasis will fail if the first or last sentences are removed/prepended/appended to.

rufuspollock · 2012-04-04T17:27:51Z

@nickstenning I note I mentioned the matching of sections of text in the list post I linked above: " the other option i know of here is to do hashing of small string sections of the document to generate your identifiers". Agree that you will want to extend to be more subtle.

donohoe · 2012-04-04T23:07:14Z

For example, Emphasis will fail if the first or last sentences are removed/prepended/appended to.

No, it will not fail. It splits the Key and checks against the First and Last sentences. In some cases the Paragraph has leading sentence removed but last sentence remains the same (and similar variations of modification).

Emphasis, within a level of tolerance, finds the best match and takes that as the link. From the blog post:

By the time the reader attaches a key to a URL, an index of all the paragraphs in the article or post has been built to match the key against. But there’s an extra twist: when matching a key to a paragraph, the key is split into two pieces (in our example, Lsa and Tpp). Both pieces are compared against the first and last sentences of each paragraph. If there’s no perfect match for the full key, the code checks for a match to one of the halves. This covers instances where one of the sentences in question was removed or edited beyond (machine) recognition.

To enhance the matching even further, the Emphasis code calculates the Levenshtein distance (how “similar” two pieces of text are to each other) of the split key. This check covers many cases where a paragraph was later modified. If two pieces of different keys match closely enough within a given tolerance, that will count as a match.

nickstenning · 2012-04-05T08:38:37Z

Sorry, my point was not to "diss" Emphasis, but to point out that there are special cases in which it has no option but to fail. If you remove the first and last sentence of a paragraph, there is no possible way this algorithm can recover. My suggestion was simply that the constraints you have (a need for short keys that will fit nicely in URLs) is not one we share, and so we could employ more robust strategies such as first-letter-of-every-word-in-annotation.

donohoe · 2012-04-09T21:58:13Z

I didn't take it that way! Sorry if I cam across as such.

You are correct - if both first and last sentences were removed the link will fail.

However I would argue that is a good thing. If you were linking to a chunk of text that changed significantly then it has probably lost its original intent and your link should not be valid any more.

fdev31 · 2012-05-10T17:39:11Z

Sorry if it's already something discussed that or if I'm missing something but why don't you just require the caller to provide the information you miss like the new location ?
Maybe a handy solution could be to let the user chose a "state" for a given location, so you can store multiple, non-conflicting content, and serve it back, as long as the state string is meaningful. I believe many servers can display varying content for one location, depending on some cookie or POST variable...

gka · 2012-11-15T17:15:51Z

This work of research might be helpful for solving this problem: Robust Intra-document Locations

tilgovi · 2012-11-15T21:07:51Z

Got a wiki page over here with lots of links to more: https://github.com/hypothesis/h/wiki/robust-anchors

I'd like to see more discussion of where Annotator could change to support extensible anchoring methods.

nickstenning · 2014-09-21T18:14:23Z

From this point forward, I'm going to be keeping the Annotator issue tracker for bug reports only. Enhancements and feature requests should be made on the mailing list.

As this is a feature request, I'm going to close this issue. If you feel that I've miscategorised the discussion, and there is a genuine unaddressed bug, feel free to reopen with an explanation.

nickstenning mentioned this issue Apr 3, 2012

IE9: JS problems after page-refresh #80

Closed

nickstenning mentioned this issue Mar 12, 2013

Dom manipulation #190

Closed

nickstenning mentioned this issue Jul 3, 2014

MD5 hash the annotated area for validation #396

Closed

nickstenning closed this as completed Sep 21, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Changing Content Problem #115

The Changing Content Problem #115

nickstenning commented Mar 27, 2012

philipn commented Mar 27, 2012

nickstenning commented Mar 27, 2012

rufuspollock commented Mar 28, 2012

nickstenning commented Mar 30, 2012

nickstenning commented Apr 4, 2012

rufuspollock commented Apr 4, 2012

donohoe commented Apr 4, 2012

nickstenning commented Apr 5, 2012

donohoe commented Apr 9, 2012

fdev31 commented May 10, 2012

gka commented Nov 15, 2012

tilgovi commented Nov 15, 2012

nickstenning commented Sep 21, 2014

The Changing Content Problem #115

The Changing Content Problem #115

Comments

nickstenning commented Mar 27, 2012

philipn commented Mar 27, 2012

nickstenning commented Mar 27, 2012

rufuspollock commented Mar 28, 2012

nickstenning commented Mar 30, 2012

nickstenning commented Apr 4, 2012

rufuspollock commented Apr 4, 2012

donohoe commented Apr 4, 2012

nickstenning commented Apr 5, 2012

donohoe commented Apr 9, 2012

fdev31 commented May 10, 2012

gka commented Nov 15, 2012

tilgovi commented Nov 15, 2012

nickstenning commented Sep 21, 2014