Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The Changing Content Problem #115

Closed
nickstenning opened this issue Mar 27, 2012 · 13 comments
Closed

The Changing Content Problem #115

nickstenning opened this issue Mar 27, 2012 · 13 comments

Comments

@nickstenning
Copy link
Member

So, I've been thinking about this, and it really can't be that hard to get an 80% case working. As I see it, there are two scenarios we have to worry about.

  1. It's possible to heuristically determine the new location of the annotation. The exact details of the heuristic don't really matter, and can be encapsulated in a plugin, but the simplest possible routine would just search the document for the contents of the annotation quote field, and if it found an exact match, it would update the annotation's ranges and save it back to the server.
  2. It's not possible. All heuristics have failed. Still, this doesn't really matter if we implement a Memento timegate (see http://www.mementoweb.org/). We can simply send a request to the timegate for the current URL and an Accept-Datetime header set to the value of the annotation updated field, and if we get a nonzero list of links back we can pop up a message to the user along the lines of "hey, this page has changed since some annotations were made, [click here] to see them and links to historical versions of this page."

Can someone tell me if I've missed anything obvious?

Really, all we need is a plugin to encapsulate the dumb heuristics side of this (which would potentially include flagging the annotations in the UI as "I tried to automatically reposition myself" and only actually saving them if a human confirms that they make sense) and a plugin to talk to a timegate and display appropriate UI.

@philipn
Copy link

philipn commented Mar 27, 2012

One case that's not covered here is when there are some sucesses and some failures. I guess showing the successes and also showing a link to the most recent stored, annotated page would work, but there could be lots of hidden annotations - it could span back much further than one revision?

@nickstenning
Copy link
Member Author

A fair point if you interpreted what I wrote as being above as being an "all-or-nothing" attempt for all the annotations on the page, but I think it's pretty clear that the problem you describe goes away if you treat this as a per-annotation algorithm.

In fact, there's a pretty fundamental problem with what I've described above, namely that the Timegate returns a link to the Memento in the form of an HTTP Location header, which browsers are obliged to follow transparently in XMLHttp requests.

So, this suggests a required but much nicer UI: if any of the annotations in the page completely fail to load, we display a notification somewhere saying "Some annotations couldn't be loaded on this page, because it's changed since they were added. [Load this page in the history viewer] to see them". That would be a link to a MementoFox/Etherpad-style time-slider viewer hosted on AnnotateIt. This solves a number of problems at once, such as how to reinstantiate the annotator in the Memento, how to know which URL to search the annotator store for, and so on.

@rufuspollock
Copy link
Contributor

This was my contribution to the debate (on-list): http://lists.okfn.org/pipermail/annotator-dev/2012-March/000263.html

@nickstenning
Copy link
Member Author

Some links of interest from the Annotator community call:

@nickstenning
Copy link
Member Author

Noting another heuristic approach to this taken by @donohoe for Emphasis.js. See the blog post for details, but the principle is to identify paragraphs of text in a page by the first letters of the first N words of the first and last sentences of a paragraph. So this paragraph would be NahStp.

It occurs to me that extending this idea, by doing a Levenshtein comparison of the set of first letters of all the words in a sentence between paragraphs, is massively cheaper than doing full Levenshtein and probably works pretty well. What I really want to avoid is something which has obvious pathological cases. For example, Emphasis will fail if the first or last sentences are removed/prepended/appended to.

@rufuspollock
Copy link
Contributor

@nickstenning I note I mentioned the matching of sections of text in the list post I linked above: " the other option i know of here is to do hashing of small string sections of the document to generate your identifiers". Agree that you will want to extend to be more subtle.

@donohoe
Copy link
Contributor

donohoe commented Apr 4, 2012

For example, Emphasis will fail if the first or last sentences are removed/prepended/appended to.

No, it will not fail. It splits the Key and checks against the First and Last sentences. In some cases the Paragraph has leading sentence removed but last sentence remains the same (and similar variations of modification).

Emphasis, within a level of tolerance, finds the best match and takes that as the link. From the blog post:

By the time the reader attaches a key to a URL, an index of all the paragraphs in the article or post has been built to match the key against. But there’s an extra twist: when matching a key to a paragraph, the key is split into two pieces (in our example, Lsa and Tpp). Both pieces are compared against the first and last sentences of each paragraph. If there’s no perfect match for the full key, the code checks for a match to one of the halves. This covers instances where one of the sentences in question was removed or edited beyond (machine) recognition.

To enhance the matching even further, the Emphasis code calculates the Levenshtein distance (how “similar” two pieces of text are to each other) of the split key. This check covers many cases where a paragraph was later modified. If two pieces of different keys match closely enough within a given tolerance, that will count as a match.

@nickstenning
Copy link
Member Author

Sorry, my point was not to "diss" Emphasis, but to point out that there are special cases in which it has no option but to fail. If you remove the first and last sentence of a paragraph, there is no possible way this algorithm can recover. My suggestion was simply that the constraints you have (a need for short keys that will fit nicely in URLs) is not one we share, and so we could employ more robust strategies such as first-letter-of-every-word-in-annotation.

@donohoe
Copy link
Contributor

donohoe commented Apr 9, 2012

I didn't take it that way! Sorry if I cam across as such.

You are correct - if both first and last sentences were removed the link will fail.

However I would argue that is a good thing. If you were linking to a chunk of text that changed significantly then it has probably lost its original intent and your link should not be valid any more.

@fdev31
Copy link

fdev31 commented May 10, 2012

Sorry if it's already something discussed that or if I'm missing something but why don't you just require the caller to provide the information you miss like the new location ?
Maybe a handy solution could be to let the user chose a "state" for a given location, so you can store multiple, non-conflicting content, and serve it back, as long as the state string is meaningful. I believe many servers can display varying content for one location, depending on some cookie or POST variable...

@gka
Copy link

gka commented Nov 15, 2012

This work of research might be helpful for solving this problem: Robust Intra-document Locations

@tilgovi
Copy link
Member

tilgovi commented Nov 15, 2012

Got a wiki page over here with lots of links to more: https://github.com/hypothesis/h/wiki/robust-anchors

I'd like to see more discussion of where Annotator could change to support extensible anchoring methods.

@nickstenning
Copy link
Member Author

From this point forward, I'm going to be keeping the Annotator issue tracker for bug reports only. Enhancements and feature requests should be made on the mailing list.

As this is a feature request, I'm going to close this issue. If you feel that I've miscategorised the discussion, and there is a genuine unaddressed bug, feel free to reopen with an explanation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants