URI normalisation #2413

Merged
merged 14 commits into from Aug 6, 2015

Projects

None yet

5 participants

@nickstenning
Member

Add URI normalisation behind a feature flag

This is the first step in a major cleanup of how we internally normalise
URIs referenced by annotations. The URIs in question divide into two
groups:

  1. URIs stored on annotations which refer to the annotated resource.
  2. URIs discovered in the annotated resource or elsewhere, which can be
    used as hints to tell us which other URIs refer to the same
    underlying document (e.g. to alternate formats of the annotated
    resource).

We use both sets of these URIs when a user visits a page and activates
Hypothesis:

  • the client sends a search request, which includes the URI of the page
    on which it was activated.
  • the server must find all annotations made on that page, but also all
    annotations made on alternate versions of the page, and return these
    to the client.

Getting this right is both important and hard.

It's important because poor implementation will result in a poor
experience of using Hypothesis. Users don't (usually) want to see
annotations from documents other than the one they're viewing, and users
(usually) don't want any of the annotations on the page they're viewing
to be missing.

It's hard because in simple terms, a user expects that when they visit
the "same page," they will see the same annotations. But "same page" is
not a well-defined concept on the web.

Indeed, one of the so-called "axioms of web architecture" is the
opacity of URIs,
which means that a priori, there are only a limited number of
normalisations of a URI which are guaranteed to preserve reference.

At the time of writing, the standard which defines permitted
normalisations is RFC 39871, AKA the "Internationalized Resource
Identifiers (IRIs)" RFC.

Dealing correctly with internationalised identifiers is something we
will probably need to do in due course. For now, dealing correctly with
URIs will be a substantial improvement on the status quo. IRIs won't be
referred to again. The relevant standard which defines permitted
normalisations of URIs is RFC39862.

It's worth reading the above if you're trying to understand the starting
point for this code, but it's also worth understanding that in reality,
we have to be a lot more "heuristic" than these documents allow if we
want to ensure a good user experience. The "opacity of URIs", sadly,
must go out the window (for URLs, at least) if we want to provide a good
experience with Hypothesis on the real web.

For example, in practice, it may be the case that lexically sorting
URL query parameters is almost always a permissible (and thus
experience-improving) normalisation. But such an operation is clearly
not permitted by the URI normalisation specifications.

So, overall, here's the approach taken by this commit:

  • We introduce the concept of a URI normalisation, which maps an input
    URI idempotently to a normalised version of that URI.
  • When storing new annotations, we store normalised versions of the
    'source' property of targets (that identify the annotated document).
  • When searching via the API:
    • the value associated with the "uri" keyword, if passed, is used to
      look up alternate URIs.[^1]
    • Together, the query URI and alternate URIs returned from the
      previous step form a set for searching.
    • Each of the URIs in the search set are normalised.
    • The annotation database is queried to find those annotations where
      the normalised target source field is an exact match for any of the
      resulting set of normalised URIs.
  • We continue to store, unmodified, the URIs as passed by the client, so
    that we can, if necessary, alter the URI normalisation operation in
    the future.

A later commit will introduce a mechanism to regenerate the normalised
fields in the search index to facilitate both release of this feature in
a deployment with legacy annotations lacking such fields, but also the
ability to later update the URI normalisation procedure.

The addition of normalised fields to annotations is performed
unconditionally, but the alternate search procedure is guarded by a
feature flag.

[^1]: Ideally we would look up such "URI equivalences" using a
normalised URI, but this is complicated by the implementation of
annotator-store at the moment, so we don't yet do so.

Fixes #2240.
Affects (but doesn't necessarily fix) #2217. (See comments later).

@nickstenning
Member

Outstanding discussions that need to happen and/or decisions that need to be taken:

@nickstenning
Member

Also, there's a rejig in here of h.api.search from a module to a package -- that probably should be extracted so that this diff is easier to read. I'll do that...

@nickstenning
Member

Rebased on top of #2414.

@nickstenning
Member

I've also just realised that this is lacking tests for the new features of query.build, so I'll mark it as "WIP".

@nickstenning nickstenning added the WIP label Jul 29, 2015
@tilgovi
Contributor
tilgovi commented Jul 29, 2015

Question:

You say, "the normalised URI is used to look up alternate URIs, which are normalised in turn" but if we're storing the normalized URIs on ingress won't these results already be normalized?

@nickstenning
Member

You say, "the normalised URI is used to look up alternate URIs, which are normalised in turn" but if we're storing the normalized URIs on ingress won't these results already be normalized?

Documents are problematic. At the moment, when a document object is inserted (i.e. when a document field is present on an inserted annotation), the database is searched by unnormalised URI and the links found are aggregated onto the new document object.

Thus when we search for equivalent documents, we search by the unnormalised URI, because that's how annotator-store works.

We then have to normalise the returned URIs for comparison with the target.*.source_normalised fields: see here.

This should work ok for now. Ideally, in the future, we would store normalised hrefs on documents and discover document equivalences using normalised URIs as well.

You're absolutely right to bring this up, because it highlights an inaccuracy in the commit message. I'll update the PR.

@nickstenning
Member

I've removed all reference to normalisation of the document field's link.href properties. This needs to be done, but it can be done as a separate piece of work.

@nickstenning nickstenning added WIP and removed WIP labels Jul 30, 2015
@landscape-bot

Code Health
Repository health decreased by 0.05% when pulling a95f49b on uri-normalisation into 3920527 on master.

@BigBlueHat
Contributor

I'd be very curious to know how triple stores handle this problem. It seems it would be a very common issue in Sematic Web / Linked Data circles as to have plenty of existing code for URI equality comparisons and even optimized storage based on similarity, etc.

This seems like too common a problem to not have some existing code laid for it.

@nickstenning
Member

URI equality in 3stores is, as far as I can tell, usually defined as in RFC3986, as discussed above. We necessarily have to go further than that because real-world web publishers aren't as aware of the semantic value of URIs as your average user of a 3store.

nickstenning added some commits Jul 17, 2015
@nickstenning nickstenning Add URI normalisation behind a feature flag
This is the first step in a major cleanup of how we internally normalise
URIs referenced by annotations. The URIs in question divide into two
groups:

1. URIs stored on annotations which refer to the annotated resource.

2. URIs discovered in the annotated resource or elsewhere, which can be
   used as hints to tell us which other URIs refer to the same
   underlying document (e.g. to alternate formats of the annotated
   resource).

We use both sets of these URIs when a user visits a page and activates
Hypothesis:

- the client sends a search request, which includes the URI of the page
  on which it was activated.
- the server must find all annotations made on that page, but *also* all
  annotations made on alternate versions of the page, and return these
  to the client.

Getting this right is both important and hard.

It's important because poor implementation will result in a poor
experience of using Hypothesis. Users don't (usually) want to see
annotations from documents other than the one they're viewing, and users
(usually) don't want any of the annotations on the page they're viewing
to be missing.

It's hard because in simple terms, a user expects that when they visit
the "same page," they will see the same annotations. But "same page" is
not a well-defined concept on the web.

Indeed, one of the so-called "axioms of web architecture" is the
[opacity of URIs](http://www.w3.org/DesignIssues/Axioms.html#opaque),
which means that a priori, there are only a limited number of
normalisations of a URI which are guaranteed to preserve reference.

At the time of writing, the standard which defines permitted
normalisations is RFC 3987[1], AKA the "Internationalized Resource
Identifiers (IRIs)" RFC.

[1]: https://tools.ietf.org/html/rfc3987#section-5

Dealing correctly with internationalised identifiers is something we
will probably need to do in due course. For now, dealing correctly with
URIs will be a substantial improvement on the status quo. IRIs won't be
referred to again. The relevant standard which defines permitted
normalisations of URIs is RFC3986[2].

[2]: https://tools.ietf.org/html/rfc3986#section-6

It's worth reading the above if you're trying to understand the starting
point for this code, but it's also worth understanding that in reality,
we have to be a lot more "heuristic" than these documents allow if we
want to ensure a good user experience. The "opacity of URIs", sadly,
must go out the window (for URLs, at least) if we want to provide a good
experience with Hypothesis on the real web.

For example, in practice, it may be the case that lexically sorting
URL query parameters is almost always a permissible (and thus
experience-improving) normalisation. But such an operation is clearly
not permitted by the URI normalisation specifications.

So, overall, here's the approach taken by this commit:

- We introduce the concept of a URI normalisation, which maps an input
  URI idempotently to a normalised version of that URI.

- When storing new annotations, we store normalised versions of the
  'source' property of targets (that identify the annotated document).

- When searching via the API:

  + the value associated with the "uri" keyword, if passed, is used to
    look up alternate URIs.[^1]

  + Together, the query URI and alternate URIs returned from the
    previous step form a set for searching.

  + Each of the URIs in the search set are normalised.

  + The annotation database is queried to find those annotations where
    the normalised target source field is an exact match for any of the
    resulting set of normalised URIs.

- We continue to store, unmodified, the URIs as passed by the client, so
  that we can, if necessary, alter the URI normalisation operation in
  the future.

A later commit will introduce a mechanism to regenerate the normalised
fields in the search index to facilitate both release of this feature in
a deployment with legacy annotations lacking such fields, but also the
ability to later update the URI normalisation procedure.

The addition of normalised fields to annotations is performed
unconditionally, but the alternate search procedure is guarded by a
feature flag.

[^1]: Ideally we would look up such "URI equivalences" using a
normalised URI, but this is complicated by the implementation of
`annotator-store` at the moment, so we don't yet do so.
ba17d98
@nickstenning nickstenning Add normalised URIs on insert/update
In order to search against the "source_normalised" field, we need to add
this field when we index the annotation. This commit adds support for
computing "source_normalised" fields (for targets) when annotations are
inserted into the ElasticSearch index.
018179d
@tilgovi tilgovi and 1 other commented on an outdated diff Aug 5, 2015
h/api/search/query.py
- query["query"] = {
- "filtered": {
- "filter": nipsa.nipsa_filter(userid=userid),
- "query": query["query"]
- }
- }
+ # And now we simplify the query. If there are no added matches, then we can
@tilgovi
tilgovi Aug 5, 2015 Contributor

This whole block might be unnecessary optimization that Elasticsearch is smart enough to deal with.

@tilgovi
tilgovi Aug 5, 2015 Contributor

Or, alternatively, let's simplify this function a bit and make it clearer. Start with explicit empty lists:

filter_clauses = []
query_clauses = []

Then append as necessary.

Finally, construct the whole body rather than mutating it.

@nickstenning
nickstenning Aug 6, 2015 Member

This isn't really intended to be optimization for ElasticSearch. It was supposed to be optimization for humans trying to read the generated queries while debugging :)

But certainly we can do something more like what you suggested in your second comment.

@tilgovi tilgovi commented on an outdated diff Aug 5, 2015
h/api/search/query.py
@@ -98,3 +118,16 @@ def _match_clause_for_uri(uristr):
"should": matchers
}
}
+
+
+def _filter_clause_for_uri(uristr):
@tilgovi
tilgovi Aug 5, 2015 Contributor

I might suggest calling this _term_clause_for_uri as the other is called _match_clause_for_uri and both are ignorant of whether they're being added to a filter or a query. Here, filters is really just something more like clauses and it's the caller that's choosing to put it into a filter.

@BigBlueHat BigBlueHat commented on the diff Aug 5, 2015
h/api/test/uri_test.py
+ # Query: ensure OTHER characters are encoded
+ ("http://example.com?你好世界=γειά σου κόσμος", "http://example.com?%E4%BD%A0%E5%A5%BD%E4%B8%96%E7%95%8C=%CE%B3%CE%B5%CE%B9%CE%AC+%CF%83%CE%BF%CF%85+%CE%BA%CF%8C%CF%83%CE%BC%CE%BF%CF%82"),
+ ("http://example.com?love=♥", "http://example.com?love=%E2%99%A5"),
+
+ # Query: normalise case of encodings
+ ("http://example.com?love=%e2%99%a5", "http://example.com?love=%E2%99%A5"),
+
+ # Query: lexically sort parameters by name
+ ("http://example.com?a=1&b=2", "http://example.com?a=1&b=2"),
+ ("http://example.com?b=2&a=1", "http://example.com?a=1&b=2"),
+
+ # Query: preserve relative ordering of multiple params with the same name
+ ("http://example.com?b=2&b=3&b=1&a=1", "http://example.com?a=1&b=2&b=3&b=1"),
+ ("http://example.com?b=&b=3&b=1&a=1", "http://example.com?a=1&b=&b=3&b=1"),
+
+ # Query: remove parameters known to be irrelevant for document identity
@BigBlueHat
BigBlueHat Aug 5, 2015 Contributor

👎 for presuming omniscience 😉

There's no guarantee that someone even knowledgeable of their common use isn't customizing the site content based on these query params being available.

@tilgovi
tilgovi Aug 5, 2015 Contributor

Yes, but I would guess it's the exception, not the rule, that the content would be heavily customized this way and false positives (up to a point) are better than false negatives.

It's "where's my annotation?!?" << "huh, this annotation doesn't really seem to belong here"

@nickstenning
nickstenning Aug 6, 2015 Member

Google Analytics query params are designed to allow site owners to distinguish between visits to the same page, and so are almost by definition not going to be used to display different content.

But even if we ignore that fact, it's not a problem if this code identifies as the same some URLs which are in fact different. The question is whether on balance, Hypothesis users are better served by loading annotations for pages that differ only by these query params, or not.

Lacking data suggesting that large swathes of webmasters are completely irrational and have decided to overload utm_* query parameters to mean something different, I think this is the right behaviour for URI normalisation.

@tilgovi tilgovi commented on an outdated diff Aug 5, 2015
h/api/search/transform.py
+
+ Scan the passed annotation for any target URIs or document metadata URIs
+ and add normalised versions of these to the document.
+ """
+
+ # FIXME: When this becomes simply part of a search indexing operation, this
+ # should probably not mutate its argument.
+ _normalise_annotation_target_uris(annotation)
+
+
+def render(annotation):
+ """
+ Render an annotation retrieved from search for public display.
+
+ Receives data direct from the search index and reformats it for rendering
+ or display in the public API. At the moment all this does is remove
@tilgovi
tilgovi Aug 5, 2015 Contributor

I would remove "At the moment all this does..." since that's a comment that is liable to get out of date and it's more detail than anyone needs to know here. Want to know exactly what it does? Go look at that function.

@tilgovi tilgovi and 1 other commented on an outdated diff Aug 5, 2015
h/api/search/transform.py
+ return data
+
+
+def _normalise_annotation_target_uris(annotation):
+ if 'target' not in annotation:
+ return
+ if not isinstance(annotation['target'], list):
+ return
+ for target in annotation['target']:
+ if not isinstance(target, dict):
+ continue
+ if 'source' not in target:
+ continue
+ if not isinstance(target['source'], basestring):
+ continue
+ target['source_normalised'] = uri.normalise(target['source'])
@tilgovi
tilgovi Aug 5, 2015 Contributor

Is there a reason we can't just exclude this from the source in the mapping? That would let it be searchable and indexed but not stored in the source. It means that for a re-index we'd have to run annotations through our Python code again to regenerate these things, but shouldn't we do that anyway if we're starting to handle things like this in Python?

@tilgovi
tilgovi Aug 5, 2015 Contributor

Sorry, this was meant to apply to the function below.

@nickstenning
nickstenning Aug 6, 2015 Member

Right. I did consider that -- but I think for now it's clearer to keep this separate. Certainly, at some point I would imagine that hypothesis reindex is a tool that pulls stuff out of the database, prepares it for search, and then shoves it in elasticsearch.

When we do that, I'd agree with you that _source excludes might be one option for simplifying what we retrieve from the search index. But until that point I think it's going to be confusing to have a reindex function that doesn't work. (And updating reindex is outside of the the scope of this pull request.)

@tilgovi tilgovi and 1 other commented on an outdated diff Aug 5, 2015
h/api/views.py
@@ -76,9 +76,18 @@ def index(context, request):
@api_config(context=Root, name='search')
def search(request):
"""Search the database for annotations matching with the given query."""
+ search_normalised_uris = request.feature('search_normalised')
@tilgovi
tilgovi Aug 5, 2015 Contributor

Oof. We have a keyword argument propagating through two layers of abstraction just to support putting it behind a feature flag? Is this a case where maybe we can just say we're special enough to warrant using get_current_request()? Maybe it's silly for our feature flag check to even be a request property when we could be using the query property on the Feature model like we do with the other models. Of course, the problem with that is that the feature flag state may depend on other properties of the request, like the admin and staff flags of the current user, so hmmmmmmmmmm

@tilgovi
tilgovi Aug 5, 2015 Contributor

Can this really usefully be feature flagged anyway? I mean, I can see how it can be turned on or off for a deployment, but can it really be on for some people but not for others? The way it's described is that this argument says that we can assume all the URIs have a normalized copy stored and indexed, but that's not true if it only happens for some users' annotations.

@nickstenning
nickstenning Aug 6, 2015 Member

Yeah, so this is ugly, and it's what I meant when I said the other day that some feature flags are really technical debt, in the sense that they allow us to go faster now (by deploying things before we're certain we're ready to release them) at the cost of having to do some work later to clean them up (i.e. remove the feature flag and the other branch of the code).

In this case, I think the benefits (we can test the new URI normalisation code before releasing it to the general public) outweigh the costs. And yes, it can be usefully feature flagged -- because:

  1. source_normalised fields are added regardless of the flag
  2. we can run the data migration without harming the current setup
  3. requests to the API are authenticated, so we can switch on the new search code for staff only in testing
@tilgovi tilgovi and 1 other commented on an outdated diff Aug 5, 2015
@@ -129,6 +129,9 @@ def run_tests(self):
'h.worker': [
'notification=h.notification.worker:run',
'nipsa=h.api.nipsa.worker:worker',
+ ],
+ 'h.annotool': [
+ 'reprepare=h.api.search:prepare',
@tilgovi
tilgovi Aug 5, 2015 Contributor

Hehe. Can we just call this prepare instead of reprepare? After all, the first time we run this things won't even have been prepared.

nickstenning added some commits Jul 28, 2015
@nickstenning nickstenning Add `hypothesis annotool` for modifying the annotation database
This commit adds a command to the `hypothesis` CLI, which enables
running of functions which modify annotations, so that we can run data
migrations against the annotation database.

The only supported function at the moment is 'normalise-uris', which
ensures that all annotations in the database have normalised copies of
their "target.source" attributes.
5d2d814
@nickstenning nickstenning Filter 'source_normalised' keys from annotations before display 5261ed4
@nickstenning nickstenning Move normalisation/cleanup into `h.api.search.transform`
This moves all responsibility for adding and removing the
"source_normalised" fields from annotation data into a "transform"
module in `h.api.search`.

This keeps `h.api.uri` simple.
172fe73
@nickstenning nickstenning URI normalisation: strip trailing slashes
It would be unusual for the slashless version of a URL and the slashed
version, e.g.

    http://example.com/foo/
    http://example.com/foo

to refer to different documents. Thus, when normalising URLs, we remove
any trailing slashes from the pathinfo.
38bc599
@nickstenning nickstenning Refactor `uri.normalise` for clarity
It's slightly easier to read if we separate the normalisation functions
out into one per URL component, rather than asking each piece to do its
own URI parsing/splitting/recombining.

Each "_normalise_*" function gets passed the whole URI object, because
the correct normalised form of each component may depend on other parts
of the URL. For example, it's only correct to remove default port
declarations (such as ":80" or ":443") if the corresponding scheme is
the right one (i.e. "http", "https").
934af1a
@nickstenning nickstenning URI normalisation: add query string normalisation
Do some relatively uncontroversial normalisation of query strings. We
sort the query string by key, leaving multiple query params with the
same key in the same relative ordering.
a1189bf
@nickstenning nickstenning URI normalisation: remove Google Analytics query params
Google Analytics allows users to provide query parameters which feed
additional data into the JavaScript you run on your page. These params
can be reasonably assumed to be irrelevant for determining document
identity, so we can strip them from the normalised URL.
ec0d983
@nickstenning nickstenning Input URIs can be unicode, outputs are strings
And we expect unicode in query params to be normalised to urlencoded
form.
b49e1aa
@nickstenning nickstenning URI normalisation: normalise percent-encoding
Normalise percent-encoding of paths and query strings as dictated by
RFC3986.

Paths and query strings can both contain [percent-encoded][1] character
data. Percent-encoding is a way to include  in URLs (and URIs more
generally) certain characters which may have semantic meaning in a URL
(such as "/" in a URL path), or which may simply not be permitted (such
as literal UTF-8 characters).

Sometimes these percent-encodings must be preserved -- percent-encoded
UTF-8 unicode, for example. But sometimes URLs that differ only in
percent-encoding can (according to RFC3986) be considered the same. To
pick one example, "-" (hyphen) is an "unreserved character", so:

    http://example.com/foo-bar, AND
    http://example.com/foo%2Dbar

can be considered the same. In addition, case in percent encodings does
not carry information, so

    http://example.com/foo%2dbar, AND
    http://example.com/foo%2Dbar

are also the same URL.

This commit adds support for normalising percent-encoding of URL paths
and query strings.

[1]: https://en.wikipedia.org/wiki/Percent-encoding
57feff9
@nickstenning nickstenning URI normalisation: strip multiple trailing slashes
uri.normalise should be idempotent.
5cc0e96
@nickstenning nickstenning Refactor query.build for clarity
Rather than creating half a query body and mutating it, we create the
components of it one at a time and assemble it all at the end. This
reduces the amount of

    query["query"]["filtered"]["query"]["query"]["term"]...

craziness in the code.
6a2f803
@nickstenning nickstenning removed the WIP label Aug 6, 2015
@nickstenning
Member

I've addressed most of the outstanding issues. I've also done a little bit of analysis to understand how this is going to affect the existing database of annotations -- see the comparative dataset of unique URLs before/after normalisation (this is deliberately not available to people outside the Hypothesis organisation because it's from the full dataset, not public-only annotations).

One possible concern I have is around dropping fragments. As it happens we already do this client-side and so this PR doesn't change behaviour there, but it certainly seems to lead to a subpar experience on pages such as these:

We have a load of the above pages annotated in our database, and at the moment we're basically loading all the annotations for the entire site on every page...

For now I think it's safe enough to merge this and we can test out and try different normalisation strategies before committing and flipping this on for everyone.

@nickstenning
Member

We also do still need a plan for the 13,000+ annotations lacking "target" fields. I haven't yet checked in detail, but I presume these are legacy annotations with annotator-style "ranges" objects -- we can probably do a one-off migration to give these a target field.

@landscape-bot

Code Health
Repository health increased by 0.07% when pulling 6a2f803 on uri-normalisation into 9f5674f on master.

@tilgovi tilgovi Americanise spellings :)
0ee4749
@tilgovi
Contributor
tilgovi commented Aug 6, 2015

I changed all the spellings of normalize to the American English version. I always feel sort of awkward about that, but it's consistent with everything else in the code. Even comments in this PR said analyze while the code said normalise :). Sorry, you all.

There's a couple more but not related to this PR I noticed and will replace after I merge.

@tilgovi
Contributor
tilgovi commented Aug 6, 2015

Also, @nickstenning this is truly great. Thank you.

@tilgovi tilgovi merged commit cc4ca35 into master Aug 6, 2015

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage increased (+1.0%) to 64.141%
Details
@tilgovi tilgovi deleted the uri-normalisation branch Aug 6, 2015
@judell
Contributor
judell commented Aug 7, 2015

❤️

This closes #2217 as well?

@nickstenning
Member

Nope. #2217 is mostly about document equivalence mappings, which is the second chunk of this work.

@judell
Contributor
judell commented Aug 12, 2015

Hmm. #2217's example, of these giving different results:

https://hypothes.is/stream?q=uri:http%3A//blog.jonudell.net

https://hypothes.is/stream?q=uri:http%3A//blog.jonudell.net/

does not now repro. If #2413 wasn't the fix, what was?

@tilgovi
Contributor
tilgovi commented Aug 13, 2015

No idea, but close it if you can't repro?

@nickstenning nickstenning added a commit that referenced this pull request Sep 25, 2015
@nickstenning nickstenning Give comments/page notes a single target with source field
In order for comments (AKA "page notes") to be discoverable with the
search logic implemented in #2413 (cc4ca35), they need to have at least
one target with a source field.

Previously a comment was defined as an annotation

- with a `uri`
- without any targets
- without any references (no targets but with references == a reply)

Now comments have a single target with a single field, `source`. This
finally puts us in a position where we can consider deprecating the
`uri` field.
6f43495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment