URI normalization, c14n, escaping, and extraction
Ruby
Latest commit 36d5990 Sep 21, 2017 @igrigorik igrigorik Merge pull request #37 from brianvans/nokogiri_1_8
Update nokogiri to 1.8
Permalink
Failed to load latest commit information.
lib Bump version to 1.0.22 Mar 24, 2017
spec
.gitignore Add appraisal to test different versions of runtime dependencies Feb 23, 2017
.rspec rspec + autotest Jan 25, 2011
.travis.yml
Appraisals Add appraisal to test different versions of runtime dependencies Feb 23, 2017
Gemfile oh hai Jan 21, 2011
LICENSE add license file Apr 21, 2015
README.md Specify min Ruby version of >= 2.0.0 Mar 24, 2017
Rakefile Add appraisal to test different versions of runtime dependencies Feb 23, 2017
postrank-uri.gemspec

README.md

PostRank URI

Gem Version Build Status

A collection of convenience methods (Ruby 2.0+) for dealing with extracting, (un)escaping, normalization, and canonicalization of URIs. At PostRank we process over 20M URI associated activities each day, and we need to make sure that we can reliably extract the URIs from a variety of text formats, deal with all the numerous and creative ways users like to escape and unescape their URIs, normalize the resulting URIs, and finally apply a set of custom canonicalization rules to make sure that we can cross-reference when the users are talking about the same URL.

In a nutshell, we need to make sure that creative cases like the ones below all resolve to same URI:

API

  • PostRank::URI.extract(text) - Detect URIs in text, discard bad TLD's

  • PostRank::URI.clean(uri) - Unescape, normalize, apply c14n filters - 95% use case.

  • PostRank::URI.normalize(uri) - Apply RFC normalization rules, discard extra path characters, drop anchors

  • PostRank::URI.unescape(uri) - Unescape URI entities, handle +/%20's, etc

  • PostRank::URI.escape(uri) - Escape URI

Example

>> PostRank::URI.extract('some random text with http://link.to somecanadiansite.ca')
[
    [0] "http://link.to/",
    [1] "http://somecanadiansite.ca/"
]

>> PostRank::URI.clean('link.to?a=b&utm_source=FeedBurner#stuff')
[
    [0] "http://link.to/?a=b"
]

C14N

As part of URI canonicalization the library will remove common tracking parameters from Google Analytics and several other providers. Beyond that, host-specific rules are also applied. For example, nytimes.com likes to add a 'partner' query parameter for tracking purposes, but which has no effect on the content - hence, it is removed from the URI. For full list, see the c14n.yml file.

Detecting "duplicate URLs" is a hard problem to solve (expensive in all senses), instead we are compiling a manually assembled database. If you find cases which are missing, please do report them, or send us a pull request!

Development

Setup

bundle install

Running tests

bundle exec rake

Running dependency appraisals

To verify postrake-uri works with different versions of its runtime dependencies you can run:

bundle exec appraisal install
bundle exec rake appraisal

This will execute the test suite with different versions of the dependencies.