Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
a clojure library for normalizing urls with configurable aggressiveness
Clojure

This branch is 248 commits ahead, 23 commits behind jaydonnell:master

Merge remote-tracking branch 'retiman/master'

* retiman/master: (75 commits)
  Version bump
  It's a big mystery to me how leiningen does this, but at lease this junk seems to make it happy.
  Bump version; 0.5.2 exists
  Bump version
  This fixes issue #8
  Add some tests for https
  Properly lower-case host before checking for default port
  Version bump
  Generated by lein
  To make lein happy to deploy
  New Leiningen doesn't need lein-clojars
  Version bump
  Updated for Leiningen
  Ensure order for query keys in an ordered-map
  Upgrade to Clojure 1.4.0
  Upgrade to Leiningen 2.0.0
  Removed url-equal? tests.
  Removed deprecated functions
  Added a test that shows removing duplicate keys and sorting is possible.
  Bump version; previous change is not backwards compatible
  ...

Conflicts:
	project.clj
	src/url_normalizer/core.clj
latest commit 4af660caf1
@jashmenn authored
Failed to load latest commit information.
src/url_normalizer Merge remote-tracking branch 'retiman/master'
test/url_normalizer
.gitignore Generated by lein
README.md Readme format update
project.clj Version bump

README.md

DESCRIPTION

Normalize URLs, with options for selecting normalizations that may or may not be semantic preserving.

These normalizations are only tested well against the HTTP and HTTPS schemes.

SEMANTIC PRESERVING NORMALIZATIONS

Applying these normalizations will not cause the URL to describe a different resource.

  • Lower case the scheme portion
  • Lower case the host
  • Upper case percent encodings
  • Decode unreserved characters
  • Add a trailing slash to the host
  • Remove the default port
  • Remove dot segments

NON SEMANTIC PRESERVING NORMALIZATIONS

Apply these normalizations with caution because technically they cause the URL to describe a different resource (but sometimes the server won't care; e.g. if you remove the fragment). There is some trickiness involved with hashbang segments for sites like Twitter, so there is an option for that.

  • Remove the directory index
  • Remove the fragment
  • Convert from an IP address to a hostname
  • Remove duplicate query keys and values
  • Remove empty query
  • Remove empty user info segment
  • Remove trailing dot in host
  • Keep the hashbang fragment
  • Force http instead of https
  • Remove the www from the host
  • Sort the query keys
  • Decode reserved characters

BUILDING

Make sure to delete your classes and lib directory if you are upgrading. Leiningen and Clojure are finicky.

USAGE

Use the normalize function to apply specific normalizations to URLs. Note that only safe, semantic preserving normalizations are applied by default.

(use '[url-normalizer.core :exclude (resolve)])

(normalize "http://WWW.EXAMPLE.COM:80/%7ejane/foo/bar/../baz")
-> #<URI http://www.example.com/~jane/foo/baz>

(normalize "../../../../bif#foo" {:base "http://example.com:8080/a/b/c/f/d"})
-> #<URI http://example.com:8080/a/bif#foo>

(normalize "http://example.com?")
-> #<URI http://example.com/?>

(normalize "http://example.com?" {:remove-empty-query? true})
-> #<URI http://example.com/>

(normalize "http://example.com/#!/foo" {:remove-fragment? true})
-> #<URI http://example.com/>

(normalize "http://example.com/#!/foo" {:remove-fragment? true
                                        :keep-hashbang-fragment? true})
-> #<URI http://example.com/#!/foo>

(normalize "http://例え.テスト/")
-> #<URI http://xn--r8jz45g.xn--zckzah/>

(with-normalization-context
  {:lower-case-host? false
   :remove-default-port? false
   :remove-empty-query? true
   :remove-trailing-dot-in-host? true}
 #(normalize "http://WWW.example.COM.:80/?"))
-> #<URI http://WWW.example.COM:80/>

See #'url-normalizer.core/*context* for applicable normalizations. Some normalizations do not preserve semantics; be warned.

You can also test if two URLs are equivalent or if they are equal. Two URLs are equivalent if they normalize to the same thing, but they are equal if their ascii representations are the same.

(use 'url-normalizer.core)

(equivalent? "http://example.com" "http://example.com/")
-> true

(equal? "http://example.com" "http://example.com/")
-> false

AUTHORS

Tests taken from Sam Ruby's version of urlnorm.py

SEE ALSO

LICENSE

Copyright (C) 2010

Distributed under the Eclipse Public License, the same as Clojure.

Something went wrong with that request. Please try again.