Skip to content
A HTML parser for Clojure.
Clojure
Find file
Pull request Compare This branch is 17 commits behind nathell:master.
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Failed to load latest commit information.
src/pl/danieljanus
LICENSE
README.md
project.clj

README.md

clj-tagsoup

This is a HTML parser for Clojure, somewhat akin to Common Lisp's cl-html-parse. It is a wrapper around the TagSoup Java SAX parser, but has a DOM interface. It is buildable by Leiningen.

Usage

The two main functions defined by clj-tagsoup are parse and parse-string. The first one can take anything accepted by clojure.contrib's reader function except for a Reader, while the second can parse HTML from a string.

The resulting HTML tree is a vector, consisting of:

  1. a keyword representing the tag name,
  2. a map of tag attributes (mapping keywords to strings),
  3. children nodes (strings or vectors of the same format).

This is the same format as used by hiccup, thus the output of parse is appropriate to pass to hiccup.

There are also utility accessors (tag, attributes, children).

clj-tagsoup will automatically use the correct encoding to parse the file if one is specified in either the HTTP headers (if the argument to parse is an URL object or a string representing one) or a <meta http-equiv="..."> tag. If you are running Clojure from within Leiningen, you might experience problems with the charsets; see this blog post for details.

clj-tagsoup is meant to parse HTML tag soup, but, in practice, nothing prevents you to use it to parse arbitrary (potentially malformed) XML. The :xml keyword argument causes clj-tagsoup to take into consideration the XML header when detecting the encoding.

Another option for parsing XML is using the parse-xml function. It just invokes clojure.xml/parse with TagSoup, so the output format is compatible with clojure.xml and is not the one described above.

Example

(parse "http://example.com")
=> [:html {}
          [:head {}
                 [:title {} "Example Web Page"]]
          [:body {}
                 [:p {} "You have reached this web page by typing \"example.com\",\n\"example.net\",\n  or \"example.org\" into your web browser."]
                 [:p {} "These domain names are reserved for use in documentation and are not available \n  for registration. See "
                     [:a {:shape "rect", :href "http://www.rfc-editor.org/rfc/rfc2606.txt"} "RFC \n  2606"]
                     ", Section 3."]]]

Author

clj-tagsoup was written by Daniel Janus.

Something went wrong with that request. Please try again.