A Clojure wrapper for reading large XML files with XOM
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.



This wrapper tries to make it easy to pull out interesting bits of large XML files in a space-efficient way.

Almost every XML file I ever look at seems to be large (>1GB) and oriented around records (MARC records, Solr documents, etc.). Generally all I want to do is:

  • Read the next record
  • Pull out a few of its fields
  • Throw the rest away, rinse and repeat

so that's what this does, using XOM and a queue behind the scenes.

Example usage

Parsing a Solr response

  <lst name="responseHeader">
    <int name="status">0</int>
    <int name="QTime">0</int>
  <result name="response" numFound="170" start="0">
      <arr name="author"><str>Hardie &amp; Gorman Pty. Ltd</str></arr>
      <arr name="callnumber"><str>MAP FOLDER 176, LFSP 2757</str></arr>
      <arr name="title"><str>Important clearance sale of valuable city, suburban &amp; country properties, at upset prices : including investment, residential and vacant properties, building, farming, agricultural, and grazing lands ... Auction sale, Wednesday, 8th December, 1897 / Hardie &amp; Gorman, Auctioneers</str></arr>
      <arr name="author"><str>Hamburg, Harold G</str></arr>
      <arr name="callnumber"><str>PAM BOX 429</str></arr>
      <arr name="title"><str>The legislation of Richard III</str></arr>

to extract a list of authors:

(with-open [rdr (clojure.java.io/reader (java.net.URL. "http://my.host/solr/select?q=my+query"))]
  (doall (apply concat
                (xml-picker-seq.core/xpath-query "arr[@name='author']/str")))))


  • Walks through the document parsing (and loading into memory) one full "doc" node at a time

  • Uses XPath to pull out all of the str values for "author"

  • Gets the string values for each author node and returns them

Parsing MARCXML records

Slightly more fiddly because of the namespaces, but much the same:

(with-open [rdr (clojure.java.io/reader "http://www.loc.gov/standards/marcxml/xml/collection.xml")]
  (let [context (nu.xom.XPathContext. "marc" "http://www.loc.gov/MARC21/slim")
        titles (xml-picker-seq.core/xml-picker-seq
                rdr "record"
                (xml-picker-seq.core/xpath-query "//marc:datafield[@tag = '245']/marc:subfield[@code = 'a']"
                                                 :context context :final-fn first))]
    (doseq [title titles]
      (do-something-with title))))

Parsing KML files

(ns kml-parser.core
  (:use [xml-picker-seq.core :only [xml-picker-seq xpath-query]]
        [clojure.java.io :only [reader]])
  (:require [clojure.string :as string]))

(defn read-kml [filename]
  (with-open [rdr (reader filename)]
    (doall (xml-picker-seq rdr "Placemark"
                           (juxt (xpath-query "*[local-name()='Point']/*[local-name()='coordinates']" :final-fn first)
                                 (xpath-query "*[local-name()='description']" :final-fn first
                                              :extract-fn #(-> % .getValue (string/split #"\n"))))))))


  • Replace clojure.contrib.seq-utils/fill-queue, contrib is depreciated.


Copyright (C) 2009 Mark Triggs

Distributed under the Eclipse Public License. See the file "COPYING".