This wrapper tries to make it easy to pull out interesting bits of large XML files in a space-efficient way.
Almost every XML file I ever look at seems to be large (>1GB) and oriented around records (MARC records, Solr documents, etc.). Generally all I want to do is:
- Read the next record
- Pull out a few of its fields
- Throw the rest away, rinse and repeat
so that's what this does, using XOM and a queue behind the scenes.
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
</lst>
<result name="response" numFound="170" start="0">
<doc>
<arr name="author"><str>Hardie & Gorman Pty. Ltd</str></arr>
<arr name="callnumber"><str>MAP FOLDER 176, LFSP 2757</str></arr>
<arr name="title"><str>Important clearance sale of valuable city, suburban & country properties, at upset prices : including investment, residential and vacant properties, building, farming, agricultural, and grazing lands ... Auction sale, Wednesday, 8th December, 1897 / Hardie & Gorman, Auctioneers</str></arr>
</doc>
<doc>
<arr name="author"><str>Hamburg, Harold G</str></arr>
<arr name="callnumber"><str>PAM BOX 429</str></arr>
<arr name="title"><str>The legislation of Richard III</str></arr>
</doc>
...
</result>
</response>
to extract a list of authors:
(with-open [rdr (clojure.java.io/reader (java.net.URL. "http://my.host/solr/select?q=my+query"))]
(doall (apply concat
(xml-picker-seq.core/xml-picker-seq
rdr
"doc"
(xml-picker-seq.core/xpath-query "arr[@name='author']/str")))))
This:
-
Walks through the document parsing (and loading into memory) one full "doc" node at a time
-
Uses XPath to pull out all of the str values for "author"
-
Gets the string values for each author node and returns them
Slightly more fiddly because of the namespaces, but much the same:
(with-open [rdr (clojure.java.io/reader "http://www.loc.gov/standards/marcxml/xml/collection.xml")]
(let [context (nu.xom.XPathContext. "marc" "http://www.loc.gov/MARC21/slim")
titles (xml-picker-seq.core/xml-picker-seq
rdr "record"
(xml-picker-seq.core/xpath-query "//marc:datafield[@tag = '245']/marc:subfield[@code = 'a']"
:context context :final-fn first))]
(doseq [title titles]
(do-something-with title))))
(ns kml-parser.core
(:use [xml-picker-seq.core :only [xml-picker-seq xpath-query]]
[clojure.java.io :only [reader]])
(:require [clojure.string :as string]))
(defn read-kml [filename]
(with-open [rdr (reader filename)]
(doall (xml-picker-seq rdr "Placemark"
(juxt (xpath-query "*[local-name()='Point']/*[local-name()='coordinates']" :final-fn first)
(xpath-query "*[local-name()='description']" :final-fn first
:extract-fn #(-> % .getValue (string/split #"\n"))))))))
- Replace
clojure.contrib.seq-utils/fill-queue
, contrib is depreciated.
Copyright (C) 2009 Mark Triggs
Distributed under the Eclipse Public License. See the file "COPYING".