Skip to content
Attentive RSS/Atom feed parser for Clojure
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
art
src
.gitignore
CHANGELOG.md
LICENSE
Makefile
README.md
VERSION
project.clj

README.md

Remus

An attentive RSS and Atom feed parser for Clojure. It's built on top of well-known and powerful ROME Tools Java library. It deals with encoding issues and non-standard XML tags. Remus aimed at extracting as much information from a feed as possible.

Table of Contents

Benefits

  • Fetches all the known fields from a feed and turns them into plain Clojure data structures;
  • relies on up-to-date ROME release;
  • uses all the power of clj-http for HTTP(s) communication instead of deprecated ROME Fetcher;
  • preserves all the non-standard XML tags for further processing (see example below).

Installation

Add it into your :dependencies vector:

[remus "0.1.0-SNAPSHOT"]

Usage

First, import the library:

(ns your.project
  (:require [remus :refer [parse-url parse-file]]))

or:

(require '[remus :refer [parse-url parse-file]])

Parsing a URL

(def result (parse-url "http://planet.clojure.in/atom.xml"))

The variable result is a map with two keys: :response and :feed. These are HTTP response and a parsed feed. Here is a truncated version of a feed:

(def feed (:feed result))

(println feed)

;;;;
;; just a small subset
;;;;

{:description nil,
 :feed-type "atom_1.0",
 :entries
 [{:description nil,
   :updated-date #inst "2018-08-13T10:00:00.000-00:00",
   :extra {:tag :extra, :attrs nil, :content ()},
   :title
   "PurelyFunctional.tv Newsletter 287: DataScript, GraphQL, CRDTs",
   :author "Eric Normand",
   :link
   "https://purelyfunctional.tv/issues/purelyfunctional-tv-newsletter-287-datascript-graphql-crdts/",
   :uri "https://purelyfunctional.tv/?p=28660",
   :contents
   ({:type "html",
     :mode nil,
     :value
     "<div class=\" reset\">\n<p><em>Issue 287 August 13, 2018 <a href=\"https://purelyfunctional.tv/newsletter-archives/\">Archives</a> <a href=\"https://purelyfunctional.tv/newsletter/\" title=\"Thanks, Jeff!\">Subscribe</a></em></p>\n<p>Hi Clojurationists,</p>\n<p>I've just been digging <a href=\"https://twitter.com/puredanger/status/1028103654241443840\" title=\"\">this lovely tweet from Alex Miller</a>.</p>\n<p>Rock on!<br /><a href=\"http://twitter.com/ericnormand\">Eric Normand</a> &lt;<a href=\"mailto: ... "}),
 :published-date #inst "2018-08-13T11:59:11.000-00:00",
 :entry-links
 ({:rel "alternate",
   :href "http://planet.clojure.in/",
   :length 0}
  {:rel "self",
   :href "http://planet.clojure.in/atom.xml",
   :length 0}),
 :title "Planet Clojure",
 :language nil,
 :link "http://planet.clojure.in/",
 :uri "http://planet.clojure.in/atom.xml",
 :authors ()}

As for HTTP response, it's the same data structure that clj-http.client/response function returns. You'll need that data to save some of the HTTP headers for further requests (see below).

Parsing a file

(def feed (parse-file "/path/to/some/atom.xml"))

This function returns a parsed feed.

Parsing an input stream

(def feed (parse-stream (clojure.java.io/input-stream some-source)))

Like parse-file, it returns a parsed feed as a data structure.

HTTP communication tweaks

Since Remus relies on clj-http library for HTTP communication, you are welcome to use all its features. For example, to control redirects, security validation, authentication, etc. When calling parse-url, pass an optional map with HTTP parameters:

;; Do not check an untrusted SSL certificate.
(parse-url "http://planet.clojure.in/atom.xml"
           {:insecure? true})


;; Parse a user/pass protected HTTP resource.
(parse-url "http://planet.clojure.in/atom.xml"
           {:basic-auth ["username" "password"]})


;; Pretending being a browser. Some sites protect access by "User-Agent" header.
(parse-url "http://planet.clojure.in/atom.xml"
           {:headers {"User-Agent" "Mozilla/5.0 (Macintosh; Intel Mac...."}})

Remus overrides just two options: :as and :throw-exceptions. No matter if you pass them, their values will become :stream and true. We use streamed HTTP response since ROME works better with raw binary content rather then parsed string. We throw exceptions for non-200 responses to prevent parsing their content.

Here is how you can access the negative HTTP response:

(try
  (parse-url "http://non-existing-url")
  (catch ExceptionInfo e
    (let [response (ex-data e)
          {:keys [status headers]} response]
      (println status headers)
      ;; do anything you want
      )))

Or you may use the Slingshot approach to catch HTTP-thrown exceptions as the official manual describes.

When parsing a URL, a good option would be to pass If-None-Match and If-Modified-Since headers with values gotten from Etag and Last-Modified headers of a previous response. This trick is know as conditional GET and might prevent server from sending the data you've already received before:

(def result (parse-url "http://planet.lisp.org/rss20.xml"))

(def feed (:feed result))

(def etag (-> result :response :headers :etag))
;; "5b71766f-2f597"

(def last-modified (-> result :response :headers :last-modified))
;; Mon, 13 Aug 2018 12:15:43 GMT

;;;;
;; Now, try to fetch data passing conditionals headers:
;;;;

(def result-new
  (parse-url "http://planet.lisp.org/rss20.xml"
             {:headers {"If-None-Match" etag
                        "If-Modified-Since" last-modified}}))

(println result-new)

{:response
 {:request-time 741,
  :repeatable? false,
  :protocol-version {:name "HTTP", :major 1, :minor 1},
  :streaming? false,
  :chunked? false,
  :reason-phrase "Not Modified",
  :headers
  {"Server" "nginx/1.9.0",
   "Date" "Mon, 13 Aug 2018 13:47:12 GMT",
   "Last-Modified" "Mon, 13 Aug 2018 12:15:43 GMT",
   "Connection" "close",
   "ETag" "\"5b71766f-2f597\""},
  :orig-content-encoding nil,
  :status 304,
  :length 0,
  :body nil,
  :trace-redirects []},
 :feed nil}

Since the server returned non-200 but positive status code (304 in our case), we don't parse the response at all. So the :feed field in the result-new variable will be nil.

Non-standard tags

Sometimes, RSS/Atom feeds ship additional data in non-standard tags. A good example might be a typical YouTube feed. Let's examine one of its entries:

<entry>
  <id>yt:video:TbthtdBw93w</id>
  <yt:videoId>TbthtdBw93w</yt:videoId>
  <yt:channelId>UCaLlzGqiPE2QRj6sSOawJRg</yt:channelId>
  <title>Datomic Ions in Seven Minutes</title>
  <link rel="alternate" href="https://www.youtube.com/watch?v=TbthtdBw93w"/>
  <author>
    <name>ClojureTV</name>
    <uri>
      https://www.youtube.com/channel/UCaLlzGqiPE2QRj6sSOawJRg
    </uri>
  </author>
  <published>2018-07-03T21:16:16+00:00</published>
  <updated>2018-08-09T16:29:51+00:00</updated>
  <media:group>
    <media:title>Datomic Ions in Seven Minutes</media:title>
    <media:content url="https://www.youtube.com/v/TbthtdBw93w?version=3" type="application/x-shockwave-flash" width="640" height="390"/>
    <media:thumbnail url="https://i1.ytimg.com/vi/TbthtdBw93w/hqdefault.jpg" width="480" height="360"/>
    <media:description>
      Stuart Halloway introduces Ions for Datomic Cloud on AWS.
    </media:description>
    <media:community>
      <media:starRating count="67" average="5.00" min="1" max="5"/>
      <media:statistics views="1977"/>
    </media:community>
  </media:group>
</entry>

In addition to the standard fields, the feed carries information about the video ID, channel ID and statistics: views count, the number of times the video was starred and its average rating.

Or if you parse geo-related feeds, probably you'll face lat/lot coordinates, location names, etc.

Other RSS parsers either drop this data or require you to write a custom extension. But Remus squashes all the non-standard tags into a parsed XML data. It puts that data into an :extra field for each entry and on the top level of a feed.

This is how you can touch the extra data:

(def result (parse-url
"https://www.youtube.com/feeds/videos.xml?channel_id=UCaLlzGqiPE2QRj6sSOawJRg"))

(def feed (:feed result))

;;;;
;; Get entry-specific custom data
;;;;

(-> feed :entries first :extra)

{:tag :extra,
 :attrs nil,
 :content
 ({:tag :yt/videoId, :attrs nil, :content "TbthtdBw93w"}
  {:tag :yt/channelId,
   :attrs nil,
   :content "UCaLlzGqiPE2QRj6sSOawJRg"}
  {:tag :media/group,
   :attrs nil,
   :content
   ({:tag :media/title,
     :attrs nil,
     :content "Datomic Ions in Seven Minutes"}
    {:tag :media/content, :attrs #, :content nil}
    {:tag :media/thumbnail, :attrs #, :content nil}
    {:tag :media/description,
     :attrs nil,
     :content
     "Stuart Halloway introduces Ions for Datomic Cloud on AWS."}
    {:tag :media/community, :attrs nil, :content #})})}


;;;;
;; Get feed-specific custom data
;;;;

(-> feed :extra)

{:tag :extra,
 :attrs nil,
 :content
 ({:tag :yt/channelId,
   :attrs nil,
   :content "UCaLlzGqiPE2QRj6sSOawJRg"})}

The extra data follows XML-friendly format so it can be processed with any XML-related technics: walking, zippers, etc.

Encoding issues

All the parse-<something> functions mentioned above take additional ROME-related options. Use them to solve XML-decoding issues when dealing with weird or non-set HTTP headers. ROME's got a solid algorithm to guess encoding, but sometimes it might need your help.

At the moment, Remus supports :lenient and :encoding options. The first one indicates if ROME should try to guess encoding in case of XML parsing error. For guessing, it tries to use encoding passed under :encoding key.

Dealing with Windows encoding and unset Content-type/encoding headers:

(parse-url "https://some/rss.xml" nil {:lenient true :encoding "cp1251"})

The same options work for parsing a file or a stream:

(parse-file "https://another/atom.xml" {:lenient true :encoding "cp1251"})
(parse-stream in-source {:lenient true :encoding "cp1251"})

License

Copyright © 2018 Ivan Grishaev

Distributed under the Eclipse Public License either version 1.0 or (at your option) any later version.

You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.