Skip to content
This repository

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP
Newer
Older
100644 169 lines (115 sloc) 11.642 kb
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
1 == Feedzirra
2
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
3 I'd like feedback on the api and any bugs encountered on feeds in the wild. I've set up a {google group here}[http://groups.google.com/group/feedzirra].
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
4
5 === Description
6
7 Feedzirra is a feed library that is designed to get and update many feeds as quickly as possible. This includes using libcurl-multi through the taf2-curb[link:http://github.com/taf2/curb/tree/master] gem for faster http gets, and libxml through nokogiri[link:http://github.com/tenderlove/nokogiri/tree/master] and sax-machine[link:http://github.com/pauldix/sax-machine/tree/master] for faster parsing.
8
9 Once you have fetched feeds using Feedzirra, they can be updated using the feed objects. Feedzirra automatically inserts etag and last-modified information from the http response headers to lower bandwidth usage, eliminate unnecessary parsing, and make things speedier in general.
10
11 Another feature present in Feedzirra is the ability to create callback functions that get called "on success" and "on failure" when getting a feed. This makes it easy to do things like log errors or update data stores.
12
13 The fetching and parsing logic have been decoupled so that either of them can be used in isolation if you'd prefer not to use everything that Feedzirra offers. However, the code examples below use helper methods in the Feed class that put everything together to make things as simple as possible.
14
15 The final feature of Feedzirra is the ability to define custom parsing classes. In truth, Feedzirra could be used to parse much more than feeds. Microformats, page scraping, and almost anything else are fair game.
16
17 === Installation
18
19 For now Feedzirra exists only on github. It also has a few gem requirements that are only on github. Before you start you need to have libcurl[link:http://curl.haxx.se/] and libxml[link:http://xmlsoft.org/] installed. If you're on Leopard you have both. Otherwise, you'll need to grab them. Once you've got those libraries, these are the gems that get used: nokogiri, pauldix-sax-machine, taf2-curb (note that this is a fork that lives on github and not the Ruby Forge version of curb), and pauldix-feedzirra. The feedzirra gemspec has all the dependencies so you should be able to get up and running with the standard github gem install routine:
20
21 gem sources -a http://gems.github.com # if you haven't already
22 gem install pauldix-feedzirra
23
24 *NOTE:*Some people have been reporting a few issues related to installation. First, the Ruby Forge version of curb is not what you want. It will not work. Nor will the curl-multi gem that lives on Ruby Forge. You have to get the taf2-curb[link:http://github.com/taf2/curb/tree/master] fork installed.
25
26 If you see this error when doing a require:
27
28 /Library/Ruby/Site/1.8/rubygems/custom_require.rb:31:in `gem_original_require': no such file to load -- curb_core (LoadError)
29
30 It means that the taf2-curb gem didn't build correctly. To resolve this you can do a git clone git://github.com/taf2/curb.git then run rake gem in the curb directory, then sudo gem install pkg/curb-0.2.4.0.gem. After that you should be good.
31
32 If you see something like this when trying to run it:
33
34 NoMethodError: undefined method `on_success' for #<Curl::Easy:0x1182724>
35 from ./lib/feedzirra/feed.rb:88:in `add_url_to_multi'
36
37 This means that you are requiring curl-multi or the Ruby Forge version of Curb somewhere. You can't use those and need to get the taf2 version up and running.
38
39 If you're on Debian or Ubuntu and getting errors while trying to install the taf2-curb gem, it could be because you don't have the latest version of libcurl installed. Do this to fix:
40
41 sudo apt-get install libcurl4-gnutls-dev
42
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
43 Another problem could be if you are running Mac Ports and you have libcurl installed through there. You need to uninstall it for curb to work! The version in Mac Ports is old and doesn't play nice with curb. If you're running Leopard, you can just uninstall and you should be golden. If you're on an older version of OS X, you'll then need to {download curl}[http://curl.haxx.se/download.html] and build from source. Then you'll have to install the taf2-curb gem again. You might have to perform the step above.
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
44
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
45 If you're still having issues, please let me know on the mailing list. Also, {Todd Fisher (taf2)}[link:http://github.com/taf2] is working on fixing the gem install. Please send him a full error report.
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
46
47 === Usage
48
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
49 {A gist of the following code}[link:http://gist.github.com/57285]
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
50
51 require 'feedzirra'
52
53 # fetching a single feed
54 feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing")
55
56 # feed and entries accessors
57 feed.title # => "Paul Dix Explains Nothing"
58 feed.url # => "http://www.pauldix.net"
59 feed.feed_url # => "http://feeds.feedburner.com/PaulDixExplainsNothing"
60 feed.etag # => "GunxqnEP4NeYhrqq9TyVKTuDnh0"
61 feed.last_modified # => Sat Jan 31 17:58:16 -0500 2009 # it's a Time object
62
63 entry = feed.entries.first
64 entry.title # => "Ruby Http Client Library Performance"
65 entry.url # => "http://www.pauldix.net/2009/01/ruby-http-client-library-performance.html"
66 entry.author # => "Paul Dix"
67 entry.summary # => "..."
68 entry.content # => "..."
69 entry.published # => Thu Jan 29 17:00:19 UTC 2009 # it's a Time object
70 entry.categories # => ["...", "..."]
71
72 # sanitizing an entry's content
73 entry.title.sanitize # => returns the title with harmful stuff escaped
74 entry.author.sanitize # => returns the author with harmful stuff escaped
75 entry.content.sanitize # => returns the content with harmful stuff escaped
76 entry.content.sanitize! # => returns content with harmful stuff escaped and replaces original (also exists for author and title)
77 entry.sanitize! # => sanitizes the entry's title, author, and content in place (as in, it changes the value to clean versions)
78 feed.sanitize_entries! # => sanitizes all entries in place
79
80 # updating a single feed
81 updated_feed = Feedzirra::Feed.update(feed)
82
83 # an updated feed has the following extra accessors
84 updated_feed.updated? # returns true if any of the feed attributes have been modified. will return false if only new entries
85 updated_feed.new_entries # a collection of the entry objects that are newer than the latest in the feed before update
86
87 # fetching multiple feeds
88 feed_urls = ["http://feeds.feedburner.com/PaulDixExplainsNothing", "http://feeds.feedburner.com/trottercashion"]
c185afb4 » Justin S. Leitgeb
2009-04-17 Fix a couple of typos that annoyed me while copying and pasting code …
89 feeds = Feedzirra::Feed.fetch_and_parse(feed_urls)
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
90
91 # feeds is now a hash with the feed_urls as keys and the parsed feed objects as values. If an error was thrown
92 # there will be a Fixnum of the http response code instead of a feed object
93
94 # updating multiple feeds. it expects a collection of feed objects
c185afb4 » Justin S. Leitgeb
2009-04-17 Fix a couple of typos that annoyed me while copying and pasting code …
95 updated_feeds = Feedzirra::Feed.update(feeds.values)
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
96
97 # defining custom behavior on failure or success. note that a return status of 304 (not updated) will call the on_success handler
98 feed = Feedzirra::Feed.fetch_and_parse("http://feeds.feedburner.com/PaulDixExplainsNothing",
99 :on_success => lambda {|feed| puts feed.title },
100 :on_failure => lambda {|url, response_code, response_header, response_body| puts response_body })
101 # if a collection was passed into fetch_and_parse, the handlers will be called for each one
102
103 # the behavior for the handlers when using Feedzirra::Feed.update is slightly different. The feed passed into on_success will be
104 # the updated feed with the standard updated accessors. on failure it will be the original feed object passed into update
105
106 # Defining custom parsers
107 # TODO: the functionality is here, just write some good examples that show how to do this
108
109 === Extending
110
111 === Benchmarks
112
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
113 One of the goals of Feedzirra is speed. This includes not only parsing, but fetching multiple feeds as quickly as possible. I ran a benchmark getting 20 feeds 10 times using Feedzirra, rFeedParser, and FeedNormalizer. For more details the {benchmark code can be found in the project in spec/benchmarks/feedzirra_benchmarks.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/feedzirra_benchmarks.rb]
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
114
115 feedzirra 5.170000 1.290000 6.460000 ( 18.917796)
116 rfeedparser 104.260000 12.220000 116.480000 (244.799063)
117 feed-normalizer 66.250000 4.010000 70.260000 (191.589862)
118
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
119 The result of that benchmark is a bit sketchy because of the network variability. Running 10 times against the same 20 feeds was meant to smooth some of that out. However, there is also a {benchmark comparing parsing speed in spec/benchmarks/parsing_benchmark.rb}[http://github.com/pauldix/feedzirra/blob/7fb5634c5c16e9c6ec971767b462c6518cd55f5d/spec/benchmarks/parsing_benchmark.rb] on an atom feed.
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
120
121 feedzirra 0.500000 0.030000 0.530000 ( 0.658744)
122 rfeedparser 8.400000 1.110000 9.510000 ( 11.839827)
123 feed-normalizer 5.980000 0.160000 6.140000 ( 7.576140)
124
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
125 There's also a {benchmark that shows the results of using Feedzirra to perform updates on feeds}[http://github.com/pauldix/feedzirra/blob/45d64319544c61a4c9eb9f7f825c73b9f9030cb3/spec/benchmarks/updating_benchmarks.rb] you've already pulled in. I tested against 179 feeds. The first is the initial pull and the second is an update 65 seconds later. I'm not sure how many of them support etag and last-modified, so performance may be better or worse depending on what feeds you're requesting.
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
126
127 feedzirra fetch and parse 4.010000 0.710000 4.720000 ( 15.110101)
128 feedzirra update 0.660000 0.280000 0.940000 ( 5.152709)
129
130 === TODO
131
132 This thing needs to hammer on many different feeds in the wild. I'm sure there will be bugs. I want to find them and crush them. I didn't bother using the test suite for feedparser. i wanted to start fresh.
133
134 Here are some more specific TODOs.
135 * Make a feedzirra-rails gem to integrate feedzirra seamlessly with Rails and ActiveRecord.
136 * Add support for authenticated feeds.
137 * Create a super sweet DSL for defining new parsers.
138 * Test against Ruby 1.9.1 and fix any bugs.
139 * I'm not keeping track of modified on entries. Should I add this?
140 * Clean up the fetching code inside feed.rb so it doesn't suck so hard.
141 * Make the feed_spec actually mock stuff out so it doesn't hit the net.
142 * Readdress how feeds determine if they can parse a document. Maybe I should use namespaces instead?
143
144 === LICENSE
145
146 (The MIT License)
147
148 Copyright (c) 2009:
149
19393560 » Daniel Insley
2009-03-16 added documentation for existing methods in feed class
150 {Paul Dix}[http://pauldix.net]
ab86687f » Daniel Insley
2009-03-13 added documentation to feed.rb
151
152 Permission is hereby granted, free of charge, to any person obtaining
153 a copy of this software and associated documentation files (the
154 'Software'), to deal in the Software without restriction, including
155 without limitation the rights to use, copy, modify, merge, publish,
156 distribute, sublicense, and/or sell copies of the Software, and to
157 permit persons to whom the Software is furnished to do so, subject to
158 the following conditions:
159
160 The above copyright notice and this permission notice shall be
161 included in all copies or substantial portions of the Software.
162
163 THE SOFTWARE IS PROVIDED 'AS IS', WITHOUT WARRANTY OF ANY KIND,
164 EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
165 MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
166 IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
167 CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
168 TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
169 SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
Something went wrong with that request. Please try again.