Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
mass html scraping made easy
Ruby
branch: master

Fetching latest commit…

Cannot retrieve the latest commit at this time

Failed to load latest commit information.
examples
lib
spec
.gitignore
Gemfile
Guardfile
LICENSE
README.md
Rakefile
happyscraper.gemspec

README.md

HappyScraper

HappyScraper makes it easy to write and use large quantities of nibbler based html scrapers.

It features a clean declarative DSL, the possibility to define scrapers in external files and a capability based approch when it comes to autoselecting the correct scraper for a given input.

Example

External definition of a sample scraper (e.g. in scrapers/blog_example_org.rb)

 url  "http://blog.example.org/"

 with "body.vlog"
 with "/html/head/meta[@property='og:type' and @content='blog']"

 element :title

 elements ".hentry" => :articles do
   element "h2"     => :headline
   element "a/@href" => :url
 end

And here is a Ruby application which uses above scraper.

require 'happyscraper'
require 'open-uri'

# Load all scrapers from `scrapers` directory.
happy = HappyScraper.load_scrapers( Dir.glob( 'scrapers/*' ) )

url  = 'http://blog.example.org/'
html = URI(url).read

# Based on the 'url' and 'with' capabilities set in scraper definitions
# the correct scraper is automatically selected for scraping.
blog = happy.scrap( html, url )

blog.title                   # => 'blog title'
blog.articles.last.headline  # => 'headline of last article'
blog.articles.last.url       # => 'http://blog.example.org/entry/10'

For real life usage see news.rb and related scrapers in the examples directory.

Installation with bundler

Put this line in your Gemfile and bundle install:

gem "happyscraper", :git => "git://github.com/pdg/happyscraper.git"

Credits

HappyScraper is build upon the excellent nibbler gem which is used for the heavy lifting.

License

Released under the MIT License. Copyright © 2012, Patrick Das Gupta

Something went wrong with that request. Please try again.