Skip to content

jonathanhefner/grubby

master
Switch branches/tags

Name already in use

A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch?
Code

Latest commit

 

Git stats

Files

Permalink
Failed to load latest commit information.
Type
Name
Latest commit message
Commit time
lib
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

grubby

Fail-fast web scraping. grubby adds a layer of utility and error-checking atop the marvelous Mechanize gem. See API listing below, or browse the full documentation.

Examples

The following code scrapes stories from the Hacker News front page:

require "grubby"

class HackerNews < Grubby::PageScraper
  scrapes(:items) do
    page.search!(".athing").map{|element| Item.new(element) }
  end

  class Item < Grubby::Scraper
    scrapes(:story_link){ source.at!("a.storylink") }

    scrapes(:story_url){ expand_url(story_link["href"]) }

    scrapes(:title){ story_link.text }

    scrapes(:comments_link, optional: true) do
      source.next_sibling.search!(".subtext a").find do |link|
        link.text.match?(/comment|discuss/)
      end
    end

    scrapes(:comments_url, if: :comments_link) do
      expand_url(comments_link["href"])
    end

    scrapes(:comment_count, if: :comments_link) do
      comments_link.text.to_i
    end

    def expand_url(url)
      url.include?("://") ? url : source.document.uri.merge(url).to_s
    end
  end
end

# The following line will raise an exception if anything goes wrong
# during the scraping process.  For example, if the structure of the
# HTML does not match expectations due to a site change, the script will
# terminate immediately with a helpful error message.  This prevents bad
# data from propagating and causing hard-to-trace errors.
hn = HackerNews.scrape("https://news.ycombinator.com/news")

# Your processing logic goes here:
hn.items.take(10).each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Hacker News also offers a JSON API, which may be more robust for scraping purposes. grubby can scrape JSON just as well:

require "grubby"

class HackerNews < Grubby::JsonScraper
  scrapes(:items) do
    # API returns array of top 500 item IDs, so limit as necessary
    json.take(10).map do |item_id|
      Item.scrape("https://hacker-news.firebaseio.com/v0/item/#{item_id}.json")
    end
  end

  class Item < Grubby::JsonScraper
    scrapes(:story_url){ json["url"] || hn_url }

    scrapes(:title){ json["title"] }

    scrapes(:comments_url, optional: true) do
      hn_url if json["descendants"]
    end

    scrapes(:comment_count, optional: true) do
      json["descendants"]&.to_i
    end

    def hn_url
      "https://news.ycombinator.com/item?id=#{json["id"]}"
    end
  end
end

hn = HackerNews.scrape("https://hacker-news.firebaseio.com/v0/topstories.json")

# Your processing logic goes here:
hn.items.each do |item|
  puts "* #{item.title}"
  puts "  #{item.story_url}"
  puts "  #{item.comment_count} comments: #{item.comments_url}" if item.comments_url
  puts
end

Core API

Auxiliary API

grubby loads several gems that extend Ruby objects with utility methods. Some of those methods are listed below. See each gem's documentation for a complete API listing.

Installation

Install the grubby gem.

Contributing

Run rake test to run the tests.

License

MIT License