Skip to content
This repository

Welcome to Sinew

Sinew collects structured data from web sites (screen scraping). It provides a Ruby DSL built for crawling, a robust caching system, and integration with Nokogiri. Though small, this project is the culmination of years of effort based on crawling systems built at several different companies.

Sinew requires Ruby 1.9, HTML Tidy and Curl.

Sinew is distributed as a ruby gem:

gem install sinew

Table of Contents

Example

Here's an example for collecting Amazon's bestseller list:

# get the url
get "http://www.amazon.com/gp/bestsellers/books/ref=sv_b_3"

# use nokogiri to find books
noko.css(".zg_itemRow").each do |item|
  # pull out the stuff we care about using nokogiri
  row = { }
  row[:url] = item.css(".zg_title a").first[:href]
  row[:title] = item.css(".zg_title")
  row[:img] = item.css(".zg_itemImage_normal img").first[:src]

  # append a row to the csv
  csv_emit(row)
end

If you paste this into a file called bestsellers.sinew and run sinew bestsellers.sinew, it will create a bestsellers.csv file containing the url, title and img for each bestseller.

How it Works

There are three main features provided by Sinew.

The Sinew DSL

Sinew uses recipe files to crawl web sites. Recipes have the .sinew extension, but they're plain old Ruby. The Sinew DSL makes crawling easy. Use get to make an HTTP GET:

get "https://www.google.com/search?q=darwin"
get "https://www.google.com/search", q: "charles darwin"

Once you've done a get, you have access to the document in four different formats (see Parsing for details). In general, it's easiest to use noko to automatically parse and interact with the results. If Nokogiri isn't appropriate, you can fall back to regular expressions run against raw, html, or clean.

get "https://www.google.com/search?q=darwin"

# pull out the links with nokogiri
links = noko.css("a").map { |i| i[:href] }
puts links.inspect

# or, use a regex
links = clean[/<a[^>]+href="([^"]+)/, 1]
puts links.inspect

CSV Output

Recipes output CSV files. To continue the example above:

get "https://www.google.com/search?q=darwin"
noko.css("a").each do |i|
  row = { }
  row[:href] = i[:href]
  row[:text] = i.text
  csv_emit row
end

Sinew creates a CSV file with the same name as the recipe, and csv_emit(hash) appends a row. The values of your hash are converted to strings:

  1. Nokogiri nodes are converted to text
  2. Arrays are joined with "|", so you can separate them later
  3. HTML tags, entities and non-ascii chars are removed
  4. Whitespace is squished

Caching

Requests are made using curl, and all responses are cached on disk in ~/.sinew. Error responses are cached as well. Each URL will be hit exactly once, and requests are rate limited to one per second. Sinew tries to be polite.

The files in ~/.sinew have nice names and are designed to be human readable. This helps when writing recipes. Sinew never deletes files from the cache - that's up to you!

Because all requests are cached, you can run Sinew repeatedly with confidence. Run it over and over again while you build up your recipe.

DSL Reference

Making requests

  • get(url) - fetch a url with HTTP GET. You can also pass in a parameter Hash, which will be appended to the URL
  • post(url) - fetch a url with HTTP POST. You can also pass in a parameter Hash, which will be posted with form url encoding.
  • url - the url of the last request. If the request goes through a redirect, url will reflect the final url.
  • uri - the Ruby URI of the last request. This is useful for resolving relative URLs:
imgs = noko.css("img").map { |i| i[:src] }
imgs = imgs.map { |i| (uri + i).to_s }

Parsing

  • raw - the raw HTML from the last request
  • html - Raw HTML processed through Tidy. Javascript is removed and whitespace is normalized, which makes for easier parsing.
  • clean - Same as html, but with most attributes removed. Sometimes this is easier to parse than html.
  • noko - a Nokogiri document built from the tidied HTML

Writing

  • csv_header(keys) - specify the columns for CSV output. If you don't call this, Sinew will use the keys from the first call to csv_emit.
  • csv_emit(hash) - append a row to the CSV file

Hints

Writing Sinew recipes is fun and easy. The builtin caching means you can iterate quickly, since you won't have to re-fetch the data. Here are some hints for writing idiomatic recipes:

  • Sinew doesn't (yet) check robots.txt - please check it manually.
  • Prefer Nokogiri over regular expressions wherever possible. Learn CSS selectors.
  • In Chrome, $$ in the console is your friend.
  • Fallback to regular expressions if you're desperate. Depending on the site, use either raw, html, or clean for your regular expressions. html is probably your best bet. raw is good for crawling Javascript, but it's fragile if the site changes.
  • Learn to love String#[regexp], which is an obscure operator but incredibly handy for Sinew.
  • Laziness is useful. Keep your CSS selectors and regular expressions simple, so maybe they'll work again the next time you need to crawl a site.
  • Don't be afraid to mix CSS selectors, regular expressions, and Ruby:
noko.css("table")[4].css("td").select { |i| i[:width].to_i > 80 }.map(&:text)
  • Debug your recipes using plain old puts, or better yet use ap from awesome_print.
  • Run sinew -v to get a report on every csv_emit. Very handy.
  • Add the CSV files to your git repo. That way you can version them and get diffs!

Limitations

  • Requires Ruby 1.9, HTML Tidy and Curl
  • Almost no support for international (non-english) characters

Changelog

Master (unreleased)

  • TBD!

1.0.3

  • Friendlier message if curl or tidy are missing.

1.0.2

  • Remove entity options from tidy, which didn't work on MacOS (thanks Rex!)

1.0.1

  • Trying to run on 1.8 produces a fatal error. Onward!
  • Added first batch of unit tests
Something went wrong with that request. Please try again.