-
Notifications
You must be signed in to change notification settings - Fork 332
How To: Scrape Web Pages
What's special about scraping Web pages that's different from log processing?
It is I/O bound. I/O bound means that the job at hand not only depends on your machine and specifically your CPU, it most commonly means that you depends on other people's machine. And, well, other people's machine suck.
- Latency
- Transfer speed
- Failures
- Corruption
All these influence how your job will be executed.
If you used a single thread, or a single process, then naively they would block while waiting for I/O, or fail, or both block and fail and present you with corrupt data :(.
If this took 1 second, then you have a 1 req/s pipeline in your hands.
In Ruby, there's no silver bullet other than making more pipelines (let's ignore evented frameworks for now) - more threads or more processes and Sneakers is designed to scale both.
So same as with the How To: Do Log Processing example, let's outline a worker:
require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'
class WebScraper
include Sneakers::Worker
from_queue :web_pages
def work(msg)
doc = Nokogiri::HTML(open(msg))
page_title = doc.css('title').text
worker_trace "Found: #{page_title}"
ack!
end
end
However, since this worker does I/O, it will by default open up 25 threads for us. What if we want more?
require 'sneakers'
require 'open-uri'
require 'nokogiri'
require 'sneakers/metrics/logging_metrics'
class WebScraper
include Sneakers::Worker
from_queue :web_pages,
:threads => 50,
:prefetch => 50,
:timeout_job_after => 1
def work(msg)
doc = Nokogiri::HTML(open(msg))
page_title = doc.css('title').text
worker_trace "Found: #{page_title}"
ack!
end
end
This means we set up 50 threads that will all do I/O for us at the same time. A good practice is to set up a prefetch policy against RabbitMQ of at least the amount of threads involved.
We also want to timeout super-fast; a timeout of 1 second means a thread can only be held up to 1 second, and this whole thing will generate at worst 50 req/s (worst being all jobs failing and timeouting on us).
If you are thinking of adding a persistence layer here (for example, for saving the page titles), note that the fact that Sneakers opens up to so many threads and so many processes unless you opt in for connection sharing, it may cause high contention for the data store client used. In case the data store client supports connection pooling and/or tune-able concurrency, those may need adjusting to match the concurrency level used by Sneakers.
Finding suitable values is often a matter of trial and error.