Puppet Master
A scripted LMXML HTTP client that uses dispatch under the covers.
This is the web crawler to rule them all.
Motivation
Every developer needs to write a screenscraper application once in a while. The problem scope is usually a unique one (like download a series of mp3's, for example). The actions that fulfill that problem scope could have been rewritten as a series of miniature, composable actions. That's what puppet master does.
Puppet Master uses open source technologies to transform an LMXML file
containing a series of recursive actions
, to do something with a headless,
automated browser. The program to download all of the html files from a
specific web page link is now contained in an LMXML file.
Configuring
A puppet client is simply a configured AsyncHttpClient
. Here's an
example LMXML configuring a client to follow redirects, keep alive's, and
setting the user agent to Mozilla:
browser
config
user-agent "Mozilla/5.0"
follow-redirects "true"
keep-alive "true"
Note: The configured client can be stored in a properties file and loaded
from there, via browser @file="config.properties"
.
Actions
A puppet action
is simply a transformation function that transforms a parsed
node into controlled action on the configured browser. All actions are written
recursively under the instructions
node.
browser @file="config.properties"
instructions
go @to="http://google.com" println
Puppet master ships with quite a few actions:
go
: makes a GET requestprintln
: prints out the source to stdout.find
: finds specific html nodes by css, and loads them infind-results
set
: sets a value to be used later in the script contexteach
: iterates over a collection offind-results
download
: writes response to a file
Note: There is a plan for a lot more default actions.
Instructions can be programmatically aliased, removed, or added.
import actions._
val instructions = Instructions("get" -> GoAction)
val promises = Lmxml.fromFile("instructions.lmxml")(instructions)
promises foreach (_ foreach println)