Skip to content

philcali/puppet-master

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Puppet Master

A scripted LMXML HTTP client that uses dispatch under the covers.

This is the web crawler to rule them all.

Motivation

Every developer needs to write a screenscraper application once in a while. The problem scope is usually a unique one (like download a series of mp3's, for example). The actions that fulfill that problem scope could have been rewritten as a series of miniature, composable actions. That's what puppet master does.

Puppet Master uses open source technologies to transform an LMXML file containing a series of recursive actions, to do something with a headless, automated browser. The program to download all of the html files from a specific web page link is now contained in an LMXML file.

Configuring

A puppet client is simply a configured AsyncHttpClient. Here's an example LMXML configuring a client to follow redirects, keep alive's, and setting the user agent to Mozilla:

browser
  config
    user-agent "Mozilla/5.0"
    follow-redirects "true"
    keep-alive "true"

Note: The configured client can be stored in a properties file and loaded from there, via browser @file="config.properties".

Actions

A puppet action is simply a transformation function that transforms a parsed node into controlled action on the configured browser. All actions are written recursively under the instructions node.

browser @file="config.properties"
  instructions
    go @to="http://google.com" println

Puppet master ships with quite a few actions:

  • go: makes a GET request
  • println: prints out the source to stdout.
  • find: finds specific html nodes by css, and loads them in find-results
  • set: sets a value to be used later in the script context
  • each: iterates over a collection of find-results
  • download: writes response to a file

Note: There is a plan for a lot more default actions.

Instructions can be programmatically aliased, removed, or added.

import actions._

val instructions = Instructions("get" -> GoAction)

val promises = Lmxml.fromFile("instructions.lmxml")(instructions)

promises foreach (_ foreach println)

About

Scripted LMXML HTTP client (puppet)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages