Whitman is a simple service for sampling data from a JSON-based web API over time. It is inspired by two other more specific projects, Chameleon and Karmanaut. Chameleon was designed to sample Stack Overflow users' reputation over time; Karmanaut was designed to sample Reddit users' comment and link karma scores over time. Whitman, on the other hand, is designed to be more general. It can sample data from any JSON-based web API over time, storing the results in a MongoDB database, by simply creating and editing a JSON configuration file that declares the structure of the target web API, which data should be sampled from it, and how it should be stored.
Right now, Whitman is geared towards sampling a piece of data for a set of users of an API (for example, Stack Overflow reputation or Reddit karma). It may be generalized in the future to sample data points that are not user-specific, but it is currently not designed to do that.
Before you can sample data, you must first create the universe...or at least an
executable JAR file. You can build a JAR file containing all necessary .class
files with lein:
$ lein uberjar
-
Create a MongoDB database (you may name it whatever you want).
-
Populate the
users
collection in your MongoDB database with the IDs of the users whose data you want to record, in the following format:{_id: <user ID>}
Next, you must create the crawler configuration file. Some examples are given in
the doc/
directory. Configuration files are in JSON. They may contain the
following keys (not all keys are necessary, and if they are not present in your
configuration, defaults will be used; if a default is not specified, the key is
required):
Key | Purpose | Default |
---|---|---|
connection |
Host and port number of MongoDB server | localhost:27017 |
database |
MongoDB database name | Required |
collection |
Name of MongoDB collection where samples should be stored | Required |
user-agent |
HTTP User-Agent to use when making HTTP requests | whitman/<VERSION> |
source |
URL from which samples should be pulled. It should contain one
parameter that should be substituted with a user (or record) ID; this
parameter can be denoted with the placeholder %s . |
Required |
records |
MongoDB keypath where records that should be sampled are pulled | Required |
data |
Data points that should be crawled | (See next table) |
The data
specifies the data that should be crawled. It takes the
following keys:
Key | Purpose | Default |
---|---|---|
path
| The keypath to the data that should be recorded from the JSON response. | Required |
key |
Document key that the sampled data should be stored under | Required |
Keypaths are period-separated paths specifying how the crawler should store or retrieve various pieces of data. They are currently used in two ways: When specifying the records, stored in MongoDB, that should be polled for data, and for describing how to retrieve data samples from a JSON response.
The first keypath specifies what records should be crawled. It is in the
format <collection>.<field>
and specifies the MongoDB collection and field
that are used to construct a crawlable URL for a given record. For example,
if you want to crawl a list of users stored in a users
collection, with their
IDs specified in the _id
field, the keypath would be users._id
.
Keypaths are also used to describe how data should be sampled from a JSON
response. They are a period-separated path to the key that should be sampled.
For example, say you wanted to sample link_karma
from the following API
response:
{
"data": {
"link_karma": 1000
}
}
The keypath would be data.link_karma
.
If a key is an array, you can specify an element of the array using an integer.
For example, if you wanted to sample reputation
from the following API
response:
{
"items": [
{
"reputation": 150000
}
]
}
You would use the keypath items.0.reputation
.
Once you have built Whitman, you can run it using
$ java -jar target/uberjar/whitman-standalone.jar path/to/config.json
where path/to/config.json
is, of course, the path to your JSON configuration
file.