Whitman

Whitman is a simple service for sampling data from a JSON-based web API over time. It is inspired by two other more specific projects, Chameleon and Karmanaut. Chameleon was designed to sample Stack Overflow users' reputation over time; Karmanaut was designed to sample Reddit users' comment and link karma scores over time. Whitman, on the other hand, is designed to be more general. It can sample data from any JSON-based web API over time, storing the results in a MongoDB database, by simply creating and editing a JSON configuration file that declares the structure of the target web API, which data should be sampled from it, and how it should be stored.

Right now, Whitman is geared towards sampling a piece of data for a set of users of an API (for example, Stack Overflow reputation or Reddit karma). It may be generalized in the future to sample data points that are not user-specific, but it is currently not designed to do that.

Prerequisites

Usage

Building

Before you can sample data, you must first create the universe...or at least an executable JAR file. You can build a JAR file containing all necessary .class files with lein:

$ lein uberjar

Setup

Database

Create a MongoDB database (you may name it whatever you want).
Populate the users collection in your MongoDB database with the IDs of the users whose data you want to record, in the following format:
```
 {_id: <user ID>}
```

Configuration

Next, you must create the crawler configuration file. Some examples are given in the doc/ directory. Configuration files are in JSON. They may contain the following keys (not all keys are necessary, and if they are not present in your configuration, defaults will be used; if a default is not specified, the key is required):

Key	Purpose	Default
`connection`	Host and port number of MongoDB server	`localhost:27017`
`database`	MongoDB database name	Required
`collection`	Name of MongoDB collection where samples should be stored	Required
`user-agent`	HTTP User-Agent to use when making HTTP requests	`whitman/<VERSION>`
`source`	URL from which samples should be pulled. It should contain one parameter that should be substituted with a user (or record) ID; this parameter can be denoted with the placeholder `%s`.	Required
`records`	MongoDB keypath where records that should be sampled are pulled	Required
`data`	Data points that should be crawled	(See next table)

The data specifies the data that should be crawled. It takes the following keys:

Key	Purpose	Default
`path`	The keypath to the data that should be recorded from the JSON response.	Required
`key`	Document key that the sampled data should be stored under	Required

Keypaths

Keypaths are period-separated paths specifying how the crawler should store or retrieve various pieces of data. They are currently used in two ways: When specifying the records, stored in MongoDB, that should be polled for data, and for describing how to retrieve data samples from a JSON response.

The first keypath specifies what records should be crawled. It is in the format <collection>.<field> and specifies the MongoDB collection and field that are used to construct a crawlable URL for a given record. For example, if you want to crawl a list of users stored in a users collection, with their IDs specified in the _id field, the keypath would be users._id.

Keypaths are also used to describe how data should be sampled from a JSON response. They are a period-separated path to the key that should be sampled. For example, say you wanted to sample link_karma from the following API response:

{
  "data": {
    "link_karma": 1000
  }
}

The keypath would be data.link_karma.

If a key is an array, you can specify an element of the array using an integer. For example, if you wanted to sample reputation from the following API response:

{
  "items": [
    {
      "reputation": 150000
    }
  ]
}

You would use the keypath items.0.reputation.

Running

Once you have built Whitman, you can run it using

$ java -jar target/uberjar/whitman-standalone.jar path/to/config.json

where path/to/config.json is, of course, the path to your JSON configuration file.

Name		Name	Last commit message	Last commit date
Latest commit History 95 Commits
bin		bin
doc		doc
fixtures		fixtures
src/whitman		src/whitman
test/whitman		test/whitman
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
project.clj		project.clj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Whitman

Prerequisites

Usage

Building

Setup

Database

Configuration

Keypaths

Running

About

Releases

Packages

Languages

License

mdippery/whitman

Folders and files

Latest commit

History

Repository files navigation

Whitman

Prerequisites

Usage

Building

Setup

Database

Configuration

Keypaths

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages