The Infochimps Platform is an end-to-end, managed solution for building Big Data applications. It integrates best-of-breed technologies like Hadoop, Storm, Kafka, MongoDB, ElasticSearch, HBase, &c. and provides simple interfaces for accessing these powerful tools.
Computation, analytics, scripting, &c. are all handled by Wukong within the platform. Wukong is an abstract framework for defining computations on data. Wukong processors and flows can run in many different execution contexts including:
- locally on the command-line for testing or development purposes
- as a Hadoop mapper or reducer for batch analytics or ETL
- within Storm as part of a real-time data flow
The Infochimps Platform uses the concept of a deploy pack for developers to develop all their processors, flows, and jobs within. The deploy pack can be thought of as a container for all the necessary Wukong code and plugins useful in the context of an Infochimps Platform application. It includes the following libraries:
- wukong-hadoop: Run Wukong processors as mappers and reducers within the Hadoop framework. Model Hadoop jobs locally before you run them.
- wukong-storm: Run Wukong processors within the Storm framework. Model flows locally before you run them.
- wukong-load: Load the output data from your local Wukong jobs and flows into a variety of different data stores.
- wonderdog: Connect Wukong processors running within Hadoop to Elasticsearch as either a source or sink for data.
The deploy pack is installed as a RubyGem:
$ sudo gem install wukong-deploy
Wukong-Deploy provides a command-line tool
wu-deploy which can be
used to create or interact with deploy packs.
Create a new deploy pack:
$ wu-deploy new my_app Within /home/user/my_app: create . create app/models create app/processors ...
This will create a directory
my_app in the current directory.
dry_run option will print what should happen without
actually doing anything:
$ wu-deploy new my_app --dry_run Within /home/user/my_app: create . create app/models create app/processors ...
You'll be prompted if there is a conflict. You can pass the
option to always overwrite files and the
skip option to never
If your current directory is within an existing deploy pack you can start up an IRB console with the deploy pack's environment already loaded:
$ wu-deploy console irb(main):001:0>
A deploy pack is a repository with the following Rails-like file structure:
├── app │ ├── models │ ├── processors │ ├── flows │ └── jobs ├── config │ ├── environment.rb │ ├── application.rb │ ├── initializers │ ├── settings.yml │ └── environments │ ├── development.yml │ ├── production.yml │ └── test.yml ├── data ├── Gemfile ├── Gemfile.lock ├── lib ├── log ├── Rakefile ├── spec │ ├── spec_helper.rb │ └── support └── tmp
Let's look at it piece by piece:
app: The directory with all the action. It's where you define:
- models: Your domain models or "nouns", which define and wrap the different kinds of data elements in your application. They are built using whatever framework you like (defaults to Gorillib)
- processors: Your fundamental operations or "verbs", which are passed records and parse, filter, augment, normalize, or split them.
- flows: Chain together processors into streaming flows for ingestion, real-time processing, or complex event processing (CEP)
- jobs: Pair processors together to create batch jobs to run in Hadoop
config: Where you place all application configuration for all environments
- environment.rb: Defines the runtime environment for all code, requiring and configuring all Wukong framework code. You shouldn't have to edit this file directly.
- application.rb: Require and configure libraries specific to your application. Choose a model framework, pick what application code gets loaded by default (vs. auto-loaded).
- initializers: Holds any files you need to load before application.rb here. Useful for requiring and configuring external libraries.
- settings.yml: Defines application-wide settings.
- environments: Defines environment-specific settings in YAML files named after the environment. Overrides config/settings.yml.
- data: Holds sample data in flat files. You'll develop and test your application using this data.
- Gemfile and Gemfile.lock: Defines how libraries are resolved with Bundler.
- lib: Holds any code you want to use in your application but that isn't "part of" your application (like vendored libraries, Rake tasks, &c.).
- log: A good place to stash logs.
- Rakefile: Defines Rake tasks for the development, test, and deploy of your application.
spec: Holds all your RSpec unit tests.
- spec_helper.rb: Loads libraries you'll use during testing, includes spec helper libraries from Wukong.
- support: Holds support code for your tests.
- tmp: A good place to stash temporary files.