Camus is LinkedIn's Kafka --> HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. The code includes automatic discovery of topics, Avro schema management and creation of folders partitioned by date and hour.
You can get a basic overview from this Building LinkedIn’s Real-time Activity Data Pipeline. There is also a google group for discussion that you can email at firstname.lastname@example.org, or you can search the archives. If you are interested please ask any questions on that mailing list.
A single execution of Camus consists of three stages:
1. Setup Stage
2. Hadoop Job
The EtlRequests are distributed across all mappers, where each mapper serially processes each request. Each task is responsible to:
Each mapper generates four set of files:
3. Reporting Stage
This a post-hadoop stage that reads all the counts files, aggregates the values and submits the results back to Kafka in the form of another topic. (TrackingMonitoringEvent)