Skip to content


Subversion checkout URL

You can clone with
Download ZIP


ehrlichja edited this page · 32 revisions

What is Camus?

Camus is LinkedIn's Kafka --> HDFS pipeline. It is a mapreduce job that does distributed data loads out of Kafka. The code includes automatic discovery of topics, Avro schema management and creation of folders partitioned by date and hour.

You can get a basic overview from this Building LinkedIn’s Real-time Activity Data Pipeline. There is also a google group for discussion that you can email at, or you can search the archives. If you are interested please ask any questions on that mailing list.

Brief Overview

A single execution of Camus consists of three stages:

1. Setup Stage

  • Fetch metadata from Kafka (List of topics and partitions, leader for each partition, latest Offset)
  • Read previous execution files (consumed offsets by the previous execution, if any)
  • Prepare "EtlRequest" (unit of work for Camus) for all topics

2. Hadoop Job

The EtlRequests are distributed across all mappers, where each mapper serially processes each request. Each task is responsible to:

  • Fetch events from Kafka and partition them across folders based on their timestamp
  • Collect count statistics
  • Store updated consumed offsets on HDFS

Each mapper generates four set of files:

  • Avro Data Files : Events fetched from Kafka
  • Counts Files: Collect count statistics
  • Offset : Current consumed offsets to be used by the next execution
  • Error : Generated if an error was encountered. Empty in case no error was generated

3. Reporting Stage

This a post-hadoop stage that reads all the counts files, aggregates the values and submits the results back to Kafka in the form of another topic. (TrackingMonitoringEvent)

Deep Dive

Setup Stage
Hadoop Job
Reporting Stage
Camus InputFormat and OutputFormat Behavior

Setting up Camus

Configuration Parameters



Some other sources of information

Camus Gotchas

Something went wrong with that request. Please try again.