Skip to content


Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Who's using Cascalog

Peter Lubell-Doughtie edited this page · 7 revisions

Factual is constantly aggregating and processing growing sets of data. We find ourselves relying more and more on the Hadoop stack of technologies. Cascalog has allowed us easy abstraction from details of data sources (with taps, as in cascading). More specifically, we use Cascalog to run our machine learning algorithms on billions of web pages and user contributed data to aggregate factual data present in multiple sources. We also benefit from the ad-hoc nature of Cascalog when doing things such as generating statistics across our datasets, verifying map-reduce job outputs, tracing the history of data through our processing pipeline, and running experimental data manipulation and transformations.

We're also benefiting from the availability of Clojure in Cascalog. Clojure is a natural fit when doing custom data manipulations, and it's also quite useful to use the REPL to experiment. Being able to "call out" to pure Clojure from our Cascalog queries has been a big win.

At Harvard School of Public Health we use Cascalog to query large datasets generated by next-generation sequencing. We need an approach that facilitates rapid iterations of coding and testing for algorithm development work, and then scales to handle increasingly large data volumes. As a small group that works on many projects simultaneously, we need to be as efficient as possible since any development code could potentially become part of processing pipelines.

Cascalog makes coding for Hadoop much easier. This allows us to focus on the queries and data interpretation. It additionally increases the understandability of the code, which is essential for reproducibility and transparency. A detailed writeup of some of our work with Cascalog is available here.

At Intent Media, we provide predictive analytics to help retailers recognize and react to the unique value of each site visitor. Cascalog allows us to efficiently analyze terabytes of data to help retailers make smart, real-time choices about who sees what and how to adapt their site to best realize the full value of each visitor.

Cascalog is an increasingly core component of our backend modeling pipeline comprising data aggregation, pre-processing, and feature extraction.

At Lumosity, we are committed to pioneering the understanding and enhancement of the human brain to give each person the power to unlock their full potential. Data analysis is an important part of our business, whether it's to conduct new scientific studies to learn more about the human brain or analyze user behavior on our site to optimize Lumosity and the training experience. Cascalog allows our Research & Development team to efficiently analyze our database of human cognitive performance – the largest in the world with over 450 million data points - to gain new insights on cognitive training.

Cascalog is at the core of Twitter's tools for publisher partners. A batch workflow written using Clojure and Cascalog updates a variety of !ElephantDB views a few times a day. These views include time series aggregations, influence analysis, follower distribution analysis, and more. Additionally, Cascalog is used to vertically partition the greater than 40TB dataset in a few different ways to allow for efficient querying later on. Cascalog's conciseness and great expressive capabilities greatly reduce the complexity in our batch processing.

Cascalog is also used for ad-hoc querying and exploratory work, taking advantage of the ease of defining and running queries from the REPL. When a major event happens, we extract relevant tweets from the master datastore to a local computer where they can be analyzed in a quick iterative fashion.

Cascalog forms the core of Yieldbot's intent modeling and matching technology stack. Publisher's data is fed through a batch workflow at regular intervals and performs a wide array of task such as predictive modeling, text processing, metrics aggregation.

Cascalog and Clojure allow us to develop, deploy, explore and iterate on our workflows with extreme speed and minimal effort. You can read about our experience migrating from Apache Pig to Cascalog here: Why Yieldbot Chose Cascalog over Pig for Hadoop Processing

REDD Metrics

REDD Metrics uses Cascalog at the heart of our large-scale deforestation monitoring system, currently housed at the Center for Global Development in Washington. We process hundreds of gigabytes of NASA satellite data down into concrete predictions on the likelihood that some piece of land will be deforested in the next month. Cascalog allows us to generate timeseries and perform analysis at a scale unimaginable with current "state of the art" practices. We look forward to open sourcing our work in the coming months. For updates, take a look at our blog.

uSwitch uses high-level data to make business decisions and drill down to the microscopic-level to enable a personalised experience to each of our customers. Cascalog sits at the heart of our modular data pipeline transforming immutable event data to clean and extract customer features for the rest of the business. Furthermore, the logical and functional nature of Cascalog enables our small data team to build simple, composable data processing workflow on scale.

Something went wrong with that request. Please try again.