General Repo for CS286A projects
This document under heavy revision
In the large, the class project is intended to prototype infrastructure for managing metadata and lineage in the Apache Big Data context. The project should have three phases:
- requirements gathering (2 weeks)
- functional specification and design (2 weeks)
- prototype implementation (4 weeks) These phases will likely overlap and have feedback loops. That's OK.
We envision three basic components:
- A metadata repository that (a) has a schema to capture relevant information from a set of prototypical tasks and tools, (b) is extensible to new tasks and tools with varying degrees of opacity, and (c) can scale up to large volumes of metadata and high access rates.
- A crawler that can walk large repositories of information, and call out to an extensible set of external data "recognizers" or "profilers" that can assess the contents of individual data files or sets. Candidate datastores include HDFS, POSIX filesystems, and relational databases (via standards like JDBC), and perhaps some special file types like iPython notebooks. The crawler should interface with a standard scheduling infrastructure at two levels 1. macro: fire off crawls on a schedule (nightly, weekly, etc) 2. micro: execute the crawl through the scheduler: i.e. visit files and feed them up for REST calls at a load-sensitive pace.
- A metadata mover facility that provides (a) an API for inserting metadata into the repository, (b) a facility for reliable bulk movement of large volumes data into the repository, and (c) an interface to the same scheduling infrastructure as the crawler for executing bulk metadata movement
The goal here is not to write everything from scratch, but rather to use and augment well-supported open source components. Potentially useful components include:
- Gobblin. This is an excellent starting point for the crawler and metadata mover projects.
- A variety of open-source databases could be used for the metadata repository. We will have to decide very early on whether we want to use a relational database or a key-value store. A concern with relational databases is that there isn't currently a well-used scale-out (parallel/distributed) relational database in the typical Apache Big Data environment
- Kafka is a standard open source tool for reliable bulk data movement that could be useful and will likely work well with Gobblin as both are from LinkedIn.
We will endeavor to support some real-world use cases from scientists on campus, as well as partners in industry. On campus, we have access to experts on two important toolchains:
- The BDAS stack including Spark and MLlib.
- Project Jupyter, and especially iPython.