Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, lake, or whatever you choose to call your Hadoop data warehouse these days.
Java Scala JavaScript CSS HTML PigLatin
Latest commit 149351f Jul 1, 2016 @utzwestermann utzwestermann committed on GitHub Update README.md

README.md

Schedoscope

Introduction

Schedoscope is a scheduling framework for painfree agile development, testing, (re)loading, and monitoring of your datahub, datalake, or whatever you choose to call your Hadoop data warehouse these days.

Schedoscope makes the headache go away you are certainly going to get when having to frequently rollout and retroactively apply changes to computation logic and data structures in your datahub with traditional ETL job schedulers such as Oozie.

With Schedoscope,

  • you never have to create DDL and schema migration scripts;
  • you do not have to manually determine which data must be deleted and recomputed in face of retroactive changes to logic or data structures;
  • you specify Hive table structures (called "views"), partitioning schemes, storage formats, dependent views, as well as transformation logic in a concise Scala DSL;
  • you have a wide range of options for expressing data transformations - from file operations and MapReduce jobs to Pig scripts, Hive queries, and Oozie workflows;
  • you benefit from Scala's static type system and your IDE's code completion to make less typos that hit you late during deployment or runtime;
  • you can easily write unit tests for your transformation logic in ScalaTest and run them quickly right out of your IDE;
  • you schedule jobs by expressing the views you need - Schedoscope takes care that all required dependencies - and only those- are computed as well;
  • you can easily export view data in parallel to external systems such as Redis caches, JDBC, or Kafka topics;
  • you have Metascope - a nice metadata management and data lineage tracing tool - at your disposal;
  • you achieve a higher utilization of your YARN cluster's resources because job launchers are not YARN applications themselves that consume cluster capacitity.

Getting Started

Get a glance at

Build it:

 [~]$ git clone https://github.com/ottogroup/schedoscope.git
 [~]$ cd schedoscope
 [~/schedoscope]$  MAVEN_OPTS='-XX:MaxPermSize=512m' mvn clean install

Follow the Open Street Map tutorial to install and run Schedoscope in a standard Hadoop distribution image:

Take a look at the View DSL Primer to get more information about the capabilities of the Schedoscope DSL:

More documentation can be found here:

Check out Metascope! It's an add-on to Schedoscope for collaborative metadata management, data discovery and exploration, and data lineage tracing:

Metascope

When is Schedoscope not for you?

Schedoscope is based on the following assumptions:

  • data are largely relational and meaningfully representable as Hive tables;
  • there is enough cluster time and capacity to actually allow for retroactive recomputation of data;
  • it is acceptable to compile table structures, dependencies, and transformation logic into what is effectively a project-specific scheduler.

Should any of those assumptions not hold in your context, you should probably look for a different scheduler.

Origins

Schedoscope was conceived at the Business Intelligence department of Otto Group

Contributions

The following people have contributed to the various parts of Schedoscope so far:

Utz Westermann (maintainer), Hans-Peter Zorn, Kassem Tohme, Christian Richter, Dominik Benz, Martin Sänger, Annika Seidler, Alexander Kolb.

We would love to get contributions from you as well. We haven't got a formalized submission process yet. If you have an idea for a contribution or even coded one already, get in touch with Utz or just send us your pull request. We will work it out from there.

Please help making Schedoscope better!

News

07/01/2016 - Release 0.6.3

We have released Version 0.6.3 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

We have fixed a security issue with Metascope that allowed non-admin users to edit taxonomies.

06/30/2016 - Release 0.6.2

We have released Version 0.6.2 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Hadoop dependencies have been updated to CDH-5.7.1. A critical bug that could result in no more views transforming while depending views still waiting has been fixed. Reliability of Metascope has been improved.

06/23/2016 - Release 0.6.1

We have released Version 0.6.1 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

Hive transformations are no longer submitted via Hive Server 2 to the cluster but directly via the hive-exec library. The reason for this change are stability and resource leakage issues commonly encountered when operating Hive Server 2. Please note that Hive transformations are now issued with hive.auto.convert.join set to false by default to limit heap consumption in Schedoscope due to involuntary local map join operations. Refer to Hive Transformation for more information on how to reenable map joins for queries that need them.

Also: quite a few bug fixes, better error messages when using the CLI client, improved parallelization of JDBC exports.

05/27/2016 - Release 0.6.0

We have released Version 0.6.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

We have updated the checksumming algorithm for Hive transformations such that changes to comments, settings, and formatting no longer affect the checksum. This should significantly reduce operations worries. However, the checksums of all your Hive queries compared to Release 0.5.0 will change. Take care that you issue a materialization request with mode RESET_TRANSFORMATION_CHECKSUMS when switching to this version to avoid unwanted view recomputations! Hence the switch of the minor release number.

The test framework now automatically checks whether there is an ON condition for each JOIN clause in your Hive queries. Also, it checks whether each input view you provide in basedOn is also declared as a dependency.

05/21/2016 - Release 0.5.0

We have released Version 0.5.0 as a Maven artifact to our Bintray repository (see Setting Up A Schedoscope Project for an example pom).

This is a biggie. We have added Metascope to our distribution. Metascope is a collaborative metadata management, documentation, exploration, and data lineage tracing tool that exploits the integrated specification of data structure, dependencies, and computation logic in Schedoscope views. See the tutorial and the Metascope primer for more information.

Community / Forums

Build Status

Build Status

License

Licensed under the Apache License 2.0