Sparkling Warehouse

Sparkling Warehouse is a set of scripts that builds complex workflow of batch jobs for analysing logistics raw data by using Spark. It handles data preprocessing (pipeline), data modeling/instantiation (nozzle), workflow management (valve), visualization (tank), and much more.

Background

The purpose of the Warehouse is to address an unwieldy memory issue typically associated with long-running batch processes when dealing with massive data in SF Express project. By dint of much trying, I started thinking to decompose the analytical problem into tasks that can be executed in parallel & series, and make every component be memory-friendly.

Basically, I rewrote all the feature service logic for the Warehouse that requires high frequency R&W manipulations on the large raw dataset by PySpark. Clearly, by deploying computation on Spark, the Warehouse makes the workflow easy to design without concern about insufficient memory, and it also greatly improves computational performance.

Ideally, you can assemble pretty much any pipelines of tasks you want and manage them by valves, but the warehouse still comes with some one-button pipeline templates that you can use. It includes support for running ...

Change Log

Dec 21, 2017 Bulid first pipeline for testing workflow which was newly refactored on Spark.

Name		Name	Last commit message	Last commit date
Latest commit History 109 Commits
spark		spark
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Sparkling Warehouse

Background

Change Log

About

Releases

Packages

Contributors 2

Languages

meowoodie/Logistics-Analytics

Folders and files

Latest commit

History

Repository files navigation

Sparkling Warehouse

Background

Change Log

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages