# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

## Principles

- Both notebooks and code files are first citizens

Following python package conventions, the root of the project is tagged by a `__main__.py` and directory of source code (either python or notebooks) contains the `__init__.py` code. By doing so, python and notebook files can reference each other.

Python notebooks and Python code can be mixed and matched, and are interoperable with each other. By using datafaucet, you can include notebooks as modules to python code, and you can include python modules in a notebook. 

- Decouple Code and Data Resources

Data can be located anywhere, on remote HDFS clusters, or Object Store Services exposed via S3 protocols etc. Also you can keep data on the local file system. No matter where data is located, we want to de-couple data resources from the code executed in the data pipeline.

Separating data and code is done by declaring all data resources/providers as configuration in metadata files. Metadata files make possible to define aliases for data resources, data services and engine configurations, and keeping the ETL and ML code tidy with no hardcoded parameters.

- Decouple Code from Configuration

Code either stored as notebooks or as python files should be decoupled from both engine configurations and from data locations. All configuration is kept in metadata yaml files. Multiple setups for test, exploration, production can be defined as different metadata configuration profiles. Profiles can inherit configurations settings from other profiles.

## Chapters

#### How to
  - [Installing](install.ipynb)
  - [Engine](engine.ipynb)
  - [Metadata](metadata.ipynb)
  - [Project](project.ipynb)
  - [Resources](resources.ipynb)
  - [Load and Save](loadsave.ipynb)
  - [Logging](logging.ipynb)
  - [Scaffolding](scaffolding.ipynb)
  - [CLI](run.ipynb)
  
#### Data Ingest
  - [Copy: overwrite/append](ingest.ipynb)
  - [Log: slow changing dimensions](change.ipynb)
  
#### ETL
  - [Merge Tables](merge.ipynb)
  - [Fact Table](facts.ipynb)
  - [Star Schema](star.ipynb)

#### Aggregation
  - [Aggregated data](aggregate.ipynb)
  - [Publishing cubes](publish.ipynb)
  
#### Reporting
  - Interactive
  - BI tools
  - Publish

#### Pipelines:
  - CI/CD
  - DTAP
  - 