# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datafaucet as dfc

## Load a project

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When datafaucet project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

Loading the profile can be done with the `datafaucet.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.

In [2]:
help(dfc.project.load)

Help on function load in module datafaucet.project:

load(profile='default', rootpath=None)



### Project Configuration

In [3]:
# Loading default profile
import datafaucet as dfc
project = dfc.project.load()

created SparkEngine
Init engine "spark"
Connecting to spark master: local[*]
Engine context spark:2.4.4 successfully started


## Inspect current project configuration
The following will display the configuration of the project metadata profile and configuration data loaded. The configuration is available as a dictionary object.

In [4]:
dfc.project.info()

version: 0.8.2
username: natbusa
session_name: datalabframework-env.git
session_id: '0x4b8cdc24030b11ea'
profile:
rootdir: /home/natbusa/Projects/databox/demos/tutorial/demo
script_path: project.ipynb
dotenv_path: .env
notebooks_files:
  - Untitled1.ipynb
  - main.ipynb
  - install.ipynb
  - patched.ipynb
  - resources.ipynb
  - engine.ipynb
  - Untitled.ipynb
  - load_compare.ipynb
  - metadata.ipynb
  - hyperloglog.ipynb
  - project.ipynb
  - loadsave.ipynb
  - scaffolding.ipynb
  - join.ipynb
  - logging.ipynb
  - events.ipynb
python_files:
  - minimal.py
  - __main__.py
metadata_files:
  - metadata.yml
repository:
    type: git
    committer: natbusa
    hash: 84ddd60
    commit: 84ddd60c8bfeba211a120dcb1af569d836f7f49e
    branch: master
    url: https://github.com/natbusa/datalabframework-env.git
    name: datalabframework-env.git
    date: '2019-11-02T12:12:25+08:00'
    clean: false

### Loading a specific profile

Loading explicitely a different profile.  
In this case the profile `prod` will connect to a cluster in client mode.

In [5]:
# Loading default profile
project = dfc.project.load('minimal')

Init engine "spark"
Connecting to spark master: local[*]
Engine context spark:2.4.4 successfully started
