# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datafaucet as dfc

## Load a project

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When datafaucet project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

Loading the profile can be done with the `datafaucet.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.

In [2]:
help(dfc.project.load)

Help on function load in module datafaucet.project:

load(profile='default', rootpath=None, reload=True, parameters=None)



### Project Configuration

In [1]:
# Loading default profile
import datafaucet as dfc
project = dfc.project.load()

ERROR:datafaucet:KafkaLoggingHandler NoBrokersAvailable  - disabling kafka logging handler
NOTICE:datafaucet:project.ipynb:engine:__init__ | Connecting to spark master: local[*]
NOTICE:datafaucet:project.ipynb:engine:__init__ | Engine context spark:2.4.4 successfully started


## Inspect current project configuration
The following will display the configuration of the project metadata profile and configuration data loaded. The configuration is available as a dictionary object.

In [4]:
dfc.project.info()

version: 0.9.1
username: natbusa
session_name: default-datalabframework.git
session_id: '0xd0f7fbd01aff11ea'
profile: default
rootdir: /home/natbusa/Projects/datafaucet/examples/tutorial
script_path: project.ipynb
dotenv_path: .env
notebooks_files:
  - Untitled1.ipynb
  - main.ipynb
  - install.ipynb
  - patched.ipynb
  - Untitled2.ipynb
  - resources.ipynb
  - Untitled.ipynb
  - aggregate.ipynb
  - scd.ipynb
  - test.ipynb
  - metadata.output.ipynb
  - load_compare.ipynb
  - metadata.ipynb
  - engine-pandas.ipynb
  - hyperloglog.ipynb
  - mobilenumber.ipynb
  - project.ipynb
  - loadsave.ipynb
  - scaffolding.ipynb
  - join.ipynb
  - logging.ipynb
  - Untitled5.ipynb
  - engine-dask.ipynb
  - generate.ipynb
  - Untitled3.ipynb
  - Untitled4.ipynb
  - engine-spark.ipynb
  - events.ipynb
  - columns.ipynb
python_files:
  - hello_cereal.py
  - minimal.py
  - __main__.py
  - .ipynb_checkpoints/hello_cereal-checkpoint.py
metadata_files:
  - metadata.yml
repository:
    type: git
    commit

### Loading a specific profile

Loading explicitely a different profile.  
In this case the profile `prod` will connect to a cluster in client mode.

In [5]:
# Loading default profile
project = dfc.project.load('minimal')

ERROR:datafaucet:KafkaLoggingHandler NoBrokersAvailable  - disabling kafka logging handler
INFO:datafaucet:engines.py:__call__ | Factory: Stop the current SparkEngine instance
INFO:datafaucet:engines.py:__init__ | Init engine "spark"
NOTICE:datafaucet:engine.py:__init__ | Connecting to spark master: local[*]
NOTICE:datafaucet:engine.py:__init__ | Engine context spark:2.4.4 successfully started
