# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datafaucet as dfc
from datafaucet import logging as log

## Logging

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When datafaucet project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

### Metadata

Logging can be configured via metadata.yml file. The logging section of the metadata will allow you to define three types of handlers: a stdout handler, a file handler, and a kafka handler. Here below the configuration details:

```
loggers:
    root:
        severity: info

    datafaucet:
        name: dfc
        stdio:
            enable: true
            severity: notice
        file:
            enable: true
            severity: notice
        kafka:
            enable: false
            severity: info
            hosts:
                kafka-node1:9092
                kafka-node2:9092
            topic: dfc
```

### Logs

Logging via datafaucet support 5 levels:
  - info
  - notice
  - warning
  - error
  - fatal

#### No project metadata loaded.
Logging will work without loading any metadata project configuration, but in this case it will use the default cofiguration of the python root logger. By default, `debug`, `info` and `notice` level are filtered out. To enable the full functionality, including logging to kafka and logging the custom logging information about the project (sessionid, username, etc) you must load a project first.

In [2]:
log.debug('debug')
log.info('info')
log.notice('notice')
log.warning('a warning message')
log.error('this is an error')
log.critical('critical condition')

this is an error
critical condition


#### Loading a metadata profile
If a logging configuration is loaded, then extra functionality will be available. In particular, logging will log datafaucet specific info, such as the session id, and data can be passed as a dictionary, optionally with a custom message

In [3]:
dfc.project.load()

NOTICE:dfc:project.py:load Engine created SparkEngine
NOTICE:dfc:engines.py:Engine Init engine "spark"
NOTICE:dfc:project.py:load Connecting to spark master: local[*]
NOTICE:dfc:project.py:load Engine context spark:2.4.4 successfully started


<datafaucet.project.Project at 0x7f0a35d5b240>

In [4]:
log.debug('debug')
log.info('info')
log.notice('notice')
log.warning('a warning message')
log.error('this is an error')
log.critical('critical condition')

NOTICE:dfc:interactiveshell.py:run_cell_async notice
ERROR:dfc:interactiveshell.py:run_cell_async this is an error
CRITICAL:dfc:interactiveshell.py:run_cell_async critical condition


In [5]:
# custom message
dfc.logging.notice('hello world')

NOTICE:dfc:interactiveshell.py:run_cell_async hello world


In [6]:
# *args similar to print
dfc.logging.warning('message', 'can have', 'multiple parts', 'and', 'types:', dfc.__name__, 'is a', type(dfc))



In [7]:
# add custom data dictionary as a dictionary
dfc.logging.warning('custom data + message', extra={'test_value':42})



In [8]:
# extra dictionary is not shown in stdout, but does show in file (jsonl format) and kafka log messages
!tail -n 1 dfc.log | jq .

[1;39m{
  [0m[34;1m"@timestamp"[0m[1;39m: [0m[0;32m"2019-12-02T07:05:58.996794"[0m[1;39m,
  [0m[34;1m"sid"[0m[1;39m: [0m[0;32m"0x1f56610614d211ea"[0m[1;39m,
  [0m[34;1m"repohash"[0m[1;39m: [0m[0;32m"d0d1774"[0m[1;39m,
  [0m[34;1m"reponame"[0m[1;39m: [0m[0;32m"datalabframework.git"[0m[1;39m,
  [0m[34;1m"username"[0m[1;39m: [0m[0;32m"natbusa"[0m[1;39m,
  [0m[34;1m"filepath"[0m[1;39m: [0m[0;32m"../../../../miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py"[0m[1;39m,
  [0m[34;1m"funcname"[0m[1;39m: [0m[0;32m"interactiveshell.py:run_cell_async"[0m[1;39m,
  [0m[34;1m"message"[0m[1;39m: [0m[0;32m"custom data + message"[0m[1;39m,
  [0m[34;1m"data"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"test_value"[0m[1;39m: [0m[0;39m42[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m}[0m


In [9]:
# from a function

def my_nested_function():
    log.warning('another message')
    log.error('custom',extra=[1,2,3])
    
def my_function():
    log.notice(extra = {'a':'text', 'b':2})
    my_nested_function()
    
my_function()

NOTICE:dfc:interactiveshell.py:run_ast_nodes 
ERROR:dfc:interactiveshell.py:run_code custom


In [10]:
!tail -n 3 dfc.log | jq .

[1;39m{
  [0m[34;1m"@timestamp"[0m[1;39m: [0m[0;32m"2019-12-02T07:06:00.308861"[0m[1;39m,
  [0m[34;1m"severity"[0m[1;39m: [0m[0;32m"NOTICE"[0m[1;39m,
  [0m[34;1m"sid"[0m[1;39m: [0m[0;32m"0x1f56610614d211ea"[0m[1;39m,
  [0m[34;1m"repohash"[0m[1;39m: [0m[0;32m"d0d1774"[0m[1;39m,
  [0m[34;1m"reponame"[0m[1;39m: [0m[0;32m"datalabframework.git"[0m[1;39m,
  [0m[34;1m"username"[0m[1;39m: [0m[0;32m"natbusa"[0m[1;39m,
  [0m[34;1m"filepath"[0m[1;39m: [0m[0;32m"../../../../miniconda3/lib/python3.7/site-packages/ipykernel_launcher.py"[0m[1;39m,
  [0m[34;1m"funcname"[0m[1;39m: [0m[0;32m"interactiveshell.py:run_ast_nodes"[0m[1;39m,
  [0m[34;1m"message"[0m[1;39m: [0m[0;32m""[0m[1;39m,
  [0m[34;1m"data"[0m[1;39m: [0m[1;39m{
    [0m[34;1m"a"[0m[1;39m: [0m[0;32m"text"[0m[1;39m,
    [0m[34;1m"b"[0m[1;39m: [0m[0;39m2[0m[1;39m
  [1;39m}[0m[1;39m
[1;39m}[0m
[1;39m{
  [0m[34;1m"@timestamp"[0m[1;39m: [0m