# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datafaucet as dfc
from datafaucet import logging as log

## Logging

One of the main things here is to have configuration and code separated in different files. Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files. When datafaucet project is loaded, it starts by searching for a `__main__.py` file, according to python module file naming conventions. When such a file is found, the corresponding directory is set as the root path for the project. All modules and alias paths are all relative to the project root path.

### Metadata

Logging can be configured via metadata.yml file. The logging section of the metadata will allow you to define three types of handlers: a stdout handler, a file handler, and a kafka handler. Here below the configuration details:

```
loggers:
    root:
        severity: info

    datafaucet:
        name: dfc
        stdio:
            enable: true
            severity: notice
        file:
            enable: true
            severity: notice
        kafka:
            enable: false
            severity: info
            hosts:
                kafka-node1:9092
                kafka-node2:9092
            topic: dfc
```

### Logs

Logging via datafaucet support 5 levels:
  - info
  - notice
  - warning
  - error
  - fatal

#### No project metadata loaded.
Logging will work without loading any metadata project configuration, but in this case it will use the default cofiguration of the python root logger. By default, `debug`, `info` and `notice` level are filtered out. To enable the full functionality, including logging to kafka and logging the custom logging information about the project (sessionid, username, etc) you must load a project first.

In [2]:
log.debug('debug')
log.info('notice')
log.notice('jnotice')
log.warning('a warning message')
log.error('this is an error')
log.critical('critical condition')

this is an error
critical condition


#### Loading a metadata profile
If a logging configuration is loaded, then extra functionality will be available. In particular, logging will log datafaucet specific info, such as the session id, and data can be passed as a dictionary, optionally with a custom message

In [3]:
dfc.project.load()

created SparkEngine
Init engine "spark"


could not autodetect driver to install for file, version None
could not autodetect driver to install for hdfs, version 3.2.1


Configuring packages:
  -  com.microsoft.sqlserver:mssql-jdbc:6.4.0.jre8
  -  com.oracle.ojdbc:ojdbc8:12.2.0.1
  -  mysql:mysql-connector-java:8.0.12
  -  org.apache.hadoop:hadoop-aws:3.2.1
  -  org.postgresql:postgresql:42.2.5
  -  ru.yandex.clickhouse:clickhouse-jdbc:0.1.54
Configuring conf:
  -  spark.hadoop.fs.s3a.access.key : ****** (redacted)
  -  spark.hadoop.fs.s3a.endpoint : http://minio:9000
  -  spark.hadoop.fs.s3a.impl : org.apache.hadoop.fs.s3a.S3AFileSystem
  -  spark.hadoop.fs.s3a.path.style.access : true
  -  spark.hadoop.fs.s3a.secret.key : ****** (redacted)
Connecting to spark master: local[*]


Could not start the engine context


Java gateway process exited before sending its port number


<datafaucet.project.Project at 0x7efc75776278>

In [4]:
# custom message
dfc.logging.notice('hello world')

NOTICE:dfc:run_code hello world


In [5]:
# custom data
dfc.logging.warning({'test_value':42})



In [6]:
# custom data and message
dfc.logging.warning('custom message', extra={'more':123})



In [7]:
# from a function

def my_nested_function():
    log.warning('another message')
    log.error('custom',extra=[1,2,3])
    
def my_function():
    log.notice({'a':'text', 'b':2})
    my_nested_function()
    
my_function()

NOTICE:dfc:my_function data
ERROR:dfc:my_nested_function custom
