# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

In [1]:
import datafaucet as dfc

## Metadata

Project configuration is done by loading a profile from a collection of metadata files. Metadata files can be located anywhere under the root path of the given project. Configuring a datafaucet with metadata is completely optional (engine, resources, logging, can all be initialiazed without)... but it's quite handy especially if you need to deal with multiple profiles (dev, test, prod)

### How to use it

Metadata as three methods: load(), info() and profile() as shown here below:

In [2]:
dfc.metadata.load()

In [3]:
dfc.metadata.info()

files:
  - /home/natbusa/Projects/datafaucet/datafaucet/schemas/default.yml
  - /home/natbusa/Projects/databox/demos/tutorial/demo/metadata.yml
profiles:
  - default
  - dev
  - prod
  - stage
  - test
active: default

In [4]:
profile = dfc.metadata.profile()
#profile

You can use the dfc.metadata.profile() anywhere in your code.

Here below an explanation of startdard sections of the metadata profile.  
As used to configure resources, engine and logging

### Metadata files
    
Metadata configuration can be split into multiple files as long as they end with `metadata.yml`. For example: `metadata.yml`, `abc.metadata.yaml`, `abc_metadata.yml` are all valid metadata file names.


All metadata files in all subdirectories from the project root directory are loaded, unless the directory contains a file `metadata.ignore.yml`


Metadata files can provide multiple profile configurations, each profile is a _bare document_ wihtin the same yaml file. This is done by separating the configuration with a line containing three hyphens `---`  (see https://yaml.org/spec/1.2/spec.html#YAML)


As described above, each profile, can be broken down in multiple yaml files. When loading the metadata files all configuration belonging to the same profile with be merged. 

All metadata profiles inherit the settings from profile `default`

### Metadata sections
    
Metadata files are composed of 6 sections:

```yaml
  - profile 
  - variables
  - providers 
  - resources
  - engine
  - loggers
```

### Metadata sections: `profile`
A metadata configuration supports multiple profiles. 
The following profiles are pre-defined and canned in each default metadata configuration

 - `default`
 - `prod`
 - `stage`
 - `test`
 - `dev`

You can extend the above profiles with extra settings, or define new custom profiles. 


By loading a different profile you can define different configuratioon for your data resources, 
without having to modify your code. For instance, you cat setup the files to be saves on local 
disk for testing and in hdfs for production, as described in this snippet below:

```yaml
    ---
    profile: default
    providers:
        processed_data:
            service: local
            path: data
            format: parquet
    ---
    profile: prod
    providers:
        processed_data:
            service: hdfs
            hostname: hdfs-namenode
            path: /prod/data
    ---
    profile: test
```

In the above example, the profiles `test` and `default` share the same configuration, while the profile `prod` defined the provider alias `processed_data` as an hdfs location.


You can also use profiles to define different options configurations for the spark engine or different logging options. Here below an example of a default configuration which uses  a local spark setup in test/dev while using a spark cluster for prod and stage profiles

```yaml
    ---
    profile: default
    engine:
        type: spark
        master: local[*]
    ---
    profile: prod
    engine:
        type: spark
        master: spark://spark-prod-cluster:17077
    ---
    profile: stage
    engine:
        type: spark
        master: spark://spark-stage-cluster:17077
    ---
    profile: test
```

### Metadata sections: `variables`

The variable section in the profile allows you to define variables and 
reuse them in other part of the configuration. Datafaucet yaml files 
support jinja2 templates for variable substitution. The template rendering 
is only performed once upon project load.

Here below an example of a variable section and 
how to use it for the rest of the configuration:

```yaml
    ---
    profile: default
    variables: 
      a: hello
      b: "{{ variables.a}} world"
      c: "{{ env('SHELL') }}"
      d: "{{ env('ENV_VAR_NOT_DEFINED', 'foo'}}"
      e: "{{ now() }}"
      f: "{{ now(tz='UTC', format='%Y-%m-%d %H:%M:%S') }}"

      my_string_var: "Hi There!"
      my_env_var: "{{ env('DB_USERNAME', 'guest') }}"
      my_concat_var: "{{ engine.type }} running at {{ engine.master }}"

    ---
```

The above metadata profile will be rendered as:

```yaml
    ---
    profile: default
    variables:
        a: hello
        b: hello world
        c: /bin/bash
        d: foo
        e: '2019-03-27 08:42:00'
        f: '2019-03-27'
        my_string_var: Hi There!
        my_env_var: guest
        my_concat_var: spark running at local[*]
    ---
```

Note that:

 - variables can be defined in multiple profiles

 - variables section in a give profile always  
   inherit the variable from the `default` profile

 - a maximum of 5 rendering passes if allowed

 - values including a jinja template context must alwasy be quoted.  
   As is my_var: `"{{ ... }}"`

#### Accessing configuration values in a jinja template:

Yaml object values can be referenced in the jinja template using the . notation.
To access the data item, provide the path from the root of the profile. 
For instance the provider `processed_data` format in the example above can be referenced as: `providers.processed_data.format`

Jinja rendering operations:
Please refer to [placeholder for jinja url] for a list of operators on jinja variables

#### Metadata Jinja functions:

On top of the default set of operations, two functions can be used inside a jinja rendering context:

`def env(env_var, default_value='null')`  
renders in the template the value of environment variable `env_var` or null if not available. This function can be useful (also in combination with a .env file), to avoid hard-coding passwords and other login/auth data in the metadata configuration. Note that is setup is meant for convinence and not for security. 

Example:

```yaml
    my_env_var: "{{ env('DB_USERNAME', 'guest') }}"
```


`def now(tz='UTC', format='%Y-%m-%d %H:%M:%S')`  
renders the system current datetime value, optionally a different timezone and string formatting option can be added. This function can be useul if you want to execute code on a time window relative to the current time. Example:

```yaml
    utc_now: "{{ now(tz='UTC', format='%Y-%m-%d %H:%M:%S') }}"
```

### Metadata sections: `resources` and `providers`
    
A provider is a service which allows you to load and save data. Datafaucet extend the spark load save API calls by decoupling the provider configuration from the code.


The `providers` section in the metadata allow you to define a arbitrary number of providers. A provider are declared as an alias defining a set of properties. See example below:
    
```yaml
    profile: default

    providers:
        my_provider:
            service: hdfs
            hostname: hdfs-namenode
            path: /foo
            format: parquet
```

Here below the list of valid properties you can define for a provider:
    
`service`:  
The service which is going to be use for load/save data.  
Supported services:
```yaml
    - minio
    - hdfs
    - local
    - mysql
    - postgres
    - oracle
    - mssql
    - sqlite
```

`format`:  
    The format used for reading and writing data. Default is 'parquet' for all filesystem and object store services. For other type of services, such as databases (sql, nosql, newsql), this property is ignored. Supported formats: 

```yml
    - jdbc
    - nosql
    - csv
    - parquet
    - json
    - jsonl
```

`host`, `hostname`:   
The ip address or the dns name of the host providing the service
Default is 127.0.0.1

`port`:   
The port number of the host providing the service
The default port depends on the service according to the following table:

```yml
    hdfs: 8020
    mysql: 3306
    postgres: 5432
    mssql: 1433
    oracle: 1521
    elastic: 9200
    minio: 9000
```

`database`:  
The database name from the selected jdbc service

`path`:  
The root path used to save/load data resources. If the path is a fully qualified url such as (`hdfs://data.cluster.local/foo/bar`), it will be used straight away.

For jdbc connectors and rdbms services, if no database is provided, the path defines the database name, if both the database and path are provided, the provider's path defines the database schema.

`url`
if the url is provided directly it will be used as such. Otherwise the url will be assembled using the following properties:
   
   - `service`
   - `host`
   - `port`
   - `database`
   - `path`

`username`,  
`password`,  
The credential for authenticating for the given providers

`cache`:  
Cache the data, before saving or after loading

`date_column`:   
define a column in the dataframe to be the date column (for faster read/write)  

`date_start`:  
filter data according to this start date/datetime for the `date_column`

`date_end`:  
filter data according to this end date/datetime for the `date_column`

`date_window`:  
in combination with either `date_end` or `date_start` 
it defines a filter interval for the `date_column`. 
If defined and valid, this is implicitely applied when loading and saving data.

`date_partition`:  
`update_column`:  
`hash_column`:  
`state_column`:  
`hash_column`:  
Add special columns to the dataframe.

`options`:  
Extra options, as defined in the selected engine for load/save

### Metadata sections: `engine`
    
This cmetadata section defines the engine configurations to process data.
The following properties can be defined:

`type`:  
Engine type. Currently supports is limited to the option `spark`

`master`:  
The url of the spark master (e.g. `spark://23.195.26.187:7077`)

`timezone`:  
This option allows spark to interpret the datetime data as belonging 
to a different timezone than the one provided by the machine defaults.

When set to `naive` will interpret each datetime object as 'naive', 
no timezone translation will be executed. This option is equivalent 
to setting the `timezone` parameter to 'UTC'

`detect`:  
If detect is set to true, some packages and jars will be added, depending on the providers being declared in the metadata configuration. For instance, by automatically adding the jdbc drivers for the databases.

Also this parameter, will set some server configurations depending on the cluster specs to improve script execution.

`submit`:  
A section which allows to definea and add a number of files during the engine's initalization. Files are declared as belonging to the following groups. Packages must contain three parts according to the java ivy package dependency conventions and separated by a colon `:`, for instance 

`config`:  
A list of custom configurations, defined as key, value pairs. For example, check out the list of valid Spark 2.4.0 configurations as provided at https://spark.apache.org/docs/2.4.0/configuration.html


Here below an example summarizing all the given engine settings:

```yml
    type: spark
    master: "local[*]"
    timezone: naive
    submit:
        jars`:
            - jarname_1
            - ...
        packages`:
            - package_1
            - ...
        py-files`:
            - pyfile_1
        config:
            key1: value1
            key2: value2
            ...
```

### Metadata sections: `logging`

This section define the logging configuration.