# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

Loading the default profile

In [1]:
import datafaucet as dfc

# load the metadata
dfc.metadata.load()

In [3]:
isinstance(None, type(None))

True

## Projects Resources

Data binding works with the metadata files. It's a good practice to declare the actual binding in the metadata and avoiding hardcoding the paths in the notebooks and python source files.

Project resources are defined by an alias. For resources sharing the same provider service, it is recommended to define the common metadata configuration in the provider and leave the specific resource detail in the resource metadata. 

### Resource metadata
    
Resource metadata configuration has the same yaml properties as the provider, with the exception of a `provider` property which refers to the correcsponfin provider alias configuration. The bare minimum for a resource metadata is the resource `path` which represent a hierarchical path in a file system or object store. For databases the path can represent either a table or a query. 


Resource metadata is assembled using the resource and provider configuration. If the resource alias is not defined, it assumes that the resource is implicitely defined and the alias is actually the path of the resource.

The full metadata configuration of a resource can be inspected with `project.resource(...)`

### Unified resource object
In order to simplify copying data from one format and one provider to another, we define a dictionary/object which can describe various type of resources. 
Here below a couple of examples:

#### Local file resource
The following function call, define the same resource metadata  
All calls produce the same resource dictionary

In [None]:
def mask_rootdir(resource):
    d = resource.copy()
    if d['service']=='file':
        d['url'] = '<project_rootdir>/' + dfc.utils.relpath(d['url'], dfc.rootdir())
    return d

def equal(resource, ref):
    r = dfc.resources.hash(ref)
    s = dfc.resources.hash(mask_rootdir(resource))
    return r == s

In [2]:
ref = dfc.yaml.YamlDict(
"""
hash: '0x7161cdea'
url: <project_rootdir>/data/ascombe.csv
service: file
version:
format: csv
host: 127.0.0.1
options:
    header: true
    inferSchema: true
""")

# via metadata yml resource
src = dfc.Resource('ascombe')
assert equal(src, ref)

# path anf provider in metadata yml
src = dfc.Resource('ascombe.csv', 'localfs')
assert equal(src, ref)

# just local path (relative to project roopath)
src = dfc.Resource('data/ascombe.csv', header=True, inferSchema=True)
assert equal(src, ref)

# absolute path
path  = dfc.rootdir() + '/data/ascombe.csv'
src = dfc.Resource(path, header=True, inferSchema=True)
assert equal(src, ref)

# full uri
path  = f'file://{dfc.rootdir()}/data/ascombe.csv'
src = dfc.Resource(path, header=True, inferSchema=True)
assert equal(src, ref)

# from resource
src = dfc.Resource(src)
assert equal(src, ref)

mask_rootdir(src)

NameError: name 'dfc' is not defined

#### HDFS service

The following function call, define a file, e.g parquet in a hdfs filesystem  
All calls define the same resource dictionary, default port is 8020, default hdfs version is 3.1.1

In [14]:
dfc.Resource('hdfs://test/data.parquet')

hash: '0x17a27b8b'
url: hdfs://test:8020/data.parquet
service: hdfs
version: 3.2.1
format: parquet
host: test
port: 8020
options: {}

#### Database table

The following function call, define a table from a database  
All calls define the same resource dictionary

In [16]:
ref = dfc.yaml.YamlDict(
"""
hash: '0xab5696a2'
url: jdbc:mysql://mysql:3306/sakila
service: mysql
version: 8.0.12
format: jdbc
host: mysql
port: 3306
user: sakila
password: sakila
driver: com.mysql.cj.jdbc.Driver
database: sakila
schema: sakila
table: actor
options: {}
""")

# via metadata yml resource
src = dfc.Resource('actor', 'sakila')
assert(src==ref)

# via table and url
src = dfc.Resource('actor', 'jdbc:mysql://sakila:sakila@mysql:3306/sakila')
assert(src==ref)

# via table and url and connection details
src = dfc.Resource('actor', 'jdbc:mysql://mysql/sakila', password='sakila', user='sakila')
assert(src==ref)

# only connection details
src = dfc.Resource(
    database='sakila', 
    table='actor', 
    host='mysql', 
    service='mysql', 
    password='sakila', 
    user='sakila')
assert(src==ref)

# via db/table and connection details
src = dfc.Resource(
    'sakila/actor', 
    host='mysql', 
    service='mysql', 
    password='sakila', 
    user='sakila')
assert(src==ref)

# from resource
src = dfc.Resource(src)
assert(src==ref)

src

hash: '0x6f84aa5'
url: /home/natbusa/Projects/datafaucet/examples/tutorial/sakila/actor
service: file
version:
format:
host: 127.0.0.1
options: {}



AssertionError: 

#### Database queries

The following function call, will extract the same table as abovem using a query. 
Passing a query can be helpfull if you want to join or limit somehow the amount of data extracted. Note that in some cases the enigne is able to push down projections (select) and predicates (where) to the engine, after the database table is loaded, thanks to lazy execution.

In [5]:
ref = dfc.yaml.YamlDict(
"""
hash: '0x9b72d98e'
url: jdbc:mysql://mysql:3306/sakila
service: mysql
version: 8.0.12
format: jdbc
host: mysql
port: 3306
user: sakila
password: sakila
driver: com.mysql.cj.jdbc.Driver
database: sakila
schema: sakila
table: ( select * from actor ) as _query
options: {}
""")

# via db/table and connection details
src = dfc.Resource(
    'SELECT * FROM actor;',
    'sakila',
    host='mysql', 
    service='mysql', 
    password='sakila', 
    user='sakila')
assert(src==ref)


#### Other service providers

The following function call, define a resource from web, file services and object stores

In [8]:
# from various services and protocols
path  = 'hdfs:///data/ascombe.csv'
src = dfc.Resource(path, header=True, inferSchema=True)
print(src)

path  = 'https://subdomain.example.com:88/data/ascombe.csv'
src = dfc.Resource(path, header=True, inferSchema=True)
print(src)

path  = 'http://data.example.org/data/ascombe.csv.gz'
src = dfc.Resource(path, header=True, inferSchema=True)
print(src)

path = 'https://raw.githubusercontent.com/natbusa/dfc-tutorial/master/data/examples/sample.csv'
src = dfc.Resource(path)
print(src)

hash: '0x57847e4d'
url: hdfs://127.0.0.1:8020/data/ascombe.csv
service: hdfs
version: 3.1.1
format: csv
host: 127.0.0.1
port: 8020
options:
    inferSchema: true
    header: true

hash: '0xa5649b1a'
url: https://subdomain.example.com:88/data/ascombe.csv
service: https
version:
format: csv
host: subdomain.example.com
port: 88
options:
    inferSchema: true
    header: true

hash: '0xbb98d5d9'
url: http://data.example.org:80/data/ascombe.csv.gz
service: http
version:
format: csv
host: data.example.org
port: 80
options:
    inferSchema: true
    header: true
    compression: gzip

hash: '0xaa13a118'
url: https://raw.githubusercontent.com:443/natbusa/dfc-tutorial/master/data/examples/sample.csv
service: https
version:
format: csv
host: raw.githubusercontent.com
port: 443
options: {}

