# Datafaucet

Basic example and directory structure.

## Elements

This ETL/Data Science scaffolding works with three elements, 
which are co-ordinated with each other:

  - The introductory python notebook you are reading now (main.ipynb)
  - A directory structure for code and data processing (data)
  - The datafaucet python package (datafaucet)
  - Configuration files (metadata.yml, \__main__.py, Makefile)

## Principles ##

- ** Both notebooks and code are first citizens **

In the source directory `src` you will find all source code. In particular, both notebooks and code files are treated as source files. Source code is further partitioned and scaffolded in several directories to simplify and organize the data science project. Following python package conventions, the root of the project is tagged by a `__main__.py` file and directory contains the `__init__.py` code. By doing so, python and notebook files can reference each other.

Python notebooks and Python code can be mixed and matched, and are interoperable with each other. You can include function from a notebook to a python code, and you can include python files in a notebook. 

- ** Data Directories should not contain logic code **

Data can be located anywhere, on remote HDFS clusters, or Object Store Services exposed via S3 protocols etc. Also you can keep data on the local file system. For illustration purposes, this demo will use a local directory for data scaffolding. 

Separating data and code is done by moving all configuration to metadata files. Metadata files make possible to define aliases for data resources, data services and spark configurations, and keeping the ETL and ML code tidy with no hardcoded parameters.

- ** Decouple Code from Configuration **

Code either stored as notebooks or as python files should be decoupled from both engine configurations and from data locations. All configuration is kept in `metadata.yml` yaml files. Multiple setups for test, exploration, production can be described  in the same `metadata.yml` file or in separate multiple files using __profiles__. All profile inherit from a default profiles, to reduce dupllication of configurations settings across profiles.

- ** Declarative Configuration **

Metadata files are responsible for the binding of data and engine configurations to the code. For instance all data in the code shouold be referenced by an alias, and storage and retrieval of data object and files should happen via a common API. The metadata yaml file, describes the providers for each data source as well as the mapping of data aliases to their corresponding data objects. 



## Project Template

The data science project is structured in a way to facilitate the deployment of the artifacts, and to switch from batch processing to live experimentation. The top level project is composed of the following items:

### Top level Structure

```
├── binder
├── ci
├── data
├── resources
├── src
├── test
│
├── main.ipynb
├── versions.ipynb
│
├── __main__.py
├── metadata.yml
│
└── Makefile

```

## datafaucet

In [1]:
import datafaucet as dfc

### Package things
Package version: package variables `version_info`, `__version__`

In [2]:
dfc.version_info

(0, 8, 2)

In [3]:
dfc.__version__

'0.8.2'

Check is the datafaucet is loaded in the current python context

In [4]:
try:
    __DATALOOF__
    print("the datafaucet is loaded")
except NameError:
    print("the datafaucet is not loaded")

the datafaucet is loaded


### Modules: project

Project is all about setting the correct working directories where to run and find your notebooks, python files and configuration files.  

When the datafaucet is imported, it starts by searching for a `__main__.py` file, according to python module file naming conventions. All modules and alias paths are all relative to this project root path.

## Load a project profile

Loading the profile can be done with the `datafaucet.project.load` function call. It will look for files ending with `metadata.yml`. The function can optionally set the current working directory and import the key=values of .env file into the python os environment. if no parameters are specified, the default profile is loaded.

In [7]:
help(dfc.project.load)

Help on function load in module datafaucet.project:

load(profile='default', rootpath=None)



In [10]:
# load the 'default' environment
dfc.project.load('default')

Additional properties are not allowed ('stdio' was unexpected) 

## schema path:
'properties/loggers/properties/datafaucet/additionalProperties'

## metadata schema definition :
type: object
properties:
    name:
        type: string
        default: dfc
    stream:
        type: object
        properties:
            severity:
                type: string
            enable:
                type: boolean
        additionalProperties: false
    stdout:
        type: object
        properties:
            severity:
                type: string
                default: notice
            enable:
                type: boolean
                default: true
        additionalProperties: false
    file:
        type: object
        properties:
            severity:
                type: string
                default: info
            path:
                type: string
            enable:
                type: boolean
                default: false
        additionalProperties: false
    kaf

KeyError: 'providers'

## Inspect loaded metadata profile

Call the `datafaucet.project.metadata` to get and inspect the metadata profile loaded. It returns an object of type Metadata. It behaves as a read-only dictionary.

In [3]:
help(dfc.project.metadata)

Help on function metadata in module datafaucet.project:

metadata()
    return a metadata object which provides just one method:
    config() : provides the current loaded metadata profile information
    
    :return: a Metadata object



In [4]:
md = dfc.project.metadata()
md

providers:
    local_filesystem:
        write:
            options:
                header: true
                mode: overwrite
        read:
            options:
                inferSchema: true
                header: true
        path: data
        service: local
        format: csv
loggers:
    stream:
        enable: true
        severity: info
engine:
    context:
        master: local[1]
    type: spark
resources:
    correlation:
        path: correlation.csv
        provider: local_filesystem
    ascombe:
        path: ascombe.csv
        provider: local_filesystem
profile: default

Metadata files support jinja templates. This feature can be used to read in environment variables

## Inspect current project configuration

You can inspect the current project configuration, by calling the `datafaucet.project.config` function.

In [5]:
help(dfc.project.config)

Help on function config in module datafaucet.project:

config()
    Returns the current project configuration
    :return: a dictionary with project configuration data



In [6]:
import datafaucet as dfc
dfc.project.config()

version: 0.6.0
profile: default
filename: main.ipynb
rootdir: /home/natbusa/Projects/datafaucet-demos/demos/basic/demo
workdir: /home/natbusa/Projects/datafaucet-demos/demos/basic/demo
username: natbusa
repository:
    type:
    committer: ''
    hash: 0
    commit: 0
    branch: ''
    url: ''
    name: ''
    date: ''
    clean: false
files:
    notebooks:
      - main.ipynb
      - versions.ipynb
      - src/Untitled.ipynb
      - src/hello.ipynb
    python:
      - __main__.py
    metadata:
      - metadata.yml
    dotenv:
engine:
    type: spark
    name: default
    version: 2.3.1
    conf:
        spark.app.id: local-1545384032427
        spark.rdd.compress: 'True'
        spark.app.name: default
        spark.serializer.objectStreamReset: '100'
        spark.driver.port: '34699'
        spark.executor.id: driver
        spark.submit.deployMode: client
        spark.ui.showConsoleProgress: 'true'
        spark.master: local[1]
        spark.driver.host: 10.196.160.215
    env:
 

Data resources are relative to the `rootpath`. 

### Resources

Data binding works with the metadata files. It's a good practice to declare the actual binding in the metadata and avoiding hardcoding the paths in the notebooks and python source files.

In [15]:
dfc.project.resource('ascombe')['url']

'/home/natbusa/Projects/datafaucet-demos/demos/basic/demo/data/ascombe.csv'

In [20]:
dfc.project.resource('./data/ascombe.csv', 'localfs')

url: /home/natbusa/Projects/datafaucet-demos/demos/basic/demo/localfs/data/ascombe.csv
service: file
format: csv
driver:
database:
username:
password:
provider_alias:
resource_alias:
resource_path: ./data/ascombe.csv
provider_path: /home/natbusa/Projects/datafaucet-demos/demos/basic/demo/localfs
read:
    cache: false
    options: {}
    filter:
        date_column:
        date_start:
        date_end:
        date_window:
        date_timezone:
    partition:
        repartition:
        coalesce:
    mapping: {}
write:
    cache: false
    options: {}
    filter:
        date_column:
        date_start:
        date_end:
        date_window:
        date_timezone:
    partition:
        repartition:
        coalesce:
    mapping: {}

### Modules: Engines

This submodules will allow you to start a context, from the configuration described in the metadata. It also provide, basic load/store data functions according to the aliases defined in the configuration.

Let's start by listing the aliases and the configuration of the engines declared in `metadata.yml`.


__Context: Spark__  
Let's start the engine session, by selecting a spark context from the list. Your can have many spark contexts declared, for instance for single node 

In [1]:
import datafaucet as dfc
engine = dfc.project.engine()
engine.config()

type: spark
name: default
version: 2.3.1
conf:
    spark.driver.port: '37227'
    spark.rdd.compress: 'True'
    spark.app.name: default
    spark.serializer.objectStreamReset: '100'
    spark.executor.id: driver
    spark.submit.deployMode: client
    spark.ui.showConsoleProgress: 'true'
    spark.master: local[1]
    spark.driver.host: 10.196.160.215
    spark.app.id: local-1545384242509
env:
    PYSPARK_SUBMIT_ARGS: ' pyspark-shell'
rootdir: /home/natbusa/Projects/datafaucet-demos/demos/basic/demo

You can quickly inspect the properties of the context by calling the `info()` function

By calling the `context` method, you access the Spark SQL Context directly. The rest of your spark python code is not affected by the initialization of your session with the datafaucet.

In [2]:
engine = dfc.project.engine()
spark = engine.context()

Once again, let's read the csv data again, this time using the spark context. First using the engine `write` utility, then directly using the spark context and the `dfc.data.path` function to localize our labeled dataset.

In [3]:
#read using the engine utility
df = engine.load('ascombe')

In [4]:
df.printSchema()

root
 |-- idx: long (nullable = true)
 |-- Ix: double (nullable = true)
 |-- Iy: double (nullable = true)
 |-- IIx: double (nullable = true)
 |-- IIy: double (nullable = true)
 |-- IIIx: double (nullable = true)
 |-- IIIy: double (nullable = true)
 |-- IVx: double (nullable = true)
 |-- IVy: double (nullable = true)



In [5]:
df.show()

+---+----+-----+----+----+----+-----+----+----+
|idx|  Ix|   Iy| IIx| IIy|IIIx| IIIy| IVx| IVy|
+---+----+-----+----+----+----+-----+----+----+
|  0|10.0| 8.04|10.0|9.14|10.0| 7.46| 8.0|6.58|
|  1| 8.0| 6.95| 8.0|8.14| 8.0| 6.77| 8.0|5.76|
|  2|13.0| 7.58|13.0|8.74|13.0|12.74| 8.0|7.71|
|  3| 9.0| 8.81| 9.0|8.77| 9.0| 7.11| 8.0|8.84|
|  4|11.0| 8.33|11.0|9.26|11.0| 7.81| 8.0|8.47|
|  5|14.0| 9.96|14.0| 8.1|14.0| 8.84| 8.0|7.04|
|  6| 6.0| 7.24| 6.0|6.13| 6.0| 6.08| 8.0|5.25|
|  7| 4.0| 4.26| 4.0| 3.1| 4.0| 5.39|19.0|12.5|
|  8|12.0|10.84|12.0|9.13|12.0| 8.15| 8.0|5.56|
|  9| 7.0| 4.82| 7.0|7.26| 7.0| 6.42| 8.0|7.91|
| 10| 5.0| 5.68| 5.0|4.74| 5.0| 5.73| 8.0|6.89|
+---+----+-----+----+----+----+-----+----+----+



Finally, let's calculate the correlation for each set I,II, III, IV between the `x` and `y` columns and save the result on an separate dataset.

In [6]:
from pyspark.ml.feature import VectorAssembler

for s in ['I', 'II', 'III', 'IV']:
    va = VectorAssembler(inputCols=[s+'x', s+'y'], outputCol=s)
    df = va.transform(df)
    df = df.drop(s+'x', s+'y')
    
df.show()

+---+------------+-----------+------------+-----------+
|idx|           I|         II|         III|         IV|
+---+------------+-----------+------------+-----------+
|  0| [10.0,8.04]|[10.0,9.14]| [10.0,7.46]| [8.0,6.58]|
|  1|  [8.0,6.95]| [8.0,8.14]|  [8.0,6.77]| [8.0,5.76]|
|  2| [13.0,7.58]|[13.0,8.74]|[13.0,12.74]| [8.0,7.71]|
|  3|  [9.0,8.81]| [9.0,8.77]|  [9.0,7.11]| [8.0,8.84]|
|  4| [11.0,8.33]|[11.0,9.26]| [11.0,7.81]| [8.0,8.47]|
|  5| [14.0,9.96]| [14.0,8.1]| [14.0,8.84]| [8.0,7.04]|
|  6|  [6.0,7.24]| [6.0,6.13]|  [6.0,6.08]| [8.0,5.25]|
|  7|  [4.0,4.26]|  [4.0,3.1]|  [4.0,5.39]|[19.0,12.5]|
|  8|[12.0,10.84]|[12.0,9.13]| [12.0,8.15]| [8.0,5.56]|
|  9|  [7.0,4.82]| [7.0,7.26]|  [7.0,6.42]| [8.0,7.91]|
| 10|  [5.0,5.68]| [5.0,4.74]|  [5.0,5.73]| [8.0,6.89]|
+---+------------+-----------+------------+-----------+



After assembling the dataframe into four sets of 2D vectors, let's calculate the pearson correlation for each set. In the case the the ascombe sets, all sets should have the same pearson correlation.

In [7]:
from pyspark.ml.stat import Correlation
from pyspark.sql.types import DoubleType

corr = {}
cols = ['I', 'II', 'III', 'IV']

# calculate pearson correlations
for s in cols:
    corr[s] = Correlation.corr(df, s, 'pearson').collect()[0][0][0,1].item()

# declare schema
from pyspark.sql.types import StructType, StructField, FloatType
schema = StructType([StructField(s, FloatType(), True) for s in cols])

# create output dataframe
corr_df = spark.createDataFrame(data=[corr], schema=schema)

In [8]:
import pyspark.sql.functions as f
corr_df.select([f.round(f.avg(c), 3).alias(c) for c in cols]).show()

+-----+-----+-----+-----+
|    I|   II|  III|   IV|
+-----+-----+-----+-----+
|0.816|0.816|0.816|0.817|
+-----+-----+-----+-----+



Save the results. It's a very small data frame, however Spark when saving  csv format files, assumes large data sets and partitions the files inside an object (a directory) with the name of the target file. See below:


In [10]:
engine.save(corr_df,'correlation')

We read it back to chack all went fine

In [11]:
engine.load('correlation').show()

+----------+---------+---------+----------+----------+
|Unnamed: 0|        I|       II|       III|        IV|
+----------+---------+---------+----------+----------+
|         0|0.8164205|0.8162365|0.81628674|0.81652147|
+----------+---------+---------+----------+----------+



### Modules: Export

This submodules will allow you to export cells and import them in other notebooks as python packages. Check the notebook [versions.ipynb](versions.ipynb), where you will see how to export the notebook, then follow the code here below to check it really works!


In [12]:
import datafaucet as dfc
dfc.project.load()

from versions import python_version

importing Jupyter notebook from versions.ipynb


In [13]:
python_version()

Hello world: python 3.6.7
