# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

## Data Engine

### Starting the engine

Super simple, yet flexible :) 

In [1]:
import datafaucet as dfc

In [2]:
#start the engine
engine = dfc.engine('spark')

In [3]:
#you can also use directlt the specific engine class
engine = dfc.SparkEngine()

Loading and saving data resources is an operation performed by the engine. The engine configuration can be passed straight as parameters in the engine call, or configured in metadata yaml files.

### Engine Context

You can access the underlying engine by referring to the engine.context. In particular for the spark engine the context can be accessed with the next example code:

In [5]:
spark = dfc.context()

In [6]:
spark

In [7]:
df = dfc.range(5)
df.data.grid()

Unnamed: 0,id
0,0
1,1
2,2
3,3
4,4


In [8]:
type(df)

pyspark.sql.dataframe.DataFrame

In [9]:
df.datafaucet()

{'object': 'dataframe', 'type': 'spark', 'version': '0.8.2'}

### Engine configuration

In [10]:
engine = dfc.engine()

In [11]:
engine.conf

{'spark.rdd.compress': 'True',
 'spark.app.id': 'local-1574733270733',
 'spark.serializer.objectStreamReset': '100',
 'spark.master': 'local[*]',
 'spark.executor.id': 'driver',
 'spark.submit.deployMode': 'client',
 'spark.app.name': 'None',
 'spark.ui.showConsoleProgress': 'true',
 'spark.driver.port': '40997',
 'spark.driver.host': '10.10.140.37'}

In [12]:
engine.env

SPARK_HOME:
HADOOP_HOME:
JAVA_HOME: /usr/lib/jvm/java-8-oracle
PYSPARK_PYTHON: /home/natbusa/miniconda3/bin/python
PYSPARK_DRIVER_PYTHON: /home/natbusa/miniconda3/bin/python
PYTHONPATH:
PYSPARK_SUBMIT_ARGS: ' pyspark-shell'
SPARK_DIST_CLASSPATH:

For the full configuration, please uncomment and execute the following statement

In [13]:
engine.info

{'python_version': '3.7.3',
 'hadoop_version': '2.7.3',
 'hadoop_detect': 'spark',
 'hadoop_home': '',
 'spark_home': '/home/natbusa/miniconda3',
 'spark_classpath': None,
 'spark_classpath_source': '/home/natbusa/miniconda3/conf/spark-env.sh'}

### Submitting engine parameters during engine initalization

Submit master, configuration parameters and services as engine params

In [14]:
import datafaucet as dfc
dfc.engine('spark', master='local[2]', services='postgres', conf=[('spark.app.name','myapp')])

Init engine "spark"
Configuring packages:
  -  org.postgresql:postgresql:42.2.5
Configuring conf:
  -  spark.app.name : myapp
Connecting to spark master: local[2]
Engine context spark:2.4.4 successfully started


<datafaucet.spark.engine.SparkEngine at 0x7fbf035ab978>

In [15]:
dfc.engine().conf

{'spark.submit.pyFiles': '/home/natbusa/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.repl.local.jars': 'file:///home/natbusa/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.executor.id': 'driver',
 'spark.driver.port': '35087',
 'spark.driver.host': '10.10.140.37',
 'spark.app.name': 'myapp',
 'spark.rdd.compress': 'True',
 'spark.files': 'file:///home/natbusa/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.serializer.objectStreamReset': '100',
 'spark.jars': 'file:///home/natbusa/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.submit.deployMode': 'client',
 'spark.app.id': 'local-1574733278825',
 'spark.master': 'local[2]',
 'spark.ui.showConsoleProgress': 'true'}