# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

## Data Engine

### Starting the engine

Super simple, yet flexible :) 

In [1]:
import datafaucet as dfc

In [2]:
#start the engine
engine = dfc.engine('spark')

In [3]:
#you can also use directlt the specific engine class
engine = dfc.SparkEngine()

Loading and saving data resources is an operation performed by the engine. The engine configuration can be passed straight as parameters in the engine call, or configured in metadata yaml files.

### Engine Context

You can access the underlying engine by referring to the engine.context. In particular for the spark engine the context can be accessed with the next example code:

In [4]:
spark = engine.session
spark

In [5]:
df = dfc.range(5)
df.data.grid()

Unnamed: 0,id
0,0
1,1
2,2
3,3
4,4


In [6]:
type(df)

pyspark.sql.dataframe.DataFrame

In [7]:
df.datafaucet()

{'object': 'dataframe', 'type': 'spark', 'version': '0.10.0'}

### Engine configuration

In [8]:
engine = dfc.engine()

In [9]:
engine.conf

{'spark.driver.host': 'bebcacf09518',
 'spark.rdd.compress': 'True',
 'spark.serializer.objectStreamReset': '100',
 'spark.master': 'local[*]',
 'spark.submit.pyFiles': '',
 'spark.executor.id': 'driver',
 'spark.app.id': 'local-1580638803618',
 'spark.submit.deployMode': 'client',
 'spark.app.name': 'None',
 'spark.ui.showConsoleProgress': 'true',
 'spark.driver.port': '39701'}

In [10]:
engine.env

SPARK_HOME: /opt/spark
HADOOP_HOME:
JAVA_HOME: /usr/lib/jvm/java-8-openjdk-amd64
PYSPARK_PYTHON: /opt/conda/bin/python
PYSPARK_DRIVER_PYTHON: /opt/conda/bin/python
PYTHONPATH: /opt/spark/python:/opt/spark/python/lib/py4j-0.10.8.1-src.zip
PYSPARK_SUBMIT_ARGS: ' pyspark-shell'
SPARK_DIST_CLASSPATH:

For the full configuration, please uncomment and execute the following statement

In [12]:
engine.info

python_version: 3.7.6
hadoop_version: 3.2.0
hadoop_detect: spark
hadoop_home: ''
spark_home: /opt/spark
spark_classpath:
  - /opt/spark/jars/JLargeArrays-1.5.jar
  - /opt/spark/jars/JTransforms-3.1.jar
  - /opt/spark/jars/RoaringBitmap-0.7.45.jar
  - /opt/spark/jars/accessors-smart-1.2.jar
  - /opt/spark/jars/activation-1.1.1.jar
  - /opt/spark/jars/aircompressor-0.10.jar
  - /opt/spark/jars/algebra_2.12-2.0.0-M2.jar
  - /opt/spark/jars/aliyun-sdk-oss-2.8.3.jar
  - /opt/spark/jars/antlr4-runtime-4.7.1.jar
  - /opt/spark/jars/aopalliance-repackaged-2.6.1.jar
  - /opt/spark/jars/arpack_combined_all-0.1.jar
  - /opt/spark/jars/arrow-format-0.15.1.jar
  - /opt/spark/jars/arrow-memory-0.15.1.jar
  - /opt/spark/jars/arrow-vector-0.15.1.jar
  - /opt/spark/jars/audience-annotations-0.5.0.jar
  - /opt/spark/jars/avro-1.8.2.jar
  - /opt/spark/jars/avro-ipc-1.8.2.jar
  - /opt/spark/jars/avro-mapred-1.8.2-hadoop2.jar
  - /opt/spark/jars/aws-java-sdk-bundle-1.11.375.jar
  - /opt/spark/jars/azure-da

### Submitting engine parameters during engine initalization

Submit master, configuration parameters and services as engine params

In [22]:
import datafaucet as dfc
dfc.engine('spark', master='local[2]', services='postgres', conf=[('spark.app.name','myapp')])

<datafaucet.spark.engine.SparkEngine at 0x7fc37099aac8>

In [23]:
dfc.engine().conf

{'spark.executor.id': 'driver',
 'spark.submit.pyFiles': '/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.driver.host': '3b87dde9ea32',
 'spark.app.name': 'myapp',
 'spark.jars': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.app.id': 'local-1575550145078',
 'spark.rdd.compress': 'True',
 'spark.repl.local.jars': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.serializer.objectStreamReset': '100',
 'spark.driver.port': '34305',
 'spark.submit.deployMode': 'client',
 'spark.files': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.ui.showConsoleProgress': 'true',
 'spark.master': 'local[2]'}