# Datalabframework

The datalabframework is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

## Data Engine

### Starting the engine

Super simple, yet flexible :) 

In [1]:
import datalabframework as dlf

In [None]:
#start the engine
engine = dlf.engine('spark')

created SparkEngine
Init engine "spark"
Connecting to spark master: local[*]
Engine context spark:2.4.1 successfully started


In [4]:
#you can also use directlt the specific engine class
engine = dlf.SparkEngine()

Loading and saving data resources is an operation performed by the engine. The engine configuration can be passed straight as parameters in the engine call, or configured in metadata yaml files.

### Engine Context

You can access the underlying engine by referring to the engine.context. In particular for the spark engine the context can be accessed with the next example code:

In [5]:
spark = dlf.context()

In [6]:
spark

In [36]:
# create a dataframe with two columns, named resp. 'a' and 'b'

df = spark.createDataFrame([('yes',1),('no',0)], ('a', 'b'))
df.show()

+---+---+
|  a|  b|
+---+---+
|yes|  1|
| no|  0|
+---+---+



### Engine configuration

In [9]:
engine = dlf.engine()

In [10]:
engine.conf

{'spark.rdd.compress': 'True',
 'spark.serializer.objectStreamReset': '100',
 'spark.app.id': 'local-1560504961238',
 'spark.master': 'local[*]',
 'spark.executor.id': 'driver',
 'spark.submit.deployMode': 'client',
 'spark.driver.host': 'e3d68d7d4542',
 'spark.app.name': 'None',
 'spark.ui.showConsoleProgress': 'true',
 'spark.driver.port': '35295'}

In [39]:
engine.env

SPARK_HOME: /opt/spark
HADOOP_HOME: /opt/hadoop
JAVA_HOME: /usr/lib/jvm/java-8-openjdk-amd64
PYSPARK_PYTHON: /opt/conda/bin/python
PYSPARK_DRIVER_PYTHON: /opt/conda/bin/python
PYTHONPATH: /opt/spark/python:/opt/spark/python/lib/py4j-0.10.7-src.zip
PYSPARK_SUBMIT_ARGS: ' pyspark-shell'
SPARK_DIST_CLASSPATH:

For the full configuration, please uncomment and execute the following statement

In [40]:
engine.info

{'python_version': '3.6.8',
 'hadoop_version': '3.1.1',
 'hadoop_detect': 'spark',
 'hadoop_home': '/opt/hadoop',
 'spark_home': '/opt/spark',
 'spark_classpath': ['/opt/spark/jars/*',
  '/opt/hadoop/etc/hadoop',
  '/opt/hadoop/share/hadoop/common/lib/*',
  '/opt/hadoop/share/hadoop/common/*',
  '/opt/hadoop/share/hadoop/hdfs',
  '/opt/hadoop/share/hadoop/hdfs/lib/*',
  '/opt/hadoop/share/hadoop/hdfs/*',
  '/opt/hadoop/share/hadoop/mapreduce/lib/*',
  '/opt/hadoop/share/hadoop/mapreduce/*',
  '/opt/hadoop/share/hadoop/yarn',
  '/opt/hadoop/share/hadoop/yarn/lib/*',
  '/opt/hadoop/share/hadoop/yarn/*'],
 'spark_classpath_source': '/opt/spark/conf/spark-env.sh'}

### Submitting engine parameters during engine initalization

Submit master, configuration parameters and services as engine params

In [2]:
import datalabframework as dlf
dlf.engine('spark', master='spark://spark-master:7077', services='postgres')

<datalabframework.spark.engine.SparkEngine at 0x7f19f5fb8828>

In [5]:
dlf.engine().conf

{'spark.driver.host': 'e3d68d7d4542',
 'spark.executor.id': 'driver',
 'spark.submit.pyFiles': '/home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.jars': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.driver.port': '39559',
 'spark.rdd.compress': 'True',
 'spark.master': 'spark://spark-master:7077',
 'spark.repl.local.jars': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.serializer.objectStreamReset': '100',
 'spark.app.id': 'app-20190618113650-0000',
 'spark.submit.deployMode': 'client',
 'spark.app.name': 'None',
 'spark.files': 'file:///home/jovyan/.ivy2/jars/org.postgresql_postgresql-42.2.5.jar',
 'spark.ui.showConsoleProgress': 'true'}