# Datafaucet

Datafaucet is a productivity framework for ETL, ML application. Simplifying some of the common activities which are typical in Data pipeline such as project scaffolding, data ingesting, start schema generation, forecasting etc.

## Pandas Engine

### Starting the engine

Super simple, yet flexible :) 

In [1]:
import datafaucet as dfc

In [2]:
#start the engine
engine = dfc.engine('pandas')

created PandasEngine
Init engine "pandas"
Setting context to pandas.
Engine context pandas:0.25.1 successfully started


In [3]:
#you can also use directlt the specific engine class
engine = dfc.PandasEngine()

Loading and saving data resources is an operation performed by the engine. The engine configuration can be passed straight as parameters in the engine call, or configured in metadata yaml files.

### Engine Context

You can access the underlying engine by referring to the engine.context. In particular for the spark engine the context can be accessed with the next example code:

In [4]:
pd = engine.context

In [10]:
df = dfc.range(10)
df.data.grid(5)

Unnamed: 0,id
0,0
1,1
2,2
3,3
4,4


In [11]:
type(df)

In [18]:
df.datafaucet()

{'object': 'dataframe', 'type': 'pandas', 'version': '0.8.2'}

### Engine configuration

In [12]:
engine = dfc.engine()

In [13]:
engine.conf

compute.use_bottleneck: true
compute.use_numexpr: false
display.chop_threshold:
display.colheader_justify: right
display.column_space: 12
display.date_dayfirst: false
display.date_yearfirst: false
display.encoding: UTF-8
display.expand_frame_repr: true
display.float_format:
display.html.border: 1
display.html.table_schema: false
display.html.use_mathjax: true
display.large_repr: truncate
display.latex.escape: true
display.latex.longtable: false
display.latex.multicolumn: true
display.latex.multicolumn_format: l
display.latex.multirow: false
display.latex.repr: false
display.max_categories: 8
display.max_columns: 20
display.max_colwidth: 50
display.max_info_columns: 100
display.max_info_rows: 1690785
display.max_rows: 60
display.max_seq_items: 100
display.memory_usage: true
display.min_rows: 10
display.multi_sparse: true
display.notebook_repr_html: true
display.pprint_nest_depth: 3
display.precision: 6
display.show_dimensions: truncate
display.unicode.ambiguous_as_wide: false
display.un

In [14]:
engine.env

SPARK_HOME:
JAVA_HOME: /usr/lib/jvm/java-8-oracle
PYTHONPATH:

For the full configuration, please uncomment and execute the following statement

In [15]:
engine.info

{'python_version': '3.7.3', 'pandas_version': '0.25.1'}

### Submitting engine parameters during engine initalization

Submit master, configuration parameters and services as engine params

In [16]:
import datafaucet as dfc
dfc.engine('pandas', conf=[('display.html.border','0')])

Init engine "pandas"
Setting context to pandas.
Engine context pandas:0.25.1 successfully started


<datafaucet.pandas.engine.PandasEngine at 0x7f53409b1cf8>

In [17]:
dfc.engine().conf

compute.use_bottleneck: true
compute.use_numexpr: false
display.chop_threshold:
display.colheader_justify: right
display.column_space: 12
display.date_dayfirst: false
display.date_yearfirst: false
display.encoding: UTF-8
display.expand_frame_repr: true
display.float_format:
display.html.border: 1
display.html.table_schema: false
display.html.use_mathjax: true
display.large_repr: truncate
display.latex.escape: true
display.latex.longtable: false
display.latex.multicolumn: true
display.latex.multicolumn_format: l
display.latex.multirow: false
display.latex.repr: false
display.max_categories: 8
display.max_columns: 20
display.max_colwidth: 50
display.max_info_columns: 100
display.max_info_rows: 1690785
display.max_rows: 60
display.max_seq_items: 100
display.memory_usage: true
display.min_rows: 10
display.multi_sparse: true
display.notebook_repr_html: true
display.pprint_nest_depth: 3
display.precision: 6
display.show_dimensions: truncate
display.unicode.ambiguous_as_wide: false
display.un