Replace NILMTK's out-of-core code with Dask #248

nipunbatra · 2014-11-29T02:18:28Z

I think I mentioned before about Blaze. I really think we should keep a close eye. It promises out of core processing, multiple backends (SQL, CSV, HDF5)...

I am happy to do some initial testing.

JackKelly · 2014-11-29T10:36:51Z

Personally, I need to focus on my own research needs and hdf is fine for my
needs. So feel free to build a proof of concept BlazeDataStore for nilntk
(and I'll help where I can) but you'll need to drive this forwards if you
want it.
On 29 Nov 2014 02:18, "Nipun Batra" notifications@github.com wrote:

I think I mentioned before about Blaze
http://blaze.pydata.org/docs/latest/index.html. I really think we
should keep a close eye. It promises out of core processing, multiple
backends (SQL, CSV, HDF5)...

I am happy to do some initial testing.

—
Reply to this email directly or view it on GitHub
#248.

JackKelly · 2016-02-08T14:20:06Z

I'm actually now quite excited about Blaze (and dask). I get the impression that - if we had the time - we could actually re-write NILMTK with a fraction of the custom code by using some of these very powerful new projects (like blaze and dask)

BTW, here's a good talk about dask (given at MLOSS 2015 at ICML)

JackKelly · 2016-02-08T14:52:48Z

having now watched the MLOSS 2015 talk on dask and I'm pretty convinced that we could pretty much throw away all of NILMTK's out-of-core code and instead use dask (and related projects) whilst maintaining a similar public NILMTK API.

nipunbatra · 2016-02-08T15:20:10Z

having now watched the MLOSS 2015 talk on dask and I'm pretty convinced that we could pretty much throw away all of NILMTK's out-of-core code and instead use dask (and related projects) whilst maintaining a similar public NILMTK API.

That sounds very interesting! Aside- I have been hacking ways of using parallelism with nilmtk.

JackKelly · 2016-02-09T09:17:52Z

Nipun mentioned Blaze last year and I foolishly didn't look into it very deeply (I should have learnt by now to always take Nipun's suggestions very seriously!) Exactly as Nipun says, it looks very interesting. The Blaze website says "Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems." As I understand it, Blaze has 'backends' for CSV, SQL, HDF5, MongoDB, Spark. Blaze exposes a unified API for loading data and doing computations on data from all these sources. I think Blaze builds a graph of computation before loading any data, so it does all the 'lazy loading' and 'out of core' stuff that we currently do ourselves in NILMTK. I'm not certain but it looks like Blaze could pretty much replace our DataStore classes (I know it always hurts to throw away code!), as well as much of our out-of-core processing. The end result is that NILMTK would have a considerably smaller code base, but we'd add the ability to import / export to SQL, MongoDB and Spark (as well as CSV and HDF5, of course!)

I'm not entirely sure how to do 'DataSet.set_window()' with Blaze but I'd be amazed if it's not possible.

nipunbatra · 2016-02-09T11:28:20Z

The function that I use the most is disaggregate_chunk. I often don't mind bypassing nilmtk's HDF functions and directly load data using Pandas from the HDFstore. I also often like to chunk data on my own. As an example, I was doing some disaggregation on 43 homes for 1 year. The NILM technique I used had a disaggregate_chunk method. I also had access to a cluster. So, I ended up making 43*365 calls to disaggregate_chunk. All of it was very smooth. I think at times, I prefer to not let nilmtk come between me and disaggregation algos. To this effect, maybe, like we had for v0.1, we let the user write code for a lot of data handling operations. Ofcourse, this would be trivial in Pandas and maybe blaze.

JackKelly · 2016-02-09T11:48:23Z

yeah, I took a very similar approach when I was doing my Neural NILM work! I think your comment is especially relevant for issue #479 "NILMTK should interact with other Python tools more smoothly": my thinking behind that issue was exactly as you say: it should be easier to build a custom pipeline using Pandas, numpy, blaze and NILMTK. i.e. to switch back and forth between NILMTK and other Python packages.

nipunbatra · 2016-02-09T12:02:40Z

Yup, I felt that something DataStore().* operations could be handled by Pandas/Blaze. Further, I think that we should make public the metrics methods that operate on chunks.

I'd really prefer the user writing the for loop for handling chunks in comparison to nilmtk.

JackKelly · 2016-02-09T12:37:35Z

I'd really prefer the user writing the for loop for handling chunks in comparison to nilmtk

yeah, I've been wondering about this. I think there might be (at least) three distinct 'use cases' for using NILMTK:

Use case 1: computing dataset statistics across an entire dataset (or even multiple datasets). In this mode, it probably is very helpful to the user to make NILMTK handle the chunks (although, internally, we'd probably use Blaze)

Use case 2: developing a NILM algo. In this case, my hunch is that NILMTK should just prepare a DataFrame and hand that DataFrame to the NILM algo, with very little other 'NILMTK baggage'.

Use case 3: running one or more NILM algorithms across lots and lots of data (e.g. to compute performance metrics). In this case, it probably is useful for NILMTK to handle the chunking.

nipunbatra · 2016-02-09T13:37:09Z

Makes sense. itching to play with Blaze now.

nipunbatra · 2016-02-10T10:13:50Z

Another pertinent backend: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html

nipunbatra · 2016-02-16T08:23:24Z

SFrame from Data is now open source: http://blog.dato.com/sframe-open-source-release

nipunbatra · 2016-02-25T16:24:37Z

Dask ecosystem keeps getting more impressive.

JackKelly · 2016-11-07T14:18:00Z

So, I've just spent a few days trying to use dask on a NILMTK HDF5 file.

TLDR: dask looks promising. But the API for dask.dataframe is much smaller then Pandas' API. Maybe, today, it would be possible to re-write a lot of NILMTK using dask instead of NILMTK's own out-of-core mechanisms. But it wouldn't be easy, and would probably require us to wrap a lot of our own code (or lightly wrapped Pandas code) in dask.delayed or dask.dataframe.rolling.wrap_rolling.

I've just tried porting nilmtk.electric.get_activations to use dask. I got stuck on the second line of code of get_activations! Dask.dataframe doesn't have a diff() method. OK, I thought: let's roll our own diff method by subtracting two copies of the same dataframe, with one dataframe shifted forwards by one row. But dask.dataframe doesn't have a shift() method. Neither could I figure out how to index dask.dataframe by position (in Pandas, this is what I wanted to do: df.iloc[:-1] - df.iloc[1:-1]; but dask.dataframe doesn't have an iloc method!). Finally I tried directly indexing the dataframe's index but that didn't work either. My current best idea for implementing diff would be to wrap some custom code in dask.dataframe.rolling.wrap_rolling, as suggested in this stack overflow answer.

Also, dask doesn't like the hierarchical column headings NILMTK uses in its HDF5 files. This is how I got round this:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import pandas as pd

# Start a progress bar for all computations
pbar = ProgressBar()
pbar.register()

def custom_load():
    df = pd.read_hdf('/data/mine/vadeec/merged/ukdale.h5', key='/building1/elec/meter11')
    df.columns = ['{}_{}'.format(a, b) 
                  for a, b in 
                  list(zip(df.columns.get_level_values(0), df.columns.get_level_values(1)))]
    return df

df = dd.from_pandas(custom_load(), chunksize=1000000)

And this is how I got resampling to work in dask:

r = df.power_active.resample('30S')
r.sem().head()

So, yeah, dask does look like a very promising project. But I probably go back to using NILMTK now.

In fact, I'm starting to wonder if out-of-core processing is necessary for NILM research. You can buy 32 GBytes of DDR4 RAM for £110 (ex VAT). One year of data for a single channel at 1 Hz resolution is only a quarter of a gigabyte (two columns of 32-bit floats).

JackKelly · 2016-11-08T09:30:18Z

I've asked how to compute the forward difference with Dask DataFrame on Stack Overflow.

JackKelly · 2016-11-08T18:02:01Z

Great news: the lovely Dask folks might implement the functionality required to implement diff: dask/dask#1765

JackKelly added the DataStore and format conversion label Dec 1, 2014

JackKelly changed the title ~~Blaze~~ Replace NILMTK's out-of-core code with Blaze Feb 9, 2016

JackKelly mentioned this issue Feb 9, 2016

Ideas for simplifying NILMTK #208

Closed

JackKelly added design refactoring simplify labels Feb 9, 2016

JackKelly mentioned this issue Feb 9, 2016

Don't save chunk-by-chunk stats #480

Closed

JackKelly changed the title ~~Replace NILMTK's out-of-core code with Blaze~~ Replace NILMTK's out-of-core code with Dask Nov 8, 2016

PMeira added this to the v1.0 milestone Sep 9, 2018

PMeira mentioned this issue Sep 15, 2018

Parallel processing #245

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace NILMTK's out-of-core code with Dask #248

Replace NILMTK's out-of-core code with Dask #248

nipunbatra commented Nov 29, 2014

JackKelly commented Nov 29, 2014

JackKelly commented Feb 8, 2016

JackKelly commented Feb 8, 2016

nipunbatra commented Feb 8, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

nipunbatra commented Feb 10, 2016

nipunbatra commented Feb 16, 2016

nipunbatra commented Feb 25, 2016

JackKelly commented Nov 7, 2016

JackKelly commented Nov 8, 2016

JackKelly commented Nov 8, 2016

Replace NILMTK's out-of-core code with Dask #248

Replace NILMTK's out-of-core code with Dask #248

Comments

nipunbatra commented Nov 29, 2014

JackKelly commented Nov 29, 2014

JackKelly commented Feb 8, 2016

JackKelly commented Feb 8, 2016

nipunbatra commented Feb 8, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

JackKelly commented Feb 9, 2016

nipunbatra commented Feb 9, 2016

nipunbatra commented Feb 10, 2016

nipunbatra commented Feb 16, 2016

nipunbatra commented Feb 25, 2016

JackKelly commented Nov 7, 2016

JackKelly commented Nov 8, 2016

JackKelly commented Nov 8, 2016