Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace NILMTK's out-of-core code with Dask #248

Open
nipunbatra opened this issue Nov 29, 2014 · 16 comments
Open

Replace NILMTK's out-of-core code with Dask #248

nipunbatra opened this issue Nov 29, 2014 · 16 comments

Comments

@nipunbatra
Copy link
Member

I think I mentioned before about Blaze. I really think we should keep a close eye. It promises out of core processing, multiple backends (SQL, CSV, HDF5)...

I am happy to do some initial testing.

@JackKelly
Copy link
Contributor

Personally, I need to focus on my own research needs and hdf is fine for my
needs. So feel free to build a proof of concept BlazeDataStore for nilntk
(and I'll help where I can) but you'll need to drive this forwards if you
want it.
On 29 Nov 2014 02:18, "Nipun Batra" notifications@github.com wrote:

I think I mentioned before about Blaze
http://blaze.pydata.org/docs/latest/index.html. I really think we
should keep a close eye. It promises out of core processing, multiple
backends (SQL, CSV, HDF5)...

I am happy to do some initial testing.


Reply to this email directly or view it on GitHub
#248.

@JackKelly
Copy link
Contributor

I'm actually now quite excited about Blaze (and dask). I get the impression that - if we had the time - we could actually re-write NILMTK with a fraction of the custom code by using some of these very powerful new projects (like blaze and dask)

BTW, here's a good talk about dask (given at MLOSS 2015 at ICML)

@JackKelly
Copy link
Contributor

having now watched the MLOSS 2015 talk on dask and I'm pretty convinced that we could pretty much throw away all of NILMTK's out-of-core code and instead use dask (and related projects) whilst maintaining a similar public NILMTK API.

@nipunbatra
Copy link
Member Author

having now watched the MLOSS 2015 talk on dask and I'm pretty convinced that we could pretty much throw away all of NILMTK's out-of-core code and instead use dask (and related projects) whilst maintaining a similar public NILMTK API.

That sounds very interesting! Aside- I have been hacking ways of using parallelism with nilmtk.

@JackKelly
Copy link
Contributor

Nipun mentioned Blaze last year and I foolishly didn't look into it very deeply (I should have learnt by now to always take Nipun's suggestions very seriously!) Exactly as Nipun says, it looks very interesting. The Blaze website says "Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems." As I understand it, Blaze has 'backends' for CSV, SQL, HDF5, MongoDB, Spark. Blaze exposes a unified API for loading data and doing computations on data from all these sources. I think Blaze builds a graph of computation before loading any data, so it does all the 'lazy loading' and 'out of core' stuff that we currently do ourselves in NILMTK. I'm not certain but it looks like Blaze could pretty much replace our DataStore classes (I know it always hurts to throw away code!), as well as much of our out-of-core processing. The end result is that NILMTK would have a considerably smaller code base, but we'd add the ability to import / export to SQL, MongoDB and Spark (as well as CSV and HDF5, of course!)

I'm not entirely sure how to do 'DataSet.set_window()' with Blaze but I'd be amazed if it's not possible.

@JackKelly JackKelly changed the title Blaze Replace NILMTK's out-of-core code with Blaze Feb 9, 2016
@nipunbatra
Copy link
Member Author

The function that I use the most is disaggregate_chunk. I often don't mind bypassing nilmtk's HDF functions and directly load data using Pandas from the HDFstore. I also often like to chunk data on my own. As an example, I was doing some disaggregation on 43 homes for 1 year. The NILM technique I used had a disaggregate_chunk method. I also had access to a cluster. So, I ended up making 43*365 calls to disaggregate_chunk. All of it was very smooth. I think at times, I prefer to not let nilmtk come between me and disaggregation algos. To this effect, maybe, like we had for v0.1, we let the user write code for a lot of data handling operations. Ofcourse, this would be trivial in Pandas and maybe blaze.

@JackKelly
Copy link
Contributor

yeah, I took a very similar approach when I was doing my Neural NILM work! I think your comment is especially relevant for issue #479 "NILMTK should interact with other Python tools more smoothly": my thinking behind that issue was exactly as you say: it should be easier to build a custom pipeline using Pandas, numpy, blaze and NILMTK. i.e. to switch back and forth between NILMTK and other Python packages.

@nipunbatra
Copy link
Member Author

Yup, I felt that something DataStore().* operations could be handled by Pandas/Blaze. Further, I think that we should make public the metrics methods that operate on chunks.

I'd really prefer the user writing the for loop for handling chunks in comparison to nilmtk.

@JackKelly
Copy link
Contributor

I'd really prefer the user writing the for loop for handling chunks in comparison to nilmtk

yeah, I've been wondering about this. I think there might be (at least) three distinct 'use cases' for using NILMTK:

Use case 1: computing dataset statistics across an entire dataset (or even multiple datasets). In this mode, it probably is very helpful to the user to make NILMTK handle the chunks (although, internally, we'd probably use Blaze)

Use case 2: developing a NILM algo. In this case, my hunch is that NILMTK should just prepare a DataFrame and hand that DataFrame to the NILM algo, with very little other 'NILMTK baggage'.

Use case 3: running one or more NILM algorithms across lots and lots of data (e.g. to compute performance metrics). In this case, it probably is useful for NILMTK to handle the chunking.

@nipunbatra
Copy link
Member Author

Makes sense. itching to play with Blaze now.

@nipunbatra
Copy link
Member Author

@nipunbatra
Copy link
Member Author

SFrame from Data is now open source: http://blog.dato.com/sframe-open-source-release

@nipunbatra
Copy link
Member Author

@JackKelly
Copy link
Contributor

So, I've just spent a few days trying to use dask on a NILMTK HDF5 file.

TLDR: dask looks promising. But the API for dask.dataframe is much smaller then Pandas' API. Maybe, today, it would be possible to re-write a lot of NILMTK using dask instead of NILMTK's own out-of-core mechanisms. But it wouldn't be easy, and would probably require us to wrap a lot of our own code (or lightly wrapped Pandas code) in dask.delayed or dask.dataframe.rolling.wrap_rolling.

I've just tried porting nilmtk.electric.get_activations to use dask. I got stuck on the second line of code of get_activations! Dask.dataframe doesn't have a diff() method. OK, I thought: let's roll our own diff method by subtracting two copies of the same dataframe, with one dataframe shifted forwards by one row. But dask.dataframe doesn't have a shift() method. Neither could I figure out how to index dask.dataframe by position (in Pandas, this is what I wanted to do: df.iloc[:-1] - df.iloc[1:-1]; but dask.dataframe doesn't have an iloc method!). Finally I tried directly indexing the dataframe's index but that didn't work either. My current best idea for implementing diff would be to wrap some custom code in dask.dataframe.rolling.wrap_rolling, as suggested in this stack overflow answer.

Also, dask doesn't like the hierarchical column headings NILMTK uses in its HDF5 files. This is how I got round this:

import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import pandas as pd

# Start a progress bar for all computations
pbar = ProgressBar()
pbar.register()

def custom_load():
    df = pd.read_hdf('/data/mine/vadeec/merged/ukdale.h5', key='/building1/elec/meter11')
    df.columns = ['{}_{}'.format(a, b) 
                  for a, b in 
                  list(zip(df.columns.get_level_values(0), df.columns.get_level_values(1)))]
    return df

df = dd.from_pandas(custom_load(), chunksize=1000000)

And this is how I got resampling to work in dask:

r = df.power_active.resample('30S')
r.sem().head()

So, yeah, dask does look like a very promising project. But I probably go back to using NILMTK now.

In fact, I'm starting to wonder if out-of-core processing is necessary for NILM research. You can buy 32 GBytes of DDR4 RAM for £110 (ex VAT). One year of data for a single channel at 1 Hz resolution is only a quarter of a gigabyte (two columns of 32-bit floats).

@JackKelly
Copy link
Contributor

@JackKelly
Copy link
Contributor

Great news: the lovely Dask folks might implement the functionality required to implement diff: dask/dask#1765

@JackKelly JackKelly changed the title Replace NILMTK's out-of-core code with Blaze Replace NILMTK's out-of-core code with Dask Nov 8, 2016
@PMeira PMeira added this to the v1.0 milestone Sep 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants