New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace NILMTK's out-of-core code with Dask #248
Comments
Personally, I need to focus on my own research needs and hdf is fine for my
|
I'm actually now quite excited about Blaze (and dask). I get the impression that - if we had the time - we could actually re-write NILMTK with a fraction of the custom code by using some of these very powerful new projects (like blaze and dask) BTW, here's a good talk about dask (given at MLOSS 2015 at ICML) |
having now watched the MLOSS 2015 talk on dask and I'm pretty convinced that we could pretty much throw away all of NILMTK's out-of-core code and instead use dask (and related projects) whilst maintaining a similar public NILMTK API. |
That sounds very interesting! Aside- I have been hacking ways of using parallelism with nilmtk. |
Nipun mentioned Blaze last year and I foolishly didn't look into it very deeply (I should have learnt by now to always take Nipun's suggestions very seriously!) Exactly as Nipun says, it looks very interesting. The Blaze website says "Blaze translates a subset of modified NumPy and Pandas-like syntax to databases and other computing systems. Blaze allows Python users a familiar interface to query data living in other data storage systems." As I understand it, Blaze has 'backends' for CSV, SQL, HDF5, MongoDB, Spark. Blaze exposes a unified API for loading data and doing computations on data from all these sources. I think Blaze builds a graph of computation before loading any data, so it does all the 'lazy loading' and 'out of core' stuff that we currently do ourselves in NILMTK. I'm not certain but it looks like Blaze could pretty much replace our DataStore classes (I know it always hurts to throw away code!), as well as much of our out-of-core processing. The end result is that NILMTK would have a considerably smaller code base, but we'd add the ability to import / export to SQL, MongoDB and Spark (as well as CSV and HDF5, of course!) I'm not entirely sure how to do 'DataSet.set_window()' with Blaze but I'd be amazed if it's not possible. |
The function that I use the most is |
yeah, I took a very similar approach when I was doing my Neural NILM work! I think your comment is especially relevant for issue #479 "NILMTK should interact with other Python tools more smoothly": my thinking behind that issue was exactly as you say: it should be easier to build a custom pipeline using Pandas, numpy, blaze and NILMTK. i.e. to switch back and forth between NILMTK and other Python packages. |
Yup, I felt that something I'd really prefer the user writing the for loop for handling chunks in comparison to nilmtk. |
yeah, I've been wondering about this. I think there might be (at least) three distinct 'use cases' for using NILMTK: Use case 1: computing dataset statistics across an entire dataset (or even multiple datasets). In this mode, it probably is very helpful to the user to make NILMTK handle the chunks (although, internally, we'd probably use Blaze) Use case 2: developing a NILM algo. In this case, my hunch is that NILMTK should just prepare a DataFrame and hand that DataFrame to the NILM algo, with very little other 'NILMTK baggage'. Use case 3: running one or more NILM algorithms across lots and lots of data (e.g. to compute performance metrics). In this case, it probably is useful for NILMTK to handle the chunking. |
Makes sense. itching to play with Blaze now. |
Another pertinent backend: https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html |
SFrame from Data is now open source: http://blog.dato.com/sframe-open-source-release |
Dask ecosystem keeps getting more impressive. |
So, I've just spent a few days trying to use dask on a NILMTK HDF5 file. TLDR: dask looks promising. But the API for I've just tried porting Also, dask doesn't like the hierarchical column headings NILMTK uses in its HDF5 files. This is how I got round this: import dask.dataframe as dd
from dask.diagnostics import ProgressBar
import pandas as pd
# Start a progress bar for all computations
pbar = ProgressBar()
pbar.register()
def custom_load():
df = pd.read_hdf('/data/mine/vadeec/merged/ukdale.h5', key='/building1/elec/meter11')
df.columns = ['{}_{}'.format(a, b)
for a, b in
list(zip(df.columns.get_level_values(0), df.columns.get_level_values(1)))]
return df
df = dd.from_pandas(custom_load(), chunksize=1000000) And this is how I got resampling to work in dask: r = df.power_active.resample('30S')
r.sem().head() So, yeah, dask does look like a very promising project. But I probably go back to using NILMTK now. In fact, I'm starting to wonder if out-of-core processing is necessary for NILM research. You can buy 32 GBytes of DDR4 RAM for £110 (ex VAT). One year of data for a single channel at 1 Hz resolution is only a quarter of a gigabyte (two columns of 32-bit floats). |
Great news: the lovely Dask folks might implement the functionality required to implement |
I think I mentioned before about Blaze. I really think we should keep a close eye. It promises out of core processing, multiple backends (SQL, CSV, HDF5)...
I am happy to do some initial testing.
The text was updated successfully, but these errors were encountered: