Big Data Mining with xarray/dask vs pyspark/MLlib #61

gmaze · 2018-01-08T21:19:50Z

Hi all,
It's not too late to wish you a happy new year 2018 and a productive year !

I am currently working with pyspark/MLlib to conduct some big data mining on ocean datasets (eg unsupervised classification).
I'm also using xarray/dask, but to work with smaller datasets or to prepare data to feed my pyspark/MLlib workflow.

I would surely move away from pyspark toward xarray and its awesome interface. But I've looked around (eg dask/dask-ml, phausamann/sklearn-xarray) and I'm confused.

I have the impression that xarray/dask can only handle incremental methods so far.
Although @TomAugspurger noticed that

All the estimators in dask-ml will work in parallel on distributed arrays.

(dask/dask-ml#111), yet, no other methods than incremental ones are available.

So my questions are:

what xarray/dask is or will be able to achieve vs pyspark/MLlib in order to mine large datasets ?
does xarray/dask can make use of a distributed parallel environment to implement classic data mining methods, without relying on incremental methods only ?

I struggle to find the answer to these questions that are important to me simply because pyspark/MLlib provides full parallel implementation of machine learning methods and I'd like to settle on which framework to work with.

Thanks for letting me know if:

we're in the likely stage where such no-incremental methods are simply awaiting for community contributions to pangeo or dask/dask-ml
pangeo folks have this in mind and possibly will address this ?

mrocklin · 2018-01-08T21:23:08Z

I recommend looking at http://dask-ml.readthedocs.io/en/latest/ . This page lists both incremental methods, such as you suggest, but also others that are fully scalable like GLM, k-means, and integration with tools like XGBoost.

Dask-ML is under active development. You might consider raising your voice at https://github.com/dask/dask-ml/issues to encourage the development of particular methods.

gmaze · 2018-01-08T22:13:03Z

Thanks for this amazingly fast answer !
My mistake, I started to write down my question a while ago and copy/pasted.
I note that the truly parallel K-Means has been committed on Oct.2017 and PCA in December, indeed, that's active development !
I'll raise more issues about other methods like GMM directly at dask-ml

Thus this raises the question: how are dask-ml and pangeo efforts coordinated ? is there any Machine Learning group with pangeo ?

mrocklin · 2018-01-08T23:04:26Z

is there any Machine Learning group with pangeo ?

Not as far as I know. Collaboration is quite welcome.

rabernat · 2018-01-09T02:15:18Z

@gmaze: the overall goal of pangeo is to help support the development and interoperability of the open source software stack we need to do the best possible ocean / atmosphere / climate research. Machine learning is definitely an important part of that stack, and its importance will only grow in the future!

With the first NSF grant, our focus has been on the lower-level components of the stack: xarray, dask, and making them work well in a distributed context. But the whole point of this is to eventually enable more sophisticated layers to develop.

This is a long way of saying that we welcome your leadership and collaboration on the integration of machine learning libraries. As Matt said, the pace of development is quite fast, and the landscape is changing rapidly. It would be great to define some specific use cases and examples which could help focus our efforts.

Closer integration between xarray and tensorflow is also something we should think about.

gmaze · 2018-01-11T08:32:30Z

Great, I definitely would like to work on writing some specific use cases before moving on.
Right now I think it is fairly easy to combine xarray with scikit-learn, and efforts like sklearn-xarray pave the way to even more integration. The problem is with large dataset where scikit-learn is no longer the most effective solution.
Tell me how to proceed !

stale · 2018-06-15T22:40:34Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale · 2018-06-22T23:26:20Z

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

gmaze mentioned this issue Jan 8, 2018

Example PBS Script dask/distributed#1260

Closed

TomAugspurger mentioned this issue Jan 8, 2018

sklearn-xarray compatability dask/dask-ml#112

Open

stale bot added the stale label Jun 15, 2018

stale bot closed this as completed Jun 22, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Big Data Mining with xarray/dask vs pyspark/MLlib #61

Big Data Mining with xarray/dask vs pyspark/MLlib #61

gmaze commented Jan 8, 2018

mrocklin commented Jan 8, 2018

gmaze commented Jan 8, 2018

mrocklin commented Jan 8, 2018

rabernat commented Jan 9, 2018

gmaze commented Jan 11, 2018

stale bot commented Jun 15, 2018

stale bot commented Jun 22, 2018

Big Data Mining with xarray/dask vs pyspark/MLlib #61

Big Data Mining with xarray/dask vs pyspark/MLlib #61

Comments

gmaze commented Jan 8, 2018

mrocklin commented Jan 8, 2018

gmaze commented Jan 8, 2018

mrocklin commented Jan 8, 2018

rabernat commented Jan 9, 2018

gmaze commented Jan 11, 2018

stale bot commented Jun 15, 2018

stale bot commented Jun 22, 2018