Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Big Data Mining with xarray/dask vs pyspark/MLlib #61

Closed
gmaze opened this issue Jan 8, 2018 · 7 comments
Closed

Big Data Mining with xarray/dask vs pyspark/MLlib #61

gmaze opened this issue Jan 8, 2018 · 7 comments
Labels

Comments

@gmaze
Copy link
Member

gmaze commented Jan 8, 2018

Hi all,
It's not too late to wish you a happy new year 2018 and a productive year !

I am currently working with pyspark/MLlib to conduct some big data mining on ocean datasets (eg unsupervised classification).
I'm also using xarray/dask, but to work with smaller datasets or to prepare data to feed my pyspark/MLlib workflow.

I would surely move away from pyspark toward xarray and its awesome interface. But I've looked around (eg dask/dask-ml, phausamann/sklearn-xarray) and I'm confused.

I have the impression that xarray/dask can only handle incremental methods so far.
Although @TomAugspurger noticed that

All the estimators in dask-ml will work in parallel on distributed arrays.

(dask/dask-ml#111), yet, no other methods than incremental ones are available.

So my questions are:

  • what xarray/dask is or will be able to achieve vs pyspark/MLlib in order to mine large datasets ?
  • does xarray/dask can make use of a distributed parallel environment to implement classic data mining methods, without relying on incremental methods only ?

I struggle to find the answer to these questions that are important to me simply because pyspark/MLlib provides full parallel implementation of machine learning methods and I'd like to settle on which framework to work with.

Thanks for letting me know if:

  • we're in the likely stage where such no-incremental methods are simply awaiting for community contributions to pangeo or dask/dask-ml
  • pangeo folks have this in mind and possibly will address this ?
@mrocklin
Copy link
Member

mrocklin commented Jan 8, 2018

I recommend looking at http://dask-ml.readthedocs.io/en/latest/ . This page lists both incremental methods, such as you suggest, but also others that are fully scalable like GLM, k-means, and integration with tools like XGBoost.

Dask-ML is under active development. You might consider raising your voice at https://github.com/dask/dask-ml/issues to encourage the development of particular methods.

@gmaze
Copy link
Member Author

gmaze commented Jan 8, 2018

Thanks for this amazingly fast answer !
My mistake, I started to write down my question a while ago and copy/pasted.
I note that the truly parallel K-Means has been committed on Oct.2017 and PCA in December, indeed, that's active development !
I'll raise more issues about other methods like GMM directly at dask-ml

Thus this raises the question: how are dask-ml and pangeo efforts coordinated ? is there any Machine Learning group with pangeo ?

@mrocklin
Copy link
Member

mrocklin commented Jan 8, 2018

is there any Machine Learning group with pangeo ?

Not as far as I know. Collaboration is quite welcome.

@rabernat
Copy link
Member

rabernat commented Jan 9, 2018

@gmaze: the overall goal of pangeo is to help support the development and interoperability of the open source software stack we need to do the best possible ocean / atmosphere / climate research. Machine learning is definitely an important part of that stack, and its importance will only grow in the future!

With the first NSF grant, our focus has been on the lower-level components of the stack: xarray, dask, and making them work well in a distributed context. But the whole point of this is to eventually enable more sophisticated layers to develop.

This is a long way of saying that we welcome your leadership and collaboration on the integration of machine learning libraries. As Matt said, the pace of development is quite fast, and the landscape is changing rapidly. It would be great to define some specific use cases and examples which could help focus our efforts.

Closer integration between xarray and tensorflow is also something we should think about.

@gmaze
Copy link
Member Author

gmaze commented Jan 11, 2018

Great, I definitely would like to work on writing some specific use cases before moving on.
Right now I think it is fairly easy to combine xarray with scikit-learn, and efforts like sklearn-xarray pave the way to even more integration. The problem is with large dataset where scikit-learn is no longer the most effective solution.
Tell me how to proceed !

@stale
Copy link

stale bot commented Jun 15, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Jun 15, 2018
@stale
Copy link

stale bot commented Jun 22, 2018

This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date.

@stale stale bot closed this as completed Jun 22, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants