-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Big Data Mining with xarray/dask vs pyspark/MLlib #61
Comments
I recommend looking at http://dask-ml.readthedocs.io/en/latest/ . This page lists both incremental methods, such as you suggest, but also others that are fully scalable like GLM, k-means, and integration with tools like XGBoost. Dask-ML is under active development. You might consider raising your voice at https://github.com/dask/dask-ml/issues to encourage the development of particular methods. |
Thanks for this amazingly fast answer ! Thus this raises the question: how are dask-ml and pangeo efforts coordinated ? is there any Machine Learning group with pangeo ? |
Not as far as I know. Collaboration is quite welcome. |
@gmaze: the overall goal of pangeo is to help support the development and interoperability of the open source software stack we need to do the best possible ocean / atmosphere / climate research. Machine learning is definitely an important part of that stack, and its importance will only grow in the future! With the first NSF grant, our focus has been on the lower-level components of the stack: xarray, dask, and making them work well in a distributed context. But the whole point of this is to eventually enable more sophisticated layers to develop. This is a long way of saying that we welcome your leadership and collaboration on the integration of machine learning libraries. As Matt said, the pace of development is quite fast, and the landscape is changing rapidly. It would be great to define some specific use cases and examples which could help focus our efforts. Closer integration between xarray and tensorflow is also something we should think about. |
Great, I definitely would like to work on writing some specific use cases before moving on. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
This issue has been automatically closed because it had not seen recent activity. The issue can always be reopened at a later date. |
Hi all,
It's not too late to wish you a happy new year 2018 and a productive year !
I am currently working with pyspark/MLlib to conduct some big data mining on ocean datasets (eg unsupervised classification).
I'm also using xarray/dask, but to work with smaller datasets or to prepare data to feed my pyspark/MLlib workflow.
I would surely move away from pyspark toward xarray and its awesome interface. But I've looked around (eg dask/dask-ml, phausamann/sklearn-xarray) and I'm confused.
I have the impression that xarray/dask can only handle incremental methods so far.
Although @TomAugspurger noticed that
(dask/dask-ml#111), yet, no other methods than incremental ones are available.
So my questions are:
I struggle to find the answer to these questions that are important to me simply because pyspark/MLlib provides full parallel implementation of machine learning methods and I'd like to settle on which framework to work with.
Thanks for letting me know if:
The text was updated successfully, but these errors were encountered: