-
Notifications
You must be signed in to change notification settings - Fork 153
Integrate XGboost into Hivemall #251
Comments
Good to see this issue here. I have a question about the hivemall project. XGboost already comes with YARN integration, and the project itself is designed to be portable as a brick to any distributed platforms. See also the most recent refactor https://github.com/dmlc/xgboost/tree/brick which stands as a cleaner version that allows customizable plugin of data source(possibly from JNI callbacks) |
@tqchen thank you for introducing the nextgen xgboost. Do you have any design document explaining how to parallelize Gradient Tree Boosting in xgboost+wormhole? Gradient Boosting itself is a sequential algorithm and thus I guess tree/subtree construction of each iteration is parallelized using MPI or AllReduce though. I might use it through Hive by just kicking a YARN application. Hivemall has a custom protocol for parameter mixing among servers, called MIX server. It's similar to parameter servers but it's not KVS-based parameter servers or BSP/SSP/AllReduce protocol. MIX server goes half & half of parameter mix sever and half parameter mixing protocol. It can consider coefficient as well as feature weight. It's not for tree models. Standalone version is already available and YARN-version is under development. |
I see, so this is more like a server module for you to bring consistent state. Sounds interesting, I like this kinda of improvement into the data processing systems, instead of stick to what was there which might not be best for machine learning |
To answer your question, xgboost parallelizes tree construction, we do not yet have a detailed description of the algorithm, however, you can view it as more like a Allreduce style statistics aggregation. It is build on API of https://github.com/dmlc/rabit I am more interested in deeper integration of xgboost with other systems, as this is its main design goal. It takes two things to port a dmlc program to an existing system.
Personally i believe making xgboost talk to other systems and not restricted to certain platforms is far more interesting and helpful for our users |
I would like to follow up on this. Recently there is a deeper integration of xgboost going on with jvm stack. So now it is quite easy to integrate the existing abstraction into distributed computing systems. See xgboost4j-flink and xgboost4j-spark on how this can be done with a few lines of code |
@tqchen Thank you for informing me the pointer. It definitely helps implementing XGBoost on Hive. BTW, I found pure Scala port of XGBoost for Apache Spark in FYI |
Thanks for the pointer. Since the xgboost jvm still uses the native library in backgroud, it enjoys all the optimizations in xgboost, which means all the library features as well as faster speed and efficient memory. Our goal is to make the most optimized library available for all platforms. We have benchmarked distributed xgboost against other systems and we will publish a paper about the results in the new future. |
@tqchen Great work :)) One question; |
@maropu Your observation is correct. However, internally, xgboost use https://github.com/dmlc/rabit to communicate between workers, and this is embedded into the distributed training. So each worker coordinates with each other in each iteration and they get the identical booster in the end(train from all the dataset) |
I am also interested to hear your idea on other alternatives in integration, as it seems to me that hivemall also requires startup additional jobs to do MIX server. |
@tqchen Ah, ... I see. Off topics though, I think you can use |
Rabit uses all the computation resources among all the workers and scale up. So if we only build the model at driver, we are limited by computation resources of a single machine |
Indeed, it is essentially difficult to parallelize building a model in spark. |
In xgboost, all the executors need to collaborate, by communicating with each other with rabit. So each executors will take part of the data, train the model, and synchronize the statistics during training. So we do need all the executors to collaborate and parallelize the model building so it scales up. All the executors get the same model as a result of collaboration:) |
Ah, I see. You mean that |
Yes |
Ah, great ;) I'll look into the codes of |
Merged in #281 |
https://github.com/dmlc/xgboost
The text was updated successfully, but these errors were encountered: