Integrate XGboost into Hivemall #251

myui · 2016-01-06T03:00:06Z

tqchen · 2016-01-14T22:16:47Z

Good to see this issue here. I have a question about the hivemall project.
How are distributed algorithms being implemented (specifically synchronization) , are they implemented with iterative map-reduce workflow in hive?

XGboost already comes with YARN integration, and the project itself is designed to be portable as a brick to any distributed platforms. See also the most recent refactor https://github.com/dmlc/xgboost/tree/brick which stands as a cleaner version that allows customizable plugin of data source(possibly from JNI callbacks)

myui · 2016-01-15T10:53:16Z

@tqchen thank you for introducing the nextgen xgboost.

Do you have any design document explaining how to parallelize Gradient Tree Boosting in xgboost+wormhole? Gradient Boosting itself is a sequential algorithm and thus I guess tree/subtree construction of each iteration is parallelized using MPI or AllReduce though. I might use it through Hive by just kicking a YARN application.

Hivemall has a custom protocol for parameter mixing among servers, called MIX server.
https://github.com/myui/hivemall/wiki/How-to-use-Model-Mixing

It's similar to parameter servers but it's not KVS-based parameter servers or BSP/SSP/AllReduce protocol. MIX server goes half & half of parameter mix sever and half parameter mixing protocol. It can consider coefficient as well as feature weight. It's not for tree models.

Standalone version is already available and YARN-version is under development.
#246

tqchen · 2016-01-15T17:23:15Z

I see, so this is more like a server module for you to bring consistent state. Sounds interesting, I like this kinda of improvement into the data processing systems, instead of stick to what was there which might not be best for machine learning

tqchen · 2016-01-15T17:33:28Z

To answer your question, xgboost parallelizes tree construction, we do not yet have a detailed description of the algorithm, however, you can view it as more like a Allreduce style statistics aggregation. It is build on API of https://github.com/dmlc/rabit

I am more interested in deeper integration of xgboost with other systems, as this is its main design goal.

It takes two things to port a dmlc program to an existing system.

A stage startup protocol that starts up the processes as requested(number of workers for allreduce based jobs, plus some sever process for ps jobs), set the environment variable correctly, and restart the same program somewhere when a job died before finish.
Being able to start a tracker program somewhere (in your case could be at local startup script or mixing server) to connect the workers.

Personally i believe making xgboost talk to other systems and not restricted to certain platforms is far more interesting and helpful for our users

tqchen · 2016-03-06T16:50:20Z

I would like to follow up on this. Recently there is a deeper integration of xgboost going on with jvm stack. So now it is quite easy to integrate the existing abstraction into distributed computing systems.

See xgboost4j-flink and xgboost4j-spark on how this can be done with a few lines of code

tqchen · 2016-03-06T16:50:24Z

https://github.com/dmlc/xgboost/tree/master/jvm-packages

myui · 2016-03-07T04:57:58Z

@tqchen Thank you for informing me the pointer. It definitely helps implementing XGBoost on Hive.

BTW, I found pure Scala port of XGBoost for Apache Spark in
https://github.com/rotationsymmetry/sparkxgboost

FYI

tqchen · 2016-03-07T05:02:50Z

Thanks for the pointer. Since the xgboost jvm still uses the native library in backgroud, it enjoys all the optimizations in xgboost, which means all the library features as well as faster speed and efficient memory. Our goal is to make the most optimized library available for all platforms.

We have benchmarked distributed xgboost against other systems and we will publish a paper about the results in the new future.

maropu · 2016-03-07T10:43:02Z

@tqchen Great work :)) One question; xgboost-spark distributes learning data as RDD and builds numWorkers models (boosters). Then, ISTM only one of models is used for following predictions. Is it correct? Could you correct me if this is my bad. Thanks!

tqchen · 2016-03-07T17:23:15Z

@maropu Your observation is correct. However, internally, xgboost use https://github.com/dmlc/rabit to communicate between workers, and this is embedded into the distributed training.

So each worker coordinates with each other in each iteration and they get the identical booster in the end(train from all the dataset)

tqchen · 2016-03-07T18:11:41Z

I am also interested to hear your idea on other alternatives in integration, as it seems to me that hivemall also requires startup additional jobs to do MIX server.

maropu · 2016-03-08T08:00:31Z

@tqchen Ah, ... I see. Off topics though, I think you can use org.apache.spark.util.random.SamplingUtils so as to randomly select a part of learning data.
You collect the part in a driver, then build a model there by using rabit.

tqchen · 2016-03-08T15:55:11Z

Rabit uses all the computation resources among all the workers and scale up. So if we only build the model at driver, we are limited by computation resources of a single machine

maropu · 2016-03-08T16:40:11Z

Indeed, it is essentially difficult to parallelize building a model in spark.
What I said is to prevent other executors except for the first one from doing unnecessary works.

tqchen · 2016-03-08T16:53:34Z

In xgboost, all the executors need to collaborate, by communicating with each other with rabit. So each executors will take part of the data, train the model, and synchronize the statistics during training. So we do need all the executors to collaborate and parallelize the model building so it scales up.

All the executors get the same model as a result of collaboration:)

maropu · 2016-03-08T17:08:51Z

Ah, I see. You mean that rabbit inside parallelizes building a model in a AllReduce style between all the spark executors?

tqchen · 2016-03-08T17:26:48Z

Yes

maropu · 2016-03-09T03:03:30Z

Ah, great ;) I'll look into the codes of rabbit.

myui · 2016-09-06T04:05:42Z

Merged in #281

myui added the enhancement label Jan 6, 2016

myui self-assigned this Jan 6, 2016

myui added this to the v0.4 milestone Jan 6, 2016

myui mentioned this issue May 10, 2016

Add training and test functions to integrate the native XGBoost library #281

Merged

myui closed this as completed Sep 6, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate XGboost into Hivemall #251

Integrate XGboost into Hivemall #251

myui commented Jan 6, 2016

tqchen commented Jan 14, 2016

myui commented Jan 15, 2016

tqchen commented Jan 15, 2016

tqchen commented Jan 15, 2016

tqchen commented Mar 6, 2016

tqchen commented Mar 6, 2016

myui commented Mar 7, 2016

tqchen commented Mar 7, 2016

maropu commented Mar 7, 2016

tqchen commented Mar 7, 2016

tqchen commented Mar 7, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 9, 2016

myui commented Sep 6, 2016

Integrate XGboost into Hivemall #251

Integrate XGboost into Hivemall #251

Comments

myui commented Jan 6, 2016

tqchen commented Jan 14, 2016

myui commented Jan 15, 2016

tqchen commented Jan 15, 2016

tqchen commented Jan 15, 2016

tqchen commented Mar 6, 2016

tqchen commented Mar 6, 2016

myui commented Mar 7, 2016

tqchen commented Mar 7, 2016

maropu commented Mar 7, 2016

tqchen commented Mar 7, 2016

tqchen commented Mar 7, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 8, 2016

tqchen commented Mar 8, 2016

maropu commented Mar 9, 2016

myui commented Sep 6, 2016