Skip to content
This repository has been archived by the owner on Oct 8, 2019. It is now read-only.

Integrate XGboost into Hivemall #251

Closed
myui opened this issue Jan 6, 2016 · 19 comments
Closed

Integrate XGboost into Hivemall #251

myui opened this issue Jan 6, 2016 · 19 comments
Assignees
Milestone

Comments

@myui
Copy link
Owner

myui commented Jan 6, 2016

https://github.com/dmlc/xgboost

@myui myui self-assigned this Jan 6, 2016
@myui myui added this to the v0.4 milestone Jan 6, 2016
@tqchen
Copy link

tqchen commented Jan 14, 2016

Good to see this issue here. I have a question about the hivemall project.
How are distributed algorithms being implemented (specifically synchronization) , are they implemented with iterative map-reduce workflow in hive?

XGboost already comes with YARN integration, and the project itself is designed to be portable as a brick to any distributed platforms. See also the most recent refactor https://github.com/dmlc/xgboost/tree/brick which stands as a cleaner version that allows customizable plugin of data source(possibly from JNI callbacks)

@myui
Copy link
Owner Author

myui commented Jan 15, 2016

@tqchen thank you for introducing the nextgen xgboost.

Do you have any design document explaining how to parallelize Gradient Tree Boosting in xgboost+wormhole? Gradient Boosting itself is a sequential algorithm and thus I guess tree/subtree construction of each iteration is parallelized using MPI or AllReduce though. I might use it through Hive by just kicking a YARN application.

Hivemall has a custom protocol for parameter mixing among servers, called MIX server.
https://github.com/myui/hivemall/wiki/How-to-use-Model-Mixing

It's similar to parameter servers but it's not KVS-based parameter servers or BSP/SSP/AllReduce protocol. MIX server goes half & half of parameter mix sever and half parameter mixing protocol. It can consider coefficient as well as feature weight. It's not for tree models.

Standalone version is already available and YARN-version is under development.
#246

@tqchen
Copy link

tqchen commented Jan 15, 2016

I see, so this is more like a server module for you to bring consistent state. Sounds interesting, I like this kinda of improvement into the data processing systems, instead of stick to what was there which might not be best for machine learning

@tqchen
Copy link

tqchen commented Jan 15, 2016

To answer your question, xgboost parallelizes tree construction, we do not yet have a detailed description of the algorithm, however, you can view it as more like a Allreduce style statistics aggregation. It is build on API of https://github.com/dmlc/rabit

I am more interested in deeper integration of xgboost with other systems, as this is its main design goal.

It takes two things to port a dmlc program to an existing system.

  • A stage startup protocol that starts up the processes as requested(number of workers for allreduce based jobs, plus some sever process for ps jobs), set the environment variable correctly, and restart the same program somewhere when a job died before finish.
  • Being able to start a tracker program somewhere (in your case could be at local startup script or mixing server) to connect the workers.

Personally i believe making xgboost talk to other systems and not restricted to certain platforms is far more interesting and helpful for our users

@tqchen
Copy link

tqchen commented Mar 6, 2016

I would like to follow up on this. Recently there is a deeper integration of xgboost going on with jvm stack. So now it is quite easy to integrate the existing abstraction into distributed computing systems.

See xgboost4j-flink and xgboost4j-spark on how this can be done with a few lines of code

@tqchen
Copy link

tqchen commented Mar 6, 2016

@myui
Copy link
Owner Author

myui commented Mar 7, 2016

@tqchen Thank you for informing me the pointer. It definitely helps implementing XGBoost on Hive.

BTW, I found pure Scala port of XGBoost for Apache Spark in
https://github.com/rotationsymmetry/sparkxgboost

FYI

@tqchen
Copy link

tqchen commented Mar 7, 2016

Thanks for the pointer. Since the xgboost jvm still uses the native library in backgroud, it enjoys all the optimizations in xgboost, which means all the library features as well as faster speed and efficient memory. Our goal is to make the most optimized library available for all platforms.

We have benchmarked distributed xgboost against other systems and we will publish a paper about the results in the new future.

@maropu
Copy link
Contributor

maropu commented Mar 7, 2016

@tqchen Great work :)) One question; xgboost-spark distributes learning data as RDD and builds numWorkers models (boosters). Then, ISTM only one of models is used for following predictions. Is it correct? Could you correct me if this is my bad. Thanks!

@tqchen
Copy link

tqchen commented Mar 7, 2016

@maropu Your observation is correct. However, internally, xgboost use https://github.com/dmlc/rabit to communicate between workers, and this is embedded into the distributed training.

So each worker coordinates with each other in each iteration and they get the identical booster in the end(train from all the dataset)

@tqchen
Copy link

tqchen commented Mar 7, 2016

I am also interested to hear your idea on other alternatives in integration, as it seems to me that hivemall also requires startup additional jobs to do MIX server.

@maropu
Copy link
Contributor

maropu commented Mar 8, 2016

@tqchen Ah, ... I see. Off topics though, I think you can use org.apache.spark.util.random.SamplingUtils so as to randomly select a part of learning data.
You collect the part in a driver, then build a model there by using rabit.

@tqchen
Copy link

tqchen commented Mar 8, 2016

Rabit uses all the computation resources among all the workers and scale up. So if we only build the model at driver, we are limited by computation resources of a single machine

@maropu
Copy link
Contributor

maropu commented Mar 8, 2016

Indeed, it is essentially difficult to parallelize building a model in spark.
What I said is to prevent other executors except for the first one from doing unnecessary works.

@tqchen
Copy link

tqchen commented Mar 8, 2016

In xgboost, all the executors need to collaborate, by communicating with each other with rabit. So each executors will take part of the data, train the model, and synchronize the statistics during training. So we do need all the executors to collaborate and parallelize the model building so it scales up.

All the executors get the same model as a result of collaboration:)

@maropu
Copy link
Contributor

maropu commented Mar 8, 2016

Ah, I see. You mean that rabbit inside parallelizes building a model in a AllReduce style between all the spark executors?

@tqchen
Copy link

tqchen commented Mar 8, 2016

Yes

@maropu
Copy link
Contributor

maropu commented Mar 9, 2016

Ah, great ;) I'll look into the codes of rabbit.

@myui
Copy link
Owner Author

myui commented Sep 6, 2016

Merged in #281

@myui myui closed this as completed Sep 6, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants