New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compared with xgboost new histogram based algorithm #211
Comments
xgboost approx / hist methods currently are scaling very poorly when it comes to multithreading. This is one of my test benchs using Bosch (I can test other sets), I can't even keep the CPU more busy than 60% (sparse 1Mx1K is too small in this case): Worse is the approx method, which can't even use 25%. Did you check the CPU usage while running with approx and with hist methods? When it comes to singlethreading, I found xgboost (fast method) to be faster than LightGBM. But for multithreading, LightGBM always wins as xgboost doesn't scale linearly with histograms. |
@Laurae2 can you give more details about your test dataset ? e.g #data, #feature, sparse / one-hot coding and so on |
I think is https://www.kaggle.com/c/bosch-production-line-performance/data? @guolinke did you try |
@guolinke test data: train_numeric.csv
I can't seem to find a way to download Yahoo (email wall) and MS 30K set (email wall). |
@Laurae2 Higgs: LightGBM: 306s lightgbm_higgs_speed.log.txt I think your dataset is sparse, so maybe the reason is caused by sparse data. I will take some investigate on it. |
@wxchan try depth-wise? why? |
@guolinke xgboost only 30% CPU usage on Higgs. I'll run 12 threads, 6 threads, and 1 thread to compare all this on Higgs. It seems to run much faster than your benchmark (for 12 threads, other results incoming soon). Here is a sample for Higgs: Total time: Which CPU do you use? (I use i7-3930K in my case) This is yours (did I make a mistake in the file name?):
|
2 * E5-2680 v2, DDR3 1600Mhz, 256GB (the memory speed will affect to training speed as well) You also can run LightGBM by same setting and see what happens. |
@guolinke I also use DDR3 1600 MHz (64GB in my case). My benchmarks on Higgs:
Something I don't understand is this when I use xgboost:
While you have:
I'm using Python 3.5, so the original Python script to create the libsvm files does not work. Instead I'm using this: import os
input_filename = "HIGGS.csv"
output_train = "higgs.train"
output_test = "higgs.test"
num_train = 10500000
read_num = 0
input = open(input_filename, "r")
train = open(output_train, "w")
test = open(output_test,"w")
def WriteOneLine(tokens, output):
label = int(float(tokens[0]))
output.write(str(label))
for i in range(1,len(tokens)):
feature_value = float(tokens[i])
output.write(' ' + str(i-1) + ':' + str(feature_value))
output.write('\n')
line = input.readline()
while line:
tokens = line.split(',')
if read_num < num_train:
WriteOneLine(tokens, train)
else:
WriteOneLine(tokens, test)
read_num += 1
if (read_num % 1000 == 0):
print(read_num)
line = input.readline()
input.close()
train.close()
test.close() It does goes through the 11M files: $ wc -l HIGGS.csv
11000000 HIGGS.csv
$ wc -l HIGGS.train
10500000 HIGGS.train Is it a normal behavior? My higgs.train is 6,082,744,083 bytes, HIGGS.csv is 8,035,497,980 bytes. Downloaded and created the libsvm files 3 times to triple check, same result. higgs.train SHA-256: First line of my HIGGS.train and HIGGS.csv:
|
@Laurae2 what is the data information output by your lightGBM? https://github.com/guolinke/boosting_tree_benchmarks/blob/master/lightgbm/lightgbm_higgs_speed.log#L4 |
@guolinke I have the same exact line as yours. My params are also identical to yours, except I changed:
Default bagging in xgboost is 1.00 (use all data). I compiled xgboost from source. When I run your sh code as is (after removing all the things I can't run), I get the same issue. My lines 3086061 to 3086063 on higgs.train do not seem malformed, I don't get why xgboost does not want to go any further, it's really strange:
|
I went creating the matrix (binary format) for xgboost using R. All my tries using a libsvm format ended with nearly the same issue (it's just the row count changing in xgboost, I tried Python 2.7 and 3.5...). Now xgboost works properly and matches your runs for AUC. I have to test for speed after. My new results. Setup:
Speed:
AUC:
Running depthwise soon. |
@guolinke xgboost depthwise is faster, we can compare best performance of xgboost to lightgbm. Anyway, is there any conclusion here? According to the new updated graphs in dmlc/xgboost#1950 , xgboost has really good performance with allstate dataset. |
@wxchan Xgboost's algorithm is better for sparse data, And LightGBM is better for dense data. And I will try to reduce time cost for the sparse feature in LightGBM as well. |
@guolinke Will test for Bosch dataset. @wxchan I got reverted results for depthwise (I could test for Bosch too if needed). See table below on Higgs: Speed:
AUC:
|
@Laurae2 your result is actually same. I read your log, for 100 iterations, depthwise: 227.867, lossguide: 278.813. As that thread said, it's tested only with first dozen of iterations. (dmlc/xgboost#1950 (comment)) |
Just chiming in to note that, although comparing performance with XGBoost set at |
@Allardvm The most time cost part in histogram algorithm is the histogram building. |
@Allardvm for a feature, xgboost only tests NA values aggregated against lowest and highest value. It's negligible, even with 99% sparsity. |
close now. will give a new comparison based on LightGBM v2. |
This issue has been automatically locked since there has not been any recent activity since it was closed. To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues including a reference to this. |
xgboost just adapt the histogram based idea from LightGBM, dmlc/xgboost#1950 .
It is much faster now and I run a new compare experiment: https://github.com/guolinke/boosting_tree_benchmarks
Environment
CPU: E5-2670 v3 * 2
Memory: 256GB DDR4 2133 MHz
speed
The gap is much smaller, and LightGBM is about 1x faster(total about 2x) now.
accuracy
Higgs's AUC:
ndcg at Yahoo LTR:
ndcg at MS LTR:
The text was updated successfully, but these errors were encountered: