Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

XGBoost error on multinode with AutoML #8840

Closed
exalate-issue-sync bot opened this issue May 12, 2023 · 4 comments
Closed

XGBoost error on multinode with AutoML #8840

exalate-issue-sync bot opened this issue May 12, 2023 · 4 comments

Comments

@exalate-issue-sync
Copy link

reproduced on 4 node cluster in AWS

{code:java}
train = h2o.import_file("http://s3.amazonaws.com/h2o-public-test-data/smalldata/diabetes/diabetes_text_train.csv")

y = "diabetesMed"
aml = H2OAutoML(
exclude_algos=["GLM", "DeepLearning"],
max_models=100, max_runtime_secs_per_model=120,
keep_cross_validation_models=False,
keep_cross_validation_predictions=False,
seed=1
)
aml.train(y=y, training_frame=train)

{code}

{code:java}
08-15 09:46:46.914 10.0.0.121:54321 24460 FJ-3-25 INFO: Scoring History:
08-15 09:46:46.914 10.0.0.121:54321 24460 FJ-3-25 INFO: Timestamp Duration Number of Trees Training RMSE Training LogLoss Training AUC Training pr_auc Training Lift Training Classification Error Validation RMSE Validation LogLoss Validation AUC Validation pr_auc Validation Lift Validation Classification Error
08-15 09:46:46.914 10.0.0.121:54321 24460 FJ-3-25 INFO: 2019-08-15 09:46:46 0.064 sec 0 0.50000 0.69315 0.50000 0.00000 1.00000 0.25370 0.50000 0.69315 0.50000 0.00000 1.00000 0.25007
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: XGBoost training iteration failed
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: ml.dmlc.xgboost4j.java.XGBoostError: [09:46:47] /dot/include/xgboost/./tree_model.h:290: Check failed: static_cast(deleted_nodes_.size()) == param.num_deleted (0 vs. 4)
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR:
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: Stack trace returned 8 entries:
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (0) /tmp/libxgboost4j_minimal2120870706660257861.so(dmlc::StackTrace(unsigned long)+0x1aa) [0x7fc2ccb4cb6a]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (1) /tmp/libxgboost4j_minimal2120870706660257861.so(+0xf9419) [0x7fc2ccba8419]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (2) /tmp/libxgboost4j_minimal2120870706660257861.so(xgboost::tree::TreeSyncher::Update(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, std::vector<xgboost::RegTree*, std::allocatorxgboost::RegTree* > const&)+0x27a) [0x7fc2ccba955a]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (3) /tmp/libxgboost4j_minimal2120870706660257861.so(xgboost::gbm::GBTree::BoostNewTrees(xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::DMatrix, int, std::vector<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree >, std::allocator<std::unique_ptr<xgboost::RegTree, std::default_deletexgboost::RegTree > > >)+0x49e) [0x7fc2ccce813e]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (4) /tmp/libxgboost4j_minimal2120870706660257861.so(xgboost::gbm::GBTree::DoBoost(xgboost::DMatrix
, xgboost::HostDeviceVector<xgboost::detail::GradientPairInternal >, xgboost::ObjFunction)+0x9a9) [0x7fc2ccce93f9]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (5) /tmp/libxgboost4j_minimal2120870706660257861.so(xgboost::LearnerImpl::UpdateOneIter(int, xgboost::DMatrix*)+0x421) [0x7fc2ccc8b3b1]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (6) /tmp/libxgboost4j_minimal2120870706660257861.so(XGBoosterUpdateOneIter+0x35) [0x7fc2ccc24945]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: [bt] (7) [0x7fc2d95e6d3d]
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR:
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR:
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: at ml.dmlc.xgboost4j.java.XGBoostJNI.checkCall(XGBoostJNI.java:48)
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: at ml.dmlc.xgboost4j.java.Booster.update(Booster.java:177)
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: at ml.dmlc.xgboost4j.java.XGBoostUpdater$UpdateBooster.call(XGBoostUpdater.java:126)
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: at ml.dmlc.xgboost4j.java.XGBoostUpdater$UpdateBooster.call(XGBoostUpdater.java:104)
08-15 09:46:47.394 10.0.0.121:54321 24460 #l_1_cv_2 ERRR: at ml.dmlc.xgboost4j.java.XGBoostUpdater.run(XGBoostUpdater.java:49)
AssertError:Allreduce: boundary error
{code}

@exalate-issue-sync
Copy link
Author

Jan Sterba commented: could be fixed in lastest xgboost, we should try that

@exalate-issue-sync
Copy link
Author

Jan Sterba commented: upgrading xgboost fixes this issue

@exalate-issue-sync
Copy link
Author

Nidhi Mehta commented: 0.90 xgb version

@h2o-ops
Copy link
Collaborator

h2o-ops commented May 14, 2023

JIRA Issue Migration Info

Jira Issue: PUBDEV-6793
Assignee: Jan Sterba
Reporter: Jan Sterba
State: Closed
Fix Version: 3.28.0.1
Attachments: N/A
Development PRs: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant