New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ref count mismatch for vec ERROR while training GLM models #2795
Comments
Hi @denisabarar, |
Hi @mn-mikke. Thank you for your suggestion. We started experimenting with this version library - 3.36.1.2-1-3.2 . Until now, the error did not appear (in 5 runs). We could not use the latest version 3.36.1.3-1 because we are using Spark 3.2.1 and an error message that the library is using Spark 3.3 appeared. We will keep you updated. |
Hi again! We upgraded the h2o sparkling water library for all of our environments and now in one environment, the error is appearing really frequently. And in some pipeline runs, we are also encountering this error when the h2o cluster gets stuck: We noticed that these errors appear more frequently when we are restarting the pipeline from the beginning rather than restarting it from the training step. Should this be related to some memory issues? Could you please help up with what should we do next? |
Hi @denisabarar, |
yes, please |
I gathered the stderr logs, I will come with the stdout after. |
@denisabarar stdout logs are crucial. They contain information from H2O nodes. |
Here are the stdout logs, we cleaned them of confidential data outputs. Thank you! |
Hi @denisabarar, Can you share the logic that orchestrates training of multiple GLM models? |
Hi @mn-mikke, I hope this is what you're looking for: |
Hi @mn-mikke , In case it helps here are 2 other sets of logs where we run into this error by running the notebook from the Azure Data Factory pipeline. |
Hi @mn-mikke , Can we provide other information that could help with the investigation? Thank you! |
Hi @denisabarar, |
hi @denisabarar |
Hi @krasinski , Thank you for looking into this! And the value for spark master |
Hi @denisabarar, |
Hi @mn-mikke! I created this cluster, but still end up in a similar error, I attached the error message. |
Hi @denisabarar,
This error is differrent.
You try to persist model to the location |
Hi @mn-mikke, The issue with the path was also intermittent.
|
closing the issue due to inactivity for a long time, please reopen if still needed |
We are using a Databricks notebook where we are training around 6k GLM models in one step and then another 6k GLM models in another step. The training data contains around 250 variables. We are using the models to make predictions in the same notebook, the models are saved in DBFS which is mapped with AzureDataLake.
We are facing the following error in the training step:
This error is appearing randomly in terms of frequency and it usually works successfully after the restart of the cluster.
We don't have reproducible code because the error is appearing randomly.
Most of the time it is appearing on the second step of training the models. We tried to use h2o.removeAll() command between the trainings, but the error is still appearing sometimes, or a new type of error appeared:
The error appears more frequently when running from ADF pipelines, but it was reproduced also running from Databricks.
The error appears with different frequencies on our 4 environments that we are testing on, appearing every 10th run to every 3rd run, but no rule found. We tried to change the configuration of the cluster to use 4 or 8 nodes, no impact made.
Output Log.txt
Any type of support is welcomed, for more information please request and we'll try to provide.
The text was updated successfully, but these errors were encountered: