Ref count mismatch for vec ERROR while training GLM models #2795

denisabarar · 2022-07-05T09:03:39Z

We are using a Databricks notebook where we are training around 6k GLM models in one step and then another 6k GLM models in another step. The training data contains around 250 variables. We are using the models to make predictions in the same notebook, the models are saved in DBFS which is mapped with AzureDataLake.

We are facing the following error in the training step:

This error is appearing randomly in terms of frequency and it usually works successfully after the restart of the cluster.

Sparkling Water/PySparkling/RSparkling version: ai.h2o:sparkling-water-package_2.12:3.32.0.4-1-3.0
Spark version: 3.0.1
Scala version: 2.12
Execution mode: Databricks Spark Cluster

We don't have reproducible code because the error is appearing randomly.

Most of the time it is appearing on the second step of training the models. We tried to use h2o.removeAll() command between the trainings, but the error is still appearing sometimes, or a new type of error appeared:

The error appears more frequently when running from ADF pipelines, but it was reproduced also running from Databricks.
The error appears with different frequencies on our 4 environments that we are testing on, appearing every 10th run to every 3rd run, but no rule found. We tried to change the configuration of the cluster to use 4 or 8 nodes, no impact made.

Output Log.txt

Any type of support is welcomed, for more information please request and we'll try to provide.

mn-mikke · 2022-07-12T17:28:11Z

cc @wendycwong @michalkurka

mn-mikke · 2022-07-12T17:48:10Z

Hi @denisabarar,
Does this problem happen also on the latest SW version 3.36.1.3-1? Would it be possible to share a minimal piece of code that suffers from that problem?

denisabarar · 2022-07-25T06:34:33Z

Hi @mn-mikke. Thank you for your suggestion. We started experimenting with this version library - 3.36.1.2-1-3.2 . Until now, the error did not appear (in 5 runs). We could not use the latest version 3.36.1.3-1 because we are using Spark 3.2.1 and an error message that the library is using Spark 3.3 appeared. We will keep you updated.

denisabarar · 2022-09-02T10:42:18Z

Hi again! We upgraded the h2o sparkling water library for all of our environments and now in one environment, the error is appearing really frequently. And in some pipeline runs, we are also encountering this error when the h2o cluster gets stuck:

We noticed that these errors appear more frequently when we are restarting the pipeline from the beginning rather than restarting it from the training step. Should this be related to some memory issues?

Could you please help up with what should we do next?

mn-mikke · 2022-09-02T11:26:20Z

Hi @denisabarar,
you share logs from spark executors for the failed run?

denisabarar · 2022-09-05T15:43:43Z

Is this sufficient, or do you need the stdout and stderr?

mn-mikke · 2022-09-06T09:28:17Z

do you need the stdout and stderr?

yes, please

denisabarar · 2022-09-13T10:47:35Z

I gathered the stderr logs, I will come with the stdout after.
error logs.zip

mn-mikke · 2022-09-13T15:10:04Z

@denisabarar stdout logs are crucial. They contain information from H2O nodes.

denisabarar · 2022-09-14T06:25:37Z

Here are the stdout logs, we cleaned them of confidential data outputs. Thank you!
out logs.zip

mn-mikke · 2022-09-14T13:33:58Z

Hi @denisabarar,
It looks strange to us that requests to train a glm model are coming to all H2O nodes (Spark executors). In H2O-3 cluster there is one node which is a leader. This node is the only one receiving the requests.

Can you share the logic that orchestrates training of multiple GLM models?

denisabarar · 2022-09-15T11:47:32Z

Hi @mn-mikke,

I hope this is what you're looking for:

where the parameters of the glm method models are udfXvars - features for training, inDat-training data. There is no configuration on the spark executors other than those on the cluster type.

denisabarar · 2022-09-22T08:12:34Z

Hi @mn-mikke ,

In case it helps here are 2 other sets of logs where we run into this error by running the notebook from the Azure Data Factory pipeline.
logs_19_09_22_PROD.zip
logs_20_09_22_PROD.zip

denisabarar · 2022-09-27T11:06:50Z

Hi @mn-mikke ,

Can we provide other information that could help with the investigation?
Are there any clues about what could be causing it?

Thank you!

mn-mikke · 2022-09-27T12:24:51Z

Hi @denisabarar,
sorry for the delay. I'm currently sick, but @krasinski will take a look.

krasinski · 2022-09-29T08:35:59Z

hi @denisabarar
can you please show the value of spark.master?
You can get it more or less like: spark.conf.get("spark.master")
Also what is your databricks cluster setup?

denisabarar · 2022-09-29T10:18:29Z

Hi @krasinski ,

Thank you for looking into this!

The databricks cluster setup:

And the value for spark master
sc$master
[1] "sparklyr://localhost:65529/65529"
And for conf:
sc$conf
$sparklyr.cancellable
[1] FALSE
$spark.env.SPARK_LOCAL_IP.local
[1] "127.0.0.1"
$sparklyr.connect.csv.embedded
[1] "^1.*"
$spark.sql.legacy.utcTimestampFunc.enabled
[1] TRUE
$sparklyr.connect.cores.local
[1] 20
$spark.sql.shuffle.partitions.local
[1] 20
$sparklyr.shell.name
[1] "sparklyr"
$spark.r.libpaths
[1] "/local_disk0/.ephemeral_nfs/envs/rEnv-4cc5ee7a-3afe-40ed-a52d-e990869bde2f,/databricks/spark/R/lib,/local_disk0/.ephemeral_nfs/cluster_libraries/r,/usr/local/lib/R/site-library,/usr/lib/R/site-library,/usr/lib/R/library"

mn-mikke · 2022-09-29T10:53:22Z

Hi @denisabarar,
Can you try to switch cluster mode from High concurrency to Standard? I remember hi had troubles with High concurency mode in past.

denisabarar · 2022-09-30T05:35:45Z

Hi @mn-mikke! I created this cluster, but still end up in a similar error, I attached the error message.

errorlog_30_9_22.txt

denisabarar · 2022-10-10T08:05:42Z

Hi! Is there any update?
Do you think this error is related to the failing? ERROR sparklyr: Gateway () failed with exception, jave.io.EOFException ?

mn-mikke · 2022-10-10T15:31:40Z

Hi @denisabarar,

I created this cluster, but still end up in a similar error, I attached the error message.

This error is differrent.

ERROR: Unexpected HTTP Status code: 400 Bad Request (url = http://localhost:54321/99/Models.mojo/cann_models_PAS000052C0376840003082?dir=dbfs%3A%2Fmnt%2Fadls%2FDemandBrainLake%2FDatabricks%2FQA-US%2FCannibalization_Models%2FPAS000052C0376840003082%2Fcann_models_PAS000052C0376840003082.zip&force=TRUE)

java.lang.IllegalArgumentException
 [1] "java.lang.IllegalArgumentException: Cannot find persist manager for scheme dbfs"                             
 [2] "    water.persist.PersistManager.getPersistForURI(PersistManager.java:760)"                                  
 [3] "    hex.Model.exportMojo(Model.java:2924)"                                                                   
 [4] "    water.api.ModelsHandler.exportMojo(ModelsHandler.java:280)"                                              
 [5] "    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)"                                             
 [6] "    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)"                           
 [7] "    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)"                   
 [8] "    java.lang.reflect.Method.invoke(Method.java:498)"                                                        
 [9] "    water.api.Handler.handle(Handler.java:60)"

You try to persist model to the location dbfs:/mnt/adls/DemandBrainLake/Databricks/QA-US/Cannibalization_Models/PAS000052C0376840003082/cann_models_PAS000052C0376840003082.zip H2O API doesn't understand dbfs schema. (That's specific only to Databricks) Try path starting with /dbfs/mnt/adls/DemandBrainLake/... instead.

denisabarar · 2022-10-17T08:52:55Z

Hi @mn-mikke,

The issue with the path was also intermittent.
Rerun the job with the updated cluster (the standard one) and we end up in:

2 successful runs
2 runs where the H2O got stuck right at the beginning with this error
Warning in .h2o.__checkConnectionHealth() H2O cluster node 10.237.58.133:54321 is behaving slowly and should be inspected manually
2 runs where the H2O end up in this error with Exception thrown in awaitResult - error log attached.

errorMessage_14_10_22.txt

krasinski · 2024-04-17T21:13:29Z

closing the issue due to inactivity for a long time, please reopen if still needed

krasinski added the databricks label Jun 2, 2023

krasinski closed this as not planned Won't fix, can't repro, duplicate, stale Apr 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ref count mismatch for vec ERROR while training GLM models #2795

Ref count mismatch for vec ERROR while training GLM models #2795

denisabarar commented Jul 5, 2022

mn-mikke commented Jul 12, 2022

mn-mikke commented Jul 12, 2022

denisabarar commented Jul 25, 2022

denisabarar commented Sep 2, 2022

mn-mikke commented Sep 2, 2022

denisabarar commented Sep 5, 2022

mn-mikke commented Sep 6, 2022 •

edited

denisabarar commented Sep 13, 2022

mn-mikke commented Sep 13, 2022

denisabarar commented Sep 14, 2022

mn-mikke commented Sep 14, 2022

denisabarar commented Sep 15, 2022 •

edited

denisabarar commented Sep 22, 2022

denisabarar commented Sep 27, 2022

mn-mikke commented Sep 27, 2022

krasinski commented Sep 29, 2022

denisabarar commented Sep 29, 2022

mn-mikke commented Sep 29, 2022

denisabarar commented Sep 30, 2022

denisabarar commented Oct 10, 2022

mn-mikke commented Oct 10, 2022

denisabarar commented Oct 17, 2022

krasinski commented Apr 17, 2024

Ref count mismatch for vec ERROR while training GLM models #2795

Ref count mismatch for vec ERROR while training GLM models #2795

Comments

denisabarar commented Jul 5, 2022

mn-mikke commented Jul 12, 2022

mn-mikke commented Jul 12, 2022

denisabarar commented Jul 25, 2022

denisabarar commented Sep 2, 2022

mn-mikke commented Sep 2, 2022

denisabarar commented Sep 5, 2022

mn-mikke commented Sep 6, 2022 • edited

denisabarar commented Sep 13, 2022

mn-mikke commented Sep 13, 2022

denisabarar commented Sep 14, 2022

mn-mikke commented Sep 14, 2022

denisabarar commented Sep 15, 2022 • edited

denisabarar commented Sep 22, 2022

denisabarar commented Sep 27, 2022

mn-mikke commented Sep 27, 2022

krasinski commented Sep 29, 2022

denisabarar commented Sep 29, 2022

mn-mikke commented Sep 29, 2022

denisabarar commented Sep 30, 2022

denisabarar commented Oct 10, 2022

mn-mikke commented Oct 10, 2022

denisabarar commented Oct 17, 2022

krasinski commented Apr 17, 2024

mn-mikke commented Sep 6, 2022 •

edited

denisabarar commented Sep 15, 2022 •

edited