Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

java.lang.RuntimeException: Cloud size 1 under 2 #1739

Closed
BhushG opened this issue Jan 22, 2020 · 12 comments
Closed

java.lang.RuntimeException: Cloud size 1 under 2 #1739

BhushG opened this issue Jan 22, 2020 · 12 comments

Comments

@BhushG
Copy link

BhushG commented Jan 22, 2020

Hi, I'm getting this exception when I'm executing the job on the YARN cluster. There is no problem executing same job on a local machine.
I've tried all of these settings: http://docs.h2o.ai/sparkling-water/2.1/latest-stable/doc/configuration/internal_backend_tuning.html , but still couldn't resolve this exception.
Here are the logs:

20/01/22 07:06:53 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Disabling executor 2.
20/01/22 07:06:53 INFO scheduler.DAGScheduler: Executor lost: 2 (epoch 18)
20/01/22 07:06:53 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/01/22 07:06:53 INFO storage.BlockManagerMasterEndpoint: Removing block manager BlockManagerId(2, project-master, 37245, None)
20/01/22 07:06:53 INFO storage.BlockManagerMaster: Removed 2 successfully in removeExecutor
20/01/22 07:06:53 INFO scheduler.DAGScheduler: Shuffle files lost for executor: 2 (epoch 18)
20/01/22 07:06:53 INFO yarn.YarnAllocator: Completed container container_1578919282272_0243_01_000003 on host: project-master (state: COMPLETE, exit status: 50)
20/01/22 07:06:53 WARN yarn.YarnAllocator: Container from a bad node: container_1578919282272_0243_01_000003 on host: project-master. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1578919282272_0243_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 50
.
20/01/22 07:06:53 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Requesting driver to remove executor 2 for reason Container from a bad node: container_1578919282272_0243_01_000003 on host: project-master. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1578919282272_0243_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 50
.
20/01/22 07:06:53 ERROR cluster.YarnClusterScheduler: Lost executor 2 on project-master: Container from a bad node: container_1578919282272_0243_01_000003 on host: project-master. Exit status: 50. Diagnostics: Exception from container-launch.
Container id: container_1578919282272_0243_01_000003
Exit code: 50
Stack trace: ExitCodeException exitCode=50: 
	at org.apache.hadoop.util.Shell.runCommand(Shell.java:582)
	at org.apache.hadoop.util.Shell.run(Shell.java:479)
	at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:773)
	at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
	at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
	at java.util.concurrent.FutureTask.run(FutureTask.java:266)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)


Container exited with a non-zero exit code 50
.
20/01/22 07:06:53 INFO storage.BlockManagerMasterEndpoint: Trying to remove executor 2 from BlockManagerMaster.
20/01/22 07:06:53 INFO storage.BlockManagerMaster: Removal of executor 2 requested
20/01/22 07:06:53 INFO cluster.YarnSchedulerBackend$YarnDriverEndpoint: Asked to remove non-existent executor 2
20/01/22 07:06:56 INFO yarn.ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Max number of executor failures (1) reached)
20/01/22 07:07:52 ERROR job.projectJobDriver$: Job failed in cluster mode with IrisMemOverhead
java.lang.RuntimeException: Cloud size 1 under 2
	at water.H2O.waitForCloudSize(H2O.java:1845)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend$.org$apache$spark$h2o$backends$internal$InternalH2OBackend$$startH2OCluster(InternalH2OBackend.scala:92)
	at org.apache.spark.h2o.backends.internal.InternalH2OBackend.init(InternalH2OBackend.scala:64)
	at org.apache.spark.h2o.H2OContext.init(H2OContext.scala:130)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:418)
	at org.apache.spark.h2o.H2OContext$.getOrCreate(H2OContext.scala:446)
	at ai.h2o.sparkling.ml.algos.H2OAlgoCommonUtils$class.prepareDatasetForFitting(H2OAlgoCommonUtils.scala:47)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.prepareDatasetForFitting(H2OAutoML.scala:42)
	at ai.h2o.sparkling.ml.algos.H2OAutoML.fit(H2OAutoML.scala:57)


@mn-mikke
Copy link
Collaborator

Hi @BhushG,
Can you try to give Spark executors more memory?

@BhushG
Copy link
Author

BhushG commented Jan 22, 2020

@mn-mikke Thanks for the quick reply. We already tried that. We have taken Iris dataset for testing which is just of a few KBs and allocated 5GB to executors as well as the driver but still, it did not work.

@mn-mikke
Copy link
Collaborator

What version of Sparkling Water do you use? Could you share a code snippet that you tried to run?

@BhushG
Copy link
Author

BhushG commented Jan 23, 2020

@mn-mikke We have tried these versions of Sparkling water: 3.28.0.1-1-2.4 and 3.26.11-2.4 on spark 2.4. scala version: 2.11.8

Here is the code snippet:

def main(args: Array[String]): Unit =
{
import org.apache.spark.h2o.{H2OConf, H2OContext}
println("H2O AutoML")
println("Creating Spark Session..")
val sparkConf = new SparkConf()
.setAppName("H2OAutoML")
.setMaster("yarn")

val sparkSession = SparkSession.builder().config(sparkConf).getOrCreate()
val conf = new H2OConf(sparkSession)
  .setInternalClusterMode()
H2OContext.getOrCreate(sparkSession, conf)

val df = sparkSession.read
  .option("header", true)
  .option("inferschema", true)
  .csv("/Users/Bhushan/Datasets/iris2.csv")
df.show()

val labelCol = "species"
val predCol = "pred"
val Array(trainingDF, testingDF) = df.randomSplit(Array(0.8, 0.2))
val automl = new H2OAutoML()
automl.setLabelCol(labelCol)
automl.setSortMetric("AUTO")
automl
  .setMaxRuntimeSecs(30)
  .setPredictionCol(predCol)
  .setConvertUnknownCategoricalLevelsToNa(true)
val model = automl.fit(trainingDF)
val transformed = model.transform(testingDF)
val modelDetails = model.getModelDetails()
println(modelDetails)
transformed.show(30)

}

@BhushG
Copy link
Author

BhushG commented Jan 23, 2020

@mn-mikke we are using internal cluster mode and I've in fact set spark.dynamicAllocation.enabled to false

@BhushG
Copy link
Author

BhushG commented Jan 30, 2020

@jakubhava Hi.. Is there any solution to this exception? Sometimes the model gets trained on cluster but when I deploy same model for same dataset on cluster, it fails with cloud size 0 under 2. I appreciate your help.

@BhushG
Copy link
Author

BhushG commented Jan 31, 2020

@jakubhava @mn-mikke Is there any solution to this? or shall I use External backend?

@BhushG
Copy link
Author

BhushG commented Jan 31, 2020

I'm also not able to start the External backend. Created new issue: #1759

@jakubhava
Copy link
Contributor

Can you share the full YARN logs ( executors, driver)? We have fixed various clouding issues in the upcoming release 3.28.0.3 and I would like to verify if this issue is one of them.

@niebloomj
Copy link

I am getting the same issue and have sent the full logs on the gitter channel. Thank you.

@jakubhava
Copy link
Contributor

Yes, this issue will be fixed in the upcoming 3.28.0.3 release

@jakubhava
Copy link
Contributor

Sparkling Water 3.28.0.3 is released which fixes the clouding issues mentioned above:

Spark 2.4: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.4/3.28.0.3-1-2.4/index.html
Spark 2.3: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.3/3.28.0.3-1-2.3/index.html
Spark 2.2: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.2/3.28.0.3-1-2.2/index.html
Spark 2.1: http://h2o-release.s3.amazonaws.com/sparkling-water/spark-2.1/3.28.0.3-1-2.1/index.html

If you bump into any new issues, please create new or feel free to reopen this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants