Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large (DL) models cause oversize issues during serialization #13925

Closed
exalate-issue-sync bot opened this issue May 13, 2023 · 3 comments
Closed

Large (DL) models cause oversize issues during serialization #13925

exalate-issue-sync bot opened this issue May 13, 2023 · 3 comments
Assignees

Comments

@exalate-issue-sync
Copy link

Subject: h2o.deeplearning - Java (memory) exception?

Does h2o.deeplearing have a known problem with Java memory allocation? When I tried to run it with a large 2-layer network, each with 5000 neurons (see below), it crashed after it output the network parameter setting, causing an exception (see Output below). This happened for both java version of "1.8.0_45" with max 150g heap, and "1.7.0_79" with max 6g heap. When the size of the neurons was smaller (e.g., 2000 each), this exception did not happen. Is this a know problem with H2O/Java?

Neural network

model.dl <- h2o.deeplearning(
x = 2:(32^3+1),
y = 1,
classification=T,
data = train.h2o,
validation = test.h2o,
activation = "TanhWithDropout",
hidden = c(5000,5000),
epochs = 100,
train_samples_per_iteration = -1
)

Output:

java.lang.IllegalArgumentException: 0 > -2147483648
at java.util.Arrays.copyOfRange(Arrays.java:3519)
at water.MemoryManager.malloc(MemoryManager.java:251)
at water.MemoryManager.malloc(MemoryManager.java:223)
at water.MemoryManager.arrayCopyOfRange(MemoryManager.java:285)
at water.AutoBuffer.sendPartial(AutoBuffer.java:507)
at water.AutoBuffer.putA4f(AutoBuffer.java:1134)
at hex.deeplearning.Neurons$DenseRowMatrix.write(Neurons.java)
at water.AutoBuffer.put(AutoBuffer.java:604)
at water.AutoBuffer.putA(AutoBuffer.java:660)
at hex.deeplearning.DeepLearningModel$DeepLearningModelInfo.write(DeepLearningModel.java)
at water.AutoBuffer.put(AutoBuffer.java:598)
at hex.deeplearning.DeepLearningModel.write(DeepLearningModel.java)
at water.Value.(Value.java:358)
at water.TAtomic.atomic(TAtomic.java:23)
at water.Atomic.compute2(Atomic.java:58)
at water.Atomic.fork(Atomic.java:42)
at water.Atomic.invoke(Atomic.java:34)
at water.Lockable.write_lock(Lockable.java:60)
at hex.deeplearning.DeepLearning.trainModel(DeepLearning.java:1039)
at hex.deeplearning.DeepLearning.buildModel(DeepLearning.java:849)
at hex.deeplearning.DeepLearning.execImpl(DeepLearning.java:755)
at water.Func.exec(Func.java:42)
at water.Job$3.compute2(Job.java:334)
at water.H2O$H2OCountedCompleter.compute(H2O.java:656)
at jsr166y.CountedCompleter.exec(CountedCompleter.java:429)
at jsr166y.ForkJoinTask.doExec(ForkJoinTask.java:263)
at jsr166y.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:974)
at jsr166y.ForkJoinPool.runWorker(ForkJoinPool.java:1477)
at jsr166y.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:104)

@exalate-issue-sync
Copy link
Author

Arno Candel commented: Two possible solutions:

  1. Better error message.
  2. Cut the DL model into smaller pieces that can be serialized (and stored in DKV) independently.

@exalate-issue-sync
Copy link
Author

Arno Candel commented: Fixed with 1) for now: 5678a26

@DinukaH2O
Copy link
Contributor

JIRA Issue Migration Info

Jira Issue: PUBDEV-941
Assignee: Arno Candel
Reporter: Arno Candel
State: Resolved
Fix Version: N/A
Attachments: N/A
Development PRs: N/A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants