Training Not Working #5

HoltSpalding · 2018-07-26T18:18:24Z

I've been having a lot of trouble training new models. During the the training process,
"Extracting features for level 2 inference
Done - Extracting features for level 2 inference"
is printed over and over until at some point my CPU usage drops to almost 0% and then it says im out of memory in my shell. Then if you wait an incredibly long amount of time, training continues. I have not been able to train any models successfully yet on any of the data in the amr 1.0 release (accept one model which I trained on just two sentences from the bolt treebank). Have you run into this issue and/or what specs do you suggest I have to train a model in this way. I'm running on 12 CPU cores with 32 Gb of memory. I tried looking at the source code to see if I could fix the problem there but the issue is happening in one of the precompiled jar files (i'm gonna look at spf now to see which jar is giving me issues and see if I can create a new jar file from there) Any suggestions? Thank you so much.

HoltSpalding · 2018-07-26T18:25:09Z

Here's what is printed when it crashes:
Exception in thread "main" java.lang.OutOfMemoryError
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.util.concurrent.ForkJoinTask.getThrowableException(ForkJoinTask.java:598)
at java.util.concurrent.ForkJoinTask.reportException(ForkJoinTask.java:677)
at java.util.concurrent.ForkJoinTask.invoke(ForkJoinTask.java:735)
at java.util.stream.ReduceOps$ReduceOp.evaluateParallel(ReduceOps.java:714)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:233)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at edu.uw.cs.lil.amr.data.LabeledAmrSentenceCollection.(LabeledAmrSentenceCollection.java:83)
at edu.uw.cs.lil.amr.data.LabeledAmrSentenceCollection$Creator.create(LabeledAmrSentenceCollection.java:169)
at edu.uw.cs.lil.amr.data.LabeledAmrSentenceCollection$Creator.create(LabeledAmrSentenceCollection.java:97)
at edu.cornell.cs.nlp.spf.explat.ParameterizedExperiment.readResrouces(ParameterizedExperiment.java:204)
at edu.cornell.cs.nlp.spf.explat.DistributedExperiment.readResrouces(DistributedExperiment.java:206)
at edu.uw.cs.lil.amr.exp.AmrExp.(AmrExp.java:105)
at edu.uw.cs.lil.amr.exp.AmrExp.(AmrExp.java:117)
at edu.uw.cs.lil.amr.exp.AmrGenericExperiment.main(AmrGenericExperiment.java:28)
at edu.uw.cs.lil.amr.Main.main(Main.java:61)
Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
at com.google.common.collect.Tables.immutableCell(Tables.java:67)
at com.google.common.collect.StandardTable$CellIterator.next(StandardTable.java:323)
at com.google.common.collect.StandardTable$CellIterator.next(StandardTable.java:306)
at com.google.common.collect.TransformedIterator.next(TransformedIterator.java:48)
at uk.ac.ed.easyccg.syntax.ParserAStar.parseAstar(ParserAStar.java:339)
at uk.ac.ed.easyccg.syntax.ParserAStar.doParsing(ParserAStar.java:226)
at uk.ac.ed.easyccg.syntax.ParserAStar.parseTokens(ParserAStar.java:120)
at edu.uw.cs.lil.amr.ccgbank.easyccg.EasyCCGWrapper.getSpans(EasyCCGWrapper.java:54)
at edu.uw.cs.lil.amr.data.LabeledAmrSentence.(LabeledAmrSentence.java:45)
at edu.uw.cs.lil.amr.data.LabeledAmrSentenceCollection.lambda$new$0(LabeledAmrSentenceCollection.java:81)
at edu.uw.cs.lil.amr.data.LabeledAmrSentenceCollection$$Lambda$39/1560160481.apply(Unknown Source)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:747)
at java.util.stream.ReduceOps$ReduceTask.doLeaf(ReduceOps.java:721)
at java.util.stream.AbstractTask.compute(AbstractTask.java:316)
at java.util.concurrent.CountedCompleter.exec(CountedCompleter.java:731)
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
at java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
at java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)

yoavartzi · 2018-07-26T21:11:34Z

It looks like a memory problem. I don't remember how much memory we used for training, but our machines had quite a bit of memory, maybe 64GB or 128GB. The exception seems to be happening in one of the threads. We don't have a graceful way to bring down the entire system when a thread throws an exception. Whatever happens after this kind of exception is not to be relied on, and you should just kill the process.

yoavartzi · 2018-07-26T21:12:51Z

One thing you can do is limit to very short sentences. Memory consumption is tightly coupled with sentence length due to the use of a CKY chart. The model that you will get might not be good, but it will be a good way to test your setup.

HoltSpalding · 2018-08-01T15:12:33Z

I fixed my memory problems and was able to train a model on all the ldc data. However, an amr.pre.sp model was created and not all the files were present in the logs folder that were present when I trained a much smaller model. Do you know why this is? Is it cause of the use of split up data? Would putting all the data into one file fix this? My exp and inc files seems exactly the same across experiments, excpet when I trained the smaller model and got an amr.sp file, the data wasn't split up.

yoavartzi · 2018-08-01T15:16:03Z

If you use separate files, they have to be merged. I think the exp files we released do that. Once you merge the data resource, it should be used a single resource, so I am not sure why it will behave differently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training Not Working #5

Training Not Working #5

HoltSpalding commented Jul 26, 2018

HoltSpalding commented Jul 26, 2018

yoavartzi commented Jul 26, 2018

yoavartzi commented Jul 26, 2018

HoltSpalding commented Aug 1, 2018 •

edited

Loading

yoavartzi commented Aug 1, 2018

Training Not Working #5

Training Not Working #5

Comments

HoltSpalding commented Jul 26, 2018

HoltSpalding commented Jul 26, 2018

yoavartzi commented Jul 26, 2018

yoavartzi commented Jul 26, 2018

HoltSpalding commented Aug 1, 2018 • edited Loading

yoavartzi commented Aug 1, 2018

HoltSpalding commented Aug 1, 2018 •

edited

Loading