New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
neo4j-import out of memory error #7772
Comments
Can you also share the command-line that you run and some information about the files you are importing? Because it fails while handling duplicates, do you have a lot of duplicate nodes in your files? Like > 100M ? |
Here is the code to run the import |
The largest deduplication effort comes from me concatenating the AuthorID and AffiliationID fields from the PaperAuthorAffiliations.txt file, from here [https://academicgraph.blob.core.windows.net/graph-2016-02-05/index.html]. What I have tried to do is model a hyperedge relationship, using the deduplication feature to create unique nodes for author-institution pairs. Here is the code to make the import csv file, using the file downloaded from the page above.
|
OK by the looks of the stack trace (although obviously could be anything since it's OOM) it looks like there are lots and lots of duplicate node ids in the node input and that the OOM comes from merely collecting them (for later report). It's possible to distinguish between different groups of node ids by the use of id spaces, see http://neo4j.com/docs/operations-manual/current/deployment/#import-tool-id-spaces . Could you simply cut down on the duplicates? There's no reason them being there as they will not be imported anyway. |
@markrcosta any progress on deduplicating those node ids from the input ids? Maybe neo4j-import shouldn't keep all those duplicates, only a sample and keep a simple counter instead... would that be helpful for you? |
Mattias, I was able to get the data in the database by pre-processing it and If you're interested in how I handled the situation, you can read about it Thank you for your help. On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson notifications@github.com
Mark R. Costa, Ph.D. |
OK wonderful, great work. Ideally the import tool should be changed to cope with this, let's see if On Tue, Sep 27, 2016 at 3:46 PM, markrcosta notifications@github.com
Mattias Persson |
Mattias, Ideally, the import tool would work for this :) I had hoped to avoid If you get around to fixing this problem, let me know. I can get the data On Thu, Sep 29, 2016 at 3:12 AM, Mattias Persson notifications@github.com
Mark R. Costa, Ph.D. |
The memory usage of |
bug report
The neo4j-import tool for version 3.0.4 runs of memory, and is not reading the java max heap settings set at the OS level (i.e., it is identifying a max heap of 20.98 gb when the max heap is set at 72gb)
Neo4j Version: 3.0.4
Operating System: Ubuntu 14.10
API: neo4j-import
Steps to reproduce
I am trying to import a set of relatively large csv files related to the Microsoft Academic Graph. When I run the neo4j-import tool, the tool arbitrarily selects a maximum Java Heap size of 20gb, even though I have the max heap size set on my machine to ~72gb. After the import tool runs for some time I get a java.lang.OutOfMemoryError: Java heap space.
Expected behavior
The import tool should handle large data sets without running out of memory.
Actual behavior
The import tool aborts when the heap space is used. I've pasted the output of the import below.
The text was updated successfully, but these errors were encountered: