neo4j-import out of memory error #7772

markrcosta · 2016-08-22T17:57:22Z

bug report

The neo4j-import tool for version 3.0.4 runs of memory, and is not reading the java max heap settings set at the OS level (i.e., it is identifying a max heap of 20.98 gb when the max heap is set at 72gb)

Neo4j Version: 3.0.4
Operating System: Ubuntu 14.10
API: neo4j-import

Steps to reproduce

I am trying to import a set of relatively large csv files related to the Microsoft Academic Graph. When I run the neo4j-import tool, the tool arbitrarily selects a maximum Java Heap size of 20gb, even though I have the max heap size set on my machine to ~72gb. After the import tool runs for some time I get a java.lang.OutOfMemoryError: Java heap space.

Expected behavior

The import tool should handle large data sets without running out of memory.

Actual behavior

The import tool aborts when the heap space is used. I've pasted the output of the import below.

Neo4j version: 3.0.4
Importing the contents of these files into graph.db:
Nodes:
:Conference
/usr/share/neo4j/import/conference_header.csv
/usr/share/neo4j/import/Conferences.txt

:Author
/usr/share/neo4j/import/author_header.csv
/usr/share/neo4j/import/Authors.txt

:FoS
/usr/share/neo4j/import/fos_header.csv
/usr/share/neo4j/import/fieldsofstudy.csv

:Journal
/usr/share/neo4j/import/journal_header.csv
/usr/share/neo4j/import/Journals.txt

:Paper
/usr/share/neo4j/import/paper_header.csv
/usr/share/neo4j/import/Papers.txt

:Affiliation
/usr/share/neo4j/import/affiliation_header.csv
/usr/share/neo4j/import/Affiliations.txt

:AuthorAffiliation
/usr/share/neo4j/import/authoraffiliations.csv
Relationships:
:parent_of
/usr/share/neo4j/import/fos_hierarchy_header.csv
/usr/share/neo4j/import/foshierarchy.csv

:about
/usr/share/neo4j/import/paper_keyword_header.csv
/usr/share/neo4j/import/PaperKeywords.txt

:cited
/usr/share/neo4j/import/paper_references_header.csv
/usr/share/neo4j/import/PaperReferences.txt

:wrote
/usr/share/neo4j/import/wrote_header.csv
/usr/share/neo4j/import/PaperAuthorAffiliations.txt

:wrote_while_at
/usr/share/neo4j/import/authaffilpaper.csv

:affiliated_with
/usr/share/neo4j/import/affiliatedwith.csv

:published_in
/usr/share/neo4j/import/paperjournal.csv
/usr/share/neo4j/import/paperconference.csv

Available memory:
Free machine memory: 65.03 GB
Max heap memory : 20.98 GB
....
*DEDUPLICATE:10.91 GB------------------------------------------------------------------------] 193M
Exception in thread "Thread-581" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:113)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:589)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:493)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:281)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:113)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:589)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:493)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:281)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)

jexp · 2016-08-22T20:25:47Z

Can you also share the command-line that you run and some information about the files you are importing?

Because it fails while handling duplicates, do you have a lot of duplicate nodes in your files? Like > 100M ?

markrcosta · 2016-08-22T22:18:56Z

Here is the code to run the import
nohup neo4j-import --into graph.db --id-type string --delimiter TAB --skip-bad-relationships true --skip-bad-nodes true --skip-duplicate-nodes true --ignore-empty-strings true --ignore-extra-columns true --bad-tolerance 999999999 --processors 2 --stacktrace true --nodes:Conference "/usr/share/neo4j/import/conference_header.csv,/usr/share/neo4j/import/Conferences.txt" --nodes:Author "/usr/share/neo4j/import/author_header.csv,/usr/share/neo4j/import/Authors.txt" --nodes:FoS "/usr/share/neo4j/import/fos_header.csv,/usr/share/neo4j/import/fieldsofstudy.csv" --nodes:Journal "/usr/share/neo4j/import/journal_header.csv,/usr/share/neo4j/import/Journals.txt" --nodes:Paper "/usr/share/neo4j/import/paper_header.csv,/usr/share/neo4j/import/Papers.txt" --nodes:Affiliation "/usr/share/neo4j/import/affiliation_header.csv,/usr/share/neo4j/import/Affiliations.txt" --nodes:AuthorAffiliation /usr/share/neo4j/import/authoraffiliations.csv --relationships:parent_of "/usr/share/neo4j/import/fos_hierarchy_header.csv,/usr/share/neo4j/import/foshierarchy.csv" --relationships:about "/usr/share/neo4j/import/paper_keyword_header.csv,/usr/share/neo4j/import/PaperKeywords.txt" --relationships:cited "/usr/share/neo4j/import/paper_references_header.csv,/usr/share/neo4j/import/PaperReferences.txt" --relationships:wrote "/usr/share/neo4j/import/wrote_header.csv,/usr/share/neo4j/import/PaperAuthorAffiliations.txt" --relationships:wrote_while_at /usr/share/neo4j/import/authaffilpaper.csv --relationships:affiliated_with /usr/share/neo4j/import/affiliatedwith.csv --relationships:published_in "/usr/share/neo4j/import/paperjournal.csv,/usr/share/neo4j/import/paperconference.csv"

markrcosta · 2016-08-22T22:22:15Z

The largest deduplication effort comes from me concatenating the AuthorID and AffiliationID fields from the PaperAuthorAffiliations.txt file, from here [https://academicgraph.blob.core.windows.net/graph-2016-02-05/index.html]. What I have tried to do is model a hyperedge relationship, using the deduplication feature to create unique nodes for author-institution pairs.

Here is the code to make the import csv file, using the file downloaded from the page above.

import csv
import sys

csv.field_size_limit(sys.maxsize)

csvin = open('PaperAuthorAffiliations.txt', "r")
csvreader = csv.reader(csvin, delimiter='\t')


csvout = open('authoraffiliations.csv',"w")
csvwrite = csv.writer(csvout,delimiter = '\t', lineterminator='\n')

csvwrite.writerow(['AuthorAffiliationID:ID(AuthorAffiliation)','AAName:String'])

for row in csvreader:
    csvwrite.writerow ([row[1]+row[2],row[4]])

csvin.close()
csvout.close()

tinwelint · 2016-08-25T08:21:35Z

OK by the looks of the stack trace (although obviously could be anything since it's OOM) it looks like there are lots and lots of duplicate node ids in the node input and that the OOM comes from merely collecting them (for later report). It's possible to distinguish between different groups of node ids by the use of id spaces, see http://neo4j.com/docs/operations-manual/current/deployment/#import-tool-id-spaces .

Could you simply cut down on the duplicates? There's no reason them being there as they will not be imported anyway.

tinwelint · 2016-09-08T13:04:07Z

@markrcosta any progress on deduplicating those node ids from the input ids? Maybe neo4j-import shouldn't keep all those duplicates, only a sample and keep a simple counter instead... would that be helpful for you?

markrcosta · 2016-09-27T13:46:18Z

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson notifications@github.com
wrote:

@markrcosta https://github.com/markrcosta any progress on deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep all
those duplicates, only a sample and keep a simple counter instead... would
that be helpful for you?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq
.

Mark

Mark R. Costa, Ph.D.

tinwelint · 2016-09-29T07:12:04Z

OK wonderful, great work.

Ideally the import tool should be changed to cope with this, let's see if
this can be fixed later on in upcoming versions.

On Tue, Sep 27, 2016 at 3:46 PM, markrcosta notifications@github.com
wrote:

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-
into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson notifications@github.com
wrote:

@markrcosta https://github.com/markrcosta any progress on
deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep all
those duplicates, only a sample and keep a simple counter instead...
would
that be helpful for you?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq>
.

Mark

Mark R. Costa, Ph.D.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALbnIsgjbLDREV0XvuLVSVhTbx-5pY2ks5quR4sgaJpZM4JqIkq
.

Mattias Persson
Neo4j Hacker at Neo Technology

markrcosta · 2016-09-29T16:34:18Z

Mattias,

Ideally, the import tool would work for this :) I had hoped to avoid
writing extra code and leverage a database's ability to handle larger data
sets in the processing task.

If you get around to fixing this problem, let me know. I can get the data
to you or try to run the import on the update version of the server.

On Thu, Sep 29, 2016 at 3:12 AM, Mattias Persson notifications@github.com
wrote:

OK wonderful, great work.

Ideally the import tool should be changed to cope with this, let's see if
this can be fixed later on in upcoming versions.

On Tue, Sep 27, 2016 at 3:46 PM, markrcosta notifications@github.com
wrote:

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about
it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-
into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson <
notifications@github.com>
wrote:

@markrcosta https://github.com/markrcosta any progress on
deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep
all
those duplicates, only a sample and keep a simple counter instead...
would
that be helpful for you?

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment),
or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq>
.

Mark

Mark R. Costa, Ph.D.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7772 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AALbnIsgjbLDREV0XvuLVSVhTbx-5pY2ks5quR4sgaJpZM4JqIkq>
.

Mattias Persson
Neo4j Hacker at Neo Technology

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABZeQ8ycQdR0OhPwn4dFlppwzU1ypt-Eks5qu2TPgaJpZM4JqIkq
.

Mark

Mark R. Costa, Ph.D.

chrisvest · 2018-05-23T12:47:18Z

The memory usage of neo4j-admin import has been much improved in recent versions; especially in the newly released 3.4, and especially around the handling of duplicate data. I'm going to assume that that fixes this issue, but feel free to reopen if it's still a problem.

spacecowboy added bug kernel 3.0 labels Sep 15, 2016

tinwelint added team-kernel and removed team-kernel kernel labels Aug 25, 2017

chrisvest closed this as completed May 23, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

neo4j-import out of memory error #7772

neo4j-import out of memory error #7772

markrcosta commented Aug 22, 2016 •

edited

jexp commented Aug 22, 2016

markrcosta commented Aug 22, 2016

markrcosta commented Aug 22, 2016 •

edited

tinwelint commented Aug 25, 2016 •

edited

tinwelint commented Sep 8, 2016

markrcosta commented Sep 27, 2016

tinwelint commented Sep 29, 2016

markrcosta commented Sep 29, 2016

chrisvest commented May 23, 2018

neo4j-import out of memory error #7772

neo4j-import out of memory error #7772

Comments

markrcosta commented Aug 22, 2016 • edited

bug report

Steps to reproduce

Expected behavior

Actual behavior

jexp commented Aug 22, 2016

markrcosta commented Aug 22, 2016

markrcosta commented Aug 22, 2016 • edited

tinwelint commented Aug 25, 2016 • edited

tinwelint commented Sep 8, 2016

markrcosta commented Sep 27, 2016

tinwelint commented Sep 29, 2016

markrcosta commented Sep 29, 2016

chrisvest commented May 23, 2018

markrcosta commented Aug 22, 2016 •

edited

markrcosta commented Aug 22, 2016 •

edited

tinwelint commented Aug 25, 2016 •

edited