Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

neo4j-import out of memory error #7772

Closed
markrcosta opened this issue Aug 22, 2016 · 9 comments
Closed

neo4j-import out of memory error #7772

markrcosta opened this issue Aug 22, 2016 · 9 comments

Comments

@markrcosta
Copy link

markrcosta commented Aug 22, 2016

bug report

The neo4j-import tool for version 3.0.4 runs of memory, and is not reading the java max heap settings set at the OS level (i.e., it is identifying a max heap of 20.98 gb when the max heap is set at 72gb)

Neo4j Version: 3.0.4
Operating System: Ubuntu 14.10
API: neo4j-import

Steps to reproduce

I am trying to import a set of relatively large csv files related to the Microsoft Academic Graph. When I run the neo4j-import tool, the tool arbitrarily selects a maximum Java Heap size of 20gb, even though I have the max heap size set on my machine to ~72gb. After the import tool runs for some time I get a java.lang.OutOfMemoryError: Java heap space.

Expected behavior

The import tool should handle large data sets without running out of memory.

Actual behavior

The import tool aborts when the heap space is used. I've pasted the output of the import below.

Neo4j version: 3.0.4
Importing the contents of these files into graph.db:
Nodes:
:Conference
/usr/share/neo4j/import/conference_header.csv
/usr/share/neo4j/import/Conferences.txt

:Author
/usr/share/neo4j/import/author_header.csv
/usr/share/neo4j/import/Authors.txt

:FoS
/usr/share/neo4j/import/fos_header.csv
/usr/share/neo4j/import/fieldsofstudy.csv

:Journal
/usr/share/neo4j/import/journal_header.csv
/usr/share/neo4j/import/Journals.txt

:Paper
/usr/share/neo4j/import/paper_header.csv
/usr/share/neo4j/import/Papers.txt

:Affiliation
/usr/share/neo4j/import/affiliation_header.csv
/usr/share/neo4j/import/Affiliations.txt

:AuthorAffiliation
/usr/share/neo4j/import/authoraffiliations.csv
Relationships:
:parent_of
/usr/share/neo4j/import/fos_hierarchy_header.csv
/usr/share/neo4j/import/foshierarchy.csv

:about
/usr/share/neo4j/import/paper_keyword_header.csv
/usr/share/neo4j/import/PaperKeywords.txt

:cited
/usr/share/neo4j/import/paper_references_header.csv
/usr/share/neo4j/import/PaperReferences.txt

:wrote
/usr/share/neo4j/import/wrote_header.csv
/usr/share/neo4j/import/PaperAuthorAffiliations.txt

:wrote_while_at
/usr/share/neo4j/import/authaffilpaper.csv

:affiliated_with
/usr/share/neo4j/import/affiliatedwith.csv

:published_in
/usr/share/neo4j/import/paperjournal.csv
/usr/share/neo4j/import/paperconference.csv

Available memory:
Free machine memory: 65.03 GB
Max heap memory : 20.98 GB
....
*DEDUPLICATE:10.91 GB------------------------------------------------------------------------] 193M
Exception in thread "Thread-581" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:113)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:589)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:493)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:281)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)
Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.Arrays.copyOf(Arrays.java:3308)
at org.neo4j.unsafe.impl.batchimport.input.BadCollector.collectDuplicateNode(BadCollector.java:113)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.detectDuplicateInputIds(EncodingIdMapper.java:589)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.buildCollisionInfo(EncodingIdMapper.java:493)
at org.neo4j.unsafe.impl.batchimport.cache.idmapping.string.EncodingIdMapper.prepare(EncodingIdMapper.java:281)
at org.neo4j.unsafe.impl.batchimport.IdMapperPreparationStep.process(IdMapperPreparationStep.java:54)
at org.neo4j.unsafe.impl.batchimport.staging.LonelyProcessingStep$1.run(LonelyProcessingStep.java:56)

@jexp
Copy link
Member

jexp commented Aug 22, 2016

Can you also share the command-line that you run and some information about the files you are importing?

Because it fails while handling duplicates, do you have a lot of duplicate nodes in your files? Like > 100M ?

@markrcosta
Copy link
Author

Here is the code to run the import
nohup neo4j-import --into graph.db --id-type string --delimiter TAB --skip-bad-relationships true --skip-bad-nodes true --skip-duplicate-nodes true --ignore-empty-strings true --ignore-extra-columns true --bad-tolerance 999999999 --processors 2 --stacktrace true --nodes:Conference "/usr/share/neo4j/import/conference_header.csv,/usr/share/neo4j/import/Conferences.txt" --nodes:Author "/usr/share/neo4j/import/author_header.csv,/usr/share/neo4j/import/Authors.txt" --nodes:FoS "/usr/share/neo4j/import/fos_header.csv,/usr/share/neo4j/import/fieldsofstudy.csv" --nodes:Journal "/usr/share/neo4j/import/journal_header.csv,/usr/share/neo4j/import/Journals.txt" --nodes:Paper "/usr/share/neo4j/import/paper_header.csv,/usr/share/neo4j/import/Papers.txt" --nodes:Affiliation "/usr/share/neo4j/import/affiliation_header.csv,/usr/share/neo4j/import/Affiliations.txt" --nodes:AuthorAffiliation /usr/share/neo4j/import/authoraffiliations.csv --relationships:parent_of "/usr/share/neo4j/import/fos_hierarchy_header.csv,/usr/share/neo4j/import/foshierarchy.csv" --relationships:about "/usr/share/neo4j/import/paper_keyword_header.csv,/usr/share/neo4j/import/PaperKeywords.txt" --relationships:cited "/usr/share/neo4j/import/paper_references_header.csv,/usr/share/neo4j/import/PaperReferences.txt" --relationships:wrote "/usr/share/neo4j/import/wrote_header.csv,/usr/share/neo4j/import/PaperAuthorAffiliations.txt" --relationships:wrote_while_at /usr/share/neo4j/import/authaffilpaper.csv --relationships:affiliated_with /usr/share/neo4j/import/affiliatedwith.csv --relationships:published_in "/usr/share/neo4j/import/paperjournal.csv,/usr/share/neo4j/import/paperconference.csv"

@markrcosta
Copy link
Author

markrcosta commented Aug 22, 2016

The largest deduplication effort comes from me concatenating the AuthorID and AffiliationID fields from the PaperAuthorAffiliations.txt file, from here [https://academicgraph.blob.core.windows.net/graph-2016-02-05/index.html]. What I have tried to do is model a hyperedge relationship, using the deduplication feature to create unique nodes for author-institution pairs.

Here is the code to make the import csv file, using the file downloaded from the page above.

import csv
import sys

csv.field_size_limit(sys.maxsize)

csvin = open('PaperAuthorAffiliations.txt', "r")
csvreader = csv.reader(csvin, delimiter='\t')


csvout = open('authoraffiliations.csv',"w")
csvwrite = csv.writer(csvout,delimiter = '\t', lineterminator='\n')

csvwrite.writerow(['AuthorAffiliationID:ID(AuthorAffiliation)','AAName:String'])

for row in csvreader:
    csvwrite.writerow ([row[1]+row[2],row[4]])

csvin.close()
csvout.close()

@tinwelint
Copy link
Member

tinwelint commented Aug 25, 2016

OK by the looks of the stack trace (although obviously could be anything since it's OOM) it looks like there are lots and lots of duplicate node ids in the node input and that the OOM comes from merely collecting them (for later report). It's possible to distinguish between different groups of node ids by the use of id spaces, see http://neo4j.com/docs/operations-manual/current/deployment/#import-tool-id-spaces .

Could you simply cut down on the duplicates? There's no reason them being there as they will not be imported anyway.

@tinwelint
Copy link
Member

@markrcosta any progress on deduplicating those node ids from the input ids? Maybe neo4j-import shouldn't keep all those duplicates, only a sample and keep a simple counter instead... would that be helpful for you?

@markrcosta
Copy link
Author

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson notifications@github.com
wrote:

@markrcosta https://github.com/markrcosta any progress on deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep all
those duplicates, only a sample and keep a simple counter instead... would
that be helpful for you?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq
.

  • Mark

Mark R. Costa, Ph.D.

@tinwelint
Copy link
Member

OK wonderful, great work.

Ideally the import tool should be changed to cope with this, let's see if
this can be fixed later on in upcoming versions.

On Tue, Sep 27, 2016 at 3:46 PM, markrcosta notifications@github.com
wrote:

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-
into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson notifications@github.com
wrote:

@markrcosta https://github.com/markrcosta any progress on
deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep all
those duplicates, only a sample and keep a simple counter instead...
would
that be helpful for you?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq>
.

  • Mark

Mark R. Costa, Ph.D.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AALbnIsgjbLDREV0XvuLVSVhTbx-5pY2ks5quR4sgaJpZM4JqIkq
.

Mattias Persson
Neo4j Hacker at Neo Technology

@markrcosta
Copy link
Author

Mattias,

Ideally, the import tool would work for this :) I had hoped to avoid
writing extra code and leverage a database's ability to handle larger data
sets in the processing task.

If you get around to fixing this problem, let me know. I can get the data
to you or try to run the import on the update version of the server.

On Thu, Sep 29, 2016 at 3:12 AM, Mattias Persson notifications@github.com
wrote:

OK wonderful, great work.

Ideally the import tool should be changed to cope with this, let's see if
this can be fixed later on in upcoming versions.

On Tue, Sep 27, 2016 at 3:46 PM, markrcosta notifications@github.com
wrote:

Mattias,

I was able to get the data in the database by pre-processing it and
removing redundancies (using Python dataframes). I was hoping to use the
import tool to avoid having to create secondary code, but in this case, a
little work on the side greatly sped up the process.

If you're interested in how I handled the situation, you can read about
it
here: http://www.markcosta.net/load-the-microsoft-academic-graph-
into-neo4j/

Thank you for your help.

On Thu, Sep 8, 2016 at 9:04 AM, Mattias Persson <
notifications@github.com>
wrote:

@markrcosta https://github.com/markrcosta any progress on
deduplicating
those node ids from the input ids? Maybe neo4j-import shouldn't keep
all
those duplicates, only a sample and keep a simple counter instead...
would
that be helpful for you?


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment),
or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
ABZeQ1iYZaBnjRQv8ZrJkNp5G2o55rumks5qoAfTgaJpZM4JqIkq>
.

  • Mark

Mark R. Costa, Ph.D.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#7772 (comment), or
mute
the thread
<https://github.com/notifications/unsubscribe-auth/
AALbnIsgjbLDREV0XvuLVSVhTbx-5pY2ks5quR4sgaJpZM4JqIkq>
.

Mattias Persson
Neo4j Hacker at Neo Technology


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#7772 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/ABZeQ8ycQdR0OhPwn4dFlppwzU1ypt-Eks5qu2TPgaJpZM4JqIkq
.

  • Mark

Mark R. Costa, Ph.D.

@chrisvest
Copy link
Contributor

The memory usage of neo4j-admin import has been much improved in recent versions; especially in the newly released 3.4, and especially around the handling of duplicate data. I'm going to assume that that fixes this issue, but feel free to reopen if it's still a problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants