Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

It takes hours to bulk insert a node with many properties into a legacy index #8840

Closed
mschwore opened this issue Feb 16, 2017 · 2 comments
Closed

Comments

@mschwore
Copy link

mschwore commented Feb 16, 2017

Bug Report

My project uses the bulk insert interface to create an embedded neo4j database where some nodes have many properties indexed by a legacy full text index. I have found that upgrading to a neo4j version that contains #8462 will cause our database import process to stall on inserting these nodes with many properties.

I have included demo code that takes many hours to insert a single node.

Neo4j Version: 3.2.0-alpha04 (but any release with #8462 should exhibit this bug)
Operating System: Centos 7.1
API: Embedded Java API

Steps to reproduce

This demo exhibits the problem.

BatchInserter batchNode = BatchInserters.inserter(new File(System.getProperty("java.io.tmpdir") + File.separator + "graph.db"));

LuceneBatchInserterIndexProvider provider = new LuceneBatchInserterIndexProvider(batchNode);
BatchInserterIndex batchIndex = provider.nodeIndex("node_auto_index",
  MapUtil.stringMap(
    IndexManager.PROVIDER, "lucene", 
      "type", "fulltext"));

Map<String, Object> properties = new HashMap<>();
for(int i = 0; i < 6000; i++)
{
  properties.put(Integer.toString(i), RandomStringUtils.randomAlphabetic(200));
}

long node = batchNode.createNode(properties, Label.label("NODE"));
batchIndex.add(node, properties);

provider.shutdown();
batchNode.shutdown();

Looking at jstack, I can see that the program spends a lot of time in Document.getFields

"main" #1 prio=5 os_prio=0 tid=0x00002b241c00c000 nid=0xa6f runnable [0x00002b2419403000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.document.Document.getFields(Document.java:176)
	at org.neo4j.index.impl.lucene.legacy.IndexType.restoreSortFields(IndexType.java:397)
	at org.neo4j.index.impl.lucene.legacy.IndexType.addToDocument(IndexType.java:231)
	at org.neo4j.index.impl.lucene.legacy.LuceneBatchInserterIndex.addSingleProperty(LuceneBatchInserterIndex.java:126)
	at org.neo4j.index.impl.lucene.legacy.LuceneBatchInserterIndex.add(LuceneBatchInserterIndex.java:96)
	at neo4j_test.neo4j_test.Neo4jImport.main(Neo4jImport.java:43)

Looking through the source code, it looks like

  • Each property is inserted one at a time into LuceneBatchInserterIndex which calls restoreSortFields
    • restoreSortFields will iterate through each already inserted field
      • For each inserted field, it calls Lucene's getFields method which will again iterate through each already inserted field

In the end, inserting this one node will cause the fields array in Lucene's document to be iterated through O(n^3) times which ends up taking a very long time.

Expected behavior

The property should be bulk inserted into the index quickly.

Actual behavior

A single large node can take hours to insert causing database creation to essentially never complete.

@chrisvest
Copy link
Contributor

Thank you for the detailed report. We'll take a look.

@MishaDemianenko
Copy link
Contributor

Closed by #8887

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants