It takes hours to bulk insert a node with many properties into a legacy index #8840

mschwore · 2017-02-16T21:50:57Z

Bug Report

My project uses the bulk insert interface to create an embedded neo4j database where some nodes have many properties indexed by a legacy full text index. I have found that upgrading to a neo4j version that contains #8462 will cause our database import process to stall on inserting these nodes with many properties.

I have included demo code that takes many hours to insert a single node.

Neo4j Version: 3.2.0-alpha04 (but any release with #8462 should exhibit this bug)
Operating System: Centos 7.1
API: Embedded Java API

Steps to reproduce

This demo exhibits the problem.

BatchInserter batchNode = BatchInserters.inserter(new File(System.getProperty("java.io.tmpdir") + File.separator + "graph.db"));

LuceneBatchInserterIndexProvider provider = new LuceneBatchInserterIndexProvider(batchNode);
BatchInserterIndex batchIndex = provider.nodeIndex("node_auto_index",
  MapUtil.stringMap(
    IndexManager.PROVIDER, "lucene", 
      "type", "fulltext"));

Map<String, Object> properties = new HashMap<>();
for(int i = 0; i < 6000; i++)
{
  properties.put(Integer.toString(i), RandomStringUtils.randomAlphabetic(200));
}

long node = batchNode.createNode(properties, Label.label("NODE"));
batchIndex.add(node, properties);

provider.shutdown();
batchNode.shutdown();

Looking at jstack, I can see that the program spends a lot of time in Document.getFields

"main" #1 prio=5 os_prio=0 tid=0x00002b241c00c000 nid=0xa6f runnable [0x00002b2419403000]
   java.lang.Thread.State: RUNNABLE
	at org.apache.lucene.document.Document.getFields(Document.java:176)
	at org.neo4j.index.impl.lucene.legacy.IndexType.restoreSortFields(IndexType.java:397)
	at org.neo4j.index.impl.lucene.legacy.IndexType.addToDocument(IndexType.java:231)
	at org.neo4j.index.impl.lucene.legacy.LuceneBatchInserterIndex.addSingleProperty(LuceneBatchInserterIndex.java:126)
	at org.neo4j.index.impl.lucene.legacy.LuceneBatchInserterIndex.add(LuceneBatchInserterIndex.java:96)
	at neo4j_test.neo4j_test.Neo4jImport.main(Neo4jImport.java:43)

Looking through the source code, it looks like

Each property is inserted one at a time into LuceneBatchInserterIndex which calls restoreSortFields
- restoreSortFields will iterate through each already inserted field
  - For each inserted field, it calls Lucene's getFields method which will again iterate through each already inserted field

In the end, inserting this one node will cause the fields array in Lucene's document to be iterated through O(n^3) times which ends up taking a very long time.

Expected behavior

The property should be bulk inserted into the index quickly.

Actual behavior

A single large node can take hours to insert causing database creation to essentially never complete.

The text was updated successfully, but these errors were encountered:

chrisvest · 2017-02-17T09:42:14Z

Thank you for the detailed report. We'll take a look.

MishaDemianenko · 2017-04-24T16:00:27Z

Closed by #8887

spacecowboy added bug kernel labels Feb 21, 2017

MishaDemianenko mentioned this issue Feb 23, 2017

Improve legacy index restoreSortFields speed. #8887

Merged

MishaDemianenko closed this as completed Apr 24, 2017

tinwelint added team-kernel and removed team-kernel labels Aug 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

It takes hours to bulk insert a node with many properties into a legacy index #8840

It takes hours to bulk insert a node with many properties into a legacy index #8840

mschwore commented Feb 16, 2017 •

edited

chrisvest commented Feb 17, 2017

MishaDemianenko commented Apr 24, 2017

It takes hours to bulk insert a node with many properties into a legacy index #8840

It takes hours to bulk insert a node with many properties into a legacy index #8840

Comments

mschwore commented Feb 16, 2017 • edited

Bug Report

Steps to reproduce

Expected behavior

Actual behavior

chrisvest commented Feb 17, 2017

MishaDemianenko commented Apr 24, 2017

mschwore commented Feb 16, 2017 •

edited