helper procedures to use import-tool headers for node/relationship creation #489

jexp · 2017-07-01T11:23:20Z

the neo4j-import tool headers already contain information about label, id-property, property-types, skipping etc.

It would be great to load these csv files with load csv and use the header information internally to set labels/rel-types, convert types for properties or skip columns

apoc.load.csv should also support that

load csv from "url" as line
call apoc.import.create/merge/find/update - Node/Relationship(line, n / r) yield node/relationship

support :LABEL and :TYPE columns
skip :IGNORE columns
do type conversion like :INTEGER or :BOOLEAN or arrays
set (Label) group information as label
use :ID properties and :START_ID , :END_ID for lookups / merge
make sure there are constraints/indexes for ID properties (fail if there are none)

From @szarnyasg:

:ID(Comment)|id:LONG|creationDate:STRING|locationIP:STRING|browserUsed:STRING|content:STRING|length:INT
274877907725|274877907725|2010-10-17T05:34:03.139+0000|88.80.236.34|Chrome|yes|3

$ bin/neo4j-import --into $DB_DIR --nodes:Message:Comment comment.csv --delimiter '|'
...

Is there a simple way to load this from Cypher? the best approach I can think of is to use

```LOAD CSV WITH HEADERS FROM ... AS line FIELDTERMINATOR |


and instantiate nodes like

```CREATE (:Comment {locationIP: line['locationIP:STRING'], ...})```

The text was updated successfully, but these errors were encountered:

szarnyasg · 2017-07-07T19:36:35Z

Status report: I started working on this. So far, I implemented an initial version that generates a Cypher query based on a CSV header. Currently, I develop the code in a separate project at https://github.com/szarnyasg/cypher-csv-header

support :LABEL and :TYPE columns

These are not yet supported.

skip :IGNORE columns

OK

do type conversion like :INTEGER or :BOOLEAN or arrays

OK, even for arrays

set (Label) group information as label

What do you mean by this?

use :ID properties and :START_ID , :END_ID for lookups / merge

What should we do, if there these properties do not have a name? I'd use a named such as csv_internal_id. However, we should also add indexes for these to obtain reasonable performance.

make sure there are constraints/indexes for ID properties (fail if there are none)

This is a TODO as well.

jexp · 2017-07-11T01:11:45Z

@szarnyasg thanks a lot for the update

set (Label) group information as label
-> use the group header information as primary label.

use :ID properties and :START_ID , :END_ID for lookups / merge
we can create an artificial name that we can chose to keep or remove later
something like _id or just _

support :LABEL and :TYPE columns
These are not yeat supported.

This could be done by a post-processing step where we aggregate the nodes/rels by label/type and set it manually or with a cypher statement with UNWIND.

btw. we have to make sure to not exceed memory limits for tx-sizes.

So probably makes most sense to run this in batches. I.e. pass batch-size rows to a separate tx (in a separate thread) to create the data.

jexp · 2017-07-25T17:26:22Z

@szarnyasg how is it going?

szarnyasg · 2017-07-25T18:33:58Z

@jexp sorry I went missing for some time. It's still on my todolist, but I took some vacation and did some research/proposal/paper writing recently. I will get back to it this week - it's not that far to have a working prototype, I hope.

szarnyasg · 2017-08-20T09:19:52Z

@jexp I got back to working on this recently and was joined by my Master's student, @hegyibalint. One thing that occurred to us is that the approach is not entirely clear yet.

Our original understanding was that we should generate a LOAD CSV Cypher query and execute it However, your first post suggests that we should extend the functionalities of apoc.load.csv.

jexp · 2017-08-20T10:40:50Z

I think the idea was to allow an efficient import similar to neo4j-import just as procedure?

I.e. the procedure would take care of full control of creating the data (nodes and rels), constraints etc.

szarnyasg · 2017-08-26T10:03:44Z

We have an initial implementation in my fork: https://github.com/szarnyasg/neo4j-apoc-procedures/commit/d3fcdc1c50737028439a15d441ee8ef60ce75c73

We used the apoc.export.csv package similarly to the GraphML and Cypher exports. I believe they used export as import cannot be used as a package name). We considered using the existing apoc.load.csv package as well, but that is intended for just loading a CSV, not exporting it.

One thing that is not entirely clear to us is how the progress reporting should be handled. Our current approach is a quick-and-dirty hack: it uses a progress.csv to save the results.

https://github.com/szarnyasg/neo4j-apoc-procedures/blob/d3fcdc1c50737028439a15d441ee8ef60ce75c73/src/main/java/apoc/export/csv/ImportCSV.java#L73

As the loading is carried out by an internal Cypher command*, it can be difficult to track progress. Maybe we should use RETURN count(*) in LOAD CSV to summarise current progress?

* Actually, the LOAD CSV command can also call APOC to create nodes/rels with dynamic labels/types. This means that we cannot use the QueryStatistics class.

jexp · 2017-08-26T16:22:51Z

I'll have a look.

Just quickly from your comments.
Good idea with the progress file. You can also write progress to the log.
Unfortunately you cannot stream (yet) to the client.

Not sure if we had an misunderstanding with "export" ?
You don't need to use the package name, each procedure has it's own name override with the annotation.

If you generate the load csv anyway, you could also generate a multi-pass for different labels / rel-type or generate the labels/rel-types in.

Yes I know the query statistics bit is quite annoying.

You can at least use apoc.create.addLabels, so the node-counts are correct.
And then manually count the created relationships with return count(r) or count(distinct r).
Although large aggregations are expensive too.

szarnyasg · 2017-08-26T17:10:52Z

Not sure if we had an misunderstanding with "export" ?
You don't need to use the package name, each procedure has it's own name override with the annotation.

I found out that it's possible to override the package name for a given procedure such as
apoc.import.graphml, and used this approach for the CSV loader as well. However, I think this is quite confusing to newcomers as it makes the code more difficult to navigate in.

szarnyasg · 2017-08-26T17:12:56Z

One think that occurred to me though is that we could simplify the code a bit by requiring all files to have a header - this is currently optional. If a certain CSV file does not have a header, users can still use apoc.load.csv.

As we have to access the file anyways (to get the number of lines), this is quite straighforward to implement. Also, it makes the public API simpler, as the user does not have to provide the header in the CALL command (the internal API can still expose a method which takes the argument that generated a query based on the header -- we actually use this for ingraph).

szarnyasg · 2017-08-26T19:39:05Z

@jexp I have a conceptual question. In the thread starting post, you mentioned that:

make sure there are constraints/indexes for ID properties (fail if there are none)

AFAIK, indexes can only be defined for a certain label, so the user has to create the appropriate ID index for all labels. I tried to create them from APOC, but I don't think this is possible due to the "Cannot perform schema updates in a transaction that has performed data updates." limitation.

szarnyasg · 2017-08-26T19:51:13Z

My previous question leads to another, more important question: can we start new (and parallel) transactions from a transaction calling an APOC procedure? I am starting new transaction for all batches with db.beginTx(), but I am not sure they have an effect.

szarnyasg · 2017-08-27T12:39:03Z

Some new lessons learnt:

The Neo4j import tool ignores the name of the :ID (and :START_ID, :END_ID) fields, so for myId:ID, myId is discarded.
For the APOC procedure, users should be able to specify the name of the id. This occurred me while using this procedure in the Train Benchmark).

jexp · 2017-08-27T23:40:54Z

Really that shouldn't be the case for :ID at least it didn't used to be.

jexp · 2017-08-27T23:42:56Z

Yes you can start new transactions in new threads (but you need to inject GraphDatabaseAPI), see examples here which Util methods help:

https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/3.2/src/main/java/apoc/stats/DegreeDistribution.java#L101

Those background tx could even be used to create the schema indexes upfront.

szarnyasg · 2017-08-28T05:54:23Z

Really that shouldn't be the case for :ID at least it didn't used to be.
You're right, I mixed up things. The names for :START_ID and :END_ID are omitted, not the name for :ID.

Given the file id-test.csv:

myId:ID,name:STRING
1,John
2,Jane

And the following import sequence

rm -rf data/databases/graph.db/
bin/neo4j-import --into data/databases/graph.db --nodes:Person import/id-test.csv --relationships:KNOWS import/rel-test.csv

It works with both rel-test.csv files:

:START_ID,:END_ID
1,2

or

myId:START_ID,myId:END_ID
1,2

jexp · 2017-08-28T07:27:54Z

Ya, it doesn't need the id-names.
But it stores them afaik if you provide them.

jexp · 2017-08-28T07:43:38Z

btw. something that might be relevant here too is groups. To handle conflicting id-spaces.
Where Group =~= Label

i.e.

myId:ID(Person),name:STRING
1,John
2,Jane

:ID(Company),name:STRING
1,eBay
2,neo4j

:START_ID(Person),:END_ID(Company)
1,2
2,1

jexp · 2017-08-28T07:43:50Z

Feel free to send your commit as a PR

szarnyasg · 2017-08-29T09:10:08Z

The example is getting bigger (4 CSV files now), so I created a repository for it: https://github.com/szarnyasg/neo4j-csv-ids

Here is how the CSVs look:

==> persons.csv <==
personId:ID,name
1,John
2,Jane

==> companies.csv <==
companyId:ID,name
4,Neo4j

==> unis.csv <==
uniId:ID,name
3,TU Munich

==> affiliated-with.csv <==
:START_ID,:END_ID
1,3
2,4

The problem is that if we respect the ID attributes' name (personId, companyId, uniId), we will not be able to insert the AFFILIATED_WITH relationships. So my idea is the following: if we have user-specified ID names, we should duplicate the ID field:

have an internal id (currently __csv_id) and
also create the user-specified field (e.g. personId)

I believe this will also work with id spaces.

szarnyasg · 2017-08-30T14:29:22Z

Okay, so we are getting close.

make sure there are constraints/indexes for ID properties (fail if there are none)

I added a check for indexes: https://github.com/szarnyasg/neo4j-apoc-procedures/blob/9babeafa963d6c832084e130d1364aca06a21e7a/src/main/java/apoc/export/csv/ImportCSV.java#L108 (partially inspired by https://stackoverflow.com/questions/23443337/java-check-if-index-already-exists-neo4j-if-clause)

And I create them like this:
https://github.com/szarnyasg/neo4j-apoc-procedures/blob/9babeafa963d6c832084e130d1364aca06a21e7a/src/test/java/apoc/export/csv/ImportCsvTest.java#L41

Does this look OK?

szarnyasg · 2017-11-15T12:52:20Z

(I originally added this comment to the PR, but it rather belongs here.)

@jexp I finally got back to this. I revamped the code a bit and now have one fundamental question left: how should we pass the URI of the CSV file to the procedure? Let me elaborate on that one a bit.

The procedure needs to use the filename at least twice:

look into it to read the CSV header @ https://github.com/neo4j-contrib/neo4j-apoc-procedures/pull/581/files#diff-df1b88452931fbee58e22a3f1d27b3a8R181
pass to LOAD CSV, which will use it to open and read the file @ https://github.com/neo4j-contrib/neo4j-apoc-procedures/pull/581/files#diff-df1b88452931fbee58e22a3f1d27b3a8R193
The file URI has to be relative to the specified import directory (related blog post).

There is a surprising contradiction in this:

if we pass a relative URI, step 1 will not work - to determine the location of the file, we'd have to know the value of the dbms.directories.import directory
if we pass an absolute URI, step 2 will not work - we'd have to produce an URI that is relative the dbms.directories.import directory

So in both cases, the dbms.directories.import directory is required. Is there a way for APOC procedures to access that directory?

szarnyasg · 2017-11-15T16:54:59Z

Okay, it seems I almost reinvented the wheel here. FileUtils should take care of those issues.

…ort tool's specification (#581) * Add 'apoc.import.csv' procedure This allows users to load graphs to an online database from CSVs with headers that follow the conventions of the neo4j-import tool * Introduce batch transactions to 'apoc.import.csv'

szarnyasg mentioned this issue Jul 1, 2017

Use APOC CSV loader to generate expected results ftsrg/ingraph#176

Open

This was referenced Aug 19, 2017

Migrate to Neo4j 3.2 ftsrg/trainbenchmark#88

Closed

Use APOC procedures to handle CSVs ftsrg/trainbenchmark#95

Open

Support batching szarnyasg/cypher-csv-header#3

Closed

szarnyasg mentioned this issue Aug 27, 2017

Ideas for introducing LOAD CSV opencypher/openCypher#249

Closed

szarnyasg mentioned this issue Aug 27, 2017

Support LOAD CSV ftsrg/ingraph#154

Closed

This was referenced Sep 5, 2017

Fixes #489 - Add support for loading CSVs formatted according the import tool's specification #581

Merged

Clarify custom id names for :ID fields in the import tool neo4j/neo4j-documentation#121

Closed

jexp added enhancement and removed enhancement labels Apr 11, 2018

jexp mentioned this issue Jun 24, 2018

Fix all GraphML Issues #840

Closed

12 tasks

szarnyasg mentioned this issue Jul 13, 2018

Too many mandatory fields for relationships in the import tool's doc neo4j/neo4j-documentation#233

Closed

jexp closed this as completed in 989ac0e Jul 23, 2018

neo-technology-build-agent mentioned this issue Sep 1, 2022

Fix all GraphML Issues neo4j/apoc#133

Closed

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

helper procedures to use import-tool headers for node/relationship creation #489

helper procedures to use import-tool headers for node/relationship creation #489

jexp commented Jul 1, 2017 •

edited

Loading

szarnyasg commented Jul 7, 2017 •

edited

Loading

jexp commented Jul 11, 2017

jexp commented Jul 25, 2017

szarnyasg commented Jul 25, 2017 •

edited

Loading

szarnyasg commented Aug 20, 2017

jexp commented Aug 20, 2017

szarnyasg commented Aug 26, 2017 •

edited

Loading

jexp commented Aug 26, 2017

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 26, 2017 •

edited

Loading

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 27, 2017

jexp commented Aug 27, 2017

jexp commented Aug 27, 2017

szarnyasg commented Aug 28, 2017

jexp commented Aug 28, 2017

jexp commented Aug 28, 2017

jexp commented Aug 28, 2017

szarnyasg commented Aug 29, 2017 •

edited

Loading

szarnyasg commented Aug 30, 2017 •

edited

Loading

szarnyasg commented Nov 15, 2017 •

edited

Loading

szarnyasg commented Nov 15, 2017

helper procedures to use import-tool headers for node/relationship creation #489

helper procedures to use import-tool headers for node/relationship creation #489

Comments

jexp commented Jul 1, 2017 • edited Loading

szarnyasg commented Jul 7, 2017 • edited Loading

jexp commented Jul 11, 2017

jexp commented Jul 25, 2017

szarnyasg commented Jul 25, 2017 • edited Loading

szarnyasg commented Aug 20, 2017

jexp commented Aug 20, 2017

szarnyasg commented Aug 26, 2017 • edited Loading

jexp commented Aug 26, 2017

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 26, 2017 • edited Loading

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 26, 2017

szarnyasg commented Aug 27, 2017

jexp commented Aug 27, 2017

jexp commented Aug 27, 2017

szarnyasg commented Aug 28, 2017

jexp commented Aug 28, 2017

jexp commented Aug 28, 2017

jexp commented Aug 28, 2017

szarnyasg commented Aug 29, 2017 • edited Loading

szarnyasg commented Aug 30, 2017 • edited Loading

szarnyasg commented Nov 15, 2017 • edited Loading

szarnyasg commented Nov 15, 2017

jexp commented Jul 1, 2017 •

edited

Loading

szarnyasg commented Jul 7, 2017 •

edited

Loading

szarnyasg commented Jul 25, 2017 •

edited

Loading

szarnyasg commented Aug 26, 2017 •

edited

Loading

szarnyasg commented Aug 26, 2017 •

edited

Loading

szarnyasg commented Aug 29, 2017 •

edited

Loading

szarnyasg commented Aug 30, 2017 •

edited

Loading

szarnyasg commented Nov 15, 2017 •

edited

Loading