Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

helper procedures to use import-tool headers for node/relationship creation #489

Closed
jexp opened this issue Jul 1, 2017 · 23 comments
Closed

Comments

@jexp
Copy link
Member

jexp commented Jul 1, 2017

the neo4j-import tool headers already contain information about label, id-property, property-types, skipping etc.

It would be great to load these csv files with load csv and use the header information internally to set labels/rel-types, convert types for properties or skip columns

apoc.load.csv should also support that

load csv from "url" as line
call apoc.import.create/merge/find/update - Node/Relationship(line, n / r) yield node/relationship
  • support :LABEL and :TYPE columns
  • skip :IGNORE columns
  • do type conversion like :INTEGER or :BOOLEAN or arrays
  • set (Label) group information as label
  • use :ID properties and :START_ID , :END_ID for lookups / merge
  • make sure there are constraints/indexes for ID properties (fail if there are none)

From @szarnyasg:

:ID(Comment)|id:LONG|creationDate:STRING|locationIP:STRING|browserUsed:STRING|content:STRING|length:INT
274877907725|274877907725|2010-10-17T05:34:03.139+0000|88.80.236.34|Chrome|yes|3

$ bin/neo4j-import --into $DB_DIR --nodes:Message:Comment comment.csv --delimiter '|'
...

Is there a simple way to load this from Cypher? the best approach I can think of is to use

```LOAD CSV WITH HEADERS FROM ... AS line FIELDTERMINATOR |


and instantiate nodes like

```CREATE (:Comment {locationIP: line['locationIP:STRING'], ...})```
@szarnyasg
Copy link
Contributor

szarnyasg commented Jul 7, 2017

Status report: I started working on this. So far, I implemented an initial version that generates a Cypher query based on a CSV header. Currently, I develop the code in a separate project at https://github.com/szarnyasg/cypher-csv-header

  • support :LABEL and :TYPE columns

These are not yet supported.

  • skip :IGNORE columns

OK

  • do type conversion like :INTEGER or :BOOLEAN or arrays

OK, even for arrays

  • set (Label) group information as label

What do you mean by this?

  • use :ID properties and :START_ID , :END_ID for lookups / merge

What should we do, if there these properties do not have a name? I'd use a named such as csv_internal_id. However, we should also add indexes for these to obtain reasonable performance.

  • make sure there are constraints/indexes for ID properties (fail if there are none)

This is a TODO as well.

@jexp
Copy link
Member Author

jexp commented Jul 11, 2017

@szarnyasg thanks a lot for the update

set (Label) group information as label
-> use the group header information as primary label.

use :ID properties and :START_ID , :END_ID for lookups / merge
we can create an artificial name that we can chose to keep or remove later
something like _id or just _

support :LABEL and :TYPE columns
These are not yeat supported.

This could be done by a post-processing step where we aggregate the nodes/rels by label/type and set it manually or with a cypher statement with UNWIND.

btw. we have to make sure to not exceed memory limits for tx-sizes.

So probably makes most sense to run this in batches. I.e. pass batch-size rows to a separate tx (in a separate thread) to create the data.

@jexp
Copy link
Member Author

jexp commented Jul 25, 2017

@szarnyasg how is it going?

@szarnyasg
Copy link
Contributor

szarnyasg commented Jul 25, 2017

@jexp sorry I went missing for some time. It's still on my todolist, but I took some vacation and did some research/proposal/paper writing recently. I will get back to it this week - it's not that far to have a working prototype, I hope.

@szarnyasg
Copy link
Contributor

@jexp I got back to working on this recently and was joined by my Master's student, @hegyibalint. One thing that occurred to us is that the approach is not entirely clear yet.

Our original understanding was that we should generate a LOAD CSV Cypher query and execute it However, your first post suggests that we should extend the functionalities of apoc.load.csv.

@jexp
Copy link
Member Author

jexp commented Aug 20, 2017

I think the idea was to allow an efficient import similar to neo4j-import just as procedure?

I.e. the procedure would take care of full control of creating the data (nodes and rels), constraints etc.

@szarnyasg
Copy link
Contributor

szarnyasg commented Aug 26, 2017

We have an initial implementation in my fork: https://github.com/szarnyasg/neo4j-apoc-procedures/commit/d3fcdc1c50737028439a15d441ee8ef60ce75c73

We used the apoc.export.csv package similarly to the GraphML and Cypher exports. I believe they used export as import cannot be used as a package name). We considered using the existing apoc.load.csv package as well, but that is intended for just loading a CSV, not exporting it.

One thing that is not entirely clear to us is how the progress reporting should be handled. Our current approach is a quick-and-dirty hack: it uses a progress.csv to save the results.

https://github.com/szarnyasg/neo4j-apoc-procedures/blob/d3fcdc1c50737028439a15d441ee8ef60ce75c73/src/main/java/apoc/export/csv/ImportCSV.java#L73

As the loading is carried out by an internal Cypher command*, it can be difficult to track progress. Maybe we should use RETURN count(*) in LOAD CSV to summarise current progress?

* Actually, the LOAD CSV command can also call APOC to create nodes/rels with dynamic labels/types. This means that we cannot use the QueryStatistics class.

@jexp
Copy link
Member Author

jexp commented Aug 26, 2017

I'll have a look.

Just quickly from your comments.
Good idea with the progress file. You can also write progress to the log.
Unfortunately you cannot stream (yet) to the client.

Not sure if we had an misunderstanding with "export" ?
You don't need to use the package name, each procedure has it's own name override with the annotation.

If you generate the load csv anyway, you could also generate a multi-pass for different labels / rel-type or generate the labels/rel-types in.

Yes I know the query statistics bit is quite annoying.

You can at least use apoc.create.addLabels, so the node-counts are correct.
And then manually count the created relationships with return count(r) or count(distinct r).
Although large aggregations are expensive too.

@szarnyasg
Copy link
Contributor

Not sure if we had an misunderstanding with "export" ?
You don't need to use the package name, each procedure has it's own name override with the annotation.

I found out that it's possible to override the package name for a given procedure such as
apoc.import.graphml, and used this approach for the CSV loader as well. However, I think this is quite confusing to newcomers as it makes the code more difficult to navigate in.

@szarnyasg
Copy link
Contributor

szarnyasg commented Aug 26, 2017

One think that occurred to me though is that we could simplify the code a bit by requiring all files to have a header - this is currently optional. If a certain CSV file does not have a header, users can still use apoc.load.csv.

As we have to access the file anyways (to get the number of lines), this is quite straighforward to implement. Also, it makes the public API simpler, as the user does not have to provide the header in the CALL command (the internal API can still expose a method which takes the argument that generated a query based on the header -- we actually use this for ingraph).

@szarnyasg
Copy link
Contributor

@jexp I have a conceptual question. In the thread starting post, you mentioned that:

make sure there are constraints/indexes for ID properties (fail if there are none)

AFAIK, indexes can only be defined for a certain label, so the user has to create the appropriate ID index for all labels. I tried to create them from APOC, but I don't think this is possible due to the "Cannot perform schema updates in a transaction that has performed data updates." limitation.

@szarnyasg
Copy link
Contributor

My previous question leads to another, more important question: can we start new (and parallel) transactions from a transaction calling an APOC procedure? I am starting new transaction for all batches with db.beginTx(), but I am not sure they have an effect.

@szarnyasg
Copy link
Contributor

Some new lessons learnt:

  • The Neo4j import tool ignores the name of the :ID (and :START_ID, :END_ID) fields, so for myId:ID, myId is discarded.
  • For the APOC procedure, users should be able to specify the name of the id. This occurred me while using this procedure in the Train Benchmark).

@jexp
Copy link
Member Author

jexp commented Aug 27, 2017

Really that shouldn't be the case for :ID at least it didn't used to be.

@jexp
Copy link
Member Author

jexp commented Aug 27, 2017

Yes you can start new transactions in new threads (but you need to inject GraphDatabaseAPI), see examples here which Util methods help:

https://github.com/neo4j-contrib/neo4j-apoc-procedures/blob/3.2/src/main/java/apoc/stats/DegreeDistribution.java#L101

Those background tx could even be used to create the schema indexes upfront.

@szarnyasg
Copy link
Contributor

Really that shouldn't be the case for :ID at least it didn't used to be.
You're right, I mixed up things. The names for :START_ID and :END_ID are omitted, not the name for :ID.

Given the file id-test.csv:

myId:ID,name:STRING
1,John
2,Jane

And the following import sequence

rm -rf data/databases/graph.db/
bin/neo4j-import --into data/databases/graph.db --nodes:Person import/id-test.csv --relationships:KNOWS import/rel-test.csv

It works with both rel-test.csv files:

:START_ID,:END_ID
1,2

or

myId:START_ID,myId:END_ID
1,2

@jexp
Copy link
Member Author

jexp commented Aug 28, 2017

Ya, it doesn't need the id-names.
But it stores them afaik if you provide them.

@jexp
Copy link
Member Author

jexp commented Aug 28, 2017

btw. something that might be relevant here too is groups. To handle conflicting id-spaces.
Where Group =~= Label

i.e.

myId:ID(Person),name:STRING
1,John
2,Jane

:ID(Company),name:STRING
1,eBay
2,neo4j

:START_ID(Person),:END_ID(Company)
1,2
2,1

@jexp
Copy link
Member Author

jexp commented Aug 28, 2017

Feel free to send your commit as a PR

@szarnyasg
Copy link
Contributor

szarnyasg commented Aug 29, 2017

The example is getting bigger (4 CSV files now), so I created a repository for it: https://github.com/szarnyasg/neo4j-csv-ids

Here is how the CSVs look:

==> persons.csv <==
personId:ID,name
1,John
2,Jane

==> companies.csv <==
companyId:ID,name
4,Neo4j

==> unis.csv <==
uniId:ID,name
3,TU Munich

==> affiliated-with.csv <==
:START_ID,:END_ID
1,3
2,4

The problem is that if we respect the ID attributes' name (personId, companyId, uniId), we will not be able to insert the AFFILIATED_WITH relationships. So my idea is the following: if we have user-specified ID names, we should duplicate the ID field:

  1. have an internal id (currently __csv_id) and
  2. also create the user-specified field (e.g. personId)

I believe this will also work with id spaces.

@szarnyasg
Copy link
Contributor

szarnyasg commented Aug 30, 2017

@szarnyasg
Copy link
Contributor

szarnyasg commented Nov 15, 2017

(I originally added this comment to the PR, but it rather belongs here.)

@jexp I finally got back to this. I revamped the code a bit and now have one fundamental question left: how should we pass the URI of the CSV file to the procedure? Let me elaborate on that one a bit.

The procedure needs to use the filename at least twice:

look into it to read the CSV header @ https://github.com/neo4j-contrib/neo4j-apoc-procedures/pull/581/files#diff-df1b88452931fbee58e22a3f1d27b3a8R181
pass to LOAD CSV, which will use it to open and read the file @ https://github.com/neo4j-contrib/neo4j-apoc-procedures/pull/581/files#diff-df1b88452931fbee58e22a3f1d27b3a8R193
The file URI has to be relative to the specified import directory (related blog post).

There is a surprising contradiction in this:

  • if we pass a relative URI, step 1 will not work - to determine the location of the file, we'd have to know the value of the dbms.directories.import directory
  • if we pass an absolute URI, step 2 will not work - we'd have to produce an URI that is relative the dbms.directories.import directory

So in both cases, the dbms.directories.import directory is required. Is there a way for APOC procedures to access that directory?

@szarnyasg
Copy link
Contributor

Okay, it seems I almost reinvented the wheel here. FileUtils should take care of those issues.

@jexp jexp mentioned this issue Jun 24, 2018
12 tasks
jexp pushed a commit that referenced this issue Jul 21, 2018
…ort tool's specification (#581)

* Add 'apoc.import.csv' procedure

This allows users to load graphs to an online database from CSVs
with headers that follow the conventions of the neo4j-import tool

* Introduce batch transactions to 'apoc.import.csv'
@jexp jexp closed this as completed in 989ac0e Jul 23, 2018
jexp pushed a commit that referenced this issue Jul 23, 2018
…ort tool's specification (#581)

* Add 'apoc.import.csv' procedure

This allows users to load graphs to an online database from CSVs
with headers that follow the conventions of the neo4j-import tool

* Introduce batch transactions to 'apoc.import.csv'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants