Data Import and Export #1299

sorenbs · 2017-11-18T21:11:07Z

Overview

Graphcool should support importing data from various data sources including:

SQL
MongoDB
JSON
Firebase

From a high level the process looks like this:

+--------------+                   +----------------+                 +------------+
| +--------------+                 |                |                 |            |
| |            | |                 |                |                 |            |
| | SQL        | |   transform     |      NDF       | chunked upload  |  Graphcool |
| | MongoDB    | | +----------->   |                | +-------------> |            |
| | JSON       | |                 |                |                 |            |
| |            | |                 |                |                 |            |
+--------------+ |                 +----------------+                 +------------+
  +--------------+

A source mapper transforms the data into NDF (Normalized data format)
Basic data validation is performed on the NDF (Normalized data format)
Small chunks are uploaded to a special import endpoint on Graphcool

NDF (Normalized data format)

Requirements

design goals

support importing large datasets (10 GB +)
resumable
resilient to failure of a single node
simple to implement and understand

not important

rollback of complete import
performance

Files

The NDF (Normalized data format) is stored in a series of files named nodes-x.json, lists-x.json and relations-x.json where x is an incrementing number.

Each file is roughly 10 mb in size. The Graphcool server accepts arbitrarily large files, but files significantly larger than the 10 mb default may cause memory exhaustion.

Structure

The import json object that the endpoint expects must have the following format

{"valueType": STRING, "values": [IMPORTOBJECTS]}

Where type can be one of the three import object types: nodes, lists or relations. One file can only contain one type of import objects.

Nodes

[
  {"_typeName": STRING, "id": STRING, "fieldName": ANY ..},
  ...
]

// For Example:

[
  {"_typeName": "User", "id": "johndoe", "firstName": "John", "lastName": "Doe"}
]

Listvalues

Scalar Lists could contain values that exceed the current size limit of 10mb per file. To circumvent this limitation, they can be split up in multiple lines and therefore multiple files. They will then be concatenated by the server in the order they are provided.

[
  {"_typeName": STRING, "id": STRING, "fieldName": [ANY]},
...
]


// For example:

[
  {"_typeName": "User", "id": "johndoe", "hobbies": ["Fishing", "Cooking" .....]},
 {"_typeName": "User", "id": "johndoe", "hobbies": ["Biking", "Dancing" .....]}
]

Relations

[
  [
    {"_typeName": STRING, "id": STRING, "fieldName": STRING }, 
    {"_typeName": STRING, "id": STRING, "fieldName": STRING }
  ],
...
]

// For Example:

[
   [
    {"_typeName": "Human", "id": "johndoe", "fieldName": "husband"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": "wife"}
   ]
]

Graphcool 1.0 allows for optional Back Relations. This means that one model in the relation does not have a field associated with the relation. When such a relation is exported Graphcool 1.0 will generate a format like this:

[
   [
    {"_typeName": "Human", "id": "johndoe"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": "wife"}
   ]
]

Both the Framework and Graphcool 1.0 will accept this format. But the Framework of course still needs to have two fields defined on each relations. We will just infer the proper relation to create using the one provided field.

A format like this will also be accepted by both versions of Graphcool.

[
   [
    {"_typeName": "Human", "id": "johndoe", "fieldName": "husband"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": null}
   ]
]

Value Representations

String = string
Int - number
Float - number
Boolean - boolean
DateTime - string ((new Date()).toJSON())
Enum - string
Json - string ("{\"key\": 42}")
Scalar field - should not be included in a node item but be provided as list item.

Process

The graphcool cli is used to transform source data to the NDF (Normalized data format)
Optionally the cli is used to verify the integrity of data in the NDF (Normalized data format)
The cli is used to import the all the data in the NDF (Normalized data format)

Transform

The cli should support transforming data from various data sources to the NDF (Normalized data format). Source Mappers can be implemented as CLI plugins or build into the CLI.

Verify

The cli will get the schema from the backend and perform basic validation on the type of all data.

Additionally referential integrity can be checked for all relations, as well as identifying nodes with violated required relations.

Import

The backend exposes a new dedicated import endpoint that accepts individual files as described above. A single file is processed at a time. each element in the array is treated individually. Required relations are not verified during import of nodes. If any node fail to import, the index and reason is returned in the response from the import endpoint. The cli can then decide to retry or show the error to the user. Violation of a unique constraint is a likely error.

Other considerations

A similar process could support data export to various formats.

On the server we should perform json parsing in a streaming manner to not clog the CPU: https://github.com/circe/circe/tree/master/examples/sf-city-lots

Open questions

listvalues vs lists?

The text was updated successfully, but these errors were encountered:

nikolasburk · 2017-12-18T12:40:07Z

Would it be worth to spec out the NDF as JSON schema so we have a formal reference point for it? @sorenbs @marktani

marktani · 2017-12-22T20:59:47Z

This has now been released in the latest version of graphcool-framework and the latest developer preview of Graphcool 1.0 🎉

marcovc · 2018-01-12T09:43:05Z

Hi,
This is probably not the best place to ask questions (where is it by the way?).
I don't understand from the docs what is supposed to go in the "id" fields of my data when I'm importing. On one hand I've read that the "id" is something that graphcool manages internally, on the other, it seems necessary to have some "id" when I'm importing related records.
I've tried importing data with my own custom "id"s (just integers) but it doesn't seem to be working. It doesn't give any errors but nothing is inserted in the database.
Any clues?
Thank you!
Marco

agustif · 2018-02-23T23:44:56Z

So If I want to import from AirTable, what would be my best bet, prisma + AirTable's API wrapper, or export to csv and use graphql-cli-load? @marktani

Sorry for reopening too!

This was referenced Nov 18, 2017

Import command #610

Closed

Import JSON or CSV Data #92

Closed

nikolasburk mentioned this issue Nov 19, 2017

[Prisma 1.0] Specifications #353

Closed

32 tasks

marktani added the kind/feature label Nov 19, 2017

marktani mentioned this issue Nov 27, 2017

cli error "graphcool export is not a graphcool command" #1335

Closed

do4gr mentioned this issue Dec 1, 2017

[WIP] Bulk Data Import Endpoint #1352

Merged

marktani added the pr label Dec 1, 2017

sorenbs mentioned this issue Dec 1, 2017

Set and modify id, createdAt, updatedAt in Create and Update mutations #1278

Closed

sorenbs assigned do4gr Dec 1, 2017

sorenbs added the rfc/2-accepted label Dec 1, 2017

do4gr mentioned this issue Dec 5, 2017

[WIP] Export Bulk Data #1368

Merged

schickling mentioned this issue Dec 8, 2017

Database seeding #1181

Closed

5 tasks

do4gr mentioned this issue Dec 11, 2017

Update Import And Export Formats, enable Endpoints #1386

Merged

mavilein mentioned this issue Dec 14, 2017

[WIP] Prisma #1318

Merged

42 tasks

marktani added this to the 1.0-beta1 milestone Dec 15, 2017

This was referenced Dec 19, 2017

Improved data import/export #11

Closed

Error when exporting #675

Closed

Exporting large data should not time out #1030

Closed

marktani closed this as completed Dec 22, 2017

pantharshit00 added the kind/feature label Jan 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data Import and Export #1299

Data Import and Export #1299

sorenbs commented Nov 18, 2017 •

edited by do4gr

Loading

nikolasburk commented Dec 18, 2017

marktani commented Dec 22, 2017

marcovc commented Jan 12, 2018

agustif commented Feb 23, 2018

Data Import and Export #1299

Data Import and Export #1299

Comments

sorenbs commented Nov 18, 2017 • edited by do4gr Loading

Overview

NDF (Normalized data format)

Requirements

design goals

not important

Files

Structure

Nodes

Listvalues

Relations

Value Representations

Process

Transform

Verify

Import

Other considerations

Open questions

nikolasburk commented Dec 18, 2017

marktani commented Dec 22, 2017

marcovc commented Jan 12, 2018

agustif commented Feb 23, 2018

sorenbs commented Nov 18, 2017 •

edited by do4gr

Loading