Skip to content
This repository has been archived by the owner on Sep 2, 2022. It is now read-only.

Data Import and Export #1299

Closed
1 task
sorenbs opened this issue Nov 18, 2017 · 4 comments
Closed
1 task

Data Import and Export #1299

sorenbs opened this issue Nov 18, 2017 · 4 comments
Assignees
Milestone

Comments

@sorenbs
Copy link
Member

sorenbs commented Nov 18, 2017

Overview

Graphcool should support importing data from various data sources including:

  • SQL
  • MongoDB
  • JSON
  • Firebase

From a high level the process looks like this:

+--------------+                   +----------------+                 +------------+
| +--------------+                 |                |                 |            |
| |            | |                 |                |                 |            |
| | SQL        | |   transform     |      NDF       | chunked upload  |  Graphcool |
| | MongoDB    | | +----------->   |                | +-------------> |            |
| | JSON       | |                 |                |                 |            |
| |            | |                 |                |                 |            |
+--------------+ |                 +----------------+                 +------------+
  +--------------+
  1. A source mapper transforms the data into NDF (Normalized data format)
  2. Basic data validation is performed on the NDF (Normalized data format)
  3. Small chunks are uploaded to a special import endpoint on Graphcool

NDF (Normalized data format)

Requirements

design goals

  • support importing large datasets (10 GB +)
  • resumable
  • resilient to failure of a single node
  • simple to implement and understand

not important

  • rollback of complete import
  • performance

Files

The NDF (Normalized data format) is stored in a series of files named nodes-x.json, lists-x.json and relations-x.json where x is an incrementing number.

Each file is roughly 10 mb in size. The Graphcool server accepts arbitrarily large files, but files significantly larger than the 10 mb default may cause memory exhaustion.

Structure

The import json object that the endpoint expects must have the following format

{"valueType": STRING, "values": [IMPORTOBJECTS]}

Where type can be one of the three import object types: nodes, lists or relations. One file can only contain one type of import objects.

Nodes

[
  {"_typeName": STRING, "id": STRING, "fieldName": ANY ..},
  ...
]

// For Example:

[
  {"_typeName": "User", "id": "johndoe", "firstName": "John", "lastName": "Doe"}
]

Listvalues

Scalar Lists could contain values that exceed the current size limit of 10mb per file. To circumvent this limitation, they can be split up in multiple lines and therefore multiple files. They will then be concatenated by the server in the order they are provided.

[
  {"_typeName": STRING, "id": STRING, "fieldName": [ANY]},
...
]


// For example:

[
  {"_typeName": "User", "id": "johndoe", "hobbies": ["Fishing", "Cooking" .....]},
 {"_typeName": "User", "id": "johndoe", "hobbies": ["Biking", "Dancing" .....]}
]

Relations

[
  [
    {"_typeName": STRING, "id": STRING, "fieldName": STRING }, 
    {"_typeName": STRING, "id": STRING, "fieldName": STRING }
  ],
...
]

// For Example:

[
   [
    {"_typeName": "Human", "id": "johndoe", "fieldName": "husband"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": "wife"}
   ]
]

Graphcool 1.0 allows for optional Back Relations. This means that one model in the relation does not have a field associated with the relation. When such a relation is exported Graphcool 1.0 will generate a format like this:

[
   [
    {"_typeName": "Human", "id": "johndoe"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": "wife"}
   ]
]

Both the Framework and Graphcool 1.0 will accept this format. But the Framework of course still needs to have two fields defined on each relations. We will just infer the proper relation to create using the one provided field.

A format like this will also be accepted by both versions of Graphcool.

[
   [
    {"_typeName": "Human", "id": "johndoe", "fieldName": "husband"},
    {"_typeName": "Human", "id": "janedoe", "fieldName": null}
   ]
]

Value Representations

  • String = string
  • Int - number
  • Float - number
  • Boolean - boolean
  • DateTime - string ((new Date()).toJSON())
  • Enum - string
  • Json - string ("{\"key\": 42}")
  • Scalar field - should not be included in a node item but be provided as list item.

Process

  1. The graphcool cli is used to transform source data to the NDF (Normalized data format)
  2. Optionally the cli is used to verify the integrity of data in the NDF (Normalized data format)
  3. The cli is used to import the all the data in the NDF (Normalized data format)

Transform

The cli should support transforming data from various data sources to the NDF (Normalized data format). Source Mappers can be implemented as CLI plugins or build into the CLI.

Verify

The cli will get the schema from the backend and perform basic validation on the type of all data.

Additionally referential integrity can be checked for all relations, as well as identifying nodes with violated required relations.

Import

The backend exposes a new dedicated import endpoint that accepts individual files as described above. A single file is processed at a time. each element in the array is treated individually. Required relations are not verified during import of nodes. If any node fail to import, the index and reason is returned in the response from the import endpoint. The cli can then decide to retry or show the error to the user. Violation of a unique constraint is a likely error.

Other considerations

A similar process could support data export to various formats.

On the server we should perform json parsing in a streaming manner to not clog the CPU: https://github.com/circe/circe/tree/master/examples/sf-city-lots

Open questions

  • listvalues vs lists?
@nikolasburk
Copy link
Member

Would it be worth to spec out the NDF as JSON schema so we have a formal reference point for it? @sorenbs @marktani

@marktani
Copy link
Contributor

This has now been released in the latest version of graphcool-framework and the latest developer preview of Graphcool 1.0 🎉

@marcovc
Copy link

marcovc commented Jan 12, 2018

Hi,
This is probably not the best place to ask questions (where is it by the way?).
I don't understand from the docs what is supposed to go in the "id" fields of my data when I'm importing. On one hand I've read that the "id" is something that graphcool manages internally, on the other, it seems necessary to have some "id" when I'm importing related records.
I've tried importing data with my own custom "id"s (just integers) but it doesn't seem to be working. It doesn't give any errors but nothing is inserted in the database.
Any clues?
Thank you!
Marco

@agustif
Copy link

agustif commented Feb 23, 2018

So If I want to import from AirTable, what would be my best bet, prisma + AirTable's API wrapper, or export to csv and use graphql-cli-load? @marktani

Sorry for reopening too!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants