Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data loaders #38

Closed
mbostock opened this issue Oct 20, 2023 · 7 comments · Fixed by #89
Closed

Data loaders #38

mbostock opened this issue Oct 20, 2023 · 7 comments · Fixed by #89
Labels
enhancement New feature or request

Comments

@mbostock
Copy link
Member

mbostock commented Oct 20, 2023

We should formalize the concept of a data loader: a script that runs during build (and as needed during preview) to materialize a file attachment.

As an initial sketch, it could be something like this:

  • A data loader is an arbitrary executable. Most often we expect them to be written as Node.js programs (possibly with dependencies in package.json), but we want to support arbitrary executables for a polyglot workflow, e.g. writing a data loader in R, Python, Julia, Zig, etc.
  • A data loader, when run, generates and outputs the contents of a single file to stdout.
  • A data loader doesn’t take any arguments; it is invoked automatically by the CLI.
  • A file generated by a data loader is accessed like any other file (e.g., via FileAttachment or fetch); nothing on the client indicates any difference between a “static” file and a file that is generated by a data loader.
  • The name of the generated file corresponds to the name of the data loader (i.e., file-based routing). For example, if the generated file is named docs/foo.csv, the corresponding data loader would be named docs/foo.csv.ts or docs/foo.csv.py.
  • The double extension (e.g., .json.ts) is used to distinguish data loaders from static files.
  • [Optional] We could support SQL shorthand for data loaders, e.g., docs/foo.csv.sql. For this we’d need metadata to specify which database to use, and we’d infer the output format (such as CSV) from the file name.

During preview:

  • The server automatically runs the data loader when the corresponding file it generates is requested, if needed.
  • The server automatically re-runs the data loader when the data loader is edited, if a client is currently watching the corresponding output file.
  • We leverage reactive updates: by simply editing and saving a change to a data loader, the data loader runs automatically and the browser updates to reflect the new data as an incremental update without reloading.
  • The server keeps a cache of data loader outputs. It only runs a data loader if the desired output is missing from the cache, or if the output is older than the data loader source file.

During build:

  • Any data loaders that correspond to referenced files are run, generating a corresponding output file.
  • If a data loader is not referenced as a file attachment by any page in the project, it is not run.
  • [Optional] The build step could respect the cache, allowing incremental builds.

The file-based routing approach described above isn’t a requirement. But it does have some nice properties that we should seek to maintain:

  • You can add or remove a data loader just by adding or removing a file.
  • You can rename a data loader just by moving a file.
  • It’s (more) obvious that a file is generated, and where the source code is for a generated file.
  • You don’t need to manage some external mapping between files and data loaders.
  • Each file can be generated independently, and in parallel, on demand.
  • The CLI can automatically manage the data loader cache.
  • The CLI only needs to run data loaders that are referenced, both during preview and build.
@mbostock mbostock added the enhancement New feature or request label Oct 20, 2023
@mootari
Copy link
Member

mootari commented Oct 21, 2023

It would be great if these could support directories as well. E.g. DuckDB's import/export database requires multiple files.

@mbostock
Copy link
Member Author

We could do that using a zip file, I imagine.

@mootari
Copy link
Member

mootari commented Oct 21, 2023

That would negate the benefits of using DuckDB to stream data.

@mbostock
Copy link
Member Author

A static file server won’t support file listing, so you’ll need to statically list the files anyway. Therefore I expect you’ll need multiple data loaders so that the set of generated files is statically analyzable by the CLI.

@mootari
Copy link
Member

mootari commented Oct 21, 2023

Here is an example for a set of files produced by DuckDB's EXPORT DATABASE. All of these are loaded automatically through IMPORT DATABASE, where load.sql serves as the index.

  • schema.sql:
    CREATE TABLE annotations("runNumber" DOUBLE, "checkRunId" DOUBLE, "index" DOUBLE, "level" VARCHAR, message VARCHAR, browser VARCHAR, file VARCHAR, "row" DOUBLE, "column" DOUBLE, test VARCHAR);
    CREATE TABLE "checkRuns"("runNumber" DOUBLE, id DOUBLE, "index" DOUBLE, "name" VARCHAR, url VARCHAR, "startedAt" TIMESTAMP);
    CREATE TABLE "workflowRuns"("runNumber" DOUBLE, "createdAt" TIMESTAMP, "updatedAt" TIMESTAMP, "event" VARCHAR, "runUrl" VARCHAR, "commitHash" VARCHAR, "commitMessage" VARCHAR, "commitUrl" VARCHAR, conclusion VARCHAR, "checkUrl" VARCHAR);
  • load.sql:
    COPY annotations FROM 'docs/cache/merge-queue/db/annotations.parquet' (FORMAT 'parquet');
    COPY "checkRuns" FROM 'docs/cache/merge-queue/db/checkruns.parquet' (FORMAT 'parquet');
    COPY "workflowRuns" FROM 'docs/cache/merge-queue/db/workflowruns.parquet' (FORMAT 'parquet');
    
  • annotations.parquet
  • checkruns.parquet
  • workflowruns.parquet

They are loaded by specifying the parent folder (here /db):

const db = await DuckDBClient.of();
const conn = await db._db.connect();
await conn.query(`IMPORT DATABASE 'http://localhost:3000/_file/cache/merge-queue/db'`);

@mbostock
Copy link
Member Author

Can you create three data loaders?

  • data/merge-queue/db/annotations.parquet.ts
  • data/merge-queue/db/checkruns.parquet.ts
  • data/merge-queue/db/workflowruns.parquet.ts

You can share code between them, of course.

@mootari
Copy link
Member

mootari commented Oct 21, 2023

They would need to share state. The build process goes roughly like this:

  1. fetch nested JSON data from a remote API
  2. separate nested objects into multiple flat arrays
  3. import arrays as tables into an in-memory DuckDB database
  4. run EXPORT DATABASE with a target directory as argument, which produces load.sql, schema.sql and one parquet per table

A practical example can be found in this BI branch (WIP but working).

Ideally we would just produce a persistent database file as artifact (also demoed in that branch), but DuckDB's storage format is still in-flux and very finicky about mismatching duckdb versions.

This was referenced Oct 23, 2023
@wiltsecarpenter wiltsecarpenter linked a pull request Nov 2, 2023 that will close this issue
@cinxmo cinxmo removed their assignment Nov 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants