Data loaders #38

mbostock · 2023-10-20T18:59:15Z

We should formalize the concept of a data loader: a script that runs during build (and as needed during preview) to materialize a file attachment.

As an initial sketch, it could be something like this:

A data loader is an arbitrary executable. Most often we expect them to be written as Node.js programs (possibly with dependencies in package.json), but we want to support arbitrary executables for a polyglot workflow, e.g. writing a data loader in R, Python, Julia, Zig, etc.
A data loader, when run, generates and outputs the contents of a single file to stdout.
A data loader doesn’t take any arguments; it is invoked automatically by the CLI.
A file generated by a data loader is accessed like any other file (e.g., via FileAttachment or fetch); nothing on the client indicates any difference between a “static” file and a file that is generated by a data loader.
The name of the generated file corresponds to the name of the data loader (i.e., file-based routing). For example, if the generated file is named docs/foo.csv, the corresponding data loader would be named docs/foo.csv.ts or docs/foo.csv.py.
The double extension (e.g., .json.ts) is used to distinguish data loaders from static files.
[Optional] We could support SQL shorthand for data loaders, e.g., docs/foo.csv.sql. For this we’d need metadata to specify which database to use, and we’d infer the output format (such as CSV) from the file name.

During preview:

The server automatically runs the data loader when the corresponding file it generates is requested, if needed.
The server automatically re-runs the data loader when the data loader is edited, if a client is currently watching the corresponding output file.
We leverage reactive updates: by simply editing and saving a change to a data loader, the data loader runs automatically and the browser updates to reflect the new data as an incremental update without reloading.
The server keeps a cache of data loader outputs. It only runs a data loader if the desired output is missing from the cache, or if the output is older than the data loader source file.

During build:

Any data loaders that correspond to referenced files are run, generating a corresponding output file.
If a data loader is not referenced as a file attachment by any page in the project, it is not run.
[Optional] The build step could respect the cache, allowing incremental builds.

The file-based routing approach described above isn’t a requirement. But it does have some nice properties that we should seek to maintain:

You can add or remove a data loader just by adding or removing a file.
You can rename a data loader just by moving a file.
It’s (more) obvious that a file is generated, and where the source code is for a generated file.
You don’t need to manage some external mapping between files and data loaders.
Each file can be generated independently, and in parallel, on demand.
The CLI can automatically manage the data loader cache.
The CLI only needs to run data loaders that are referenced, both during preview and build.

The text was updated successfully, but these errors were encountered:

mootari · 2023-10-21T15:37:33Z

It would be great if these could support directories as well. E.g. DuckDB's import/export database requires multiple files.

mbostock · 2023-10-21T15:40:57Z

We could do that using a zip file, I imagine.

mootari · 2023-10-21T15:42:26Z

That would negate the benefits of using DuckDB to stream data.

mbostock · 2023-10-21T15:43:52Z

A static file server won’t support file listing, so you’ll need to statically list the files anyway. Therefore I expect you’ll need multiple data loaders so that the set of generated files is statically analyzable by the CLI.

mootari · 2023-10-21T15:53:17Z

Here is an example for a set of files produced by DuckDB's EXPORT DATABASE. All of these are loaded automatically through IMPORT DATABASE, where load.sql serves as the index.

schema.sql:

CREATE TABLE annotations("runNumber" DOUBLE, "checkRunId" DOUBLE, "index" DOUBLE, "level" VARCHAR, message VARCHAR, browser VARCHAR, file VARCHAR, "row" DOUBLE, "column" DOUBLE, test VARCHAR);
CREATE TABLE "checkRuns"("runNumber" DOUBLE, id DOUBLE, "index" DOUBLE, "name" VARCHAR, url VARCHAR, "startedAt" TIMESTAMP);
CREATE TABLE "workflowRuns"("runNumber" DOUBLE, "createdAt" TIMESTAMP, "updatedAt" TIMESTAMP, "event" VARCHAR, "runUrl" VARCHAR, "commitHash" VARCHAR, "commitMessage" VARCHAR, "commitUrl" VARCHAR, conclusion VARCHAR, "checkUrl" VARCHAR);

load.sql:

COPY annotations FROM 'docs/cache/merge-queue/db/annotations.parquet' (FORMAT 'parquet');
COPY "checkRuns" FROM 'docs/cache/merge-queue/db/checkruns.parquet' (FORMAT 'parquet');
COPY "workflowRuns" FROM 'docs/cache/merge-queue/db/workflowruns.parquet' (FORMAT 'parquet');

annotations.parquet
checkruns.parquet
workflowruns.parquet

They are loaded by specifying the parent folder (here /db):

const db = await DuckDBClient.of();
const conn = await db._db.connect();
await conn.query(`IMPORT DATABASE 'http://localhost:3000/_file/cache/merge-queue/db'`);

mbostock · 2023-10-21T17:46:19Z

Can you create three data loaders?

data/merge-queue/db/annotations.parquet.ts
data/merge-queue/db/checkruns.parquet.ts
data/merge-queue/db/workflowruns.parquet.ts

You can share code between them, of course.

mootari · 2023-10-21T20:09:00Z

They would need to share state. The build process goes roughly like this:

fetch nested JSON data from a remote API
separate nested objects into multiple flat arrays
import arrays as tables into an in-memory DuckDB database
run EXPORT DATABASE with a target directory as argument, which produces load.sql, schema.sql and one parquet per table

A practical example can be found in this BI branch (WIP but working).

Ideally we would just produce a persistent database file as artifact (also demoed in that branch), but DuckDB's storage format is still in-flux and very finicky about mismatching duckdb versions.

mbostock added the enhancement New feature or request label Oct 20, 2023

This was referenced Oct 23, 2023

DatabaseClient #46

Closed

Secret #47

Closed

SQL fenced code blocks #48

Closed

mootari mentioned this issue Oct 25, 2023

FileAttachment should expose last-modified time #62

Closed

cinxmo assigned cinxmo and wiltsecarpenter Oct 30, 2023

wiltsecarpenter linked a pull request Nov 2, 2023 that will close this issue

Data loader support #89

Merged

cinxmo removed their assignment Nov 3, 2023

wiltsecarpenter closed this as completed in #89 Nov 3, 2023

CobusT unassigned wiltsecarpenter Dec 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data loaders #38

Data loaders #38

mbostock commented Oct 20, 2023 •

edited

Loading

mootari commented Oct 21, 2023

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023 •

edited

Loading

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023

Data loaders #38

Data loaders #38

Comments

mbostock commented Oct 20, 2023 • edited Loading

mootari commented Oct 21, 2023

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023 • edited Loading

mbostock commented Oct 21, 2023

mootari commented Oct 21, 2023

mbostock commented Oct 20, 2023 •

edited

Loading

mootari commented Oct 21, 2023 •

edited

Loading