Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql cells, backed by DuckDB #844

Merged
merged 17 commits into from
Mar 6, 2024
Merged

sql cells, backed by DuckDB #844

merged 17 commits into from
Mar 6, 2024

Conversation

mbostock
Copy link
Member

@mbostock mbostock commented Feb 17, 2024

Fixes #48. TODO

  • sql tagged-template literal implemented by npm:@observablehq/duckdb
  • sql needs implicit dependency on duckdb
  • sql literal needs a shared DuckDBClient under the hood (not a new one for each query!)
  • sql frontmatter option to declare which tables should be available to sql
  • generate registerTable calls, or some such, to register these tables for sql
  • a way to name (id) the output of a sql cell for access in JavaScript?
  • throw a syntax error if the specified id is invalid
  • destructuring assignment (e.g., id=[{foo}])
  • register table sources as files in getResolvers
  • live update of sql when front matter is edited during preview
  • display directive
  • display as table rather than the default inspector
  • support DuckDB files as source? [future]
  • re-register tables when file attachments change
  • documentation

I’m imagining the frontmatter would look something like this:

---
sql: chinook.db
---

But possibly you can declare the tables individually:

---
sql:
  customers: customers.csv
  orders: orders.parquet
---

And in any case we’ll need to make sure you can use a data loader to generate the backing files (e.g., customers.csv.js). And ideally we’d correctly handle incremental edits to the frontmatter during preview, un-registering and re-registering tables, and hot module replacement to the tables 😅 with file watching. But if we have to reload the page in the interim I think this would still be a great start.

A challenge with sql.init will be avoiding the race condition — if we start to process a sql query before we’ve registered the tables, the results will be wrong. But if we’re going to do reactive updates to the sql definition anyway, maybe the answer is that sql should be defined as a generator, and any call to register or unregister a table will cause this generator to yield a new value, and then reactively reevaluate all the downstream queries. That sounds like the right path since it leverages reactivity (though it might mean some brief flashes of “wrong” results?).

One way to implement a better display for sql cells would be to use a sql.table tagged-template literal instead of sql (like we do for tex.block). We could also change the generated code under the hood so instead of this:

sql`SELECT 1 + 2`

it’s more like this:

const foo = sql`SELECT 1 + 2`;
display(Inputs.table(foo));

As for naming the output, that could be something like:

```sql id=foo
SELECT 1 + 2
```

@Fil
Copy link
Contributor

Fil commented Feb 17, 2024

Neat! I wonder if sql is the right name for the list of data sources; it could be interesting to make it more generic, since these data sources could be consumed by something else than sql in the future (maybe… a smart data table)?

@mbostock mbostock force-pushed the mbostock/duckdb-sql branch from 28e8947 to 89ddd07 Compare March 2, 2024 15:51
@mbostock
Copy link
Member Author

mbostock commented Mar 2, 2024

I want to use the name sql because it matches the sql fenced code block and the sql template literal. But I could see wanting to change the engine to e.g. SQLite, so maybe that means a more explicit form like

---
sql:
  engine: sqlite
  tables: chinook.db
---

As for the display, I think we would change ```sql to have a better default display in the future? I’m not sure that needs to be configurable, but I guess it could be an attribute of the fenced code block or page-level option too.

@mbostock mbostock force-pushed the mbostock/duckdb-sql branch from 880f771 to 59cc841 Compare March 5, 2024 04:55
@mbostock mbostock force-pushed the mbostock/duckdb-sql branch from d638eef to 498f9d3 Compare March 6, 2024 00:37
@mbostock mbostock requested a review from Fil March 6, 2024 04:24
@mbostock mbostock marked this pull request as ready for review March 6, 2024 04:24
@Fil
Copy link
Contributor

Fil commented Mar 6, 2024

The interactive view tends to demultiply the number of tables

sql-view.mp4

@Fil
Copy link
Contributor

Fil commented Mar 6, 2024

Feature requests (probably for future iterations):

  • what would be the easiest way to apply this on a JavaScript array instead of a file? For example to work with live data fetched (and data-wrangled) from an API.

We'll probably need to add some helper code similar to this snippet?

      const table = arrow.tableFromJSON(data);
      const buffer = arrow.tableToIPC(table);
      const conn = await db.connect();
      await conn.insertArrowFromIPCStream(buffer, {
        name,
        schema: "main",
        ...options
      });
  • Also, I would love to use an absolute url as a source (to query data from files hosted somewhere else).

For example:

```yaml
sql:
  adresses: https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet
```

(this currently fails with Error: Invalid Error: Opening file 'table-adresses-reu.parquet' failed with error: Failed to open file: table-adresses-reu.parquet)

  • A way to make this available from data loaders. Setting up duckdb-wasm in a data loader is quite a bit of work.

@mbostock
Copy link
Member Author

mbostock commented Mar 6, 2024

Yes, I mentioned the race condition in Slack. Is because there promises resolve out of order, and you can get a display from an older query after an earlier query. We need to handle this race condition in the display function (by ignoring the call). I can work on this later today but feel free to take a crack at it if you’re interested.

@mbostock
Copy link
Member Author

mbostock commented Mar 6, 2024

On your requested enhancements, don’t forget that you can just override the definition of sql on a page, so you can use the full capabilities of DuckDBClient to point to in-memory or remote data.

const db = await DuckDBClient.of({gaia: FileAttachment("./lib/gaia-sample.parquet")});
const sql = db.sql.bind(db);

Seems reasonable to support external URLs in front matter, though. I think the challenge is that we’ll have to guess the format from the file extension. Perhaps it’s better to just use another SQL cell to insert the remote data:

CREATE TABLE addresses
  AS SELECT *
  FROM read_parquet('https://static.data.gouv.fr/resources/bureaux-de-vote-et-adresses-de-leurs-electeurs/20230626-135723/table-adresses-reu.parquet')
  LIMIT 100

But then we’d need the other SQL cells to block on the SQL cell that defines the addresses table… so yeah, we should do it in the front matter, but the URL isn’t sufficient as we’ll also need the MIME type (or "parquet"). I suggest we do this as a future enhancement.

@mbostock
Copy link
Member Author

mbostock commented Mar 6, 2024

On the race condition, I have filed a bug and a simple reproduction in #995 (which confirms that it’s a pre-existing problem, so we could ignore it in this PR for now — say by switching to a radio button instead of a range slider). It’ll also be a bit tricky to fix, I expect; see the linked issue for discussion.

@Fil
Copy link
Contributor

Fil commented Mar 6, 2024

guess the format from the file extension

I could imagine passing it as an option in a long-form description of the source, for when it's opaque:

sql:
  - tableName:
    - source: https://<something opaque>
    - format: parquet

@Fil
Copy link
Contributor

Fil commented Mar 6, 2024

const sql = db.sql.bind(db);

I would not have guessed this invocation, but we can add syntactic sugar later.

Comment on lines +84 to +86
```sql id=[{min}]
SELECT MIN(phot_g_mean_mag) AS min FROM gaia
```
Copy link
Contributor

@Fil Fil Mar 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have liked this to be extensible to define multiple values, but it doesn't seem to work:

```sql id=[{min, max}]
SELECT MIN(phot_g_mean_mag) AS min, MAX(phot_g_mean_mag) AS max FROM gaia
```

defines only min.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that’s surprising. I would expect that to work and I’ll investigate.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In fact it works, I was editing the wrong code block! 😭

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could update that example in the documentation to include min and max?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the attribute gotcha applies, which means you either need quotes

```sql id="[{min, max}]"
SELECT MIN(phot_g_mean_mag) AS min, MAX(phot_g_mean_mag) AS max FROM gaia
```

or to remove spaces

```sql id=[{min,max}]
SELECT MIN(phot_g_mean_mag) AS min, MAX(phot_g_mean_mag) AS max FROM gaia
```

Copy link
Contributor

@Fil Fil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is truly magnificent

@mbostock
Copy link
Member Author

mbostock commented Mar 6, 2024

We could add this shorthand in the future:

const sql = DuckDBClient.sql({gaia: FileAttachment("./lib/gaia-sample.parquet")});

@mbostock mbostock enabled auto-merge (squash) March 6, 2024 18:16
@mbostock mbostock merged commit 92377f8 into main Mar 6, 2024
4 checks passed
@mbostock mbostock deleted the mbostock/duckdb-sql branch March 6, 2024 18:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SQL fenced code blocks
2 participants