## Overview

Jupyter notebooks have become the de facto standard for interactive computing
and data analysis, combining code, prose, and visualizations in a single
document.

This blog post was written in a notebook!

Jupyter's architecture separates the notebook interface, where users write and
interact with code (typically built with web technologies), from the **kernel**,
which executes it. This modular design has driven innovation, giving users
flexibility in both front ends (e.g., JupyterLab, VS Code) and programming
languages.

Since v1.37, Deno has included a built-in Jupyter kernel, bringing JavaScript
and TypeScript to data science and machine learning. Having worked extensively
with computational notebooks (mainly in Python), I find this exciting for
several reasons:

- **Easier setup** – The kernel is built into the Deno CLI, so there’s no need
for additional installation—just install Deno and start using notebooks.

- **Improved dependency management** – Notebooks often behave like standalone
scripts, making dependency management a challenge in other languages and
contributing to reproducibility issues. Deno’s ECMAScript module system allows
dependencies to be declared directly in code, enhancing self-containment and
reliability.

- **A unified ecosystem for interactive data analysis** – Jupyter frontends use
web technologies and support rich outputs in HTML, CSS, and JavaScript. With
JavaScript ecosystem as the dominant for creating interactive web UIs, Deno
connects the kernel and frontend, enabling new possibilities for data science
and machine learning.

In this notebook, we’ll demonstrate how to leverage Deno’s Jupyter kernel for
high-level data analysis, visualization, and interactive exploration.


## The Dataset  

The **National Gallery of Art (NGA) [Open Data Program](https://www.nga.gov/open-access-images/open-data.html)** provides an up-to-date archive of over **130,000 artworks** and their creators, available [on GitHub](https://github.com/NationalGalleryOfArt/opendata/tree/main/data).  

The dataset is structured as a **relational database**, exported as individual CSV files. These files contain linked information on artworks, artists, and images. We will use this dataset to construct an **in-memory representation**, enabling **exploratory data analysis** within the notebook.  

We will focus on three key tables:  

- **`objects.csv`** – Core metadata on artworks, including titles, dates, materials, and classifications.  
- **`constituents.csv`** – Information on artists, such as names, nationalities, and lifespans.  
- **`published_images.csv`** – Links to artwork images via the NGA’s **IIIF API**.  

By leveraging the **relational structure**, we will **join these tables** to create a single dataset that integrates artworks, artist details, and image links. This dataset will serve as the foundation for **analysis and visualization** within the notebook.  

## Wrangling the data

- parse with csv (show it's really big and tough to build up relationships)
- motivate why use polars (relational tables)

In [None]:
// With JSR / Web stuff

import * as csv from "jsr:@std/csv@1.0.5";
import * as streams from "jsr:@std/streams@1.0.9";

let baseUrl = new URL(
    "https://github.com/NationalGalleryOfArt/opendata/raw/refs/heads/main/data/"
);

let response = await fetch(new URL("objects.csv", baseUrl));

let objects = await Array.fromAsync(
    response.body
        .pipeThrough(new TextDecoderStream())
        .pipeThrough(new csv.CsvParseStream({ skipFirstRow: true }))
         // Just grab the first 100 (full dataset takes too long)
        // .pipeThrough(new streams.LimitedTransformStream({ size: 100 })),
    ,
    (row) => ({
        objectid: +row.objectid,
        title: row.title,
        beginyear: +row.beginyear,
        endyear: +row.endyear,
        timespan: row.visualbrowsertimespan,
		medium: row.medium,
		attribution: row.attribution,
        classification:  row.visualbrowserclassification,
    })
);

objects.slice(0, 3)

In [None]:
import * as pl from "npm:nodejs-polars@0.18.0";

let obs = pl.readRecords(objects)

In [None]:
// Why Polars + how to with Polars
import * as pl from "npm:nodejs-polars@0.18.0";

let response = await fetch(new URL("objects.csv", baseUrl));
let objects = pl.readCSV(await response.text(), { quoteChar: "\"" })
    .select(
        "objectid",
        "title",
        "beginyear",
        "endyear",
        pl.col("visualbrowsertimespan").alias("timespan"),
        "medium",
        "attribution",
        pl.col("visualbrowserclassification").as("classification"),
    );

objects.head();

In [None]:
let response = await fetch(new URL("constituents.csv", baseUrl));
let constituents = pl.readCSV(await response.text(), { quoteChar: "\"" })
    .select(
        "constituentid",
        pl.col("forwarddisplayname").alias("name"),
        pl.col("visualbrowsernationality").alias("nationality"),
    );

constituents.head()

In [None]:
let response = await fetch(new URL("objects_constituents.csv", baseUrl));
let objectToArtist = pl.readCSV(await response.text(), { quoteChar: "\"" })
    .filter(pl.col("roletype").eq(pl.lit("artist")))
    .groupBy("objectid")
    .first("constituentid") // first artist listed for object
    .select(
        "objectid",
        "constituentid", 
        "role",
    )

objectToArtist.head()

In [None]:
let response = await fetch(new URL("published_images.csv", baseUrl));
let publishedImages = pl.readCSV(await response.text(), { quoteChar: "\"" })
    .select(
        pl.col("depictstmsobjectid").alias("objectid"),
        pl.col("uuid"),
        // pl.format("https://api.nga.gov/iiif/{}/full/full/0/default.jpg", pl.col("uuid")).alias("image_url"),
    )
publishedImages.head()

In [None]:
// Takes a while but we 
// let response = await fetch("https://www.nga.gov/bin/ngaweb/collection-search-result/search.pageSize__100000.pageNumber__1.lastFacet__artobj_downloadable.json?artobj_downloadable=Image_download_available");
// let data = await response.json();
// Deno.writeTextFileSync("public-domain-ids.txt", data.results.map(object => object.id).join("\n"));
let publicDomainIds = Deno.readTextFileSync("public-domain-ids.txt").split("\n").map(d => +d);

In [None]:
// full data frame

let df = publishedImages
    .join(objects, { on: "objectid" })
    .join(objectToArtist, { on: "objectid" })
    .join(constituents, { on: "constituentid" })
    .select(pl.exclude("constituentid"))
    .withColumns(pl.col("objectid").isIn(publicDomainIds).alias("is_public_domain"))

df.head()

## Interactive tables

In [None]:
import { widget } from "jsr:@anywidget/deno";
import * as base64 from "jsr:@std/encoding@1.0.7/base64";

function agGrid(df: pl.DataFrame) {
    return widget({
    	state: {
            // TODO: Jupyter Widgets support binary data, but I'm not sure if it's implemented in Deno yet
            ipc: base64.encodeBase64(df.writeIPC()),
            _css: "https://esm.sh/ag-grid-community@33.0.4/styles/ag-grid.css"
        },
    	imports: `
import * as agGrid from "https://esm.sh/ag-grid-community@33.0.4";
import * as flech from "https://esm.sh/@uwdata/flechette@1.1.2";
import * as base64 from "https://esm.sh/jsr/@std/encoding@1.0.7/base64";
    `,
        // @ts-expect-error - function body is serialized to the front end with imports from above
    	render: ({ model, el }) => {
            agGrid.ModuleRegistry.registerModules([agGrid.AllCommunityModule]);
            el.style.height = "400px";
            let bytes = base64.decodeBase64(model.get("ipc"));
            let table = flech.tableFromIPC(bytes);
            agGrid.createGrid(el, {
                columnDefs: table.names.map(field => ({ field })),
                rowData: table.toArray(),
                pagination: true,
           });
        },
    });
}

function quak(df: pl.DataFrame) {
    return widget({
        // TODO: Jupyter Widgets support binary data, but I'm not sure if it's implemented in Deno yet
    	state: { parquet: base64.encodeBase64(df.writeParquet()) },
    	imports: `
import * as mosaic from "https://esm.sh/@uwdata/mosaic-core@~0.11?bundle";
import * as base64 from "https://esm.sh/jsr/@std/encoding@1.0.7/base64";
import * as quak from "https://esm.sh/jsr/@manzt/quak@0.0.2";
    `,
        // @ts-expect-error - function body is serialized to the front end with imports from above
    	render: async ({ model, el }) => {
            let connector = mosaic.wasmConnector();
            let db = await connector.getDuckDB();
            let coordinator = new mosaic.Coordinator();
            coordinator.databaseConnector(connector);

            let bytes = base64.decodeBase64(model.get("parquet"));
            await db.registerFileBuffer("df.parquet", bytes);
            await coordinator.exec([`CREATE OR REPLACE TABLE "df" AS SELECT * FROM "df.parquet"`])
            
            let dt = await quak.datatable("df", { coordinator, height: 400 });
            el.appendChild(dt.node());
            
            let div = document.createElement("div");
            div.style.height = "435px";
            div.appendChild(dt.node());

            el.appendChild(div);
        },
    });
}


In [None]:
// ag-grid seems to break down with >10,000
agGrid(df.head(100))

In [None]:
// quak can handle it all (keeps as compressed parquet in the front-end)
quak(
    df
        .select(pl.exclude("objectid", "uuid"))
        .head(50_000)
)

## Plotting

- explain deps observable/plot
- plot some different views / EDA

In [None]:
import * as Plot from "npm:@observablehq/plot";
import * as linkedom from "npm:linkedom";

// Plot requires a `document` instance for each plot, which we need to fill in Deno...
function Document() {
    return linkedom.parseHTML("<html></html>").document;
}

let records = df.toRecords();

Plot.plot({
  color: { legend: true },
  marks: [
    Plot.barY(
      records,
      Plot.groupX(
        { y: "count" },
        { x: "classification", sort: { x: "-y" }, fill: "is_public_domain" }
      )
    )
  ],
  marginLeft: 125,
  width: 1000,
  document: new Document()
})

In [None]:
Plot.plot({
  color: { legend: true },
  marks: [
    Plot.barX(
      records,
      Plot.groupY(
        { x: "count" },
        { y: "nationality", sort: { y: "-x" }, fill: "is_public_domain" }
      )
    )
  ],
  marginLeft: 125,
  document: new Document()
})

In [None]:
Plot.plot({
  color: { legend: true },
  marks: [
    Plot.barX(
      df
        .groupBy("attribution", "is_public_domain")
        .len()
        .sort("attribution_count", true)
        .head(20)
        .toRecords(),
      { x: "attribution_count", y: "attribution", sort: { y: "-x" }, fill: "is_public_domain"  }
    )
  ],
  marginLeft: 200,
  document: new Document()
})

In [None]:
let arts = df
    .groupBy("attribution", "classification", "is_public_domain")
    .len()
    .select(
        pl.col("attribution"),
        pl.col("classification"),
        pl.col("is_public_domain"),
        pl.col("attribution_count").alias("count")
    )
    .sort({ by: "count", descending: true })
    .filter(pl.col("classification").eq(pl.lit("painting")))
    .head(20);

Plot.plot({
  color: { legend: true },
  marks: [
    Plot.barX(arts.toRecords(), {
        x: "count",
        y: "attribution",
        fill: "is_public_domain",
        sort: { y: "-x" },
    }),
  ],
  marginLeft: 175,
  document: new Document()
})

In [None]:
let counts = objects
    .groupBy("attribution", "classification")
    .len()
    .select(
        pl.col("attribution"),
        pl.col("classification"),
        pl.col("attribution_count").alias("count")
    )
    .sort({ by: "count", descending: true });
    

let groups = pl.concat(
    ["drawing", "print", "photograph", "painting"].map(name => 
        counts.filter(pl.col("classification").eq(pl.lit(name))).head(20)
    )
)

let scaleFreeY = (options) =>
  Plot.initializer(options, (data, facets, channels, scales, dimensions) => {
    let {y: { value: Y, scale }} = channels;
    for (let index of facets) {
      let y = d3.scalePoint(d3.sort(Array.from(index, (i) => Y[i])), [
          dimensions.marginTop,
          dimensions.height - dimensions.marginBottom
        ])
        .padding(0.5);
      for (let i of index) Y[i] = y(Y[i]);
    }
    return { data, facets, channels: { y: { value: Y } } };
  });

In [None]:
Plot.plot({
  y: { grid: true },
  color: { legend: true },
  marks: [
    Plot.rectY(
        df
            .filter(pl.col("beginyear").gt(1_400))
            .toRecords(),
        Plot.binX({y: "count"}, {x: "beginyear", fill: "classification", fy: "is_public_domain"})
    ),
    Plot.ruleY([0])
  ],
  marginLeft: 100,
  width: 1200,
  height: 400,
  document: new Document(),
})

In [None]:
Plot.plot({
  color: { legend: true },
  marks: [
    Plot.waffleY(
      df.toRecords(),
      Plot.groupZ({y: "count"}, {fx: "classification", fill: "is_public_domain", unit: 300, sort: {fx: "-y"} })
    ),
    Plot.ruleY([0])
  ],
  width: 1000,
  document: new Document(),
})

In [None]:
import * as React from "npm:react";
import { renderToString } from "npm:react-dom/server";

function render(reactNode) {
  return {
    [Deno.jupyter.$display]() {
      return {
        "text/html": renderToString(reactNode),
      }
    },
  };
}

function Gallery({ objects, size = 150 }) {
  return (
    <div
      style={{
        display: "grid",
        gridTemplateColumns: `repeat(auto-fill, minmax(${size}px, 1fr))`,
        gap: "4px",
      }}
    >
      {objects
        .select("objectid", "uuid", "title", "is_public_domain")
        .map(([objectid, uuid, title, publicDomain]) => (
          <div key={objectid} style={{ position: "relative", textAlign: "center" }}>
            <a
              href={`https://www.nga.gov/collection/art-object-page.${objectid}.html`}
              target="_blank"
              rel="noopener noreferrer"
              style={{
                display: "block",
                width: `${size}px`,
                height: `${size}px`,
                position: "relative",
              }}
            >
              <img
                src={`https://api.nga.gov/iiif/${uuid}/full/!200,200/0/default.jpg`}
                alt={title}
                style={{
                  width: "100%",
                  height: "100%",
                  objectFit: "cover",
                  borderRadius: "4px",
                }}
              />
              {publicDomain && (
                <img
                  src="https://mirrors.creativecommons.org/presskit/icons/cc.svg"
                  alt="Public Domain"
                  style={{ position: "absolute", bottom: "3px", right: "3px", width: "24px", height: "24px" }}
                />
              )}
            </a>
            <a
              href={`https://api.nga.gov/iiif/${uuid}/full/max/0/default.jpg`}
              target="_blank"
              rel="noopener noreferrer"
              style={{ fontSize: "12px", display: "block", marginTop: "4px", color: "#555", textDecoration: "none" }}
            >
              full size
            </a>
          </div>
        ))}
    </div>
  );
}

let subset = df
    .filter(pl.col("is_public_domain"))
    .filter(pl.col("classification").eq(pl.lit("painting")))

render(<Gallery objects={subset.sample(100)} />);

In [None]:
let response = await fetch(new URL("published_images.csv", baseUrl));
let objects2 = pl.readCSV(await response.text(), { quoteChar: "\"" })
objects2

In [None]:
render(<Gallery objects={df.filter(pl.col("attribution").eq(pl.lit("Winslow Homer")))} />)

In [None]:

objects.withColumns(
    pl.col("objectid").isIn(pl.lit(publicDomainIds)).alias("is_public_domain")
)