Skip to content
/ dawm Public

Headless DOM manipulation toolkit, featuring an HTML/XML parser written in Rust and TypeScript implementations of standard DOM APIs. Purpose-built for static-site generation, web scraping, and other server-side workflows.

License

Notifications You must be signed in to change notification settings

nberlette/dawm

Portable high-speed DOM parser for HTML/XML written in Rust,
paired with a subset of DOM APIs implemented in TypeScript, suitable
for use in any ES2015+ JavaScript runtime with WebAssembly support.


Introduction

dawm is a headless DOM toolkit for parsing, traversing, manipulating, and serializing HTML/SVG/XML documents in server-side / serverless / edge environments. Its hybrid codebase couples a high-speed DOM parser written in Rust with TypeScript implementations of many of the standard Document Object Model (DOM) APIs.

The overall developer experience with dawm is uncannily familiar for anyone with frontend development experience, making it adoptable by a vast majority of developers with minimal friction and virtually zero overhead.

Install

deno add npm:dawm
pnpm add dawm
yarn add dawm
bun add dawm
npm i dawm

Usage

import { parseHTML, type ParseOptions } from "dawm";
import assert from "node:assert";

const options = {
  allowScripts: false, // preserves <noscript> hierarchy (default: false)
  contentType: "text/html", // use "application/xml" for XML parsing
  exactErrors: false, // default error handling mode
  quirksMode: "no-quirks", // default quirks mode
  dropDoctype: false, // strip doctype from output? (default: false)
  iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
  contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;

// The `parseHTML` function is also available via standard DOM APIs:
// - `DOMParser.parseFromString(string, "text/html")`
// - `Document.parseHTML(string, options)`
const doc = parseHTML(
  "<!doctype html><html><body><h1>Hello, world!</h1></body></html>",
  options,
);

const h1 = doc.body.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);

Overview

The dawm project is ideal for use in server-side and edge compute scenarios where performance and portability are paramount. Whether you're building an SSR framework, a web scraper, or simply need a way to run unit tests for frontend code without needing a full-blown DOM implementation like JSDOM, dawm is up to any task you can throw at it.

Purpose-built for intensive data processing tasks in server-side and edge compute scenarios, dawm is designed to be fast: both in the literal sense of its performance, and in terms of how quickly it can be adopted and integrated into your workflows.

Features

Familiar API Surface

Featuring TypeScript implementations of familiar DOM APIs like Document, Element, and Attr, this package provides a familiar developer experience with minimal learning curve.

This saves you from having to learn another framework-specific API just to manipulate HTML/XML documents — if you've done any frontend web dev before, you can immediately start using dawm in your server-side workflows without missing a beat.

High Performance

At the core of dawm lies a blazing-fast HTML/XML parser written in Rust and compiled to WebAssembly, capable of efficiently processing even large documents with ease. This ensures that your applications can handle heavy DOM manipulation tasks without breaking a sweat.

Security First

Running in a sandboxed WASM environment, dawm ensures that untrusted content cannot compromise the host application. Scripts are never executed as the parser automatically strips them out of the source document.

Standards Compliant

Built on top of the html5ever crate created by [servo], the parser boasts full compliance with the HTML5 parsing algorithm as defined by the WHATWG specification.

Zero Dependencies

Designed to be lightweight and portable, dawm comes with zero external dependencies.1 This makes it easy to integrate into any project without worrying about dependency conflicts or bloat.

Polylingual and Portable

The dawm parser is capable of parsing HTML, XML, SVG, and MathML documents, as well as HTML fragments. This makes it a versatile choice for a wide range of applications. Furthermore, it's designed to be highly portable and compatible with any modern WASM-friendly runtime, including Deno, Bun, Node, and Cloudflare Workers.


Examples

For more examples, check out the ./examples directory on GitHub.

Basic HTML Parsing

import { Document, type ParseOptions } from "dawm";
import assert from "node:assert";

const options = {
  allowScripts: false, // preserves <noscript> hierarchy (default: false)
  contentType: "text/html", // use "application/xml" for XML parsing
  exactErrors: false, // default error handling mode
  quirksMode: "no-quirks", // default quirks mode
  dropDoctype: false, // strip doctype from output? (default: false)
  iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
  contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;

const doc = Document.parseHTML(
  "<!doctype html><html><head><title>foobar</title></head>" +
    "<body><h1>Hello, world!</h1></body></html>",
  options,
);

assert.strictEqual(doc.title, "foobar");
assert.strictEqual(doc.head?.nextSibling, doc.body);
const title = doc.head.firstElementChild;
assert.strictEqual(title?.textContent, "foobar");

const h1 = doc.body?.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);

CDN Usage (via esm.sh)

ES Module (recommended)

import * as dom from "https://esm.sh/dawm?bundle&dts";

UMD Module

<script src="https://esm.sh/dawm/global?bundle"></script>
<script>
  const { dawm } = globalThis;

  const options = {
    contentType: "text/html", // default mime type
    exactErrors: false, // default error mode
    quirksMode: "no-quirks", // default quirks mode
    dropDoctype: false, // strip doctype from output
    iframeSrcdoc: false, // set to true when parsing iframe srcdoc contents
    allowScripts: false, // set to true when scripting is enabled
  };

  const doc = dawm.parseHTML(
    "<!DOCTYPE html><html><body><h1>Hello, world!</h1></body></html>",
    options,
  );

  console.log(doc.root.firstChild.firstChild.textContent); // "Hello, world!"
</script>

DOM APIs

The following is a non-exhaustive list of the implementation status of DOM APIs in the current iteration of the dawm library. Items that are checked off are fully implemented and functional; unchecked items are either on the roadmap for future implementation, or were deemed out of scope for this library (those items are noted as such).

Core

Collections

Parsing

Serialization

Traversal & Manipulation

Node

Element

Document


API

You can import from the root dawm package or from the scoped module paths shown below (e.g. import { parseHTML } from "dawm/parse").

dawm/parse

DOM-first helpers that wrap the low-level WebAssembly parser and return fully hydrated tree instances.

parseDocument

Signature
parseDocument(input: string, options?: ParseOptions | null): Document;
parseDocument(
  input: string,
  mimeType: string,
  options?: ParseOptions | null,
): Document;
Example
import { parseDocument } from "dawm";

const xml = `<note><to>Codex</to><from>dawm</from></note>`;
const doc = parseDocument(xml, "application/xml");
console.log(doc.documentElement?.nodeName); // "note"

parseFragment

Signature
parseFragment(
  input: string,
  options: FragmentParseOptions | null,
): DocumentFragment;
parseFragment(
  input: string,
  contextElement: string,
  options?: ParseOptions | null,
): DocumentFragment;
Example
import { parseFragment } from "dawm/parse";

const frag = parseFragment("<li>Two</li>", "ul");
console.log(frag.firstChild?.textContent); // "Two"

parseHTML

Signature
parseHTML(input: string, options?: ParseOptions | null): Document;
Example
import { parseHTML } from "dawm";

const doc = parseHTML("<!doctype html><html><body><h1>Hi</h1></body></html>");
console.log(doc.querySelector("h1")?.textContent); // "Hi"

parseXML

Signature
parseXML(input: string, options?: ParseOptions | null): Document;
Example
import { parseXML } from "dawm/parse";

const svg = `<svg xmlns="http://www.w3.org/2000/svg"><rect width="10"/></svg>`;
const doc = parseXML(svg);
console.log(doc.documentElement?.nodeName); // "svg"

dawm/serialize

Utilities for turning DOM nodes and collections back into strings.

serializeHTML

Signature
serializeHTML<T extends Node | Attr>(node: T | ArrayLike<T> | Iterable<T>): string;
Example
import { parseHTML, serializeHTML } from "dawm";
import assert from "node:assert";

const html = "<!doctype html><p data-msg='hi'>Hello</p>";
const doc = parseHTML(html);
const out = serializeHTML(doc.body?.firstChild);
assert.strictEqual(out, '<p data-msg="hi">Hello</p>');

serializeDOMStringMap

Serializes a DOMStringMap into a string of valid HTML data-* attributes.

Note

This is used internally by the higher-level serialization APIs, including Element.outerHTML, Element.innerHTML, and XMLSerializer.

Signature
serializeDOMStringMap(dataset: DOMStringMap): string;
Example
import { Document, serializeDOMStringMap } from "dawm";

const el = new Document().createElement("div");
el.dataset.helloWorld = "true";
console.log(serializeDOMStringMap(el.dataset)); // ' data-hello-world="true"'

serializeNamedNodeMap

Serializes a NamedNodeMap into a string of valid HTML/XML attributes.

Note

This is used internally by the higher-level serialization APIs, including Element.outerHTML, Element.innerHTML, and XMLSerializer.

Signature
serializeNamedNodeMap(attrs: NamedNodeMap): string;
Example
import { parseHTML, serializeNamedNodeMap } from "dawm";

const doc = parseHTML("<div id='app' ariaHidden='false'></div>");
console.log(serializeNamedNodeMap(doc.firstElementChild!.attributes));
// ' id="app" aria-hidden="false"'

dawm/select

CSS-selector utilities powered by the parsel-js engine, which is vendored into this package for convenience and portability.1

querySelector

Signature
querySelector<T extends Node>(node: Node, selector: string): T | null;
Example
import { parseHTML, querySelector } from "dawm";

const doc = parseHTML("<section><h2 class='title'>Docs</h2></section>");
const heading = querySelector(doc, ".title");
console.log(heading?.textContent); // "Docs"

querySelectorAll

Signature
querySelectorAll<T extends Node>(node: Node, selector: string): T[];
Example
import { parseHTML, querySelectorAll } from "dawm/select";

const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
const items = querySelectorAll(doc, "li");
console.log(items.map((n) => n.textContent)); // ["A", "B"]

matches

Signature
matches(node: Node, selector: string): boolean;
Example
import { matches, parseHTML, querySelector } from "dawm";

const doc = parseHTML("<div id='app' class='card'></div>");
const el = querySelector(doc, "#app")!;
console.log(matches(el, ".card")); // true

select

Signature
select(node: Node, match: Matcher, opts?: { single?: boolean }): Node[];
Example
import { type Matcher, parseHTML, select } from "dawm/select";

const doc = parseHTML("<main><p>One</p><p>Two</p></main>");
const paragraphs = select(doc, ((n) => n.nodeName === "P") as Matcher);
console.log(paragraphs.length); // 2

walk

Signature
walk(
  node: Node,
  callback: (node: Node, parent?: Node | null, index?: number) => void | Promise<void>,
  parent?: Node | null,
): AsyncGenerator<Node, void, number>;
Example
import { parseHTML, walk } from "dawm/select";

const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
for await (
  const node of walk(doc, async (_n) => {
    // async-safe traversal
  })
) {
  // nodes yielded in document order
}

walkSync

Signature
walkSync(
  node: Node,
  callback: (node: Node, parent?: Node | null, index?: number) => void,
  parent?: Node | null,
): void;
Example
import { parseHTML, walkSync } from "dawm/select";

const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
walkSync(doc, (node) => {
  if (node.nodeType === 1) console.log(node.nodeName);
});
// logs DIV, B, I

traverseSync

Signature
traverseSync<TNode extends Node = Node, TParent extends Node | null = TNode | null>(
  node: Node,
  test: (node: Node, parent?: TParent, index?: number) => node is TNode,
  parent?: TParent,
): Generator<TNode, void, number>;
Example
import { parseHTML, traverseSync } from "dawm/select";

const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
for (const li of traverseSync(doc, (n): n is Element => n.nodeName === "LI")) {
  console.log(li.textContent);
}

specificity

Signature
specificity(selector: string): number;
Example
import { specificity } from "dawm/select";

const score = specificity("#app .card > h2");
console.log(score); // numeric specificity score

Default export

The ./select entrypoint default-exports querySelectorAll for convenience:

Signature
import querySelectorAll from "dawm/select";

dawm/options

Helpers for normalizing parser options.

normalizeParseOptions

Signature
normalizeParseOptions(options?: string | ParseOptions | null): NormalizedParseOptions;
Example
import { normalizeParseOptions } from "dawm/options";

const opts = normalizeParseOptions({
  allowScripts: true,
  contentType: "text/html",
});
console.log(opts.quirksMode); // "no-quirks" (defaulted)

normalizeFragmentOptions

Signature
normalizeFragmentOptions(
  options?: FragmentParseOptions | string | null,
): NormalizedFragmentParseOptions;
Example
import { normalizeFragmentOptions } from "dawm/options";

const opts = normalizeFragmentOptions({ contextElement: "template" });
console.log(opts.contextElement); // "template"

dawm/types

Runtime enums and aliases re-exported from the DOM layer.

NodeType, QuirksMode, QuirksModeType

import { NodeType, QuirksMode, type QuirksModeType } from "dawm/types";

const elementNode = NodeType.Element; // 1
const quirks: QuirksModeType = QuirksMode.NoQuirks; // "no-quirks"

Advanced APIs

The modules below expose low-level APIs for advanced users, library authors, and contributors looking to build on top of dawm. You probably won't need to use these directly in most scenarios.

The data returned from most of these functions is "dehydrated" and requires secondary string-resolution and linking steps via resolveStrings and buildSubtree or buildDocumentTree.

dawm/wasm

Low-level WebAssembly bindings for the Rust-based HTML and XML parsers.

Signature
parse_doc(input: string, mime: string, options?: object | null): WireDoc;
parse_html(input: string, options?: object | null): WireDoc;
parse_xml(input: string, options?: object | null): WireDoc;
parse_frag(input: string, options: object): WireDoc;

Warning

These return raw "wire" structures that are not optimized for human-usability. Unless you know what you're doing and have a specific reason to use these APIs, you'd probably be better off with higher-level parse* APIs instead.

Example
import { buildDocumentTree, dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<em>raw</em>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.nodeName); // "EM"

dawm/tree

Utilities for turning raw WASM parser output into DOM objects (and vice versa).

buildDocumentTree
Signature
buildDocumentTree(document: WireDoc): Document;
Example
import { buildDocumentTree, dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<p>Hi</p>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.textContent); // "Hi"
buildSubtree
Signature
buildSubtree(
  node: WireNode | ResolvedWireNode,
  parent?: Node | null,
  prev?: Node | null,
  next?: Node | null,
  context?: { /* internal tree-building context */ },
): Node;
Example
import { buildSubtree, resolveStrings } from "dawm/tree";
import { type WireNode } from "dawm/wire";

const wireNode: WireNode = {
  id: 1,
  nodeType: 1,
  nodeName: 0,
  nodeValue: null,
  parentNode: null,
  firstChild: null,
  nextSibling: null,
  attributes: [],
};
const strings = ["div"];
const resolved = resolveStrings(wireNode, strings);
const element = buildSubtree(resolved, null);
console.log(element.nodeName); // "div"
resolveQuirksMode
Signature
resolveQuirksMode(mode: number | string | null | undefined): QuirksModeType;
Example
import { resolveQuirksMode } from "dawm/tree";

console.log(resolveQuirksMode("limited-quirks")); // "limited-quirks"

dawm/wire

Type guards and helpers for working with the serialized "wire" structures.

resolveStrings
Signature
resolveStrings(node: WireDoc): ResolvedWireDoc;
resolveStrings(node: WireNode, strings: string[]): ResolvedWireNode;
resolveStrings(node: WireAttr, strings: string[]): ResolvedWireAttr;
Example
import { resolveStrings } from "dawm/tree";

const resolved = resolveStrings({
  contentType: "text/html",
  quirksMode: 2,
  strings: ["", "html", "class", "data-value", "w-screen h-screen"],
  nodes: [{
    id: 1,
    nodeType: 9,
    nodeName: 1,
    nodeValue: null,
    attributes: [{ name: 2, value: 4 }, { name: 3, value: 0 }],
  }],
});

console.log(resolved.nodes[0].nodeName); // "html"
toWireDoc

Converts an unknown value into a WireDoc, throwing if the value does not conform to the expected shape. This is useful (and used internally) for ensuring type safety when dealing with raw parser output.

Signature
toWireDoc(value: unknown): WireDoc;
Example
import { dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<p>Wire</p>", null);
// throws if the parser returned an unexpected shape
const safe = toWireDoc(wire);
Guard functions
isWireDoc(value: unknown): value is WireDoc;
isWireNode(value: unknown): value is WireNode;
isWireAttr(value: unknown): value is WireAttr;
isResolvedWireDoc(value: unknown): value is ResolvedWireDoc;
isResolvedWireNode(value: unknown): value is ResolvedWireNode;
isResolvedWireAttr(value: unknown): value is ResolvedWireAttr;
isNodeLike(value: unknown): value is NodeLike;
Example
import { isNodeLike, isWireDoc } from "dawm/guards";

function inspect(value: unknown) {
  if (isWireDoc(value)) console.log("wire document");
  else if (isNodeLike(value)) console.log("dom-like node");
}

MIT © Nicholas Berlette. All rights reserved.

github · issues · jsr · npm · docs · contributing

Footnotes

  1. The only external package dawm relies on is debrotli, for decompressing the brotli-compressed WebAssembly binary. It also vendors parsel-js by Lea Verou, which is used to provide CSS selector support for the querySelector{,All} APIs. For portability and convenience, we vendor, bundle, and inline its source code during the build process, resulting in a standalone, dependency-free package. 2

  2. Abstract superclass; cannot be instantiated directly. 2

  3. Both static and living NodeLists are supported; per the DOM spec, querySelectorAll returns a static NodeList, while stateful methods like childNodes and Element.getElementsByTagName return live collections.

  4. All HTMLCollections are live collections as per the DOM spec.

  5. Semi-standard implementation of the Document.parseHTML method, but without support for the same options as the standard API. Notably, this implementation does not support the sanitization options found in the standard DOM API; however, dawm always strips <script> elements from parsed documents.

  6. Non-standard extension. 2

About

Headless DOM manipulation toolkit, featuring an HTML/XML parser written in Rust and TypeScript implementations of standard DOM APIs. Purpose-built for static-site generation, web scraping, and other server-side workflows.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Sponsor this project

Packages

No packages published