`dawm`

Portable high-speed DOM parser for HTML/XML written in Rust,
paired with a subset of DOM APIs implemented in TypeScript, suitable
for use in any ES2015+ JavaScript runtime with WebAssembly support.

Introduction

dawm is a headless DOM toolkit for parsing, traversing, manipulating, and serializing HTML/SVG/XML documents in server-side / serverless / edge environments. Its hybrid codebase couples a high-speed DOM parser written in Rust with TypeScript implementations of many of the standard Document Object Model (DOM) APIs.

The overall developer experience with dawm is uncannily familiar for anyone with frontend development experience, making it adoptable by a vast majority of developers with minimal friction and virtually zero overhead.

Install

deno add npm:dawm

pnpm add dawm

yarn add dawm

bun add dawm

npm i dawm

Usage

import { parseHTML, type ParseOptions } from "dawm";
import assert from "node:assert";

const options = {
  allowScripts: false, // preserves <noscript> hierarchy (default: false)
  contentType: "text/html", // use "application/xml" for XML parsing
  exactErrors: false, // default error handling mode
  quirksMode: "no-quirks", // default quirks mode
  dropDoctype: false, // strip doctype from output? (default: false)
  iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
  contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;

// The `parseHTML` function is also available via standard DOM APIs:
// - `DOMParser.parseFromString(string, "text/html")`
// - `Document.parseHTML(string, options)`
const doc = parseHTML(
  "<!doctype html><html><body><h1>Hello, world!</h1></body></html>",
  options,
);

const h1 = doc.body.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);

Overview

The dawm project is ideal for use in server-side and edge compute scenarios where performance and portability are paramount. Whether you're building an SSR framework, a web scraper, or simply need a way to run unit tests for frontend code without needing a full-blown DOM implementation like JSDOM, dawm is up to any task you can throw at it.

Purpose-built for intensive data processing tasks in server-side and edge compute scenarios, dawm is designed to be fast: both in the literal sense of its performance, and in terms of how quickly it can be adopted and integrated into your workflows.

Features

Familiar API Surface

Featuring TypeScript implementations of familiar DOM APIs like Document, Element, and Attr, this package provides a familiar developer experience with minimal learning curve.

This saves you from having to learn another framework-specific API just to manipulate HTML/XML documents — if you've done any frontend web dev before, you can immediately start using dawm in your server-side workflows without missing a beat.

High Performance

At the core of dawm lies a blazing-fast HTML/XML parser written in Rust and compiled to WebAssembly, capable of efficiently processing even large documents with ease. This ensures that your applications can handle heavy DOM manipulation tasks without breaking a sweat.

Security First

Running in a sandboxed WASM environment, dawm ensures that untrusted content cannot compromise the host application. Scripts are never executed as the parser automatically strips them out of the source document.

Standards Compliant

Built on top of the html5ever crate created by [servo], the parser boasts full compliance with the HTML5 parsing algorithm as defined by the WHATWG specification.

Zero Dependencies

Designed to be lightweight and portable, dawm comes with zero external dependencies.¹ This makes it easy to integrate into any project without worrying about dependency conflicts or bloat.

Polylingual and Portable

The dawm parser is capable of parsing HTML, XML, SVG, and MathML documents, as well as HTML fragments. This makes it a versatile choice for a wide range of applications. Furthermore, it's designed to be highly portable and compatible with any modern WASM-friendly runtime, including Deno, Bun, Node, and Cloudflare Workers.

Examples

For more examples, check out the ./examples directory on GitHub.

Basic HTML Parsing

import { Document, type ParseOptions } from "dawm";
import assert from "node:assert";

const options = {
  allowScripts: false, // preserves <noscript> hierarchy (default: false)
  contentType: "text/html", // use "application/xml" for XML parsing
  exactErrors: false, // default error handling mode
  quirksMode: "no-quirks", // default quirks mode
  dropDoctype: false, // strip doctype from output? (default: false)
  iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
  contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;

const doc = Document.parseHTML(
  "<!doctype html><html><head><title>foobar</title></head>" +
    "<body><h1>Hello, world!</h1></body></html>",
  options,
);

assert.strictEqual(doc.title, "foobar");
assert.strictEqual(doc.head?.nextSibling, doc.body);
const title = doc.head.firstElementChild;
assert.strictEqual(title?.textContent, "foobar");

const h1 = doc.body?.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);

CDN Usage (via esm.sh)

ES Module (recommended)

import * as dom from "https://esm.sh/dawm?bundle&dts";

UMD Module

<script src="https://esm.sh/dawm/global?bundle"></script>
<script>
  const { dawm } = globalThis;

  const options = {
    contentType: "text/html", // default mime type
    exactErrors: false, // default error mode
    quirksMode: "no-quirks", // default quirks mode
    dropDoctype: false, // strip doctype from output
    iframeSrcdoc: false, // set to true when parsing iframe srcdoc contents
    allowScripts: false, // set to true when scripting is enabled
  };

  const doc = dawm.parseHTML(
    "<!DOCTYPE html><html><body><h1>Hello, world!</h1></body></html>",
    options,
  );

  console.log(doc.root.firstChild.firstChild.textContent); // "Hello, world!"
</script>

DOM APIs

The following is a non-exhaustive list of the implementation status of DOM APIs in the current iteration of the dawm library. Items that are checked off are fully implemented and functional; unchecked items are either on the roadmap for future implementation, or were deemed out of scope for this library (those items are noted as such).

Core

Collections

Parsing

Serialization

Traversal & Manipulation

`Node`

`Element`

`Document`

API

You can import from the root dawm package or from the scoped module paths shown below (e.g. import { parseHTML } from "dawm/parse").

`dawm/parse`

DOM-first helpers that wrap the low-level WebAssembly parser and return fully hydrated tree instances.

`parseDocument`

Signature

parseDocument(input: string, options?: ParseOptions | null): Document;
parseDocument(
  input: string,
  mimeType: string,
  options?: ParseOptions | null,
): Document;

Example

import { parseDocument } from "dawm";

const xml = `<note><to>Codex</to><from>dawm</from></note>`;
const doc = parseDocument(xml, "application/xml");
console.log(doc.documentElement?.nodeName); // "note"

`parseFragment`

Signature

parseFragment(
  input: string,
  options: FragmentParseOptions | null,
): DocumentFragment;
parseFragment(
  input: string,
  contextElement: string,
  options?: ParseOptions | null,
): DocumentFragment;

Example

import { parseFragment } from "dawm/parse";

const frag = parseFragment("<li>Two</li>", "ul");
console.log(frag.firstChild?.textContent); // "Two"

`parseHTML`

Signature

parseHTML(input: string, options?: ParseOptions | null): Document;

Example

import { parseHTML } from "dawm";

const doc = parseHTML("<!doctype html><html><body><h1>Hi</h1></body></html>");
console.log(doc.querySelector("h1")?.textContent); // "Hi"

`parseXML`

Signature

parseXML(input: string, options?: ParseOptions | null): Document;

Example

import { parseXML } from "dawm/parse";

const svg = `<svg xmlns="http://www.w3.org/2000/svg"><rect width="10"/></svg>`;
const doc = parseXML(svg);
console.log(doc.documentElement?.nodeName); // "svg"

`dawm/serialize`

Utilities for turning DOM nodes and collections back into strings.

`serializeHTML`

Signature

serializeHTML<T extends Node | Attr>(node: T | ArrayLike<T> | Iterable<T>): string;

Example

import { parseHTML, serializeHTML } from "dawm";
import assert from "node:assert";

const html = "<!doctype html><p data-msg='hi'>Hello</p>";
const doc = parseHTML(html);
const out = serializeHTML(doc.body?.firstChild);
assert.strictEqual(out, '<p data-msg="hi">Hello</p>');

`serializeDOMStringMap`

Serializes a DOMStringMap into a string of valid HTML data-* attributes.

Note

This is used internally by the higher-level serialization APIs, including Element.outerHTML, Element.innerHTML, and XMLSerializer.

Signature

serializeDOMStringMap(dataset: DOMStringMap): string;

Example

import { Document, serializeDOMStringMap } from "dawm";

const el = new Document().createElement("div");
el.dataset.helloWorld = "true";
console.log(serializeDOMStringMap(el.dataset)); // ' data-hello-world="true"'

`serializeNamedNodeMap`

Serializes a NamedNodeMap into a string of valid HTML/XML attributes.

Note

This is used internally by the higher-level serialization APIs, including Element.outerHTML, Element.innerHTML, and XMLSerializer.

Signature

serializeNamedNodeMap(attrs: NamedNodeMap): string;

Example

import { parseHTML, serializeNamedNodeMap } from "dawm";

const doc = parseHTML("<div id='app' ariaHidden='false'></div>");
console.log(serializeNamedNodeMap(doc.firstElementChild!.attributes));
// ' id="app" aria-hidden="false"'

`dawm/select`

CSS-selector utilities powered by the parsel-js engine, which is vendored into this package for convenience and portability.¹

`querySelector`

Signature

querySelector<T extends Node>(node: Node, selector: string): T | null;

Example

import { parseHTML, querySelector } from "dawm";

const doc = parseHTML("<section><h2 class='title'>Docs</h2></section>");
const heading = querySelector(doc, ".title");
console.log(heading?.textContent); // "Docs"

`querySelectorAll`

Signature

querySelectorAll<T extends Node>(node: Node, selector: string): T[];

Example

import { parseHTML, querySelectorAll } from "dawm/select";

const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
const items = querySelectorAll(doc, "li");
console.log(items.map((n) => n.textContent)); // ["A", "B"]

`matches`

Signature

matches(node: Node, selector: string): boolean;

Example

import { matches, parseHTML, querySelector } from "dawm";

const doc = parseHTML("<div id='app' class='card'></div>");
const el = querySelector(doc, "#app")!;
console.log(matches(el, ".card")); // true

`select`

Signature

select(node: Node, match: Matcher, opts?: { single?: boolean }): Node[];

Example

import { type Matcher, parseHTML, select } from "dawm/select";

const doc = parseHTML("<main><p>One</p><p>Two</p></main>");
const paragraphs = select(doc, ((n) => n.nodeName === "P") as Matcher);
console.log(paragraphs.length); // 2

`walk`

Signature

walk(
  node: Node,
  callback: (node: Node, parent?: Node | null, index?: number) => void | Promise<void>,
  parent?: Node | null,
): AsyncGenerator<Node, void, number>;

Example

import { parseHTML, walk } from "dawm/select";

const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
for await (
  const node of walk(doc, async (_n) => {
    // async-safe traversal
  })
) {
  // nodes yielded in document order
}

`walkSync`

Signature

walkSync(
  node: Node,
  callback: (node: Node, parent?: Node | null, index?: number) => void,
  parent?: Node | null,
): void;

Example

import { parseHTML, walkSync } from "dawm/select";

const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
walkSync(doc, (node) => {
  if (node.nodeType === 1) console.log(node.nodeName);
});
// logs DIV, B, I

`traverseSync`

Signature

traverseSync<TNode extends Node = Node, TParent extends Node | null = TNode | null>(
  node: Node,
  test: (node: Node, parent?: TParent, index?: number) => node is TNode,
  parent?: TParent,
): Generator<TNode, void, number>;

Example

import { parseHTML, traverseSync } from "dawm/select";

const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
for (const li of traverseSync(doc, (n): n is Element => n.nodeName === "LI")) {
  console.log(li.textContent);
}

`specificity`

Signature

specificity(selector: string): number;

Example

import { specificity } from "dawm/select";

const score = specificity("#app .card > h2");
console.log(score); // numeric specificity score

Default export

The ./select entrypoint default-exports querySelectorAll for convenience:

Signature

import querySelectorAll from "dawm/select";

`dawm/options`

Helpers for normalizing parser options.

`normalizeParseOptions`

Signature

normalizeParseOptions(options?: string | ParseOptions | null): NormalizedParseOptions;

Example

import { normalizeParseOptions } from "dawm/options";

const opts = normalizeParseOptions({
  allowScripts: true,
  contentType: "text/html",
});
console.log(opts.quirksMode); // "no-quirks" (defaulted)

`normalizeFragmentOptions`

Signature

normalizeFragmentOptions(
  options?: FragmentParseOptions | string | null,
): NormalizedFragmentParseOptions;

Example

import { normalizeFragmentOptions } from "dawm/options";

const opts = normalizeFragmentOptions({ contextElement: "template" });
console.log(opts.contextElement); // "template"

`dawm/types`

Runtime enums and aliases re-exported from the DOM layer.

`NodeType`, `QuirksMode`, `QuirksModeType`

import { NodeType, QuirksMode, type QuirksModeType } from "dawm/types";

const elementNode = NodeType.Element; // 1
const quirks: QuirksModeType = QuirksMode.NoQuirks; // "no-quirks"

Advanced APIs

The modules below expose low-level APIs for advanced users, library authors, and contributors looking to build on top of dawm. You probably won't need to use these directly in most scenarios.

The data returned from most of these functions is "dehydrated" and requires secondary string-resolution and linking steps via resolveStrings and buildSubtree or buildDocumentTree.

`dawm/wasm`

Low-level WebAssembly bindings for the Rust-based HTML and XML parsers.

Signature

parse_doc(input: string, mime: string, options?: object | null): WireDoc;
parse_html(input: string, options?: object | null): WireDoc;
parse_xml(input: string, options?: object | null): WireDoc;
parse_frag(input: string, options: object): WireDoc;

Warning

These return raw "wire" structures that are not optimized for human-usability. Unless you know what you're doing and have a specific reason to use these APIs, you'd probably be better off with higher-level parse* APIs instead.

Example

import { buildDocumentTree, dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<em>raw</em>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.nodeName); // "EM"

`dawm/tree`

Utilities for turning raw WASM parser output into DOM objects (and vice versa).

`buildDocumentTree`

Signature

buildDocumentTree(document: WireDoc): Document;

Example

import { buildDocumentTree, dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<p>Hi</p>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.textContent); // "Hi"

`buildSubtree`

Signature

buildSubtree(
  node: WireNode | ResolvedWireNode,
  parent?: Node | null,
  prev?: Node | null,
  next?: Node | null,
  context?: { /* internal tree-building context */ },
): Node;

Example

import { buildSubtree, resolveStrings } from "dawm/tree";
import { type WireNode } from "dawm/wire";

const wireNode: WireNode = {
  id: 1,
  nodeType: 1,
  nodeName: 0,
  nodeValue: null,
  parentNode: null,
  firstChild: null,
  nextSibling: null,
  attributes: [],
};
const strings = ["div"];
const resolved = resolveStrings(wireNode, strings);
const element = buildSubtree(resolved, null);
console.log(element.nodeName); // "div"

`resolveQuirksMode`

Signature

resolveQuirksMode(mode: number | string | null | undefined): QuirksModeType;

Example

import { resolveQuirksMode } from "dawm/tree";

console.log(resolveQuirksMode("limited-quirks")); // "limited-quirks"

`dawm/wire`

Type guards and helpers for working with the serialized "wire" structures.

`resolveStrings`

Signature

resolveStrings(node: WireDoc): ResolvedWireDoc;
resolveStrings(node: WireNode, strings: string[]): ResolvedWireNode;
resolveStrings(node: WireAttr, strings: string[]): ResolvedWireAttr;

Example

import { resolveStrings } from "dawm/tree";

const resolved = resolveStrings({
  contentType: "text/html",
  quirksMode: 2,
  strings: ["", "html", "class", "data-value", "w-screen h-screen"],
  nodes: [{
    id: 1,
    nodeType: 9,
    nodeName: 1,
    nodeValue: null,
    attributes: [{ name: 2, value: 4 }, { name: 3, value: 0 }],
  }],
});

console.log(resolved.nodes[0].nodeName); // "html"

`toWireDoc`

Converts an unknown value into a WireDoc, throwing if the value does not conform to the expected shape. This is useful (and used internally) for ensuring type safety when dealing with raw parser output.

Signature

toWireDoc(value: unknown): WireDoc;

Example

import { dawm, toWireDoc } from "dawm";

const wire = dawm.parse_html("<p>Wire</p>", null);
// throws if the parser returned an unexpected shape
const safe = toWireDoc(wire);

Guard functions

isWireDoc(value: unknown): value is WireDoc;
isWireNode(value: unknown): value is WireNode;
isWireAttr(value: unknown): value is WireAttr;
isResolvedWireDoc(value: unknown): value is ResolvedWireDoc;
isResolvedWireNode(value: unknown): value is ResolvedWireNode;
isResolvedWireAttr(value: unknown): value is ResolvedWireAttr;
isNodeLike(value: unknown): value is NodeLike;

Example

import { isNodeLike, isWireDoc } from "dawm/guards";

function inspect(value: unknown) {
  if (isWireDoc(value)) console.log("wire document");
  else if (isNodeLike(value)) console.log("dom-like node");
}

github · issues · jsr · npm · docs · contributing

The only external package dawm relies on is debrotli, for decompressing the brotli-compressed WebAssembly binary. It also vendors parsel-js by Lea Verou, which is used to provide CSS selector support for the querySelector{,All} APIs. For portability and convenience, we vendor, bundle, and inline its source code during the build process, resulting in a standalone, dependency-free package. ↩ ↩²
Abstract superclass; cannot be instantiated directly. ↩ ↩²
Both static and living NodeLists are supported; per the DOM spec, querySelectorAll returns a static NodeList, while stateful methods like childNodes and Element.getElementsByTagName return live collections. ↩
All HTMLCollections are live collections as per the DOM spec. ↩
Semi-standard implementation of the Document.parseHTML method, but without support for the same options as the standard API. Notably, this implementation does not support the sanitization options found in the standard DOM API; however, dawm always strips <script> elements from parsed documents. ↩
Non-standard extension. ↩ ↩²

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
examples		examples
npm		npm
rs_lib		rs_lib
scripts		scripts
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
.rustfmt.toml		.rustfmt.toml
CHANGELOG.md		CHANGELOG.md
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
deno.json		deno.json
deno.lock		deno.lock
rust-toolchain.toml		rust-toolchain.toml

Uh oh!

License

nberlette/dawm

Folders and files

Latest commit

History

Repository files navigation

dawm

Introduction

Install

Usage

Overview

Features

Familiar API Surface

High Performance

Security First

Standards Compliant

Zero Dependencies

Polylingual and Portable

Examples

Basic HTML Parsing

CDN Usage (via esm.sh)

ES Module (recommended)

UMD Module

DOM APIs

Core

Collections

Parsing

Serialization

Traversal & Manipulation

Node

Element

Document

API

dawm/parse

parseDocument

Signature

Example

parseFragment

Signature

Example

parseHTML

Signature

Example

parseXML

Signature

Example

dawm/serialize

serializeHTML

Signature

Example

serializeDOMStringMap

Signature

Example

serializeNamedNodeMap

Signature

Example

dawm/select

querySelector

Signature

Example

querySelectorAll

Signature

Example

matches

Signature

Example

select

Signature

Example

walk

Signature

Example

walkSync

Signature

Example

traverseSync

Signature

Example

specificity

`dawm`

`Node`

`Element`

`Document`

`dawm/parse`

`parseDocument`

`parseFragment`

`parseHTML`

`parseXML`

`dawm/serialize`

`serializeHTML`

`serializeDOMStringMap`

`serializeNamedNodeMap`

`dawm/select`

`querySelector`

`querySelectorAll`

`matches`

`select`

`walk`

`walkSync`

`traverseSync`

`specificity`

`dawm/options`

`normalizeParseOptions`

`normalizeFragmentOptions`

`dawm/types`

`NodeType`, `QuirksMode`, `QuirksModeType`

`dawm/wasm`

`dawm/tree`

`buildDocumentTree`

`buildSubtree`

`resolveQuirksMode`

`dawm/wire`

`resolveStrings`

`toWireDoc`

Packages