Portable high-speed DOM parser for HTML/XML written in Rust,
paired with a subset of DOM APIs implemented in TypeScript, suitable
for use in any ES2015+ JavaScript runtime with WebAssembly support.
dawm is a headless DOM toolkit for parsing, traversing, manipulating, and
serializing HTML/SVG/XML documents in server-side / serverless / edge
environments. Its hybrid codebase couples a high-speed DOM parser written in
Rust with TypeScript implementations of many of the standard
Document Object Model (DOM) APIs.
The overall developer experience with dawm is uncannily familiar for anyone
with frontend development experience, making it adoptable by a vast majority of
developers with minimal friction and virtually zero overhead.
deno add npm:dawmpnpm add dawmyarn add dawmbun add dawmnpm i dawmimport { parseHTML, type ParseOptions } from "dawm";
import assert from "node:assert";
const options = {
allowScripts: false, // preserves <noscript> hierarchy (default: false)
contentType: "text/html", // use "application/xml" for XML parsing
exactErrors: false, // default error handling mode
quirksMode: "no-quirks", // default quirks mode
dropDoctype: false, // strip doctype from output? (default: false)
iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;
// The `parseHTML` function is also available via standard DOM APIs:
// - `DOMParser.parseFromString(string, "text/html")`
// - `Document.parseHTML(string, options)`
const doc = parseHTML(
"<!doctype html><html><body><h1>Hello, world!</h1></body></html>",
options,
);
const h1 = doc.body.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);The dawm project is ideal for use in server-side and edge compute scenarios
where performance and portability are paramount. Whether you're building an SSR
framework, a web scraper, or simply need a way to run unit tests for frontend
code without needing a full-blown DOM implementation like JSDOM, dawm is up to
any task you can throw at it.
Purpose-built for intensive data processing tasks in server-side and edge
compute scenarios, dawm is designed to be fast: both in the literal sense of
its performance, and in terms of how quickly it can be adopted and integrated
into your workflows.
Featuring TypeScript implementations of familiar DOM APIs like Document,
Element, and Attr, this package provides a familiar developer experience
with minimal learning curve.
This saves you from having to learn another framework-specific API just to
manipulate HTML/XML documents — if you've done any frontend web dev before, you
can immediately start using dawm in your server-side workflows without missing
a beat.
At the core of dawm lies a blazing-fast HTML/XML parser written in Rust and
compiled to WebAssembly, capable of efficiently processing even large
documents with ease. This ensures that your applications can handle heavy DOM
manipulation tasks without breaking a sweat.
Running in a sandboxed WASM environment, dawm ensures that untrusted content
cannot compromise the host application. Scripts are never executed as the parser
automatically strips them out of the source document.
Built on top of the html5ever crate created by [servo], the parser boasts full
compliance with the HTML5 parsing algorithm as defined by the WHATWG
specification.
Designed to be lightweight and portable, dawm comes with zero external
dependencies.1 This makes it easy to integrate into any project without
worrying about dependency conflicts or bloat.
The dawm parser is capable of parsing HTML, XML, SVG, and MathML documents, as
well as HTML fragments. This makes it a versatile choice for a wide range of
applications. Furthermore, it's designed to be highly portable and compatible
with any modern WASM-friendly runtime, including Deno, Bun,
Node, and Cloudflare Workers.
For more examples, check out the
./examplesdirectory on GitHub.
import { Document, type ParseOptions } from "dawm";
import assert from "node:assert";
const options = {
allowScripts: false, // preserves <noscript> hierarchy (default: false)
contentType: "text/html", // use "application/xml" for XML parsing
exactErrors: false, // default error handling mode
quirksMode: "no-quirks", // default quirks mode
dropDoctype: false, // strip doctype from output? (default: false)
iframeSrcdoc: false, // set to true when parsing iframe srcdoc content
contextElement: null, // default context element name (for fragments)
} satisfies ParseOptions;
const doc = Document.parseHTML(
"<!doctype html><html><head><title>foobar</title></head>" +
"<body><h1>Hello, world!</h1></body></html>",
options,
);
assert.strictEqual(doc.title, "foobar");
assert.strictEqual(doc.head?.nextSibling, doc.body);
const title = doc.head.firstElementChild;
assert.strictEqual(title?.textContent, "foobar");
const h1 = doc.body?.firstElementChild;
assert.strictEqual(h1?.tagName, "H1");
assert.strictEqual(h1.textContent, "Hello, world!");
assert.strictEqual(h1.parentNode, doc.body);CDN Usage (via esm.sh)
import * as dom from "https://esm.sh/dawm?bundle&dts";<script src="https://esm.sh/dawm/global?bundle"></script>
<script>
const { dawm } = globalThis;
const options = {
contentType: "text/html", // default mime type
exactErrors: false, // default error mode
quirksMode: "no-quirks", // default quirks mode
dropDoctype: false, // strip doctype from output
iframeSrcdoc: false, // set to true when parsing iframe srcdoc contents
allowScripts: false, // set to true when scripting is enabled
};
const doc = dawm.parseHTML(
"<!DOCTYPE html><html><body><h1>Hello, world!</h1></body></html>",
options,
);
console.log(doc.root.firstChild.firstChild.textContent); // "Hello, world!"
</script>The following is a non-exhaustive list of the implementation status of DOM
APIs in the current iteration of the dawm library. Items that are checked off
are fully implemented and functional; unchecked items are either on the roadmap
for future implementation, or were deemed out of scope for this library (those
items are noted as such).
-
Node2 -
Element -
Attr -
CharacterData2 -
Text -
CDATASection -
Comment -
ProcessingInstruction -
DocumentFragment -
DocumentType -
Document -
Entity(not implemented; legacy feature) -
EntityReference(not implemented; legacy feature) -
Notation(not implemented; legacy feature)
-
NodeList3 -
HTMLCollection4 -
NamedNodeMap -
DOMTokenList -
DOMStringMap -
StyleSheetList(not yet implemented) -
MediaList(not yet implemented)
-
DOMParser -
Document.parseHTML5 -
Document.parseXML6 -
Document.parseFragment6 -
Element.setHTML(not yet implemented)
-
XMLSerializer -
Element.outerHTML -
Element.innerHTML -
Element.insertAdjacentHTML(not yet implemented) -
Element.getHTML(not yet implemented)
-
Element.getElementsByClassName -
Element.getElementsByTagName -
Element.getElementsByTagNameNS -
Element.querySelector -
Element.querySelectorAll
-
Document.getElementById -
Document.getElementsByClassName -
Document.getElementsByName -
Document.getElementsByTagName -
Document.getElementsByTagNameNS -
Document.querySelector -
Document.querySelectorAll
You can import from the root dawm package or from the scoped module paths
shown below (e.g. import { parseHTML } from "dawm/parse").
DOM-first helpers that wrap the low-level WebAssembly parser and return fully hydrated tree instances.
parseDocument(input: string, options?: ParseOptions | null): Document;
parseDocument(
input: string,
mimeType: string,
options?: ParseOptions | null,
): Document;import { parseDocument } from "dawm";
const xml = `<note><to>Codex</to><from>dawm</from></note>`;
const doc = parseDocument(xml, "application/xml");
console.log(doc.documentElement?.nodeName); // "note"parseFragment(
input: string,
options: FragmentParseOptions | null,
): DocumentFragment;
parseFragment(
input: string,
contextElement: string,
options?: ParseOptions | null,
): DocumentFragment;import { parseFragment } from "dawm/parse";
const frag = parseFragment("<li>Two</li>", "ul");
console.log(frag.firstChild?.textContent); // "Two"parseHTML(input: string, options?: ParseOptions | null): Document;import { parseHTML } from "dawm";
const doc = parseHTML("<!doctype html><html><body><h1>Hi</h1></body></html>");
console.log(doc.querySelector("h1")?.textContent); // "Hi"parseXML(input: string, options?: ParseOptions | null): Document;import { parseXML } from "dawm/parse";
const svg = `<svg xmlns="http://www.w3.org/2000/svg"><rect width="10"/></svg>`;
const doc = parseXML(svg);
console.log(doc.documentElement?.nodeName); // "svg"Utilities for turning DOM nodes and collections back into strings.
serializeHTML<T extends Node | Attr>(node: T | ArrayLike<T> | Iterable<T>): string;import { parseHTML, serializeHTML } from "dawm";
import assert from "node:assert";
const html = "<!doctype html><p data-msg='hi'>Hello</p>";
const doc = parseHTML(html);
const out = serializeHTML(doc.body?.firstChild);
assert.strictEqual(out, '<p data-msg="hi">Hello</p>');Serializes a DOMStringMap into a string of valid HTML data-* attributes.
Note
This is used internally by the higher-level serialization APIs, including
Element.outerHTML, Element.innerHTML, and XMLSerializer.
serializeDOMStringMap(dataset: DOMStringMap): string;import { Document, serializeDOMStringMap } from "dawm";
const el = new Document().createElement("div");
el.dataset.helloWorld = "true";
console.log(serializeDOMStringMap(el.dataset)); // ' data-hello-world="true"'Serializes a NamedNodeMap into a string of valid HTML/XML attributes.
Note
This is used internally by the higher-level serialization APIs, including
Element.outerHTML, Element.innerHTML, and XMLSerializer.
serializeNamedNodeMap(attrs: NamedNodeMap): string;import { parseHTML, serializeNamedNodeMap } from "dawm";
const doc = parseHTML("<div id='app' ariaHidden='false'></div>");
console.log(serializeNamedNodeMap(doc.firstElementChild!.attributes));
// ' id="app" aria-hidden="false"'CSS-selector utilities powered by the parsel-js engine, which is vendored
into this package for convenience and portability.1
querySelector<T extends Node>(node: Node, selector: string): T | null;import { parseHTML, querySelector } from "dawm";
const doc = parseHTML("<section><h2 class='title'>Docs</h2></section>");
const heading = querySelector(doc, ".title");
console.log(heading?.textContent); // "Docs"querySelectorAll<T extends Node>(node: Node, selector: string): T[];import { parseHTML, querySelectorAll } from "dawm/select";
const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
const items = querySelectorAll(doc, "li");
console.log(items.map((n) => n.textContent)); // ["A", "B"]matches(node: Node, selector: string): boolean;import { matches, parseHTML, querySelector } from "dawm";
const doc = parseHTML("<div id='app' class='card'></div>");
const el = querySelector(doc, "#app")!;
console.log(matches(el, ".card")); // trueselect(node: Node, match: Matcher, opts?: { single?: boolean }): Node[];import { type Matcher, parseHTML, select } from "dawm/select";
const doc = parseHTML("<main><p>One</p><p>Two</p></main>");
const paragraphs = select(doc, ((n) => n.nodeName === "P") as Matcher);
console.log(paragraphs.length); // 2walk(
node: Node,
callback: (node: Node, parent?: Node | null, index?: number) => void | Promise<void>,
parent?: Node | null,
): AsyncGenerator<Node, void, number>;import { parseHTML, walk } from "dawm/select";
const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
for await (
const node of walk(doc, async (_n) => {
// async-safe traversal
})
) {
// nodes yielded in document order
}walkSync(
node: Node,
callback: (node: Node, parent?: Node | null, index?: number) => void,
parent?: Node | null,
): void;import { parseHTML, walkSync } from "dawm/select";
const doc = parseHTML("<div><b>hi</b><i>there</i></div>");
walkSync(doc, (node) => {
if (node.nodeType === 1) console.log(node.nodeName);
});
// logs DIV, B, ItraverseSync<TNode extends Node = Node, TParent extends Node | null = TNode | null>(
node: Node,
test: (node: Node, parent?: TParent, index?: number) => node is TNode,
parent?: TParent,
): Generator<TNode, void, number>;import { parseHTML, traverseSync } from "dawm/select";
const doc = parseHTML("<ul><li>A</li><li>B</li></ul>");
for (const li of traverseSync(doc, (n): n is Element => n.nodeName === "LI")) {
console.log(li.textContent);
}specificity(selector: string): number;import { specificity } from "dawm/select";
const score = specificity("#app .card > h2");
console.log(score); // numeric specificity scoreThe ./select entrypoint default-exports querySelectorAll for convenience:
import querySelectorAll from "dawm/select";Helpers for normalizing parser options.
normalizeParseOptions(options?: string | ParseOptions | null): NormalizedParseOptions;import { normalizeParseOptions } from "dawm/options";
const opts = normalizeParseOptions({
allowScripts: true,
contentType: "text/html",
});
console.log(opts.quirksMode); // "no-quirks" (defaulted)normalizeFragmentOptions(
options?: FragmentParseOptions | string | null,
): NormalizedFragmentParseOptions;import { normalizeFragmentOptions } from "dawm/options";
const opts = normalizeFragmentOptions({ contextElement: "template" });
console.log(opts.contextElement); // "template"Runtime enums and aliases re-exported from the DOM layer.
import { NodeType, QuirksMode, type QuirksModeType } from "dawm/types";
const elementNode = NodeType.Element; // 1
const quirks: QuirksModeType = QuirksMode.NoQuirks; // "no-quirks"The modules below expose low-level APIs for advanced users, library authors, and
contributors looking to build on top of dawm. You probably won't need to use
these directly in most scenarios.
The data returned from most of these functions is "dehydrated" and requires secondary string-resolution and linking steps via
resolveStringsandbuildSubtreeorbuildDocumentTree.
Low-level WebAssembly bindings for the Rust-based HTML and XML parsers.
parse_doc(input: string, mime: string, options?: object | null): WireDoc;
parse_html(input: string, options?: object | null): WireDoc;
parse_xml(input: string, options?: object | null): WireDoc;
parse_frag(input: string, options: object): WireDoc;Warning
These return raw "wire" structures that are not optimized for human-usability.
Unless you know what you're doing and have a specific reason to use these
APIs, you'd probably be better off with higher-level parse* APIs instead.
import { buildDocumentTree, dawm, toWireDoc } from "dawm";
const wire = dawm.parse_html("<em>raw</em>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.nodeName); // "EM"Utilities for turning raw WASM parser output into DOM objects (and vice versa).
buildDocumentTree(document: WireDoc): Document;import { buildDocumentTree, dawm, toWireDoc } from "dawm";
const wire = dawm.parse_html("<p>Hi</p>", null);
const doc = buildDocumentTree(toWireDoc(wire));
console.log(doc.body?.firstChild?.textContent); // "Hi"buildSubtree(
node: WireNode | ResolvedWireNode,
parent?: Node | null,
prev?: Node | null,
next?: Node | null,
context?: { /* internal tree-building context */ },
): Node;import { buildSubtree, resolveStrings } from "dawm/tree";
import { type WireNode } from "dawm/wire";
const wireNode: WireNode = {
id: 1,
nodeType: 1,
nodeName: 0,
nodeValue: null,
parentNode: null,
firstChild: null,
nextSibling: null,
attributes: [],
};
const strings = ["div"];
const resolved = resolveStrings(wireNode, strings);
const element = buildSubtree(resolved, null);
console.log(element.nodeName); // "div"resolveQuirksMode(mode: number | string | null | undefined): QuirksModeType;import { resolveQuirksMode } from "dawm/tree";
console.log(resolveQuirksMode("limited-quirks")); // "limited-quirks"Type guards and helpers for working with the serialized "wire" structures.
resolveStrings(node: WireDoc): ResolvedWireDoc;
resolveStrings(node: WireNode, strings: string[]): ResolvedWireNode;
resolveStrings(node: WireAttr, strings: string[]): ResolvedWireAttr;import { resolveStrings } from "dawm/tree";
const resolved = resolveStrings({
contentType: "text/html",
quirksMode: 2,
strings: ["", "html", "class", "data-value", "w-screen h-screen"],
nodes: [{
id: 1,
nodeType: 9,
nodeName: 1,
nodeValue: null,
attributes: [{ name: 2, value: 4 }, { name: 3, value: 0 }],
}],
});
console.log(resolved.nodes[0].nodeName); // "html"Converts an unknown value into a WireDoc, throwing if the value does not
conform to the expected shape. This is useful (and used internally) for ensuring
type safety when dealing with raw parser output.
toWireDoc(value: unknown): WireDoc;import { dawm, toWireDoc } from "dawm";
const wire = dawm.parse_html("<p>Wire</p>", null);
// throws if the parser returned an unexpected shape
const safe = toWireDoc(wire);isWireDoc(value: unknown): value is WireDoc;
isWireNode(value: unknown): value is WireNode;
isWireAttr(value: unknown): value is WireAttr;
isResolvedWireDoc(value: unknown): value is ResolvedWireDoc;
isResolvedWireNode(value: unknown): value is ResolvedWireNode;
isResolvedWireAttr(value: unknown): value is ResolvedWireAttr;
isNodeLike(value: unknown): value is NodeLike;import { isNodeLike, isWireDoc } from "dawm/guards";
function inspect(value: unknown) {
if (isWireDoc(value)) console.log("wire document");
else if (isNodeLike(value)) console.log("dom-like node");
}MIT © Nicholas Berlette. All rights reserved.
Footnotes
-
The only external package
dawmrelies on isdebrotli, for decompressing the brotli-compressed WebAssembly binary. It also vendorsparsel-jsby Lea Verou, which is used to provide CSS selector support for thequerySelector{,All}APIs. For portability and convenience, we vendor, bundle, and inline its source code during the build process, resulting in a standalone, dependency-free package. ↩ ↩2 -
Both static and living
NodeLists are supported; per the DOM spec,querySelectorAllreturns a staticNodeList, while stateful methods likechildNodesandElement.getElementsByTagNamereturn live collections. ↩ -
All
HTMLCollections are live collections as per the DOM spec. ↩ -
Semi-standard implementation of the
Document.parseHTMLmethod, but without support for the same options as the standard API. Notably, this implementation does not support the sanitization options found in the standard DOM API; however,dawmalways strips<script>elements from parsed documents. ↩