Skip to content

iwillspeak/Teasel

Repository files navigation

Teasel

Teasing HTML elements from plain text

Logo

Teasel is an HTML syntax tree parser written in TypeScript. Teasel aims to be a fast and reliable full-fidelity parser for HTML linters and refactoring tools.

Key Features

  • Full-fidelity tree - Every byte in the input text will be represented somewhere in the output syntax tree, in the order it was in the source text.
  • Fault tolerant perser - All input texts produce an output tree, and a set of errors. The closer the input is to a standards-compliant HTML document the fewer error diagnostics.
  • Syntax, not Semantic - Teasel parses HTML as a syntax tree. The end result is not an HTML DOM. This means that all the warts of the origional document are avilable to dig into; ideal for linters.

Docs and Getting Started

To get started using Teasel it can be installed from GitHub packages:

$ npm install @iwillspeak/teasel@0.3.0

Once installed you can then parse any string containing HTML into a syntax tree:

import {Parser} from '@iwillspeak/teasel/lib/parse/Parser.js';

const result = Parser.parseDocument('<html><p>Hello World');

Check out the teasel docs for where to go next.

Repo Structure

This repository contains three main packages:

  • teasel - The main parser libary. This is the package you want to reference as a consumer.
  • pyracantha - The language agnostic low-level syntax tree library used by teasel to represent parsed documents.
  • teasel-cli - A command line tool to test parsing HTML documents with teasel.

🐲 TODO 🐲:

  • Handle attributes on opening tags
  • Better error recovery when expect fails.
    • Tolerate and warn on some malformed whitespace. e.g.: < p>.
    • Malformed attribute lists synchronise on >.
  • Node cache should cache nodes in the green tree builder.
  • Node cache interface and implementation.
  • Parser should accept optional cache.
  • Handle Closing of outer tags correctly. e.g.: <p><i>hello</p>.
  • Handle Closing of non-nesting siblings. e.g.: <li>a<li>b.
  • Handling for implicit self closing of 'void' elements <hr> etc.
  • Support for esoteric DOCTYPEs e.g. SYSTEM 'about:legacy-compat'.
  • Document and fragment parse APIs.
  • Syntax builder / factory API for creating and updating nodes.
  • Handling of raw text elements. e.g. script, and style.
  • Support for character references. e.g. &amp;.
  • HTML / XML crossover
  • Support for processing instructions, e.g. <?xml version="1.0">.
  • Support for CDATA values / tokens.