Teasing HTML elements from plain text
Teasel is an HTML syntax tree parser written in TypeScript. Teasel aims to be a fast and reliable full-fidelity parser for HTML linters and refactoring tools.
- Full-fidelity tree - Every byte in the input text will be represented somewhere in the output syntax tree, in the order it was in the source text.
- Fault tolerant perser - All input texts produce an output tree, and a set of errors. The closer the input is to a standards-compliant HTML document the fewer error diagnostics.
- Syntax, not Semantic - Teasel parses HTML as a syntax tree. The end result is not an HTML DOM. This means that all the warts of the origional document are avilable to dig into; ideal for linters.
To get started using Teasel it can be installed from GitHub packages:
$ npm install @iwillspeak/teasel@0.3.0
Once installed you can then parse any string containing HTML into a syntax tree:
import {Parser} from '@iwillspeak/teasel/lib/parse/Parser.js';
const result = Parser.parseDocument('<html><p>Hello World');
Check out the teasel
docs for where to go next.
This repository contains three main packages:
teasel
- The main parser libary. This is the package you want to reference as a consumer.pyracantha
- The language agnostic low-level syntax tree library used byteasel
to represent parsed documents.teasel-cli
- A command line tool to test parsing HTML documents with teasel.
- Handle attributes on opening tags
- Better error recovery when
expect
fails.- Tolerate and warn on some malformed whitespace. e.g.:
< p>
. - Malformed attribute lists synchronise on
>
.
- Tolerate and warn on some malformed whitespace. e.g.:
- Node cache should cache nodes in the green tree builder.
- Node cache interface and implementation.
- Parser should accept optional cache.
- Handle Closing of outer tags correctly. e.g.:
<p><i>hello</p>
. - Handle Closing of non-nesting siblings. e.g.:
<li>a<li>b
. - Handling for implicit self closing of 'void' elements
<hr>
etc. - Support for esoteric DOCTYPEs e.g.
SYSTEM 'about:legacy-compat'
. - Document and fragment parse APIs.
- Syntax builder / factory API for creating and updating nodes.
- Handling of raw text elements. e.g.
script
, andstyle
. - Support for character references. e.g.
&
. - HTML / XML crossover
- Support for processing instructions, e.g.
<?xml version="1.0">
. - Support for
CDATA
values / tokens.