This project demonstrates the basics of building an HTML tokenizer, parser, and serializer. I would advice not to use this in production in any form, as its far from optimized (neither forgiving, nor handling errors).
This project is a simple HTML parser written in JavaScript, which includes three main components:
- Tokenizer: Breaks down HTML into individual tokens.
- Parser: Converts tokens into a structured syntax tree (AST).
- Serializer: Converts the syntax tree back into a valid HTML string.
This project consists of three main components:
The Tokenizer breaks down an HTML document into tokens, which represent different parts of the HTML structure, such as tags, text nodes, and the DOCTYPE declaration. Each token is classified into a type:
- DOCTYPE: Represents the document type declaration.
- TAG: Represents HTML tags, both opening (e.g., ) and closing (e.g., ).
- TEXT: Represents the text content within tags.
The Parser takes the tokens generated by the tokenizer and organizes them into a structured tree known as an Abstract Syntax Tree (AST). This tree structure represents the hierarchy and nesting of HTML elements, making it easy to analyze and manipulate the document's structure programmatically.
The Serializer takes the AST generated by the parser and converts it back into a valid HTML string. This allows us to transform or modify the HTML document via the AST and then serialize it back to standard HTML format.
To run the tokenizer, parser, and serializer, follow these steps:
- Clone this repository.
- Open the project directory in your terminal.
- Run the code as shown in the example below.
Here’s a quick example of how to use the HTML tokenizer, parser, and serializer:
const html = `<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>`;
// Initialize tokenizer and parse the tokens
const tokenizer = new HTMLTokenizer(html);
const tokens = tokenizer.tokenize();
console.log('Tokens:', tokens);
// Initialize parser and parse the tokens into an AST
const parser = new HTMLParser(tokens);
const ast = parser.parse();
console.log('AST:', ast);
// Serialize the AST back into HTML
const serializer = new HTMLSerializer(ast);
const serializedHtml = serializer.serialize();
console.log('Serialized HTML:', serializedHtml);For the HTML example above, you should expect output similar to:
[
{ "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
{ "type": "TAG", "value": "<html>" },
{ "type": "TAG", "value": "<head>" },
{ "type": "TAG", "value": "<title>" },
{ "type": "TEXT", "value": "Test" },
{ "type": "TAG", "value": "</title>" },
{ "type": "TAG", "value": "</head>" },
{ "type": "TAG", "value": "<body>" },
{ "type": "TAG", "value": "<p>" },
{ "type": "TEXT", "value": "Hello, world!" },
{ "type": "TAG", "value": "</p>" },
{ "type": "TAG", "value": "</body>" },
{ "type": "TAG", "value": "</html>" }
]AST (Abstract Syntax Tree)
{
"type": "Document",
"children": [
{ "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
{
"type": "Element",
"tagName": "html",
"children": [
{
"type": "Element",
"tagName": "head",
"children": [
{
"type": "Element",
"tagName": "title",
"children": [{ "type": "Text", "value": "Test" }]
}
]
},
{
"type": "Element",
"tagName": "body",
"children": [
{
"type": "Element",
"tagName": "p",
"children": [{ "type": "Text", "value": "Hello, world!" }]
}
]
}
]
}
]
}Serialized HTML
<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>Queries support node tree search:
const result = domTree.querySelector('root > html > head > title');- Tokenization of HTML Tags: The tokenizer can process standard HTML tags, both opening and closing.
- AST Generation: Builds an abstract syntax tree from tokens, representing the HTML document structure.
- Serialization: Converts the AST back into HTML format, preserving the document structure.
- Simple dom traversal query selector.