Example HTML Parser (including Tokenizer & Serializer)

This project demonstrates the basics of building an HTML tokenizer, parser, and serializer. I would advice not to use this in production in any form, as its far from optimized (neither forgiving, nor handling errors).

This project is a simple HTML parser written in JavaScript, which includes three main components:

Tokenizer: Breaks down HTML into individual tokens.
Parser: Converts tokens into a structured syntax tree (AST).
Serializer: Converts the syntax tree back into a valid HTML string.

About This Project

This project consists of three main components:

Tokenizer

The Tokenizer breaks down an HTML document into tokens, which represent different parts of the HTML structure, such as tags, text nodes, and the DOCTYPE declaration. Each token is classified into a type:

DOCTYPE: Represents the document type declaration.
TAG: Represents HTML tags, both opening (e.g., ) and closing (e.g., ).
TEXT: Represents the text content within tags.

Parser

The Parser takes the tokens generated by the tokenizer and organizes them into a structured tree known as an Abstract Syntax Tree (AST). This tree structure represents the hierarchy and nesting of HTML elements, making it easy to analyze and manipulate the document's structure programmatically.

Serializer

The Serializer takes the AST generated by the parser and converts it back into a valid HTML string. This allows us to transform or modify the HTML document via the AST and then serialize it back to standard HTML format.

How to Use

Running the Components

To run the tokenizer, parser, and serializer, follow these steps:

Clone this repository.
Open the project directory in your terminal.
Run the code as shown in the example below.

Example Code

Here’s a quick example of how to use the HTML tokenizer, parser, and serializer:

const html = `<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>`;

// Initialize tokenizer and parse the tokens
const tokenizer = new HTMLTokenizer(html);
const tokens = tokenizer.tokenize();

console.log('Tokens:', tokens);

// Initialize parser and parse the tokens into an AST
const parser = new HTMLParser(tokens);
const ast = parser.parse();

console.log('AST:', ast);

// Serialize the AST back into HTML
const serializer = new HTMLSerializer(ast);
const serializedHtml = serializer.serialize();

console.log('Serialized HTML:', serializedHtml);

Example Output

For the HTML example above, you should expect output similar to:

[
  { "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
  { "type": "TAG", "value": "<html>" },
  { "type": "TAG", "value": "<head>" },
  { "type": "TAG", "value": "<title>" },
  { "type": "TEXT", "value": "Test" },
  { "type": "TAG", "value": "</title>" },
  { "type": "TAG", "value": "</head>" },
  { "type": "TAG", "value": "<body>" },
  { "type": "TAG", "value": "<p>" },
  { "type": "TEXT", "value": "Hello, world!" },
  { "type": "TAG", "value": "</p>" },
  { "type": "TAG", "value": "</body>" },
  { "type": "TAG", "value": "</html>" }
]

AST (Abstract Syntax Tree)

{
  "type": "Document",
  "children": [
    { "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
    {
      "type": "Element",
      "tagName": "html",
      "children": [
        {
          "type": "Element",
          "tagName": "head",
          "children": [
            {
              "type": "Element",
              "tagName": "title",
              "children": [{ "type": "Text", "value": "Test" }]
            }
          ]
        },
        {
          "type": "Element",
          "tagName": "body",
          "children": [
            {
              "type": "Element",
              "tagName": "p",
              "children": [{ "type": "Text", "value": "Hello, world!" }]
            }
          ]
        }
      ]
    }
  ]
}

Serialized HTML

<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>

Queries support node tree search:

const result = domTree.querySelector('root > html > head > title');

Features

Tokenization of HTML Tags: The tokenizer can process standard HTML tags, both opening and closing.
AST Generation: Builds an abstract syntax tree from tokens, representing the HTML document structure.
Serialization: Converts the AST back into HTML format, preserving the document structure.
Simple dom traversal query selector.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.vscode		.vscode
src		src
test		test
.babelrc		.babelrc
.gitattributes		.gitattributes
.gitignore		.gitignore
.prettierrc		.prettierrc
README.md		README.md
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Example HTML Parser (including Tokenizer & Serializer)

Table of Contents

About This Project

Tokenizer

Parser

Serializer

How to Use

Running the Components

Example Code

Example Output

Features

About

Uh oh!

Releases

Packages

Uh oh!

Languages

gvanastasov/example-html-parser

Folders and files

Latest commit

History

Repository files navigation

Example HTML Parser (including Tokenizer & Serializer)

Table of Contents

About This Project

Tokenizer

Parser

Serializer

How to Use

Running the Components

Example Code

Example Output

Features

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages