Skip to content

gvanastasov/example-html-parser

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Example HTML Parser (including Tokenizer & Serializer)

This project demonstrates the basics of building an HTML tokenizer, parser, and serializer. I would advice not to use this in production in any form, as its far from optimized (neither forgiving, nor handling errors).

This project is a simple HTML parser written in JavaScript, which includes three main components:

  1. Tokenizer: Breaks down HTML into individual tokens.
  2. Parser: Converts tokens into a structured syntax tree (AST).
  3. Serializer: Converts the syntax tree back into a valid HTML string.

Table of Contents

About This Project

This project consists of three main components:

Tokenizer

The Tokenizer breaks down an HTML document into tokens, which represent different parts of the HTML structure, such as tags, text nodes, and the DOCTYPE declaration. Each token is classified into a type:

  • DOCTYPE: Represents the document type declaration.
  • TAG: Represents HTML tags, both opening (e.g., ) and closing (e.g., ).
  • TEXT: Represents the text content within tags.

Parser

The Parser takes the tokens generated by the tokenizer and organizes them into a structured tree known as an Abstract Syntax Tree (AST). This tree structure represents the hierarchy and nesting of HTML elements, making it easy to analyze and manipulate the document's structure programmatically.

Serializer

The Serializer takes the AST generated by the parser and converts it back into a valid HTML string. This allows us to transform or modify the HTML document via the AST and then serialize it back to standard HTML format.

How to Use

Running the Components

To run the tokenizer, parser, and serializer, follow these steps:

  1. Clone this repository.
  2. Open the project directory in your terminal.
  3. Run the code as shown in the example below.

Example Code

Here’s a quick example of how to use the HTML tokenizer, parser, and serializer:

const html = `<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>`;

// Initialize tokenizer and parse the tokens
const tokenizer = new HTMLTokenizer(html);
const tokens = tokenizer.tokenize();

console.log('Tokens:', tokens);

// Initialize parser and parse the tokens into an AST
const parser = new HTMLParser(tokens);
const ast = parser.parse();

console.log('AST:', ast);

// Serialize the AST back into HTML
const serializer = new HTMLSerializer(ast);
const serializedHtml = serializer.serialize();

console.log('Serialized HTML:', serializedHtml);

Example Output

For the HTML example above, you should expect output similar to:

[
  { "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
  { "type": "TAG", "value": "<html>" },
  { "type": "TAG", "value": "<head>" },
  { "type": "TAG", "value": "<title>" },
  { "type": "TEXT", "value": "Test" },
  { "type": "TAG", "value": "</title>" },
  { "type": "TAG", "value": "</head>" },
  { "type": "TAG", "value": "<body>" },
  { "type": "TAG", "value": "<p>" },
  { "type": "TEXT", "value": "Hello, world!" },
  { "type": "TAG", "value": "</p>" },
  { "type": "TAG", "value": "</body>" },
  { "type": "TAG", "value": "</html>" }
]

AST (Abstract Syntax Tree)

{
  "type": "Document",
  "children": [
    { "type": "DOCTYPE", "value": "<!DOCTYPE html>" },
    {
      "type": "Element",
      "tagName": "html",
      "children": [
        {
          "type": "Element",
          "tagName": "head",
          "children": [
            {
              "type": "Element",
              "tagName": "title",
              "children": [{ "type": "Text", "value": "Test" }]
            }
          ]
        },
        {
          "type": "Element",
          "tagName": "body",
          "children": [
            {
              "type": "Element",
              "tagName": "p",
              "children": [{ "type": "Text", "value": "Hello, world!" }]
            }
          ]
        }
      ]
    }
  ]
}

Serialized HTML

<!DOCTYPE html>
<html>
<head><title>Test</title></head>
<body><p>Hello, world!</p></body>
</html>

Queries support node tree search:

const result = domTree.querySelector('root > html > head > title');

Features

  • Tokenization of HTML Tags: The tokenizer can process standard HTML tags, both opening and closing.
  • AST Generation: Builds an abstract syntax tree from tokens, representing the HTML document structure.
  • Serialization: Converts the AST back into HTML format, preserving the document structure.
  • Simple dom traversal query selector.

About

Just a dummy naive html parsing (token + serialize) example

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published