Dompressor Chrome Extension

A lightweight DOM compressor

Installation:

Clone into a directory:

git clone https://github.com/peterwangsc/dompressor.git

Open the chrome extensions settings by navigating to the url:

chrome://extensions

Enable developer mode in the top right corner:

Import the directory by clicking on Load Unpacked:

And selecting the dompressor folder:

Pin the extension and click the icon to open the popup:

Right click on the popup and click inspect to open the developer console:

Click the button and the popup console will print the output of the compressor:

Considerations:

The output is a string of HTML elements which can be stored in an HTML file and opened in a browser.

The difference is that all of the irrelevant nodes are trimmed from the output so that only the useful (to LLM) data is outputted.

Some of the output that was trimmed:

script, noscript, meta, style, and link tags
elements with "visibility:hidden" or "display:none" styling
elements whose children consists of only one element (excluding text nodes)

The output only renders the text nodes (wrapped in {{handlebars}} to indicate text) and leaves out everything else except each text node's immediate parent, unless the element has more than one child node.

Some example outputs:

The Github Profile Page

https://github.com/peterwangsc/dompressor/blob/master/example/example_output-github-profile-page.html

A Wikipedia Article

https://github.com/peterwangsc/dompressor/blob/master/example/example_output-wikipedia-article.html

A Google Search

https://github.com/peterwangsc/dompressor/blob/master/example/example_output-google-search.html

Images and Iframes

Some additional things to consider are that some elements are included inside of iframes, which have their own DOM.

In order to include the content of those iframes in the output, another loop was added to generate more HTML strings for every iframe.

Some iframes do not allow that because of CORS, so they will be left out of the output. Only iframes that are hosted on the same domain are included.

It actually took me a while to find a website with an iframe on it, but I finally stumbled upon LinkedIn which uses an iframe, as well as some other third party iframes.

The images array of URLs is also added, to give the LLM a list of URLs to the visible images on the page. LLMs that have multimodality can use these URLs to load the image as an embedding.

The images array is sourced by first grabbing the images on the page, then searching the page for elements with background-image set to a URL.

The output from these two augmentations is as expected. The iframes produce their own HTML string, and the images get scraped from the DOM's image elements as well as the elements with a background-image set to a URL.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
example		example
README.md		README.md
background.js		background.js
content.js		content.js
icon.png		icon.png
manifest.json		manifest.json
popup.html		popup.html
popup.js		popup.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dompressor Chrome Extension

A lightweight DOM compressor

Considerations:

Some example outputs:

Images and Iframes

About

Releases

Packages

Languages

peterwangsc/dompressor

Folders and files

Latest commit

History

Repository files navigation

Dompressor Chrome Extension

A lightweight DOM compressor

Considerations:

Some example outputs:

Images and Iframes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages