Installation:
Clone into a directory:
git clone https://github.com/peterwangsc/dompressor.git
Open the chrome extensions settings by navigating to the url:
chrome://extensions
Enable developer mode in the top right corner:
Import the directory by clicking on Load Unpacked:
And selecting the dompressor folder:
Pin the extension and click the icon to open the popup:
Right click on the popup and click inspect to open the developer console:
Click the button and the popup console will print the output of the compressor:
The output is a string of HTML elements which can be stored in an HTML file and opened in a browser.
The difference is that all of the irrelevant nodes are trimmed from the output so that only the useful (to LLM) data is outputted.
Some of the output that was trimmed:
- script, noscript, meta, style, and link tags
- elements with "visibility:hidden" or "display:none" styling
- elements whose children consists of only one element (excluding text nodes)
The output only renders the text nodes (wrapped in {{handlebars}} to indicate text) and leaves out everything else except each text node's immediate parent, unless the element has more than one child node.
The Github Profile Page
A Wikipedia Article
https://github.com/peterwangsc/dompressor/blob/master/example/example_output-wikipedia-article.html
A Google Search
https://github.com/peterwangsc/dompressor/blob/master/example/example_output-google-search.html
Some additional things to consider are that some elements are included inside of iframes, which have their own DOM.
In order to include the content of those iframes in the output, another loop was added to generate more HTML strings for every iframe.
Some iframes do not allow that because of CORS, so they will be left out of the output. Only iframes that are hosted on the same domain are included.
It actually took me a while to find a website with an iframe on it, but I finally stumbled upon LinkedIn which uses an iframe, as well as some other third party iframes.
The images array of URLs is also added, to give the LLM a list of URLs to the visible images on the page. LLMs that have multimodality can use these URLs to load the image as an embedding.
The images array is sourced by first grabbing the images on the page, then searching the page for elements with background-image set to a URL.
The output from these two augmentations is as expected. The iframes produce their own HTML string, and the images get scraped from the DOM's image elements as well as the elements with a background-image set to a URL.