Feat: images in fetched RAG website documents #2658

alexfromapex · 2024-05-30T02:21:58Z

alexfromapex
May 30, 2024

Is your feature request related to a problem? Please describe.
When using # to load a website as a document, it should pull the full HTML of the website (or be configurable to download images as separate documents, etc). Especially with the rise of multi-modal models like llava this will allow more complex and extremely useful behaviors. E.g. I want to pull the HTML for this website about math problems and then have the model replace the <img /> tags with the content of their alt attributes which contains the LaTeX definition that the image is representing.

Describe the solution you'd like
When using # to fetch websites as documents, the full HTML for the website URL should be pulled, and/or maybe the images could be fetched as separate documents (with a setting).

Describe alternatives you've considered
I've tried manually copying and pasting the HTML into the chat but the context length is too short and the models aren't picking up on anything except the last few bits of HTML.

Additional context
This website https://artofproblemsolving.com/wiki/index.php/2024_AIME_I_Problems contains some math problems. I'd like to fetch the HTML and then parse out the math problems (including the LaTeX in the image alt attributes):

<img src="//latex.artofproblemsolving.com/a/9/e/a9e826e68f4134acde4bc1d430a580e0e3649cff.png" class="latex" alt="$s+\frac12$" style="vertical-align: -13px" width="46" height="38">

Which the model could then extract the LaTeX portion:

$s+\frac12$

There's probably a lot of clever ways this could be done but it's a really useful and interesting use-case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feat: images in fetched RAG website documents #2658

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Feat: images in fetched RAG website documents #2658

Uh oh!

alexfromapex May 30, 2024

Replies: 0 comments

alexfromapex
May 30, 2024