Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better Automatic Token Reduction #49

Open
jamesturk opened this issue Jun 6, 2023 · 1 comment
Open

Better Automatic Token Reduction #49

jamesturk opened this issue Jun 6, 2023 · 1 comment
Labels
planned enhancement New feature or request
Milestone

Comments

@jamesturk
Copy link
Owner

To get this working in more places, more experimentation with token reduction is needed. How stripped down/minified can we get the HTML without causing reliability issues?

This isn't straightforward as it seems & many off the shelf tools are focused on different problems:

  • Minifiers seem to confuse GPT-4 a fair bit, so using off-the-shelf obfuscators/minifiers isn't the right solution here.
  • A lot of tools exist to sanitize HTML, but they often remove class names/etc. that are important to keep as hints. (and will be important if we get to the point of generating XPath)

It seems like the right approach is going to be an allow/disallow list based approach to extend/expand upon what's been done already in lxml.clean.

@jamesturk jamesturk changed the title Automatic Token Reduction Better Automatic Token Reduction Jun 6, 2023
@jamesturk jamesturk added the planned enhancement New feature or request label Jun 6, 2023
@jamesturk jamesturk added this to the 0.6.0 milestone Jun 7, 2023
@jamesturk
Copy link
Owner Author

Leaving myself a note that it is probably desirable to have a mode that does not modify the page structure. (i.e. no deletion of container tags) so that XPath can remain valid if we go down that route. This doesn't mean it's a hard and fast rule, but that it should be configurable ideally.

This means

<div>
   <div>  
   <div>  
     Content
   </div>
   </div>
</div>

Can't be simplified.

Counts should be avoided in generated XPath at almost any cost (e.g. never generate //table[4]) but could run into similar issues, maybe a second toggle for this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
planned enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant