Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hybrid Mode: ask scrapeghost to write selectors #7

Open
jamesturk opened this issue Mar 19, 2023 · 1 comment
Open

Hybrid Mode: ask scrapeghost to write selectors #7

jamesturk opened this issue Mar 19, 2023 · 1 comment
Labels

Comments

@jamesturk
Copy link
Owner

jamesturk commented Mar 19, 2023

See FAQ: https://jamesturk.github.io/scrapeghost/faq/#why-not-ask-the-scraper-to-write-css-xpath-selectors

There's an alternate version of the long-page scraper that could generate extraction selectors and then apply them client-side. Would be a huge cost savings for simple list pages. I'm exploring ideas related to this and will start posting updates on it soon.

@jamesturk jamesturk added the idea label Mar 20, 2023
@jamesturk jamesturk added this to the 0.5.0 milestone Mar 22, 2023
@jamesturk jamesturk changed the title hybrid mode Hybrid Mode: ask scrapeghost to write selectors Mar 27, 2023
@jamesturk jamesturk removed this from the 0.5.0 milestone Jun 6, 2023
@jamesturk
Copy link
Owner Author

Revisiting this after a few failed attempts that didn't go anywhere.

Generating the XPath/etc. just isn't nearly as robust as going straight to the data, so many variations of this introduce a whole host of problems.

In the current request to GPT, the full HTML (mod cleaning) is sent as well as a JSON schema. This is usually enough to get the data itself out of the HTML without examples because the field names are good enough (first name, last name, etc. have semantic meaning that GPT can use to make the extraction call.

For list pages, it's common to wind up with something like this:

for link in tree.xpath(DIV_OR_TR_XPATH):
    item = {
          "field": FIELD_SUB_XPATH,
          "field2": FIELD2_SUB_XPATH,
    }

To make this work on list pages, we need to send a representative sample of the page (the entire page is often too long)
and then ask GPT for DIV_OR_TR_XPATH, FIELD_XPATH, FIELD2_XPATH, etc.

We don't want each field's XPath to include the parent (e.g. we don't want //tr[3]/td[@id=xyz], we just want ./td[@id=xyz].

The prompt needs to be quite different.

Ideas:

  • template code like above and ask it to fill in the gaps
  • require some data to be sent with the request (eg. first and third item) and use those to help it understand what to grab
  • first pass: request "container" (how to word this?) / second pass: given just container HTML what XPath would be given

Potential Uses:

  • huge page where only first chunk fits within window (but need to be careful with cleaned HTML here since it'll modify xpaths)
  • pagination, where xpath can be determined by first page, and then repeated over and over
  • the same could be true of templated pages, but they run into the issue where not all data is xpath-able at least with list pages the general result is a list of URLs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant