Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: expose ScrapeHTML function in API #3

Merged
merged 2 commits into from
Jan 7, 2023

Conversation

hay-kot
Copy link
Contributor

@hay-kot hay-kot commented Jan 3, 2023

This PR adds the ability to pass the HTML body via an io.Reader to a new function ScrapeHTML.

No business code was written, I've just moved the parts that work with the body into their own function and call it from the ScrapeFrom function

@kkyr
Copy link
Owner

kkyr commented Jan 3, 2023

Thanks for the PR, however, I'm hesitant to increase the surface of the public API unless there's good reason to, because doing so is typically not a reversible decision.

What's the use case for providing the HTML body directly? Would providing a custom http client instead solve your use case?

@hay-kot
Copy link
Contributor Author

hay-kot commented Jan 3, 2023

What's the use case for providing the HTML body directly? Would providing a custom http client instead solve your use case?

No, there are two specific cases this does not cover.

  1. Fetching data en-mass and feeding it to the "scraper" in a different process. You can imagine a data pipe-line, or a background service where one is making HTTP requests and the other does the data processing.
  2. When a page is restricted behind a paywall, there is no way to feed the HTML data in by hand. I paid user can't provide a raw HTML file to later be processed.

Thanks for the PR, however, I'm hesitant to increase the surface of the public API unless there's good reason to, because doing so is typically not a reversible decision.

I appreciate the apprehension to extend the API, I feel that the the current API is missing a few things.

  • Can't provide http.Client instead have to use package singleton
  • Can't provide data in any way to the API, it MUST be retrieved from a network call.
  • Can't access the underlying schema data which makes it difficult to troubleshoot any issues that may arise, and upstream fixes when they're encountered.

I know I'm going to need those three things to move forward with what I'm building so I was planning to upstream those changes. I absolutely understand if you don't want to expose and maintain those API's through this package though.

For what it's worth, the recipe-scrapers library does expose that functionality. 🤷

@kkyr
Copy link
Owner

kkyr commented Jan 3, 2023

Can't provide http.Client instead have to use package singleton

Should be resolved by providing functional options to the API.

Can't provide data in any way to the API, it MUST be retrieved from a network call.

Fair call given the use cases you provided.

Can't access the underlying schema data which makes it difficult to troubleshoot any issues that may arise, and upstream fixes when they're encountered.

You mean having access to the raw ld+json data?

I agree providing a mechanism for client to avoid network call is something the API should provide. Let me think on this and get back to you.

@hay-kot
Copy link
Contributor Author

hay-kot commented Jan 5, 2023

You mean having access to the raw ld+json data?

Yeah, some underlying access to an untyped map[string]interface{} for debugging errors and analysis. I don't know if that's worth exposing by default given that it would require some allocations that some users may not need.

@kkyr
Copy link
Owner

kkyr commented Jan 5, 2023

Yeah I'm not sure that's something we want to expose, at least in that form. Some websites might not be scraped using ld+json data and instead rely on good-old scraping DOM elements, in which case the map[string]interface{} would be empty and misleading.

I do see however the issue with not having visibility into what goes wrong and I can think of a couple alternative ways to alleviate this:

  • expose and return custom errors and then clients can use errors.Is().
  • instrument code with log lines and a NoOpLogger and optionally allow clients to provide a Logger implementation.

In terms of your PR, I think what you proposed is fine, however, I would also rename ScrapeFrom to ScrapeURL, so then the API is more descriptive when reading it at a glance:

  • ScrapeURL
  • ScrapeHTML

If you make this change I'd be happy to merge it - but please also update doc.go + README.md as well with the new API and a note about how ScrapeHTML takes a urlStr parameter in order to find the corresponding scraper.

@hay-kot
Copy link
Contributor Author

hay-kot commented Jan 6, 2023

In terms of your PR, I think what you proposed is fine, however, I would also rename
ScrapeFrom to ScrapeURL, so then the API is more descriptive when reading it at a glance:

  • ScrapeURL
  • ScrapeHTML

If you make this change I'd be happy to merge it - but please also update doc.go + README.md as well with the new API and a note about how ScrapeHTML takes a urlStr parameter in order to find the corresponding scraper.

Should be good now, thanks!

@kkyr kkyr merged commit 5758ce1 into kkyr:main Jan 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants