feat: expose ScrapeHTML function in API #3

hay-kot · 2023-01-03T01:49:45Z

This PR adds the ability to pass the HTML body via an io.Reader to a new function ScrapeHTML.

No business code was written, I've just moved the parts that work with the body into their own function and call it from the ScrapeFrom function

kkyr · 2023-01-03T21:06:32Z

Thanks for the PR, however, I'm hesitant to increase the surface of the public API unless there's good reason to, because doing so is typically not a reversible decision.

What's the use case for providing the HTML body directly? Would providing a custom http client instead solve your use case?

hay-kot · 2023-01-03T22:13:33Z

What's the use case for providing the HTML body directly? Would providing a custom http client instead solve your use case?

No, there are two specific cases this does not cover.

Fetching data en-mass and feeding it to the "scraper" in a different process. You can imagine a data pipe-line, or a background service where one is making HTTP requests and the other does the data processing.
When a page is restricted behind a paywall, there is no way to feed the HTML data in by hand. I paid user can't provide a raw HTML file to later be processed.

Thanks for the PR, however, I'm hesitant to increase the surface of the public API unless there's good reason to, because doing so is typically not a reversible decision.

I appreciate the apprehension to extend the API, I feel that the the current API is missing a few things.

Can't provide http.Client instead have to use package singleton
Can't provide data in any way to the API, it MUST be retrieved from a network call.
Can't access the underlying schema data which makes it difficult to troubleshoot any issues that may arise, and upstream fixes when they're encountered.

I know I'm going to need those three things to move forward with what I'm building so I was planning to upstream those changes. I absolutely understand if you don't want to expose and maintain those API's through this package though.

For what it's worth, the recipe-scrapers library does expose that functionality. 🤷

kkyr · 2023-01-03T22:33:10Z

Can't provide http.Client instead have to use package singleton

Should be resolved by providing functional options to the API.

Can't provide data in any way to the API, it MUST be retrieved from a network call.

Fair call given the use cases you provided.

Can't access the underlying schema data which makes it difficult to troubleshoot any issues that may arise, and upstream fixes when they're encountered.

You mean having access to the raw ld+json data?

I agree providing a mechanism for client to avoid network call is something the API should provide. Let me think on this and get back to you.

hay-kot · 2023-01-05T20:43:23Z

You mean having access to the raw ld+json data?

Yeah, some underlying access to an untyped map[string]interface{} for debugging errors and analysis. I don't know if that's worth exposing by default given that it would require some allocations that some users may not need.

kkyr · 2023-01-05T22:31:27Z

Yeah I'm not sure that's something we want to expose, at least in that form. Some websites might not be scraped using ld+json data and instead rely on good-old scraping DOM elements, in which case the map[string]interface{} would be empty and misleading.

I do see however the issue with not having visibility into what goes wrong and I can think of a couple alternative ways to alleviate this:

expose and return custom errors and then clients can use errors.Is().
instrument code with log lines and a NoOpLogger and optionally allow clients to provide a Logger implementation.

In terms of your PR, I think what you proposed is fine, however, I would also rename ScrapeFrom to ScrapeURL, so then the API is more descriptive when reading it at a glance:

ScrapeURL
ScrapeHTML

If you make this change I'd be happy to merge it - but please also update doc.go + README.md as well with the new API and a note about how ScrapeHTML takes a urlStr parameter in order to find the corresponding scraper.

hay-kot · 2023-01-06T02:18:42Z

In terms of your PR, I think what you proposed is fine, however, I would also rename
ScrapeFrom to ScrapeURL, so then the API is more descriptive when reading it at a glance:

ScrapeURL

ScrapeHTML

If you make this change I'd be happy to merge it - but please also update doc.go + README.md as well with the new API and a note about how ScrapeHTML takes a urlStr parameter in order to find the corresponding scraper.

Should be good now, thanks!

feat: expose ScrapeHTML function in API

dfcecd5

rename ScrapeFrom -> ScrapeURL

bcb8911

kkyr merged commit 5758ce1 into kkyr:main Jan 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: expose ScrapeHTML function in API #3

feat: expose ScrapeHTML function in API #3

hay-kot commented Jan 3, 2023

kkyr commented Jan 3, 2023

hay-kot commented Jan 3, 2023

kkyr commented Jan 3, 2023

hay-kot commented Jan 5, 2023

kkyr commented Jan 5, 2023 •

edited

hay-kot commented Jan 6, 2023

feat: expose ScrapeHTML function in API #3

feat: expose ScrapeHTML function in API #3

Conversation

hay-kot commented Jan 3, 2023

kkyr commented Jan 3, 2023

hay-kot commented Jan 3, 2023

kkyr commented Jan 3, 2023

hay-kot commented Jan 5, 2023

kkyr commented Jan 5, 2023 • edited

hay-kot commented Jan 6, 2023

kkyr commented Jan 5, 2023 •

edited