Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraper request: bergamot.app #986

Open
josefhelie opened this issue Jan 15, 2024 · 21 comments · May be fixed by #1064
Open

Scraper request: bergamot.app #986

josefhelie opened this issue Jan 15, 2024 · 21 comments · May be fixed by #1064

Comments

@josefhelie
Copy link

I'm currently using the free app Bergamot (which is closed source) to store my recipes, but I'd like to move to Mealie. I've encountered an error message that says, 'recipe_scrapers was unable to scrape this URL.' Is it possible to get a scraper, please? 😇
Thanks for your help.
A link to a shared recipe: https://dashboard.bergamot.app/shared/T8IJLjbtHdh2pj

@jayaddison
Copy link
Collaborator

Hi @josefhelie - thanks for the question / feature request.

In theory, yes this is possible - the webpage is public and represents a recipe. However, there are some potentially important items of information absent on the page: in particular, its origin (from another website? self-authored?) and the instructions.

Do you know whether those details can be included when sharing a recipe like this from the app? It's difficult to develop and test without a few complete samples.

@josefhelie
Copy link
Author

i'm sorry I shared a recipe that don't reflect all the requested fields. Here is a better example: https://dashboard.bergamot.app/shared/mIB4jYQtZU1A97
Is it better?

@jayaddison
Copy link
Collaborator

Yep, that initially looks good to me @josefhelie - it's difficult to say for certain without coding it up, but it seems to have most/all of the information we'd need. Thanks!

@jayaddison jayaddison changed the title Is it possible to design a scrapper for Bergamot? Scraper request: bergamot.app Jan 20, 2024
@josefhelie
Copy link
Author

Thanks a lot @jayaddison :)

@josefhelie
Copy link
Author

josefhelie commented Apr 3, 2024

May I ask any update on this request @jayaddison?
thanks :)

@jayaddison
Copy link
Collaborator

Hi @josefhelie - apologies for my delayed reply. No further updates on this at the moment I'm afraid. Do you have any interest in learning some Python coding?

@mlduff
Copy link
Contributor

mlduff commented Apr 15, 2024

@jayaddison I took a look, looks like it is fairly easy to call the API endpoint, which can be derived from the URL of the recipe.
For https://dashboard.bergamot.app/shared/mIB4jYQtZU1A97 the associated API endpoint is https://api.bergamot.app/recipes/shared?r=mIB4jYQtZU1A97.

I'm not sure how the library normally supports the case of recipes being loaded via an API call after the original page load - I can see a few examples (goustojson.py, monsieurcuisine.py) that seem to do this - I would be happy to tackle this if you are happy to me to do so?

@jayaddison
Copy link
Collaborator

Thanks @mlduff!

I'm not sure how the library normally supports the case of recipes being loaded via an API call after the original page load - I can see a few examples (goustojson.py, monsieurcuisine.py) that seem to do this - I would be happy to tackle this if you are happy to me to do so?

About the handling of APIs: yep, well discovered - we do have a few scrapers that retrieve data using APIs at the moment. A potential design/architecture problem with that is that it (currently) tightly-couples the scraper to an HTTP client - namely requests at the moment; nearly a de-facto client for Python, but even so, it may not be ideal to depend entirely on it.

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

A long explanation, but the short answer is: yep, please go ahead, but be aware that this would currently only be supported in the v14 / mainline branch.

@jayaddison
Copy link
Collaborator

@mlduff also a design / implementation question for your consideration: those recipes sometimes contain a link to the original source of the recipe. Should we return that as the canonical URL for recipes when possible?

@mlduff
Copy link
Contributor

mlduff commented Apr 15, 2024

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

@jayaddison is your preference for me to develop this in the v15 branch? If I implement in v14 (which seems easier), will it then need rewriting at some point (are the other ones like the example I found going to also need similar rewriting?)?

@mlduff
Copy link
Contributor

mlduff commented Apr 15, 2024

@mlduff also a design / implementation question for your consideration: those recipes sometimes contain a link to the original source of the recipe. Should we return that as the canonical URL for recipes when possible?

Good point, will try to do that.

@jayaddison
Copy link
Collaborator

Meanwhile we have a v15 development branch that can optionally use requests, but that otherwise requires callers to retrieve the HTML and pass it to the scraper themselves. Marginally less convenient, but allowing callers to use whatever HTTP client(s) they prefer (anything from built-in urlopen, low-level urllib3, requests, httpx, etc).

@jayaddison is your preference for me to develop this in the v15 branch? If I implement in v14 (which seems easier), will it then need rewriting at some point (are the other ones like the example I found going to also need similar rewriting?)?

I'd recommend implementing it for v14, yep.

@josefhelie
Copy link
Author

Hi @josefhelie - apologies for my delayed reply. No further updates on this at the moment I'm afraid. Do you have any interest in learning some Python coding?
Thanks @jayaddison, but i don't have enough free time to do that, even if I would like to!! 😢
Thanks @mlduff too :)

@mlduff
Copy link
Contributor

mlduff commented Apr 15, 2024

@jayaddison I noticed that the tests for the two scrapers I mentioned above are located under the legacy section - do I add my tests under there as well?

@mlduff
Copy link
Contributor

mlduff commented Apr 15, 2024

@josefhelie are you able to provide a couple more recipe URLs please so I can test?

@jayaddison
Copy link
Collaborator

@jayaddison I noticed that the tests for the two scrapers I mentioned above are located under the legacy section - do I add my tests under there as well?

@mlduff yep, that's the correct place for those; thanks for checking 👍 You should be able to configure the expected_requests property in the tests to return example results for both the initial HTML HTTP GET response, and also the subsequent (probably also HTTP GET) API request.

@mlduff mlduff linked a pull request Apr 16, 2024 that will close this issue
@jayaddison
Copy link
Collaborator

@josefhelie have you found any pages shared on Bergamot where the original author is credited? I've seen a few pages that have the domain name of the source URL.. I'm wondering whether there are any that list names/usernames.

@josefhelie
Copy link
Author

@jayaddison I'm not sure I have. Would it help you if you provide me a recipe I could import into Bergamot and then give you the link towards the imported recipe?

@mlduff
Copy link
Contributor

mlduff commented Apr 17, 2024

@josefhelie Here is one that has an author https://www.bestrecipes.com.au/recipes/peanut-butter-cookies-recipe/fowk6kuy

@josefhelie
Copy link
Author

I imported it in my Bergamot, here it is: https://dashboard.bergamot.app/shared/REbGkQaNoVJ5kM

@jayaddison
Copy link
Collaborator

Thanks @josefhelie - so roughly speaking, it seems like some source recipes may include author info, and the Bergamot page includes a link back to the original, but our scraper can't directly retrieve the author details at the moment (they're not in the Bergamot page, so it seems like we'd have to ask Bergamot to add those, or to retrieve them ourselves from the original URL).

I'm not completely sure what to do here; I personally place quite a lot of important on retaining the author name/info (even though it's challenging sometimes) because my assumption is that a lot of recipe authors themselves would want that to be included when people view their recipes.

I haven't contacted Bergamot to ask whether they'd consider attempting to include that info themselves, so that's one option I'm considering. Is there a support/feedback option in the app itself?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants