Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaper can not recognise anything on Allerhande.nl (ah.nl) anymore while it did in the past #2888

Closed
3 tasks done
zandhaas opened this issue Dec 29, 2023 · 3 comments
Closed
3 tasks done
Labels
bug Something isn't working scraper triage

Comments

@zandhaas
Copy link

First Check

  • I used the GitHub search to find a similar issue and didn't find it.

  • I have verified that this issue is not related to the underlying library
    hhyrsev/recipe-scrapers by 1) checking
    the debugger and data is returned, 2)
    verifying that there are errors in the log related to application level code, or
    3) verified that the site provides recipe data, or is otherwise supported by
    hhyrsev/recipe-scrapers

  • This issue can be replicated on the demo site (https://demo.mealie.io/)

Please provide 1-5 example URLs that are having errors

https://www.ah.nl/allerhande/recept/R-R1199309/courgettelasagne-met-3-kazen-en-gehakt
https://www.ah.nl/allerhande/recept/R-R1199239/vegan-groenterollade-met-saliestuffing-van-sanne-vogel

Please provide your logs for the Mealie container docker logs <container-id> > mealie.logs

mealie-log.zip

Deployment

Docker (Synology)

@zandhaas zandhaas added bug Something isn't working scraper triage labels Dec 29, 2023
@zandhaas zandhaas changed the title Scaper can not recognis anything on Allerhande.nl anymore whil it did in th epast Scaper can not recognise anything on Allerhande.nl (ah.nl) anymore while it did in the past Dec 29, 2023
@zandhaas
Copy link
Author

The scraper debugger returns:

recipe_scrapers was unable to scrape this URL

@Kuchenpirat
Copy link
Collaborator

So it seems (as already suggested in the Discord by other members of the team) that the scraper is being blocked by the website.

All i can get out of the scraper is the domain and via html mode the following "Access Denied" Message.

b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http&#58;&#47;&#47;www&#46;ah&#46;nl&#47;allerhande&#47;recept&#47;R&#45;R1199309&#47;courgettelasagne&#45;met&#45;3&#45;kazen&#45;en&#45;gehakt" on this server.<P>\nReference&#32;&#35;18&#46;84601302&#46;1703930024&#46;5a077ebe\n</BODY>\n</HTML>\n'

I'll be closing this, as there is not much mealie can do against that.

@Kuchenpirat Kuchenpirat closed this as not planned Won't fix, can't repro, duplicate, stale Dec 30, 2023
@bilhert
Copy link

bilhert commented Mar 27, 2024

I ran into the same issue. Doing some additional research gave me the following insights:
ah.nl/allerhande uses TLS fingerprinting detection to do bot detection.

A regular request with insomnia gave the same results as reported here.
however when using a tls spoofer proxy (https://github.com/LyleMi/ja3proxy)
and making sure the "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:124.0) Gecko/20100101 Firefox/124.0" and "Accept-Encoding:" header were set resulted in a correct result. (note accept-encoding only needed to be present the value does not matter)

Thus this issue may easily be resolved if the request is altered in such a way that this information is send with the scrape request.

tls spoofing can be directly build into mealie perhaps to be used optionally or by allowing the use of a proxy

additionally the "Accept" header should probably be set next to the user-agent

https://github.com/mealie-recipes/mealie/blob/mealie-next/mealie/services/scraper/scraper_strategies.py
https://github.com/mealie-recipes/mealie/blob/mealie-next/mealie/services/recipe/recipe_data_service.py

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working scraper triage
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants
@zandhaas @Kuchenpirat @bilhert and others