Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Add] Plötzblog Scraper #1062

Open
vabene1111 opened this issue Apr 14, 2024 · 13 comments · May be fixed by #1100
Open

[Add] Plötzblog Scraper #1062

vabene1111 opened this issue Apr 14, 2024 · 13 comments · May be fixed by #1100
Assignees

Comments

@vabene1111
Copy link
Collaborator

A popular german bread baking recipe website with no schema but very nice recipes

https://www.ploetzblog.de/rezepte/rustikales-weizenmischbrot/id=61b843e6f51707630ab85911
https://www.ploetzblog.de/rezepte/hafer-dinkel-brot/id=618a284828ae7154616ab156
https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

@jayaddison
Copy link
Collaborator

cc @mlduff (re: #1063, mentioning you so that you appear in the possible assignees list for this bug)

@jayaddison
Copy link
Collaborator

Ok, that didn't work (makes sense, I suppose). Perhaps issues can only be assigned to people who have interacted with the thread themselves (not by others on their behalf).

@mlduff
Copy link
Contributor

mlduff commented Apr 16, 2024

Hopefully now it works @jayaddison

@mlduff
Copy link
Contributor

mlduff commented Apr 24, 2024

@jayaddison I have an issue with one of the test cases when implementing this site. The ingredients_groups test assumes that the groups are a split of the ingredients, however it seems to be falling over in this case because the groups use different amounts of the same ingredient (which adds up to the total in the overall ingredients).

Here is a snapshot of my unit test with the expected results:

"ingredients": [
    "558 g Weizenmehl 550",
    "389 g Wasser",
    "90 g Weizenanstellgut TA 200 (weich)",
    "13 g Salz"
  ],
  "ingredient_groups": [
    {
      "ingredients": [
        "90 g Wasser",
        "90 g Weizenmehl 550",
        "90 g Weizenanstellgut TA 200 (weich)"
      ],
      "purpose": "Weizensauerteig"
    },
    {
      "ingredients": [
        "13 g Salz",
        "298 g Wasser",
        "467 g Weizenmehl 550",
        "gesamter Weizensauerteig"
      ],
      "purpose": "Hauptteig"
    }
  ],

As far as I can tell, the options I have are:

  1. Get rid of the ingredient_groups implementation
  2. Make the overall ingredients return the separated version of the ingredients (by concatenating the ingredient_groups)
  3. Updating the test to account for this difference
  4. Exempt this scraper from that particular test

I read https://github.com/hhursev/recipe-scrapers/blob/main/docs/in-depth-guide-ingredient-groups.md, and in particular this excerpt:

The ingredients found in ingredients() and ingredient_groups() should be the same because we're presenting the same set of ingredients, just in a different way. There can sometimes be minor differences in the ingredients in the schema and the ingredients in the HTML which needs to be handled.

I think that in my case if we kept my current output as is, it would still be satisfying this requirement as it is still the same ingredients, just separated out. Therefore, I think the best solution is option 3, however I would like to get your input on if that is the best option/desired output, and also how I could tackle this.

Scraped page: https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

@jayaddison
Copy link
Collaborator

Interesting - thanks @mlduff! In a way I'm surprised that we haven't encountered this problem before already.

To check: is is this part of the test logic that is failing:

# Assert that the ingredients returned by the ingredient_groups() function
# are the same as the ingredients return by the ingredients() function.
grouped = []
for group in actual.ingredient_groups():
grouped.extend(group.ingredients)
?

@mlduff
Copy link
Contributor

mlduff commented Apr 24, 2024

That is correct @jayaddison

@jayaddison
Copy link
Collaborator

Ok, thanks @mlduff. I agree that option three (handling this case within test coverage) would be ideal. However, I can't initially figure out how we'd handle this in out tests. Are you suggesting trying to parse and compare the quantities in the ingredient lines (grouped vs total), or to override the test for this individual case - or pehaps something else?

@mlduff
Copy link
Contributor

mlduff commented Apr 25, 2024

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

@jayaddison
Copy link
Collaborator

Mmm, ok. I'm continuing to think about it. If we had one Python class per scraper unit test, then perhaps what we would naturally do here would be to override the logic that checks for this ingredient-group-ingredient-list equivalence.

My sense is that something similar could be appropriate here. There's definitely possible value in adding a dev-dependency and potentially contributing improvements back there as a result of what we find; but there are also benefits in keeping developer dependencies lightweight.

@mlduff
Copy link
Contributor

mlduff commented Apr 25, 2024

@jayaddison I could move this test to the legacy tests so I can override if you like? Otherwise could we look at doing a non-legacy similar approach?

I'm also happy to investigate trying to parse the ingredients with the dev-dependencies if you would like me to.

@jayaddison
Copy link
Collaborator

@mlduff please feel free to explore any possibilities; all of those options seem like they'd provide ways to learn more about the problem and possible ways to resolve it - including the related advantages and disadvantages of each.

At the moment I don't have any further recommendations, but I'll continue thinking about it. It's not always the case, but sometimes taking a week or two to consider a problem can help to develop different ways of thinking about it that ultimately help to figure it out properly -- or at least help you to decide what the best option is from those available.

@mlduff mlduff linked a pull request Apr 26, 2024 that will close this issue
@mlduff
Copy link
Contributor

mlduff commented Apr 26, 2024

@jayaddison I have just submitted the PR for this - I went for option 4, and devised a way to exempt this scraper from that particular test. If I get time I'll further investigate ingredients-parser, however this seems to be like there still could be edge cases it falls over in (its a model that spits out confidences and the like), so I thought it probably isn't suitable for a unit test.

@jayaddison
Copy link
Collaborator

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

This does seem like a good approach, if possible. It would mean that we don't have to add an override option / subclass the test / provide alternative implementations per-scraper.

It seems there are at least two libraries available:

As you say, I do think that these should be dev-dependencies only for this use-case; let's prove the functionality and figure out a library that we like for test/internal purposes, and then perhaps in future we could provide parsed quantities on ingredients.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants