[Add] Plötzblog Scraper #1062

vabene1111 · 2024-04-14T16:13:20Z

A popular german bread baking recipe website with no schema but very nice recipes

https://www.ploetzblog.de/rezepte/rustikales-weizenmischbrot/id=61b843e6f51707630ab85911
https://www.ploetzblog.de/rezepte/hafer-dinkel-brot/id=618a284828ae7154616ab156
https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

jayaddison · 2024-04-16T10:26:27Z

cc @mlduff (re: #1063, mentioning you so that you appear in the possible assignees list for this bug)

jayaddison · 2024-04-16T10:27:35Z

Ok, that didn't work (makes sense, I suppose). Perhaps issues can only be assigned to people who have interacted with the thread themselves (not by others on their behalf).

mlduff · 2024-04-16T11:01:29Z

Hopefully now it works @jayaddison

mlduff · 2024-04-24T13:09:19Z

@jayaddison I have an issue with one of the test cases when implementing this site. The ingredients_groups test assumes that the groups are a split of the ingredients, however it seems to be falling over in this case because the groups use different amounts of the same ingredient (which adds up to the total in the overall ingredients).

Here is a snapshot of my unit test with the expected results:

"ingredients": [
    "558 g Weizenmehl 550",
    "389 g Wasser",
    "90 g Weizenanstellgut TA 200 (weich)",
    "13 g Salz"
  ],
  "ingredient_groups": [
    {
      "ingredients": [
        "90 g Wasser",
        "90 g Weizenmehl 550",
        "90 g Weizenanstellgut TA 200 (weich)"
      ],
      "purpose": "Weizensauerteig"
    },
    {
      "ingredients": [
        "13 g Salz",
        "298 g Wasser",
        "467 g Weizenmehl 550",
        "gesamter Weizensauerteig"
      ],
      "purpose": "Hauptteig"
    }
  ],

As far as I can tell, the options I have are:

Get rid of the ingredient_groups implementation
Make the overall ingredients return the separated version of the ingredients (by concatenating the ingredient_groups)
Updating the test to account for this difference
Exempt this scraper from that particular test

I read https://github.com/hhursev/recipe-scrapers/blob/main/docs/in-depth-guide-ingredient-groups.md, and in particular this excerpt:

The ingredients found in ingredients() and ingredient_groups() should be the same because we're presenting the same set of ingredients, just in a different way. There can sometimes be minor differences in the ingredients in the schema and the ingredients in the HTML which needs to be handled.

I think that in my case if we kept my current output as is, it would still be satisfying this requirement as it is still the same ingredients, just separated out. Therefore, I think the best solution is option 3, however I would like to get your input on if that is the best option/desired output, and also how I could tackle this.

Scraped page: https://www.ploetzblog.de/rezepte/mildes-weizensauerteigbrot/id=619f68b528ae7154616ab768

jayaddison · 2024-04-24T14:20:12Z

Interesting - thanks @mlduff! In a way I'm surprised that we haven't encountered this problem before already.

To check: is is this part of the test logic that is failing:

recipe-scrapers/tests/__init__.py

Lines 113 to 117 in 3273fab

    
           # Assert that the ingredients returned by the ingredient_groups() function 
        
           # are the same as the ingredients return by the ingredients() function. 
        
           grouped = [] 
        
           for group in actual.ingredient_groups(): 
        
               grouped.extend(group.ingredients)

?

mlduff · 2024-04-24T20:25:55Z

That is correct @jayaddison

jayaddison · 2024-04-25T13:52:01Z

Ok, thanks @mlduff. I agree that option three (handling this case within test coverage) would be ideal. However, I can't initially figure out how we'd handle this in out tests. Are you suggesting trying to parse and compare the quantities in the ingredient lines (grouped vs total), or to override the test for this individual case - or pehaps something else?

mlduff · 2024-04-25T13:57:01Z

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

jayaddison · 2024-04-25T14:29:27Z

Mmm, ok. I'm continuing to think about it. If we had one Python class per scraper unit test, then perhaps what we would naturally do here would be to override the logic that checks for this ingredient-group-ingredient-list equivalence.

My sense is that something similar could be appropriate here. There's definitely possible value in adding a dev-dependency and potentially contributing improvements back there as a result of what we find; but there are also benefits in keeping developer dependencies lightweight.

mlduff · 2024-04-25T14:46:16Z

@jayaddison I could move this test to the legacy tests so I can override if you like? Otherwise could we look at doing a non-legacy similar approach?

I'm also happy to investigate trying to parse the ingredients with the dev-dependencies if you would like me to.

jayaddison · 2024-04-25T19:04:51Z

@mlduff please feel free to explore any possibilities; all of those options seem like they'd provide ways to learn more about the problem and possible ways to resolve it - including the related advantages and disadvantages of each.

At the moment I don't have any further recommendations, but I'll continue thinking about it. It's not always the case, but sometimes taking a week or two to consider a problem can help to develop different ways of thinking about it that ultimately help to figure it out properly -- or at least help you to decide what the best option is from those available.

mlduff · 2024-04-26T15:25:48Z

@jayaddison I have just submitted the PR for this - I went for option 4, and devised a way to exempt this scraper from that particular test. If I get time I'll further investigate ingredients-parser, however this seems to be like there still could be edge cases it falls over in (its a model that spits out confidences and the like), so I thought it probably isn't suitable for a unit test.

jayaddison · 2024-05-06T10:40:30Z

@jayaddison I'm not entirely sure - ideally we would be able to parse the quantities (or even if you could just separate the quantities out and check that the same ingredients are mentioned), however I'm not sure of a robust way of doing this without perhaps using an external library as a dev-dependency (e.g. ingredient-parser).

If this would be inappropriate/too hard, being able to mark in some way (is it possible to annotate or something like that?) that this particular test shouldn't be run on this particular file that would be good.

This does seem like a good approach, if possible. It would mean that we don't have to add an override option / subclass the test / provide alternative implementations per-scraper.

It seems there are at least two libraries available:

ingredient-parser (thanks for mentioning that, I hadn't seen it before)
ingredient-slicer (mentioned in [POC] Return parsed ingredients (name, unit, quantity) #733 recently)

As you say, I do think that these should be dev-dependencies only for this use-case; let's prove the functionality and figure out a library that we like for test/internal purposes, and then perhaps in future we could provide parsed quantities on ingredients.

vabene1111 added the enhancement label Apr 14, 2024

jayaddison assigned mlduff Apr 16, 2024

mlduff linked a pull request Apr 26, 2024 that will close this issue

Add Ploetzblog + configuration for unit tests #1100

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Add] Plötzblog Scraper #1062

[Add] Plötzblog Scraper #1062

vabene1111 commented Apr 14, 2024

jayaddison commented Apr 16, 2024

jayaddison commented Apr 16, 2024

mlduff commented Apr 16, 2024

mlduff commented Apr 24, 2024 •

edited

jayaddison commented Apr 24, 2024

mlduff commented Apr 24, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 25, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 25, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 26, 2024

jayaddison commented May 6, 2024

[Add] Plötzblog Scraper #1062

[Add] Plötzblog Scraper #1062

Comments

vabene1111 commented Apr 14, 2024

jayaddison commented Apr 16, 2024

jayaddison commented Apr 16, 2024

mlduff commented Apr 16, 2024

mlduff commented Apr 24, 2024 • edited

jayaddison commented Apr 24, 2024

mlduff commented Apr 24, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 25, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 25, 2024

jayaddison commented Apr 25, 2024

mlduff commented Apr 26, 2024

jayaddison commented May 6, 2024

mlduff commented Apr 24, 2024 •

edited