Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Idea: Grouping Ingredients #301

Closed
bfcarpio opened this issue Jan 24, 2021 · 20 comments
Closed

Idea: Grouping Ingredients #301

bfcarpio opened this issue Jan 24, 2021 · 20 comments

Comments

@bfcarpio
Copy link
Collaborator

@jayaddison mentioned as part of #299 the potential inclusion of grouping ingredients. I remember having a similar conversation about instructions (#230)

If we were to implement this we'd need to change the ingredients methods (or add) so that there's a potential for returning a nested data structure or some other deviation from the current implementation ( akin to #283 ).

Just for brainstorming, should we and how would we?

@jayaddison
Copy link
Collaborator

No precise ideas here yet, but as a general suggestion: it'd likely make sense to re-use the same grouping behaviour / programming interface across both ingredients and instructions.

Some challenges:

  • Ingredients may be grouped while instructions might not be (or vice versa)
  • Even when both ingredients and corresponding instructions are grouped, the text description of the group name might not match on some websites
  • Most recipes will not contain groupings at all

@weightwatchers-carlanderson
Copy link
Contributor

weightwatchers-carlanderson commented Feb 26, 2021

I was thinking about this more generally. I think that it could be good if ingredients, instructions, ratings and possibly time were returned as objects.

That is, we could capture rating count, rating average, min rating and max rating (all optional) in one Ratings instance

def ratings(self):
    cnt = self.soup.find("meta", {"itemprop": "ratingCount"})["content"]
    rating = self.soup.find("meta", {"itemprop": "ratingValue"})["content"]
    return Ratings(count=cnt, rating=rating)

Similarly for ingredients or instructions, we could let that handle the grouping under the hood:

    def ingredients(self):
       obj = Ingredients()
       for ingredient in self.soup.find(....):
          obj.add_ingredient(ingredient)
       return obj

but where add_ingredient could take an optional grouping tag:

add_ingredient("1 cup mayonnaise", "sauce")
add_ingredient("1/4 teaspoon paprika", "sauce")

and you could

obj.get_ingredients().  #return a single list
obj.get_ingredients(group="sauce") #which acts as a filter

and

  obj.get_groups() #return list of group terms

If that was handled as a list of 2 tuples or as some dict, that would be abstracted away from the code that sets it in the scraper

@jayaddison
Copy link
Collaborator

@weightwatchers-carlanderson interesting - if I interpret your suggestion correctly, would this mean that we could make these changes internally within the scrapers even if the external API for the library remains the same, at least to begin with?

In particular I really like your idea about making the group an optional argument during content parsing. That fits well with the ~90% case where grouping isn't present, so we can allow scraper authors to ignore it.

@weightwatchers-carlanderson
Copy link
Contributor

the interface would be the same but the return types would be different:

def ratings(self) -> int             ---> def ratings(self) -> Ratings
def ingredients(self) -> List[str]  ---->   def ingredients(self) -> Ingredients
def instructions(self) -> str        ---->   def instructions(self) -> Instructions

That would still break all calling code and unit tests.

As many scraper methods use self.schema, that could be implemented once in that one parent class and so simple scraper methods such as the following would be unchanged:

    def title(self):
        return self.schema.title()

@bfcarpio
Copy link
Collaborator Author

I like it, but I'm hoping you have ideas on how to implement it properly in a way that's coherent. It might be overwhelming if we have users overriding any one of 20 different functions etc.

Something that could go along with this is changing the functions that are overridden. Right now we have users write the public functions, title(). Instead, we could have them write _title() do that we can implement title() around it? That might make it easier for us to maintain and give more flexibility to this object idea? I don't mind whipping out sed again.

@weightwatchers-carlanderson
Copy link
Contributor

weightwatchers-carlanderson commented Feb 27, 2021

It might be overwhelming if we have users overriding any one of 20 different functions etc.

@bfcarpio perhaps I'm misunderstanding your point and you understand this codebase much better than I but AbstractScraper defines an interface for the scrapers. In the concrete scraper subclasses, we allow them the convenience to use self.schema but we don't enforce it, and there may be genuine reasons that you need to implement with a more raw self.soup implementation. We can't control that and that is the reality today. I don't see that what I proposed provides any more freedom or confusion than today.

There are couple of things that one could do:

  1. drop the Optional aspect of the type hinting in AbstractScraper so that
    @Decorators.schema_org_priority
   def ingredients(self) -> Optional[List[str]]:
       raise NotImplementedError("This should be implemented.")

becomes

    @Decorators.schema_org_priority
    def ingredients(self) -> Ingredients:
        raise NotImplementedError("This should be implemented.")

This will catch scrapers that don't match the right signature.

and

  1. Implement a non-abstract BaseScraper that implements a standard scraper (equivalent to wild_mode). The methods in that class use self.schema and return the right type of objects. Then, if we encourage people to subclass that, rather than AbstractScraper, they can just override the methods that they need to correct or add additional methods for that particular site

@weightwatchers-carlanderson
Copy link
Contributor

weightwatchers-carlanderson commented Feb 27, 2021

P.S. rather than sed, it should be possible to use reflection to grab the code for each method for each scraper to determine whether it uses self.schema (so no changes necessary) or whether it uses some custom code.

P.P.S.
my comment above just made me realize, why today do we have scrapers that look like this:

class SomeSite(AbstractScraper):
    @classmethod
    def host(cls):
        return "somesite.com"

    def title(self):
        return self.schema.title()

    def total_time(self):
        return self.schema.total_time()

    def yields(self):
        return self.schema.yields()

    def image(self):
        return self.schema.image()

    def ingredients(self):
        return self.schema.ingredients()

    def instructions(self):
        return self.schema.instructions()

    def ratings(self):
        return self.schema.ratings()

There should be a BaseScraper that looks like this and then this site's scraper would just need to be:

class SomeSite(BaseScraper):
    @classmethod
    def host(cls):
        return "somesite.com"

@bfcarpio
Copy link
Collaborator Author

Sorry, I think I was mixing two different bug plans when I was writing that response. I do like the idea of returning objects (dataclassses?) for those methods. If anything it would allow us to increase documentation and type safety information for package consumers.

we have scrapers that look like this

I think there initially was an implementation where methods didn't need to be defined, but then people started adding in scrapers that didn't work. This was around the time where I found the package so I don't have the most context in this regard.

@jayaddison
Copy link
Collaborator

@bfcarpio @weightwatchers-carlanderson @vabene1111 pinging you to mention some recent discussion in #622 that is relevant to this feature request about grouping ingredients.

The currently-suggested approach there incorporates some ideas I gathered from here - including dataclasses as mentioned by @bfcarpio, and per-field return types suggested by @weightwatchers-carlanderson (that would fit in nicely with our recently-added support for mypy type checking).

@vabene1111 this is mostly informational for you, but I think that whether or not scraper methods accept parameters could be important for some of the data export functionality - like figuring out which methods to include and how to invoke them.

@vabene1111
Copy link
Collaborator

awesome thanks.

You seem to already have many good ideas regarding this topic. I am facing a similar issue in tandoor (where i am also using recipe-scrapers under the hood). What we have is an algorthm to split a string of ingredients into unit, amount and food, after that we can combine foods of the same unit. What we are currently working on/planning is adding automatic unit conversion so that you can also combine foods of differnt units.

One thing that might be hard for this library is that converting a string into unit, amount and food is somewhat error prone and usually requires some human work afterwards, so it might be a bit tricky to go with this approach in recipe-scrapers.

@jayaddison
Copy link
Collaborator

@vabene1111 agreed: identifying ingredients and units, and then standardizing units is hard. We don't plan to handle those in recipe-scrapers, but I can share some suggestions from experience. Is there a GitHub issue in TandoorRecipes where I could add some comments?

@vabene1111
Copy link
Collaborator

Thanks for the offer, there is currently no open issue as the solution we currently use works pretty well but if you have any ideas on improving it further feel free to open a discussion or issue, thanks :)

@bfcarpio
Copy link
Collaborator Author

I'm going to list some thoughts here that are based on this PR, but not PR focused.

I like the idea of making dataclasses that are IngredientsGroup or InstructionsGroup.

  1. They should have minimal functions other than to access content in maybe a couple of styles (string with and without newlines maybe, list of strings) that we then use in the generic instructions() or ingredients() functions.
  2. I think we should avoid tying them to be too similar to whatever schemaorg objects exist. As in, the definition for these data classes shouldn't be a schemaorg DOM model. Instead, we should keep them as logical units. If the website breaks apart ingredients visually then regardless of them being implemented as schemaorg objects we should attempt to split them.
  3. Hard pass on having these objects do any sort of ingredient unit conversion stuff. IMO, this library's goal as a scraper library is just to provide the data as raw as possible in some native data structures. Unit handling goes too far into business logic and then we start getting into issues of "are courgettes and zucchini the same object?"
  4. My sub-goal of keeping this project newcomer friendly: If you're new to programming, code like we have it now (grab HTML element and split on newlines) is pretty easy to get into. Suddenly having a bunch of modeling stuff will decrease legibility of our code. Switching from a group-less to a with-group implementation should be ~3 steps.

@vabene1111 I just started using Tandoor since you guys are a consumer of ours. I figured it'd be cool to see a project I contribute to in action. Thanks for all your work!

@jayaddison
Copy link
Collaborator

As of version 14.17.1 (just released), there's support for IngredientGroup results from the NIHHealthyEating scraper - feedback welcome; let's try that out, and if it works well then perhaps we can promote the ingredient_groups method there to the AbstractScraper class.

@Chris230291
Copy link

Groups sounds like a great idea.
Take this example https://realfood.tesco.com/recipes/spiced-turkey-burgers-with-guacamole.html
Right now I get...

['1 large avocado, diced', 'juice of 1 lime', '1 large garlic clove, crushed', '8 vine-ripened cherry tomatoes, quartered', '2tbsp coriander leaves, chopped', '1/2tbsp light olive oil', '1 onion, finely chopped', '2 large garlic cloves, crushed', '2tbsp parsley', '500g pack turkey mince', '1 heaped tsp mild chilli powder', '1 large red pepper, thickly sliced', '1 red onion, thickly sliced', 'burger buns, toasted', 'lettuce to serve']

(I think I am supposed to be getting For the guacamole For the burgers To serve too, but I am not.)

Ideally it would be returned something like this...

[
   {
      "For the guacamole":[
         "1 large avocado, diced",
         "juice of 1 lime",
         "1 large garlic clove, crushed",
         "8 vine-ripened cherry tomatoes, quartered",
         "2tbsp coriander leaves, chopped"
      ]
   },
   {
      "For the burgers":[
         "1/2tbsp light olive oil",
         "1 onion, finely chopped",
         "2 large garlic cloves, crushed",
         "2tbsp parsley",
         "500g pack turkey mince",
         "1 heaped tsp mild chilli powder"
      ]
   },
   {
      "To Server":[
         "1 large red pepper, thickly sliced",
         "1 red onion, thickly sliced",
         "burger buns, toasted",
         "lettuce to serve"
      ]
   }
]

@strangetom
Copy link
Collaborator

Hi guys!

I think this would be a great idea. I've had a go at prototyping something that make be generic enough to be useful and I think is in alignment with thoughts further up this thread.

The AbstractScraper class could include an ingredients_group function with a default implementation that did something like

class AbstractScraper:
    ...
    def ingredient_groups(self)  -> List[IngredientGroup]:
        return [IngredientGroup(ingredients=self.ingredients(), purpose=None)]

where IngredientGroup is the dataclass suggested above. The main idea of doing this is to keep the scraper interface consistent across all scrapers.

The ingredient_groups implementation could then be modified in the derived classes where it makes sense.

There might also be an opportunity to provide a utility function to help with the grouping the ingredients. Looking at a couple of examples:

The structure of the markup for the ingredients is quite similar. There's a group heading element, followed by some siblings with the ingredients, then another group heading element etc. A helper function along the lines of

def group_ingredients(soup, group_heading, group_element):
    """Group ingredients into sublists according the heading
    If the recipe doesn't group ingredients, this returns a single IngredientGroup object
    with the purpose set to None and the ingredients list populated with all ingredients.

    INPUTS
    =====
    soup : BeautifulSoup
        Scraped page html as BeautifulSoup object
    group_heading : str
        CSS selector for ingredient group heading
    group_element : str
        CSS selector for ingredient element

    OUTPUTS
    =====
    List[IngredientGroup]
        List of IngredientGroup objects.

    """
    groupings = {None: []}
    current_heading = None

    elements = scraper.soup.select(",".join([group_heading, group_element]))
    for el in elements:
        if el in el.parent.select(group_heading):
            # This is the heading
            current_heading = normalize_string(el.get_text())
            groupings[current_heading] = []
    
        else:
            text = normalize_string(el.get_text())
            groupings[current_heading].append(text)

    return [IngredientGroup(purpose=k, ingredients=v) for k, v in groupings.items() if v != []]

This works quite well on some examples.

> scraper = scrape_me("https://realfood.tesco.com/recipes/spiced-turkey-burgers-with-guacamole.html")
> group_ingredients(scraper.soup, ".recipe-detail__subheading", ".recipe-detail__list-item")
[IngredientGroup(ingredients=['1 large avocado, diced', 'juice of 1 lime', '1 large garlic clove, crushed', '8 vine-ripened cherry tomatoes, quartered', '2tbsp coriander leaves, chopped'], purpose='For the guacamole'),
 IngredientGroup(ingredients=['1/2tbsp light olive oil', '1 onion, finely chopped', '2 large garlic cloves, crushed', '2tbsp parsley', '500g pack turkey mince', '1 heaped tsp mild chilli powder'], purpose='For the burgers'),
 IngredientGroup(ingredients=['1 large red pepper, thickly sliced', '1 red onion, thickly sliced', 'burger buns, toasted', 'lettuce to serve'], purpose='To serve')]

or

> scraper = scrape_me("https://www.sainsburysmagazine.co.uk/recipes/mains/frying-pan-hawaiian-pizza", wild_mode=True)
> group_ingredients(scraper.soup, ".ingredients h5", ".ingredients li")
[IngredientGroup(ingredients=['1 tbsp olive oil, plus more for greasing', '250g strong white bread flour, plus extra for dusting', '1 tsp fast action dried yeast', '½ tsp fine sea salt'], purpose='For the dough'),
 IngredientGroup(ingredients=['1 tbsp olive oil', '1 garlic clove, crushed', '½ x 500g carton passata', '1 tsp dried oregano', '¼ tsp chilli flakes (optional)', '1 tsp sugar (any kind)', '1 tsp red wine vinegar'], purpose='For the pizza sauce'),
 IngredientGroup(ingredients=['2 pineapple rings from a tin (or about 50g fresh), drained and diced', '1 x 120g pack thick cut ham, cut into chunks or torn', '100g grated mozzarella cheese'], purpose='For the topping')]

But this isn't perfect. Because the ingredients are extracted from the html markup they might not match list of ingredients returned by SomeScraper.ingredients() if for some reason the html markup doesn't match the embedded data. An example of this is https://www.pickuplimes.com/recipe/vegan-honey-mustard-tofu-wraps-1448 where the ingredient quantities returned by the scraper are slightly different to what's shown in the page. It also doesn't work in the NIH example above because the html markup doesn't follow this pattern.

@jayaddison
Copy link
Collaborator

Hi @strangetom - hope you had a good weekend. This short reply doesn't really do your comment justice, but I think that that sounds fantastic, and I'd be glad to help review -- and with a bit of luck, merge -- your changes to support ingredient groupings.

@vabene1111
Copy link
Collaborator

thanks for the concept/idea. In my opinion the last "disclaimer" disqualifies this approach. If i cannot rely on the parser picking up all ingredients in the correct amounts than the step ordered ingredients can never be consumed as I would always live in fear of missing something (both wen using the data in scripts but also when archiving).

But I dont think the approach is completely flawed. Maybe the scraper could use the ingredients found in the way described above together with some kind of matching algorithm (maybe similar to how I do it in tandoor https://github.com/TandoorRecipes/recipes/blob/develop/vue/src/apps/ImportView/ImportViewStepEditor.vue#L194) to sort the list returned by the normal SomeScraper.ingredients() method to their respektive steps by some kind of confidence index/threshold.

@strangetom
Copy link
Collaborator

Thanks for the feedback.

@vabene1111 , I agree with you that it's not robust enough yet. If SomeScraper.ingredients() and SomeScraper.ingredient_groups() potentially returned different ingredients then that would not be helpful. The approach you suggest looks quite simple and would solve this particular problem. There should always be a close, if not exact match, so this could work well.

Another, simpler, approach would be to use my suggested group_ingredients function to calculate the indices of the ingredients in each group and then get the actual ingredient sentences from the .ingredients() function. This assumes that the ingredients the scraper extracts appear in the same order as the ingredients in the html markup. I don't think it is unreasonable to assume this, but there would be no guarantee so perhaps not as robust as the matching approach.

@jayaddison I would certainly appreciate any help you can offer. I will have a look at opening a draft PR in a few days or so and hopefully we can turn it into something useful.

@jayaddison
Copy link
Collaborator

I think we can consider this implemented by #781 - it's now available for a number of scrapers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants