Skip to content

Write a Better NLP Ingredient Parser for Binging with Babish Recipes #3

@jklewa

Description

@jklewa

Ingredient parsing is implemented using regular expressions and is rather limited in how ingredient info must be specified in the recipe. An interesting alternative would be to use natural language processing (NLP). There is an open-source project by the New York Times that used NLP for ingredient parsing and I've included a link to that project. There are probably other interesting projects to consider.

Examples of current limitations:

  • All potential ingredient units must be known beforehand.
  • Any info found after a comma or in parenthesis is simply ignored.

Potential resources

Current parsing method

class Recipe:

    Ingredient = namedtuple('Ingredient', 'qty unit name raw')
    units_pattern = r'(?:(\s?mg|g|kg|ml|L|oz|ounce|tbsp|Tbsp|tablespoon|tsp|teaspoon|cup|lb|pound|small|medium|large|whole|half)?(?:s|es)?\.?\b)'

    full_pattern = r'^(?:([-\.\/\s0-9\u2150-\u215E\u00BC-\u00BE]+)?{UNITS_PATTERN})?(?:.*\sof\s)?\s?(.+?)(?:,|$)'.format(
        UNITS_PATTERN=units_pattern)

    pattern = re.compile(full_pattern, flags=re.UNICODE)

    # https://en.wikipedia.org/wiki/Cooking_weights_and_measures#United_States_measures
    measures = {
        'drop': {'abrv': 'dr gt gtt', 'oz': 1.0 / 576},
        'smidgen': {'abrv': 'smdg smi', 'oz': 1.0 / 256},
        ...
    }

    @classmethod
    def parse_ingredient(cls, i):
        if not isinstance(i, str):
            if i.string is None:
                # multiple tags in child,
                # bs4 gets confused per https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
                s = ' '.join(i.stripped_strings)
            else:
                s = i.string.strip()
        else:
            s = i
        raw = s.replace('\xa0', '').strip()

        clean = re.sub(r'\(.+?\)', '', raw).replace('’', "'")

        parsed = cls.pattern.match(clean)
        if parsed:
            qty, unit, name = parsed.groups()
        ...

Full recipe parsing file ibdb/recipe_parser.py

Current tests used for validation

# Tests to validate parse_ingredient()!!!
tests = [
    'Bread', (None, None, 'Bread'),
    '6 stalks celery', (6.0, None, 'Stalks Celery'),
    '4 eggs', (4.0, None, 'Eggs'),
    '2 ½ pounds of full fat cream cheese, cut', (2.5, 'pound', 'Full Fat Cream Cheese'),
    '25 oreos, finely processed', (25.0, None, 'Oreos'),
    '1-2 variable ingredients', ('1-2', None, 'Variable Ingredients'),
    '2 1/2 things', (2.5, None, 'Things'),
    '1/2 things', (0.5, None, 'Things'),
    '1 large, long sourdough loaf', (1.0, 'large', 'Long Sourdough Loaf'),
    '100ml Water', (100.0, 'ml', 'Water'),
    '1L Water', (1.0, 'L', 'Water')
]

Full test suite tests/ibdb/test_recipe_parser.py

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions