-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
HacktoberfestenhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed
Description
Ingredient parsing is implemented using regular expressions and is rather limited in how ingredient info must be specified in the recipe. An interesting alternative would be to use natural language processing (NLP). There is an open-source project by the New York Times that used NLP for ingredient parsing and I've included a link to that project. There are probably other interesting projects to consider.
Examples of current limitations:
- All potential ingredient units must be known beforehand.
- Any info found after a comma or in parenthesis is simply ignored.
Potential resources
- Using NLP to parse ingredients + comments: https://github.com/NYTimes/ingredient-phrase-tagger
- Example use: https://rajmak.wordpress.com/tag/recipe-ingredients-tagging/
- Running in Docker: https://github.com/ArchSirius/docker-ingredient-phrase-tagger
Current parsing method
class Recipe:
Ingredient = namedtuple('Ingredient', 'qty unit name raw')
units_pattern = r'(?:(\s?mg|g|kg|ml|L|oz|ounce|tbsp|Tbsp|tablespoon|tsp|teaspoon|cup|lb|pound|small|medium|large|whole|half)?(?:s|es)?\.?\b)'
full_pattern = r'^(?:([-\.\/\s0-9\u2150-\u215E\u00BC-\u00BE]+)?{UNITS_PATTERN})?(?:.*\sof\s)?\s?(.+?)(?:,|$)'.format(
UNITS_PATTERN=units_pattern)
pattern = re.compile(full_pattern, flags=re.UNICODE)
# https://en.wikipedia.org/wiki/Cooking_weights_and_measures#United_States_measures
measures = {
'drop': {'abrv': 'dr gt gtt', 'oz': 1.0 / 576},
'smidgen': {'abrv': 'smdg smi', 'oz': 1.0 / 256},
...
}
@classmethod
def parse_ingredient(cls, i):
if not isinstance(i, str):
if i.string is None:
# multiple tags in child,
# bs4 gets confused per https://www.crummy.com/software/BeautifulSoup/bs4/doc/#string
s = ' '.join(i.stripped_strings)
else:
s = i.string.strip()
else:
s = i
raw = s.replace('\xa0', '').strip()
clean = re.sub(r'\(.+?\)', '', raw).replace('’', "'")
parsed = cls.pattern.match(clean)
if parsed:
qty, unit, name = parsed.groups()
...Full recipe parsing file ibdb/recipe_parser.py
Current tests used for validation
# Tests to validate parse_ingredient()!!!
tests = [
'Bread', (None, None, 'Bread'),
'6 stalks celery', (6.0, None, 'Stalks Celery'),
'4 eggs', (4.0, None, 'Eggs'),
'2 ½ pounds of full fat cream cheese, cut', (2.5, 'pound', 'Full Fat Cream Cheese'),
'25 oreos, finely processed', (25.0, None, 'Oreos'),
'1-2 variable ingredients', ('1-2', None, 'Variable Ingredients'),
'2 1/2 things', (2.5, None, 'Things'),
'1/2 things', (0.5, None, 'Things'),
'1 large, long sourdough loaf', (1.0, 'large', 'Long Sourdough Loaf'),
'100ml Water', (100.0, 'ml', 'Water'),
'1L Water', (1.0, 'L', 'Water')
]Full test suite tests/ibdb/test_recipe_parser.py
Metadata
Metadata
Assignees
Labels
HacktoberfestenhancementNew feature or requestNew feature or requesthelp wantedExtra attention is neededExtra attention is needed