In [1]:
from datasets import load_dataset
from pprint import pprint
import datasets
from typing import List, Dict
import random
from transformers import BertTokenizerFast

In [2]:
dataset = load_dataset("m3hrdadfi/recipe_nlg_lite", split='train')

Reusing dataset recipe_nlg_lite (/Users/bking/.cache/huggingface/datasets/recipe_nlg_lite/1.0.0/1.0.0/2fd5f76dc1ed88ff2d6485b11497d6ae9516f4ebb2a6cb528dfaf0520bd8e51a)


In [3]:
print(dataset)
pprint(dataset.shape)

Dataset({
    features: ['uid', 'name', 'description', 'link', 'ner', 'ingredients', 'steps'],
    num_rows: 6118
})
(6118, 7)


### What keys are available on an example?


In [4]:
# What keys are available?
dataset[0].keys()
from pprint import pprint
pprint(dataset[0])


{'description': 'we all know how satisfying it is to make great pork '
                'tenderloin, ribs, or a roast but the end of the meal creates '
                'a new quandary what do you do with the leftover pork contrary '
                "to what you might think, it's not that difficult . how to "
                'repurpose your meal is where real cooking creativity comes '
                'into play, so let us present to you our favorite pork chop '
                "soup recipe . with this recipe, you'll discover how the "
                'natural bold flavor of pork gives this hearty soup a lift '
                "that a vegetable soup or chicken noodle soup just can't get . "
                "it's a dinner recipe to warm you up on a cold winter night or "
                'a midday restorative for a long work week . throw all the '
                'ingredients in a large pot and let it simmer on the stove for '
                'a couple hours, or turn it into a slow cooker 

### Qualities of an Example

#### `description`

The description field is a **single free-text string** describing the dish, likely the header content needed for a recipe to appear on search indexes. It contextualizes the recipe to some degree.

In [5]:
example = dataset[0]
assert type(example['description']) == str, f"It's actually {type(example['description'])}"

# note the long length
assert len(example['description']) == 2426, f"It's actually {len(example['description'])}"

#### `ingredients`

The `ingredients` field is a **single free-text string** listing the ingredients in the recipe. 

##### Question: what pre-processing occurred here?
- units are inconsistent
  - an ingredient **does not** need to have a unit (e.g. 'salt' in list).
- has ingredients comma separated

In [6]:
print(example['ingredients'])
assert type(example['ingredients']) == str, f"It's actually {type(example['ingredients'])}"

3.0 bone in pork chops, salt, pepper, 2.0 tablespoon vegetable oil, 2.0 cup chicken broth, 4.0 cup vegetable broth, 1.0 red onion, 4.0 carrots, 2.0 clove garlic, 1.0 teaspoon dried thyme, 0.5 teaspoon dried basil, 1.0 cup rotini pasta, 2.0 stalk celery


In [7]:
print(dataset[1]['ingredients'])

3 large eggs, 3 large egg whites, 2/3 cup sugar, 1 tsp pure vanilla extract, pinch fine sea salt, 2 cups 200 grams almond flour or almond meal, 1 cup mixed berries, powdered sugar


Note in this case, units have no decimal points. Still comma-separated. How could we detect when a comma was/wasn't replaced in a single-ingredient item?

#### `name`

`name` is the (string) name of the dish.

#### `ner`

The `ner` is a **single free-text string** listing the recognized named entities in the recipe. 

##### Question: what pre-processing occurred here?
- is this only equivalent to the ingredients list? 
  - **Yes**: for each item in `ner`, there is exactly one ingredient in `ingredients`.
- are named entities in each recipe always a substring of ingredients? 
  - **Yes**: the NER parser does not do insertions, etc.

In [8]:
print(example['ner'])

# is this only equivalent to the ingredients list?
for e in dataset:
    ing_list = e['ingredients'].split(',')
    ner_list = e['ner'].split(',')
    assert len(ing_list) == len(ner_list), f"Not the same. ingredients: {e['ingredients']}, ner: {e['ner']}"
    for ne in ner_list:
        # are named entities in each recipe always a substring of ingredients?
        assert ne in e['ingredients']

bone in pork chops, salt, pepper, vegetable oil, chicken broth, vegetable broth, red onion, carrots, garlic, dried thyme, dried basil, rotini pasta, celery


#### `steps`

The `steps` is a **single free-text string** listing the steps for producting the recipe. Note: it is seemingly separated with `.` instead of `,`, as steps in a recipe might include commas for complex steps, or context.



## Simple pre-processing

Goal: un-pack each `ner` and `ingredients` into sentences, broken up by `,`

- use [`datasets.map`](https://huggingface.co/docs/datasets/process.html#map)

In [9]:
help(dataset.map)

Help on method map in module datasets.arrow_dataset:

map(function: Union[Callable, NoneType] = None, with_indices: bool = False, input_columns: Union[str, List[str], NoneType] = None, batched: bool = False, batch_size: Union[int, NoneType] = 1000, drop_last_batch: bool = False, remove_columns: Union[str, List[str], NoneType] = None, keep_in_memory: bool = False, load_from_cache_file: bool = None, cache_file_name: Union[str, NoneType] = None, writer_batch_size: Union[int, NoneType] = 1000, features: Union[datasets.features.Features, NoneType] = None, disable_nullable: bool = False, fn_kwargs: Union[dict, NoneType] = None, num_proc: Union[int, NoneType] = None, suffix_template: str = '_{rank:05d}_of_{num_proc:05d}', new_fingerprint: Union[str, NoneType] = None, desc: Union[str, NoneType] = None) -> 'Dataset' method of datasets.arrow_dataset.Dataset instance
    Apply a function to all the elements in the table (individually or in batches)
    and update the table (if function does updat

In [10]:
def split_commas(e: Dict, keys: List[str]) -> List[str]:
    """
    'a, b, c' -> ['a', 'b', 'c'] on all atrributes in e under argument keys
    """
    for k in keys:
        e[k] = e[k].split(',')
    return e

split_data = dataset.map(lambda e: split_commas(e, ['ner', 'ingredients']))
split_data[0]

Loading cached processed dataset at /Users/bking/.cache/huggingface/datasets/recipe_nlg_lite/1.0.0/1.0.0/2fd5f76dc1ed88ff2d6485b11497d6ae9516f4ebb2a6cb528dfaf0520bd8e51a/cache-cb14d71c09ec9bb6.arrow


{'uid': 'dab8b7d0-e0f6-4bb0-aed9-346e80dace1f',
 'name': 'pork chop noodle soup',
 'description': "we all know how satisfying it is to make great pork tenderloin, ribs, or a roast but the end of the meal creates a new quandary what do you do with the leftover pork contrary to what you might think, it's not that difficult . how to repurpose your meal is where real cooking creativity comes into play, so let us present to you our favorite pork chop soup recipe . with this recipe, you'll discover how the natural bold flavor of pork gives this hearty soup a lift that a vegetable soup or chicken noodle soup just can't get . it's a dinner recipe to warm you up on a cold winter night or a midday restorative for a long work week . throw all the ingredients in a large pot and let it simmer on the stove for a couple hours, or turn it into a slow cooker recipe and let it percolate for an afternoon . this foolproof recipe transforms your favorite comfort food into an easy meal to warm you up again 

## A more realistic example: tokenization!


Here we'll tokenize the data so that it can support training of a transformer w/ `torch`. Specifically, we should (nearly) trivially be able to reproduce the `ner` as a target from the `ingredients`.

Here, we'll consider all `ingredients` in all recipes to be independent of each other and the containing recipe. As an exercise we'll include a shuffle step. We'll tokenize using BERT.

In [11]:
def extract_ingredients_and_ner_pairs(example: Dict) -> Dict:
    """
    Given an example from 
    """
    output = {'data': []}
    ingredients = example['ingredients'].split(',')
    ners = example['ner'].split(',')
    for i, ing in enumerate(ingredients):
        output['data'].append({'ingredient': ing, 'ner': ners[i]})
    return output



In [12]:
extracted_data = dataset.map(extract_ingredients_and_ner_pairs, remove_columns=dataset.column_names)
extracted_data[0]

Loading cached processed dataset at /Users/bking/.cache/huggingface/datasets/recipe_nlg_lite/1.0.0/1.0.0/2fd5f76dc1ed88ff2d6485b11497d6ae9516f4ebb2a6cb528dfaf0520bd8e51a/cache-40ab3167e1859b88.arrow


{'data': [{'ingredient': '3.0 bone in pork chops',
   'ner': 'bone in pork chops'},
  {'ingredient': ' salt', 'ner': ' salt'},
  {'ingredient': ' pepper', 'ner': ' pepper'},
  {'ingredient': ' 2.0 tablespoon vegetable oil', 'ner': ' vegetable oil'},
  {'ingredient': ' 2.0 cup chicken broth', 'ner': ' chicken broth'},
  {'ingredient': ' 4.0 cup vegetable broth', 'ner': ' vegetable broth'},
  {'ingredient': ' 1.0 red onion', 'ner': ' red onion'},
  {'ingredient': ' 4.0 carrots', 'ner': ' carrots'},
  {'ingredient': ' 2.0 clove garlic', 'ner': ' garlic'},
  {'ingredient': ' 1.0 teaspoon dried thyme', 'ner': ' dried thyme'},
  {'ingredient': ' 0.5 teaspoon dried basil', 'ner': ' dried basil'},
  {'ingredient': ' 1.0 cup rotini pasta', 'ner': ' rotini pasta'},
  {'ingredient': ' 2.0 stalk celery', 'ner': ' celery'}]}

#### Problem encountered

We were able to get the data we want, but unfortunately it is still packed in to parent examples. Each example from the recipe set has a `data` attribute with a **`list`** of `(ingredient, ner)` pairs. To my knowledge, datasets has no `map` or map-like capability that allows expansion of one example into many. To get around this, we'll create a new dataset based off a single example, then use the `add_item` method to add them all. We won't be able to use the parallel processing methods in `datasets`, though technically we weren't using it already.

In [13]:
data = {'ingredient': [], 'ner': []}
for e in extracted_data:
    for pair in e['data']:
        data['ingredient'].append(pair['ingredient'])
        data['ner'].append(pair['ner'])
        
untokenized_ing_ner_pairs = datasets.Dataset.from_dict(data)
untokenized_ing_ner_pairs

Dataset({
    features: ['ingredient', 'ner'],
    num_rows: 62731
})

In [14]:
# inspect dataset we constructed to verify we preserced pair relationships properly
for _ in range(8):
    print(untokenized_ing_ner_pairs[random.randint(0, len(untokenized_ing_ner_pairs))])

# deep check
for pair in untokenized_ing_ner_pairs:
    assert pair['ner'] in pair['ingredient']

{'ingredient': ' 1/4 cup red onion', 'ner': ' red onion'}
{'ingredient': " about 8 cups confectioners' sugar", 'ner': " confectioners' sugar"}
{'ingredient': ' 2 tsp vanilla', 'ner': ' vanilla'}
{'ingredient': ' 3.0 teaspoon soy sauce', 'ner': ' soy sauce'}
{'ingredient': ' olive oil spray', 'ner': ' olive oil spray'}
{'ingredient': ' 2.0 bananen', 'ner': ' bananen'}
{'ingredient': ' 1.0 bok choy', 'ner': ' bok choy'}
{'ingredient': ' 2 high fiber low carb whole wheat wraps', 'ner': ' high fiber low carb whole wheat wraps'}


In [15]:
# Finally, we'll tokenize it!
tokenizer = BertTokenizerFast.from_pretrained('bert-base-cased')
tokenized_pairs = untokenized_ing_ner_pairs.map(lambda examples: tokenizer(examples['ingredient']), batched=True)

  0%|          | 0/63 [00:00<?, ?ba/s]

In [16]:
tokenized_pairs

Dataset({
    features: ['attention_mask', 'ingredient', 'input_ids', 'ner', 'token_type_ids'],
    num_rows: 62731
})

For now, concluding data explorations and experiments with the `datasets` processing tooling. Picking this up in another notebook [IngredientsToNer](./IngredientsToNer.ipynb)