<a href="https://colab.research.google.com/github/ozbej/food-analysis/blob/main/ingredient_NER_tagging.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ingredient NER Tagging



Before training we had to use NER tags for ingredients. We used the IOB tagging format and the following tags: *O, B-ING, I-ING*.

**Example tokens**: "Thoroughly","cream","shortening",",","sugar","and",\\"vanilla",".","Beat","in","eggs",",","then","chocolate","."

**Example tags**: O, O, B-ING, O, B-ING, O, B-ING, O, O,O, B-ING, O, O, B-ING, O

Sources:
- https://huggingface.co/course/chapter7/2?fw=pt
- https://huggingface.co/datasets/recipe_nlg
- https://vkhangpham.medium.com/build-a-custom-ner-pipeline-with-hugging-face-a84d09e03d88

In [1]:
%%capture
!pip install transformers datasets

In [2]:
from datasets import load_dataset, DatasetDict, Sequence, ClassLabel, Value, load_metric
import pandas as pd
from google.colab import drive
import nltk
from nltk.tokenize import word_tokenize
import re
from transformers import AutoTokenizer
from ast import literal_eval
import numpy as np

nltk.download('punkt')

drive.mount('/content/drive')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Mounted at /content/drive


## Preprocess

In [3]:
!unzip "/content/drive/MyDrive/AIR/dataset_30k.zip" -d "./data"

Archive:  /content/drive/MyDrive/AIR/dataset_30k.zip
  inflating: ./data/dataset-test.csv  
  inflating: ./data/dataset-train.csv  
  inflating: ./data/dataset-valid.csv  


In [4]:
dataset = load_dataset('csv', data_files={'train': 'data/dataset-train.csv', 'valid': 'data/dataset-valid.csv', 'test': 'data/dataset-test.csv'})



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-e7bdcffcbc291a08/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating valid split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-e7bdcffcbc291a08/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [5]:
dataset

DatasetDict({
    train: Dataset({
        features: ['id', 'title', 'ingredients', 'directions', 'link', 'source', 'NER', '__index_level_0__'],
        num_rows: 24000
    })
    valid: Dataset({
        features: ['id', 'title', 'ingredients', 'directions', 'link', 'source', 'NER', '__index_level_0__'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['id', 'title', 'ingredients', 'directions', 'link', 'source', 'NER', '__index_level_0__'],
        num_rows: 3000
    })
})

In [6]:
def process_row(row):
  row_tokenized = word_tokenize(" ". join(literal_eval(row["directions"])))
  row["recipe_tokenized"] = [word for word in row_tokenized]
  row["NER"] = literal_eval(row["NER"])
  return row

In [7]:
dataset["train"] = dataset["train"].map(process_row)
dataset["valid"] = dataset["valid"].map(process_row)
dataset["test"] = dataset["test"].map(process_row)

  0%|          | 0/24000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

In [8]:
label_list = ["O", "B-ING", "I-ING"]
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {v: k for k, v in id2label.items()}

In [9]:
def ner_tag(row):
  tokens = row["recipe_tokenized"]
  labels = ["O"] * len(tokens)
  ingredients = set(row["NER"])
  ingredient_indices = dict()

  # Save index of ingredients to know where B-ING
  for ingredient in ingredients:
    for i, word in enumerate(ingredient.split()):
      ingredient_indices[word] = i

  for i, token in enumerate(tokens):
    token = token.lower()
    if token in ingredient_indices.keys():
        if ingredient_indices[token] == 0:
          labels[i]  = "B-ING"
        else: labels[i] = "I-ING"

  row["label_names"] = "[" + ",".join([f'"{label}"' for label in labels]) + "]"
  row["labels"] = "[" + ",".join([str(label2id[label]) for label in labels]) + "]"
  row["recipe_tokenized"] = "[" + ",".join([f'"{token}"' for token in tokens]) + "]"
  row["NER"] = "[" + ",".join([f'"{ner}"' for ner in row["NER"]]) + "]"

  return row

In [10]:
dataset["train"] = dataset["train"].map(ner_tag)
dataset["valid"] = dataset["valid"].map(ner_tag)
dataset["test"] = dataset["test"].map(ner_tag)

  0%|          | 0/24000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

In [11]:
dataset["train"].to_csv("dataset-train-tagged.csv", index=None)
dataset["valid"].to_csv("dataset-valid-tagged.csv", index=None)
dataset["test"].to_csv("dataset-test-tagged.csv", index=None)

!zip "dataset_tagged_30k.zip" "dataset-train-tagged.csv" "dataset-valid-tagged.csv" "dataset-test-tagged.csv"

Creating CSV from Arrow format:   0%|          | 0/24 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

Creating CSV from Arrow format:   0%|          | 0/3 [00:00<?, ?ba/s]

  adding: dataset-train-tagged.csv (deflated 78%)
  adding: dataset-valid-tagged.csv (deflated 78%)
  adding: dataset-test-tagged.csv (deflated 78%)
