<a href="https://colab.research.google.com/github/ozbej/food-analysis/blob/main/ingredient_NER_training.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Ingredient NER extraction training

In order to achieve NER extraction we fine-tuned a pretrained NER model from HuggingFace.

Sources:
- https://huggingface.co/course/chapter7/2?fw=pt
- https://huggingface.co/datasets/recipe_nlg
- https://vkhangpham.medium.com/build-a-custom-ner-pipeline-with-hugging-face-a84d09e03d88

In [None]:
%%capture
!pip install transformers datasets seqeval wandb

In [None]:
from datasets import load_dataset, DatasetDict, Sequence, ClassLabel, Value, load_metric
import pandas as pd
from google.colab import drive
import nltk
from nltk.tokenize import word_tokenize
import re
from transformers import AutoTokenizer
from ast import literal_eval
import numpy as np

nltk.download('punkt')

drive.mount('/content/drive')

Log into wandb for logging

In [35]:
!wandb login

[34m[1mwandb[0m: Logging into wandb.ai. (Learn how to deploy a W&B server locally: https://wandb.me/wandb-server)
[34m[1mwandb[0m: You can find your API key in your browser here: https://wandb.ai/authorize
[34m[1mwandb[0m: Paste an API key from your profile and hit enter, or press ctrl+c to quit: 
[34m[1mwandb[0m: Appending key for api.wandb.ai to your netrc file: /root/.netrc


Copy dataset from drive and extract it

In [4]:
!unzip "/content/drive/MyDrive/AIR/dataset_tagged_30k.zip" -d "./data"

Archive:  /content/drive/MyDrive/AIR/dataset_tagged_30k.zip
  inflating: ./data/dataset-train-tagged.csv  
  inflating: ./data/dataset-valid-tagged.csv  
  inflating: ./data/dataset-test-tagged.csv  


Load dataset from extracted files

In [5]:
dataset = load_dataset('csv', data_files={'train': 'data/dataset-train-tagged.csv', 'valid': 'data/dataset-valid-tagged.csv', 'test': 'data/dataset-test-tagged.csv'})



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d1ed193a4140fd05/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating valid split: 0 examples [00:00, ? examples/s]

Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d1ed193a4140fd05/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  0%|          | 0/3 [00:00<?, ?it/s]

In [6]:
dataset["test"][0]

{'id': 362292,
 'title': 'Chocolate Crinkles',
 'ingredients': '["1/2 c. shortening", "1 2/3 c. sugar", "2 Tbsp. vanilla", "2 eggs", "2 squares chocolate, melted", "2 c. flour", "2 tsp. baking powder", "1/2 tsp. salt", "1/3 c. milk", "1/2 c. nuts"]',
 'directions': '["Thoroughly cream shortening, sugar and vanilla.", "Beat in eggs, then chocolate.", "Sift together dry ingredients.", "Blend in with milk; add nuts.", "Chill 3 hours; form in 1-inch balls and roll in powdered sugar.", "Place on greased cookie sheet 2 to 3 inches apart.", "Bake at 350\\u00b0 for 15 minutes.", "Cool slightly and remove from pan.", "Makes 4 dozen."]',
 'link': 'www.cookbooks.com/Recipe-Details.aspx?id=270058',
 'source': 'Gathered',
 'NER': '["shortening","sugar","vanilla","eggs","chocolate","flour","baking powder","salt","milk","nuts"]',
 '__index_level_0__': 362292,
 'recipe_tokenized': '["Thoroughly","cream","shortening",",","sugar","and","vanilla",".","Beat","in","eggs",",","then","chocolate",".","Sift","

Convert labels from string to python array

In [7]:
def process_tagged_rows(row):
  row["labels"] = literal_eval(row["labels"])
  return row

In [8]:
dataset["train"] = dataset["train"].map(process_tagged_rows)
dataset["valid"] = dataset["valid"].map(process_tagged_rows)
dataset["test"] = dataset["test"].map(process_tagged_rows)

  0%|          | 0/24000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

Define labels and label-index mapping

In [10]:
label_list = ["O", "B-ING", "I-ING"]
id2label = {i: label for i, label in enumerate(label_list)}
label2id = {v: k for k, v in id2label.items()}

Set "labels" column to be a class label

In [11]:
new_features = dataset["train"].features.copy()
new_features["labels"] = Sequence(feature=ClassLabel(num_classes=len(label_list), names=label_list, names_file=None, id=None), length=-1, id=None)
dataset = dataset.cast(new_features)

Casting the dataset:   0%|          | 0/24 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

Casting the dataset:   0%|          | 0/3 [00:00<?, ?ba/s]

Since the defined tokenizer is subword tokenizer, we have to align labels based on `word_ids`

In [12]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

def align_labels_with_tokens(labels, word_ids):
  new_labels = []
  current_word = None
  for word_id in word_ids:
    if word_id != current_word: # Start of a new word
      current_word = word_id
      label = -100 if word_id is None else labels[word_id]
      new_labels.append(label)
    elif word_id is None: # Special token
      new_labels.append(-100)
    else: # Same word as previous token
      label = labels[word_id]
      if label == 1: # If label is B-ING we change it to I-ING
        label = 2
      new_labels.append(label)

  return new_labels

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/466k [00:00<?, ?B/s]

In [13]:
def tokenize_and_align_labels(row):
  try:
    tokens = literal_eval(row["recipe_tokenized"])
  except:
    tokens = row["recipe_tokenized"].replace(',"\\",', ",")
    tokens = literal_eval(tokens)
  tokenized_inputs = tokenizer(tokens, truncation=True, is_split_into_words=True)
  labels = row["labels"]
  new_labels = []
  word_ids = tokenized_inputs.word_ids()
  row["labels"] = align_labels_with_tokens(labels, word_ids)
  return tokenized_inputs

In [14]:
tokenized_dataset = DatasetDict()

tokenized_dataset["train"] = dataset["train"].map(
    tokenize_and_align_labels,
    remove_columns=["id", "title", "ingredients", "directions", "link", "source", "NER", "__index_level_0__", "recipe_tokenized", "label_names"]
)
tokenized_dataset["valid"] = dataset["valid"].map(
    tokenize_and_align_labels,
    remove_columns=["id", "title", "ingredients", "directions", "link", "source", "NER", "__index_level_0__", "recipe_tokenized", "label_names"]
)
tokenized_dataset["test"] = dataset["test"].map(
    tokenize_and_align_labels,
    remove_columns=["id", "title", "ingredients", "directions", "link", "source", "NER", "__index_level_0__", "recipe_tokenized", "label_names"]
)

  0%|          | 0/24000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

  0%|          | 0/3000 [00:00<?, ?ex/s]

In [15]:
tokenized_dataset

DatasetDict({
    train: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 24000
    })
    valid: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3000
    })
    test: Dataset({
        features: ['labels', 'input_ids', 'token_type_ids', 'attention_mask'],
        num_rows: 3000
    })
})

Define `data_collator`, which aligns all labels in a batch so that they are the same size

In [16]:
from transformers import DataCollatorForTokenClassification

data_collator = DataCollatorForTokenClassification(tokenizer=tokenizer)

Define a function for computing evaluation metrics

In [25]:
metric = load_metric("seqeval")

def compute_metrics(eval_preds):
    logits, labels = eval_preds
    predictions = np.argmax(logits, axis=-1)

    # Remove ignored index (special tokens) and convert to labels
    true_labels = [[id2label[l] for l in label if l != -100] for label in labels]
    true_predictions = [
        [id2label[p] for (p, l) in zip(prediction, label) if l != -100]
        for prediction, label in zip(predictions, labels)
    ]
    all_metrics = metric.compute(predictions=true_predictions, references=true_labels)
    return {
        "precision": all_metrics["overall_precision"],
        "recall": all_metrics["overall_recall"],
        "f1": all_metrics["overall_f1"],
        "accuracy": all_metrics["overall_accuracy"],
    }

Define the model name and label-index mappings

In [26]:
from transformers import AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained(
    'bert-base-cased',
    id2label=id2label,
    label2id=label2id,
)

model.config.num_labels

Downloading:   0%|          | 0.00/440M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForTokenClassification: ['cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.seq_relationship.weight', 'cls.predictions.bias', 'cls.predictions.transform.dense.bias', 'cls.predictions.decoder.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of BertForTokenClassification were not initialized from the model checkpoint at bert-base-u

3

Define training arguments

In [43]:
import wandb
from transformers import TrainingArguments

wandb.init(project="ingredient-NER", entity="food-analysis")

args = TrainingArguments(
    "bert-finetuned-ner",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    learning_rate=2e-5,
    num_train_epochs=3,
    weight_decay=0.01,
    report_to="wandb",
)

PyTorch: setting up devices


Start the training

In [44]:
from transformers import Trainer

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["valid"],
    data_collator=data_collator,
    compute_metrics=compute_metrics,
    tokenizer=tokenizer,
)
trainer.train()

***** Running training *****
  Num examples = 24000
  Num Epochs = 3
  Instantaneous batch size per device = 8
  Total train batch size (w. parallel, distributed & accumulation) = 8
  Gradient Accumulation steps = 1
  Total optimization steps = 9000
  Number of trainable parameters = 108893955
Automatic Weights & Biases logging enabled, to disable set os.environ["WANDB_DISABLED"] = "true"


Epoch,Training Loss,Validation Loss,Precision,Recall,F1,Accuracy
1,0.0875,0.10927,0.774387,0.815025,0.794186,0.958171
2,0.0838,0.109692,0.767768,0.841551,0.802968,0.95854
3,0.0748,0.115475,0.778324,0.830623,0.803624,0.959445


***** Running Evaluation *****
  Num examples = 3000
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-3000
Configuration saved in bert-finetuned-ner/checkpoint-3000/config.json
Model weights saved in bert-finetuned-ner/checkpoint-3000/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-3000/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-3000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 8
Saving model checkpoint to bert-finetuned-ner/checkpoint-6000
Configuration saved in bert-finetuned-ner/checkpoint-6000/config.json
Model weights saved in bert-finetuned-ner/checkpoint-6000/pytorch_model.bin
tokenizer config file saved in bert-finetuned-ner/checkpoint-6000/tokenizer_config.json
Special tokens file saved in bert-finetuned-ner/checkpoint-6000/special_tokens_map.json
***** Running Evaluation *****
  Num examples = 3000
  Batch size = 8
Saving model checkpoin

TrainOutput(global_step=9000, training_loss=0.081714171939426, metrics={'train_runtime': 4069.423, 'train_samples_per_second': 17.693, 'train_steps_per_second': 2.212, 'total_flos': 1.045057661258976e+16, 'train_loss': 0.081714171939426, 'epoch': 3.0})

Save the model

In [45]:
trainer.save_model("./model")

Saving model checkpoint to ./model
Configuration saved in ./model/config.json
Model weights saved in ./model/pytorch_model.bin
tokenizer config file saved in ./model/tokenizer_config.json
Special tokens file saved in ./model/special_tokens_map.json


In [46]:
!zip model.zip model/*

  adding: model/config.json (deflated 49%)
  adding: model/pytorch_model.bin (deflated 7%)
  adding: model/special_tokens_map.json (deflated 42%)
  adding: model/tokenizer_config.json (deflated 41%)
  adding: model/tokenizer.json (deflated 71%)
  adding: model/training_args.bin (deflated 48%)
  adding: model/vocab.txt (deflated 53%)


In [48]:
!du -sh model.zip

385M	model.zip
