This notebook serves as a playground to test/ play around with the CookBERT model. Enjoy! 

# Connect to Google Drive

In [None]:
from google.colab import drive 
drive.mount('/content/drive')
%cd /content/drive/MyDrive/BachelorThesis/

Mounted at /content/drive
/content/drive/MyDrive/BachelorThesis


# Installations and Imports

In [None]:
!pip install transformers

Successfully installed huggingface-hub-0.4.0 pyyaml-6.0 sacremoses-0.0.49 tokenizers-0.11.6 transformers-4.17.0


In [None]:
from transformers import (
    BertTokenizerFast,
    BertForTokenClassification, 
    BertForSequenceClassification,
    BertForMaskedLM,
    BertForQuestionAnswering,
    pipeline # note: pipeline needs FastTokenizer, which is the case for CookBERT
)

# Masked Language Modeling (MLM)

In this task, the models predict a word for the [MASK] token. This is (together with 'Next Sentence Prediction') the task that the original BERT model was pretrained on, and it's also the task that was used for the domain-adaptive pre-training of BERT for the cooking domain.

1. Load CookBERT (and BERT-base-uncased for comparisons) for MLM:

In [None]:
# init CookBERT and its tokenizer
MLM_CookBERT_tokenizer = BertTokenizerFast.from_pretrained("CookBERT/further_pretraining/model_output/checkpoint-final")
MLM_CookBERT = BertForMaskedLM.from_pretrained("CookBERT/further_pretraining/model_output/checkpoint-final")
# init BERT-base-uncased and its tokenizer
MLM_BERT_tokenizer = BertTokenizerFast.from_pretrained("bert-base-uncased")
MLM_BERT = BertForMaskedLM.from_pretrained("bert-base-uncased")

# init pipelines
MLM_CookBERT_pipeline = pipeline("fill-mask", model=MLM_CookBERT, tokenizer=MLM_CookBERT_tokenizer)
MLM_BERT_pipeline = pipeline("fill-mask", model=MLM_BERT, tokenizer=MLM_BERT_tokenizer)

Downloading:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/226k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/455k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/420M [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-base-uncased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.bias', 'cls.seq_relationship.weight']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


2. Predict word for [MASK] token:

In [None]:
text_to_fill = "Cut the [MASK] into small pieces."
print("CookBERT: \t\t", MLM_CookBERT_pipeline(text_to_fill, top_k=5)) # change top_k to show the top k predictions
print("BERT-base-uncased: \t", MLM_BERT_pipeline(text_to_fill, top_k=5))

CookBERT: 		 [{'score': 0.06830091774463654, 'token': 7975, 'token_str': 'chicken', 'sequence': 'cut the chicken into small pieces.'}, {'score': 0.061225246638059616, 'token': 8808, 'token_str': 'cheese', 'sequence': 'cut the cheese into small pieces.'}, {'score': 0.04978346452116966, 'token': 5909, 'token_str': 'fruit', 'sequence': 'cut the fruit into small pieces.'}, {'score': 0.03314376249909401, 'token': 28540, 'token_str': 'cabbage', 'sequence': 'cut the cabbage into small pieces.'}, {'score': 0.02960268408060074, 'token': 24165, 'token_str': 'sausage', 'sequence': 'cut the sausage into small pieces.'}]
BERT-base-uncased: 	 [{'score': 0.063535176217556, 'token': 3536, 'token_str': 'wood', 'sequence': 'cut the wood into small pieces.'}, {'score': 0.04211747646331787, 'token': 3259, 'token_str': 'paper', 'sequence': 'cut the paper into small pieces.'}, {'score': 0.021924778819084167, 'token': 3727, 'token_str': 'leaves', 'sequence': 'cut the leaves into small pieces.'}, {'score': 0.

# Text Classification

In this task, the models classify a user utterance based on the underlying (cooking) information need. This is a multi-class classification problem with 11 different classes in total. The models were finetuned on the cookversational-Search dataset (see paper ["'What Can I Cook with these Ingredients?' - Understanding Cooking-Related Information Needs in Conversational Search"](https://dl.acm.org/doi/full/10.1145/3498330) by Alexander Frummet, David Elsweiler and Bernd Ludwig (2022)).

1. Load CookBERT for text/sequence classification:

In [None]:
# init CookBERT and its tokenizer for text classification
CL_CookBERT_tokenizer = BertTokenizerFast.from_pretrained("CookBERT/finetuning_for_downstream_tasks/text_classification/model_output/CookBERT/no_context")
CL_CookBERT = BertForSequenceClassification.from_pretrained("CookBERT/finetuning_for_downstream_tasks/text_classification/model_output/CookBERT/no_context")
print(f"Task: {CL_CookBERT.num_labels}-class classification") # print the number of labels the text is classified into

# init pipeline
CL_CookBERT_pipeline = pipeline("text-classification", model=CL_CookBERT, tokenizer=CL_CookBERT_tokenizer)

Task: 11-class classification


2. Classify text:

In [None]:
text_to_classify = "Can you tell me again, what temperature does the chicken need to cook at?"
print("Information need: ", CL_CookBERT_pipeline(text_to_classify))

Information need:  [{'label': 'Temperature', 'score': 0.7460580468177795}]


# Question Answering (QA)

This task is about extracting a passage of text from a given context, that contains the answer to a given question. The models were finetuned on the DoQa dataset (see paper ["DoQA -- Accessing Domain-Specific FAQs via Conversational QA"](https://arxiv.org/abs/2005.01328) by Campos et al. (2020)).

Note: the models were only pretrained on answerable questions and thus expect the context to contain the answer to a question. 

1. Load CookBERT for QA:

In [None]:
# init CookBERT and its tokenizer for QA
QA_CookBERT = BertForQuestionAnswering.from_pretrained("CookBERT/finetuning_for_downstream_tasks/question_answering/model_output/CookBERT")
QA_CookBERT_tokenizer = BertTokenizerFast.from_pretrained("CookBERT/finetuning_for_downstream_tasks/question_answering/model_output/CookBERT")

# init QA pipeline
QA_CookBERT_pipeline = pipeline("question-answering", model=QA_CookBERT, tokenizer=QA_CookBERT_tokenizer)

2. Extract answer to question from context:

In [None]:
context = "Rare duck meat is safe to eat because it does NOT contain the same risk of Salmonella as does chicken meat.Primarily because ducks, as mentioned above, have not traditionally been raised in the same squalid conditions as 'factory raised' chickens - salmonella is a disease that is primarily transmitted through dirt/dirty unclean conditions. Now, on the other hand, as more and more ducks are being raised in industrial conditions, they are also becoming more likely to contain strains of Salmonella."
question = "Why is rare duck meat safe?"
QA_CookBERT_pipeline(context=context, question=question, handle_impossible_answer=False, top_k=1)

{'answer': 'it does NOT contain the same risk of Salmonella as does chicken meat',
 'end': 106,
 'score': 0.012124078348279,
 'start': 38}

# Named Entity Recognition (NER)

This is a sequence labelling task, where the models tag food entities. The models were finetuned for 5 different annotation schemes on the FoodBase dataset (see paper ["FoodBase corpus: a new resource of annotated food entities"](https://academic.oup.com/database/article/doi/10.1093/database/baz121/5611291?login=true) by Popovski, Seljak and Eftimov (2019)).

1. Select the annotation scheme you want to tag the text with. Available annotation schemes are: 
- food-classification
- foodon
- hansard-closest
- hansard-parent
- snomedct

In [None]:
annotation_scheme = "food-classification"

2. Load CookBERT for NER:

In [None]:
# init CookBERT and its tokenizer for NER
NER_CookBERT_tokenizer = BertTokenizerFast.from_pretrained(f'CookBERT/finetuning_for_downstream_tasks/named_entity_recognition/model_output/CookBERT/{annotation_scheme}')
NER_CookBERT = BertForTokenClassification.from_pretrained(f'CookBERT/finetuning_for_downstream_tasks/named_entity_recognition/model_output/CookBERT/{annotation_scheme}') 

# init NER pipeline
NER_CookBERT_pipeline = pipeline("ner", model=NER_CookBERT, tokenizer=NER_CookBERT_tokenizer) # aggregation_strategy="first" to aggregate connected entities to the tag of the first token

3. Tag your input text:

In [None]:
text_to_tag = "Add two apples and one banana to your oats."
NER_CookBERT_pipeline(text_to_tag)

[{'end': 14,
  'entity': 'B-FOOD',
  'index': 3,
  'score': 0.5637132,
  'start': 8,
  'word': 'apples'},
 {'end': 29,
  'entity': 'B-FOOD',
  'index': 6,
  'score': 0.98528963,
  'start': 23,
  'word': 'banana'},
 {'end': 42,
  'entity': 'B-FOOD',
  'index': 9,
  'score': 0.96275073,
  'start': 38,
  'word': 'oats'}]