# PROJECT-3
Author: Panagiotis Anastasiadis (22101)


# Prerequisites

Install the necessary packages

In [None]:
!pip install transformers
!pip install datasets
!pip install -U jax jaxlib
!pip install faiss-cpu

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


Import the necessary libraries

In [None]:
from transformers import pipeline, set_seed, AutoTokenizer, TFAutoModelForSequenceClassification, AutoConfig
from transformers import RagTokenizer, RagRetriever, RagTokenForGeneration
from datasets import load_dataset_builder, load_dataset
from sklearn import metrics
import tensorflow as tf

# Use of pretrained models

## Text classification

In the context of text classification, we will utilize the "https://huggingface.co/j-hartmann/emotion-english-roberta-large " model. **This particular model is designed to classify emotions into six distinct categories, namely anger, disgust, fear, joy, neutral, sadness, and surprise.**

To evaluate its performance, we have created a dictionary comprising four examples that express different emotions. We will input these examples into the model to observe its ability to accurately classify them.

In [None]:
emotion_sentences = {
    "disgust": "Eeeew!",
    "sadness" : "He couldn't help but feel a deep sense of sorrow after the loss of his beloved pet.",
    "anger": "She clenched her fists tightly, trying to control her rising frustration",
    "surprise": "You won't believe what happened!"
}

Creating an instance of the model

In [None]:
emotion_txt_classifier = pipeline(model="j-hartmann/emotion-english-roberta-large")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.42G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/328 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [None]:
print(emotion_txt_classifier(emotion_sentences["disgust"]))
print(emotion_txt_classifier(emotion_sentences["sadness"]))
print(emotion_txt_classifier(emotion_sentences["anger"]))
print(emotion_txt_classifier(emotion_sentences["surprise"]))

[{'label': 'disgust', 'score': 0.8998289108276367}]
[{'label': 'sadness', 'score': 0.9881411790847778}]
[{'label': 'anger', 'score': 0.8934529423713684}]
[{'label': 'surprise', 'score': 0.8437817096710205}]


### Results
The model correctly labeled the emotions in the sentences as disgust, sadness, anger, and surprise. It achieved a score of over 80% in all cases, and notably, it accurately classified the sentence expressing sadness with nearly 100% score.

These precise and highly accurate classifications across different emotions demonstrate its practical value in various production fields like marketing, chatbots, and more.


## Zero-Shot classification

In the context of zero-shot classification, we will utilize the https://huggingface.co/facebook/bart-large-mnli " model. By providing diverse sentences with various labels, we'll evaluate the model's capabilities in predicting the appropriate labels for unseen data.

Creating an instance of the model

In [None]:
zero_shot_classifier = pipeline(task="zero-shot-classification", model="facebook/bart-large-mnli")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

In [None]:
insulting_input = {'sequence': "I can't believe how incredibly incompetent you are. It's astonishing that you can mess up such a simple task. Pathetic.",
 'labels': ['urgent', 'encouraging', 'insulting'],}

confusion_input = {'sequence': "They didn't respond. I'm not sure what to make of it",
 'labels': ['Love', 'Confusion', 'Relief'],}

relief_input = {'sequence': "They just texted me saying they feel the same way!",
 'labels': ['Love', 'Confusion', 'Relief'],}

print("Insulting input:\n", zero_shot_classifier(insulting_input['sequence'], insulting_input['labels']))
print("Confusion input:\n", zero_shot_classifier(confusion_input['sequence'], confusion_input['labels']))
print("Relief input:\n", zero_shot_classifier(relief_input['sequence'], relief_input['labels']))

Insulting input:
 {'sequence': "I can't believe how incredibly incompetent you are. It's astonishing that you can mess up such a simple task. Pathetic.", 'labels': ['insulting', 'urgent', 'encouraging'], 'scores': [0.978948712348938, 0.01720898039638996, 0.0038422641810029745]}
Confusion input:
 {'sequence': "They didn't respond. I'm not sure what to make of it", 'labels': ['Confusion', 'Relief', 'Love'], 'scores': [0.8273118734359741, 0.13391779363155365, 0.03877027705311775]}
Relief input:
 {'sequence': 'They just texted me saying they feel the same way!', 'labels': ['Relief', 'Love', 'Confusion'], 'scores': [0.4944944977760315, 0.32373538613319397, 0.18177010118961334]}


### Results

The model exhibits high confidence with a 98% accuracy rate, correctly classifying the first example as insulting while discarding other options that are not suitable.

In the context of emotions, the model accurately identifies the second example as expressing confusion.

The third example is a bit tricky because it combines feelings of relief and love. However, the model still manages to correctly classify it as relief, though with a modest score of 50%. It also detects a hint of "Love" with a score of 32%.

**Conclusions**

It is evident that the model delivers precise results when the labels are distinct and easily understood by humans. However, when labels are not independent or lack clear interpretations, the model shows lower confidence in predicting the correct classes. Therefore, this model is a good fit for production environments that meet these criteria.

## Token Classification

For the token classification task, we will utilize the model "https://huggingface.co/vblagoje/bert-english-uncased-finetuned-pos ".

The objective of this classification problem is to identify the types of English words, such as nouns, verbs, and more.

To assess the model's performance, we will evaluate it on three sentences that contain a range of word types.

Creating an instance of the model

In [None]:
token_classifier = pipeline(model="vblagoje/bert-english-uncased-finetuned-pos")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/438M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Evaluating the model

In [None]:
sentences = ["In this region of me, a great dragon is lying",
             "But I grow impatient, cannot stand the wait",
             "Why do they call me there?"]

for sentence in sentences:
  print(sentence, "\n")
  tokens = token_classifier(sentence)
  for token in tokens:
    print(token)
  print("\n")

In this region of me, a great dragon is lying 

{'entity': 'ADP', 'score': 0.99955505, 'index': 1, 'word': 'in', 'start': 0, 'end': 2}
{'entity': 'DET', 'score': 0.9994382, 'index': 2, 'word': 'this', 'start': 3, 'end': 7}
{'entity': 'NOUN', 'score': 0.9990252, 'index': 3, 'word': 'region', 'start': 8, 'end': 14}
{'entity': 'ADP', 'score': 0.9992131, 'index': 4, 'word': 'of', 'start': 15, 'end': 17}
{'entity': 'PRON', 'score': 0.999451, 'index': 5, 'word': 'me', 'start': 18, 'end': 20}
{'entity': 'PUNCT', 'score': 0.99966645, 'index': 6, 'word': ',', 'start': 20, 'end': 21}
{'entity': 'DET', 'score': 0.99947983, 'index': 7, 'word': 'a', 'start': 22, 'end': 23}
{'entity': 'ADJ', 'score': 0.9973296, 'index': 8, 'word': 'great', 'start': 24, 'end': 29}
{'entity': 'NOUN', 'score': 0.9960116, 'index': 9, 'word': 'dragon', 'start': 30, 'end': 36}
{'entity': 'AUX', 'score': 0.99850583, 'index': 10, 'word': 'is', 'start': 37, 'end': 39}
{'entity': 'VERB', 'score': 0.99924505, 'index': 11, 'wor

### Results

Based on the results, the model demonstrates a strong performance in classifying the types of English words in the given sentences. It accurately identifies **various word categories** such as adpositions (ADP), determiners (DET), nouns (NOUN), pronouns (PRON), punctuation (PUNCT), adjectives (ADJ), auxiliaries (AUX), and verbs (VERB).

**Conclusions**

The model exhibits impressive accuracy in identifying various types of English words, including punctuations, auxiliary words, and more. This makes it highly reliable for related tasks in a production environment.

## Question answering (with context)
We'll be using the "https://huggingface.co/deepset/tinyroberta-squad2 " model for this task. Our approach involves feeding the model with questions and their corresponding contexts to determine if the results are accurate.

Creating an instance of the model

In [None]:
model_name = "deepset/tinyroberta-squad2"
qa_context_model = pipeline('question-answering', model=model_name, tokenizer=model_name)


Downloading (…)lve/main/config.json:   0%|          | 0.00/835 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/326M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/383 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

We come up with four questions inspired by the Andrej Sapkowski books and particularly the Witcher wiki, and provide the relevant context to answer them.

In [None]:
QA_input1 = {
    'question': 'Was the Witcher paid for his services?',
    'context': "An unknown time later, he killed an amphisbaena and went to the court of King Idi of Kovir, where he handed in the head of the beast. However the king's mages, Zavist and Stregobor, told the king that Geralt was little more than a charlatan, so the king didn't pay the witcher anything and demanded he leave Kovir in 12 hours, which Geralt was barely able to do on account of the king's hourglass being broken."
}

QA_input2 = {
    'question': 'Why Geralt wants to find the druids?',
    'context': "Meanwhile, Geralt meets an elf named Avallac'h who tells him about a prophecy connected with Ciri. He needs to find some druids who will reportedly know where Ciri is. Yennefer is trying to find Vilgefortz's hiding place, but it is no easy task."
}

QA_input3 = {
    'question': 'Why Rience is mistreating Dandelion?',
    'context': "At the same time, a mysterious wizard called Rience is looking for the girl. He is a servant of a more powerful mage, who remains unknown. He captures Geralt's friend, Dandelion the bard, and tortures him for information about Ciri. Dandelion is saved by the timely arrival of Yennefer, who engages in a short magic combat with Rience."
}

QA_input4 = {
    'question': 'Does Ciri like Yennefer when they are about to leave the Temple School in Ellander?',
    'context': "Yennefer became Ciri's mentor and teacher. As they are about to leave the Temple School in Ellander, Yennefer asks Ciri whether she didn't like her at first, leading to a series of flashbacks detailing Ciri's studies with Yennefer from the day they were introduced and back to the present as they are about to leave the Temple. And Ciri responds by admitting the she didn't like her at first, but it quickly changed, they both bonded together, afterwards they leave."
}

qa_inputs = [QA_input1, QA_input2, QA_input3, QA_input4]

Evaluating the model

In [None]:
for question in qa_inputs:
  print(qa_context_model(question))

{'score': 0.3071286380290985, 'start': 245, 'end': 285, 'answer': "the king didn't pay the witcher anything"}
{'score': 0.28795191645622253, 'start': 67, 'end': 97, 'answer': 'a prophecy connected with Ciri'}
{'score': 0.42856675386428833, 'start': 209, 'end': 231, 'answer': 'information about Ciri'}
{'score': 0.20405906438827515, 'start': 120, 'end': 156, 'answer': "whether she didn't like her at first"}


### Results

Regarding question 1 and 3, the model accurately provides answers by understanding the essence of the questions.

As for question 2, while the model comprehends the question, it does not provide the correct answer despite extracting information from the context. (Correct Answer: "Geralt is seeking the druids as they might possess knowledge about Ciri's whereabouts.")

With regards to question 4, the question is more intricate, as the answer isn't straightforward and requires interpretation from Yennefer and Ciri's conversation. The model not only fails to provide an accurate answer but also struggles to grasp the question itself.

**Conclusions**

In a practical setting where the context is straightforward, the meanings are easily comprehensible, and the questions are simple, this model could be suitable. However, it appears to struggle when faced with more complex inquiries.

## Question answering (without context)

We're using the model **facebook/rag-token-nq** for this task, and we're following the tutorial provided at "https://huggingface.co/facebook/rag-token-nq ".



In [None]:
import faiss
tokenizer = RagTokenizer.from_pretrained("facebook/rag-token-nq")
retriever = RagRetriever.from_pretrained("facebook/rag-token-nq", index_name="exact", use_dummy_dataset=True)
model = RagTokenForGeneration.from_pretrained("facebook/rag-token-nq", retriever=retriever)

The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'DPRQuestionEncoderTokenizerFast'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called from. It may result in unexpected tokenization. 
The tokenizer class you load from this checkpoint is 'RagTokenizer'. 
The class this function is called from is 'BartTokenizer'.
The tokenizer class you load from this checkpoint is not the same type as the class this function is called fr

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading metadata: 0.00B [00:00, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading and preparing dataset wiki_dpr/dummy.psgs_w100.nq.no_index to /root/.cache/huggingface/datasets/wiki_dpr/dummy.psgs_w100.nq.no_index-dummy=True,with_index=False/0.0.0/74d4bff38a7c18a9498fafef864a8ba7129e27cb8d71b22f5e14d84cb17edd54...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/4.69G [00:00<?, ?B/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data:   0%|          | 0.00/1.32G [00:00<?, ?B/s]

Generating train split: 0 examples [00:00, ? examples/s]

Dataset wiki_dpr downloaded and prepared to /root/.cache/huggingface/datasets/wiki_dpr/dummy.psgs_w100.nq.no_index-dummy=True,with_index=False/0.0.0/74d4bff38a7c18a9498fafef864a8ba7129e27cb8d71b22f5e14d84cb17edd54. Subsequent calls will reuse this data.
Downloading and preparing dataset wiki_dpr/dummy.psgs_w100.nq.exact to /root/.cache/huggingface/datasets/wiki_dpr/dummy.psgs_w100.nq.exact-ce970d5f816ae529/0.0.0/74d4bff38a7c18a9498fafef864a8ba7129e27cb8d71b22f5e14d84cb17edd54...


Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/1 [00:00<?, ?it/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

Downloading data files:   0%|          | 0/1 [00:00<?, ?it/s]

Dataset wiki_dpr downloaded and prepared to /root/.cache/huggingface/datasets/wiki_dpr/dummy.psgs_w100.nq.exact-ce970d5f816ae529/0.0.0/74d4bff38a7c18a9498fafef864a8ba7129e27cb8d71b22f5e14d84cb17edd54. Subsequent calls will reuse this data.


  0%|          | 0/10 [00:00<?, ?it/s]

Downloading pytorch_model.bin:   0%|          | 0.00/2.06G [00:00<?, ?B/s]

Some weights of the model checkpoint at facebook/rag-token-nq were not used when initializing RagTokenForGeneration: ['rag.question_encoder.question_encoder.bert_model.pooler.dense.weight', 'rag.question_encoder.question_encoder.bert_model.pooler.dense.bias']
- This IS expected if you are initializing RagTokenForGeneration from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RagTokenForGeneration from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RagTokenForGeneration were not initialized from the model checkpoint at facebook/rag-token-nq and are newly initialized: ['rag.generator.lm_head.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictio

We test the model on various inputs (including the tutorial example from Hugginface).

In [None]:
input_dict = tokenizer.prepare_seq2seq_batch("who holds the record in 100m freestyle", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])



 michael phelps


In [None]:
input_dict = tokenizer.prepare_seq2seq_batch("What is life?", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])


 animism


In [None]:
input_dict = tokenizer.prepare_seq2seq_batch("Which team have won the most Champions League titles in football?", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

 chelsea


In [None]:
input_dict = tokenizer.prepare_seq2seq_batch("What is netflix?", return_tensors="pt")

generated = model.generate(input_ids=input_dict["input_ids"])
print(tokenizer.batch_decode(generated, skip_special_tokens=True)[0])

 netflix


### Results


Overall, the model's performance is mixed. It fails to give correct answers for the questions about the most Champions League titles and the freestyle records and provides invalid responses for the Netflix question.

While the answer to the rhetorical question about life may hold some meaning, it lacks practical utility.

Consequently, deploying the model in production environments seems unsuitable due to its limitations in accuracy and usefulness.

## Summarization

For the task of summarization we create four texts which consist of a book summary, two sections extracted from corresponding articles, and a text comprising dialogue.

In [None]:
BOOK_SUMMARY = """For over a century, humans, dwarves, gnomes, and elves have lived together in relative peace. But times have changed, the uneasy peace is over, and now the races are fighting once again. The only good elf, it seems, is a dead elf.
Geralt of Rivia, the cunning assassin known as The Witcher, has been waiting for the birth of a prophesied child. This child has the power to change the world - for good, or for evil.
As the threat of war hangs over the land and the child is hunted for her extraordinary powers, it will become Geralt's responsibility to protect them all - and the Witcher never accepts defeat.
The Witcher returns in this sequel to The Last Wish, as the inhabitants of his world become embroiled in a state of total war.
"""

ARTICLE1 = """Whether it's entertainment you're after, shopping, culture, history, architecture or parks, New York City's got it all.
It is a massive city that offers a wide variety of entertainment options for visitors. If you're planning a trip to New York, it can feel overwhelming. However, take your time and plan accordingly.
Find an affordable flight and a hotel. Before leaving, break down the city into sections by neighborhood and plan your days around visiting one neighborhood at a time. Make time to see New York's best attractions, like Central Park, the Statue of Liberty, and more.
"""
ARTICLE2 = """Manchester United and Real Madrid have both spoken to Inter about wing-back Federico Dimarco, as the Serie A club face huge pressure to sell this summer.
The Italy international was one of the Champions League's revelations this season and is seen as one of Inter's most marketable assets.
A number of clubs are trailing but United and Madrid have both expressed the most concerted interest so far.
While there has been some surprise that the Old Trafford club are in for Dimarco given that Luke Shaw and Tyrell Malacia have proven two of the players to enjoy the most progress under Erik ten Hag, United are considering a deal for a few reasons.
One is the possibility of signing a burgeoning talent for relatively little fee, and also the 25-year-old's immense versatility.
"""

EXAMPLE_DIALOGUE = """
“Handsome lad like you. There must be some special girl. Come on, what’s her name?" says Caesar.
Peeta sighs. "Well, there is this one girl. I’ve had a crush on her ever since I can remember. But I’m pretty sure she didn’t know I was alive until the reaping."
Sounds of sympathy from the crowd. Unrequited love they can relate to.
“She have another fellow?" asks Caesar.
“I don’t know, but a lot of boys like her," says Peeta.
“So, here’s what you do. You win, you go home. She can’t turn you down then, eh?" says Caesar encouragingly.
"I don’t think it’s going to work out. Winning...won’t help in my case," says Peeta.
“Why ever not?" says Caesar, mystified.
Peeta blushes beet red and stammers out. "Because...because...she came here with me.”
"""


## Abstractive Summarization

For the abstractive summarization where the model should generate a summary by understanding the source text and paraphrasing it into a condensed form, we select the following model ""facebook/bart-large-xsum" from Facebook (source: https://huggingface.co/facebook/bart-large-xsum).

Creating an instance of the model.

In [None]:
abstr_summarizer = pipeline("summarization", model="facebook/bart-large-xsum")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/309 [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Testing the model on the different text inputs

In [None]:
print(abstr_summarizer(ARTICLE1, max_length=120, min_length=30, do_sample=False))
print(abstr_summarizer(ARTICLE2, max_length=130, min_length=30, do_sample=False))
print(abstr_summarizer(BOOK_SUMMARY, max_length=122, min_length=30, do_sample=False))
print(abstr_summarizer(EXAMPLE_DIALOGUE, max_length=122, min_length=30, do_sample=False))

[{'summary_text': "New York City is one of the most visited cities in the world, and it's not hard to find a good place to live, work, or visit."}]
[{'summary_text': "Manchester United are in talks with Inter Milan over a deal for one of the club's best players, reports BBC Radio Solent and the BBC Sport website."}]
[{'summary_text': 'The Witcher 2 is set in a world where humans, elves, dwarves, and gnomes live side-by-side in harmony - but not anymore.'}]
[{'summary_text': 'At the end of the reaping, a young boy called Peeta is asked by his fellow contestants if he has any special someone he would like to marry.'}]


### Results

**Article 1 Summary**: Accurately portrays diverse entertainment options in NYC, capturing its essence.

**Article 2 Summary**: Mentions discussions between Manchester United and Inter Milan over a player, but lacks important context and specifics.

**Book Summary**: Depicts a world of racial conflict, highlighting the shift from peace to total war. It incorrectly indicates that this text is about the video-game Witcher 2 but it is actually about one of the Witcher series books.

**Dialogue Summary**: Briefly notes Peeta's unrequited love at the reaping, but lacks contextual details of his conversation with Caesar.

Overall, the summaries capture main points but may lack context or specifics, resulting in occasional inaccuracies or omissions. In a production environment, this may be a deal-breaker.

## Extractive Summarization

For the extractive summarization task where the model selects important sentences or phrases directly from the source text to create a summary we choose "facebook/bart-large-cnn" from "https://huggingface.co/facebook/bart-large-cnn ".

Creating an instance of the model

In [None]:
extractive_summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/1.63G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/363 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Testing the model on the different inputs

In [None]:
print(extractive_summarizer(ARTICLE1, max_length=124, min_length=30, do_sample=False))
print(extractive_summarizer(ARTICLE2, max_length=130, min_length=30, do_sample=False))
print(extractive_summarizer(BOOK_SUMMARY, max_length=122, min_length=30, do_sample=False))
print(extractive_summarizer(EXAMPLE_DIALOGUE, max_length=122, min_length=30, do_sample=False))

[{'summary_text': 'Find an affordable flight and a hotel. Break down the city into sections by neighborhood. Make time to see Central Park, the Statue of Liberty, and more.'}]
[{'summary_text': 'Federico Dimarco is a target for Manchester United and Real Madrid. The Italy international has been linked with a move to Old Trafford. The 25-year-old has impressed in the Champions League this season.'}]
[{'summary_text': "Geralt of Rivia, the cunning assassin known as The Witcher, has been waiting for the birth of a prophesied child. As the threat of war hangs over the land, it will become Geralt's responsibility to protect them all."}]
[{'summary_text': 'Peeta has had a crush on a girl for as long as he can remember. "I’m pretty sure she didn’t know I was alive until the reaping"'}]


### Results

The results of all inputs generated through extractive summarization capture the essence of the original texts quite well, providing cohesive and meaningful summaries.

## Translation

For the task of translation we utilize the "t5-base" model from "https://huggingface.co/t5-base " and we will translate english text to german.

Creating an instance of the model

In [None]:
translation_model = pipeline(model="t5-base")

Downloading (…)lve/main/config.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/892M [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/147 [00:00<?, ?B/s]

Downloading (…)ve/main/spiece.model:   0%|          | 0.00/792k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

For now, this behavior is kept to avoid breaking backwards compatibility when padding/encoding with `truncation is True`.
- Be aware that you SHOULD NOT rely on t5-base automatically truncating your input to 512 when padding/encoding.
- If you want to encode/pad to sequences longer than 512 you can either instantiate this tokenizer with `model_max_length` or pass `max_length` when encoding/padding.


We create 4 examples with english text

In [None]:
english_text1 = "Your bedtime story is scaring everyone"
english_text2 = "Where the dead ships dwell"
english_text3 = "Something's wrong, shut the light, heavy thoughts tonight and they aren't of Snow White"
english_text4 = "When you're in a sad mood, it can seem like it will last forever"

To evaluate the model's performance, we input the given examples and verify the accuracy of the outputs using a popular translation tool, Google Translate.

In [None]:
text = translation_model(english_text1)
print(text)

[{'translation_text': 'Ihre Bettnachricht erschreckt alle'}]


From Google translate: "Her bed message scares everyone"

In [None]:
text = translation_model(english_text2)
print(text)

[{'translation_text': 'Wo sich die Totenschiffe bewegen'}]


From Google translate: "Where the ships of the dead move"

In [None]:
text = translation_model(english_text3)
print(text)

[{'translation_text': 'Etwas ist falsch, schließe die hellen, schweren Gedanken heute Abend und sie sind nicht von Snow White'}]


From Google translate: "Something's wrong, close the light heavy thoughts tonight and they're not from Snow White"

In [None]:
text = translation_model(english_text4)
print(text)

[{'translation_text': 'Wenn man in einer traurigen Laune ist, kann es so aussehen, als würde es ewig andauern'}]


From Google translate: "When you're in a sad mood, it can seem like it lasts forever"

### Results

Based on our observations, the translation results are consistently accurate in all examples. However, a minor mistake occurs in the first sentence where the pronoun "your" is translated as "her" in German.

Overall, these results suggest that the model performs well and can be relied upon in a production environment.

## Language modeling
For the general language modeling we will use a text generation algorithm called "gpt2-medium" from "https://huggingface.co/gpt2-medium " and its capabilities on generating text based on a given input.


Creating an instance of the model

In [None]:
generator = pipeline('text-generation', model='gpt2-medium')

Downloading (…)lve/main/config.json:   0%|          | 0.00/718 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.52G [00:00<?, ?B/s]

Downloading (…)neration_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json: 0.00B [00:00, ?B/s]

Downloading (…)olve/main/merges.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


Testing the model with different inputs

In [None]:
res = generator("My name is Satan and i like to", max_length=30, num_return_sequences=3)
for r in res:
  print(r)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated_text': "My name is Satan and i like to think that i'm very clever. All this time people have told me that i couldn't see through people but"}
{'generated_text': "My name is Satan and i like to serve a greater cause…but right now this country is in a crisis. The world's greatest democracy faces a"}
{'generated_text': 'My name is Satan and i like to suck the blood of Christians (Christian people)\n\nbut not of other Christian people. And also i like'}


In [None]:
res = generator("The ball hits the", max_length=30, num_return_sequences=3)
for r in res:
  print(r)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated_text': 'The ball hits the bat while touching the ground, and you can still roll onto the other side.\n\nBouncing\n\nSince the ball doesn'}
{'generated_text': "The ball hits the catcher. He will slide down under the bat and roll backward towards the center of the field. However, he doesn't look up"}
{'generated_text': "The ball hits the ground with a loud thud, the sound emanating from the hole in the wall directly in front of me, I don't catch"}


In [None]:
res = generator("I think i am sad because", max_length=30, num_return_sequences=3)
for r in res:
  print(r)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


{'generated_text': 'I think i am sad because there was no reason for such a big deal? lol\n\ni am not scared of death by myself. when it'}
{'generated_text': 'I think i am sad because these guys have seen over and over again how we could be more than just animals when we come together and make sacrifices.'}
{'generated_text': 'I think i am sad because i lost a bit of my confidence in myself to take the plunge, it wasnt fun but i am happy to have'}


### Results

The generates text that often makes sense, but its output lacks overall coherence and practical meaning. While it can capture the essence of input ideas, it may not be useful in production systems due to its inconsistent quality.


# Fine-tuning a pre-trained model

In this step, we choose a pretrained model and customize it to match **our task of categorizing biased reviews as positive or negative**. We achieve this by training the model using the "Rotten Tomatoes" dataset, allowing it to learn and improve its performance specifically for this purpose.

### Dataset

We check the info of the rotten tomatoes dataset before downloading it.

In [None]:
ds_builder = load_dataset_builder("rotten_tomatoes")

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading metadata: 0.00B [00:00, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

In [None]:
ds_builder.info.features

{'text': Value(dtype='string', id=None),
 'label': ClassLabel(names=['neg', 'pos'], id=None)}

In [None]:
ds_builder.info.description

"Movie Review Dataset.\nThis is a dataset of containing 5,331 positive and 5,331 negative processed\nsentences from Rotten Tomatoes movie reviews. This data was first used in Bo\nPang and Lillian Lee, ``Seeing stars: Exploiting class relationships for\nsentiment categorization with respect to rating scales.'', Proceedings of the\nACL, 2005.\n"

We download the rotten-tomatoes dataset and split it into 3 different sets (train, validation, test)

In [None]:
pre_train_ds = load_dataset("rotten_tomatoes", split="train")
pre_test_ds = load_dataset("rotten_tomatoes", split="validation")
pre_val_ds = load_dataset("rotten_tomatoes", split="test")

Downloading builder script: 0.00B [00:00, ?B/s]

Downloading metadata: 0.00B [00:00, ?B/s]

Downloading readme: 0.00B [00:00, ?B/s]

Downloading and preparing dataset rotten_tomatoes/default to /root/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46...


Downloading data:   0%|          | 0.00/488k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/8530 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/1066 [00:00<?, ? examples/s]

Dataset rotten_tomatoes downloaded and prepared to /root/.cache/huggingface/datasets/rotten_tomatoes/default/1.0.0/40d411e45a6ce3484deed7cc15b82a53dad9a72aafd9f86f8f227134bec5ca46. Subsequent calls will reuse this data.




Printing the shape of the 3 subsets

In [None]:
print(pre_train_ds.shape)
print(pre_val_ds.shape)
print(pre_test_ds.shape)

(8530, 2)
(1066, 2)
(1066, 2)


### Model

We select the "bert-base-cased" model from "https://huggingface.co/bert-base-cased " which is a model trained on a large amount of text data and is capable of understanding the meaning and context of words and sentences and can be fine-tuned for various natural language processing tasks such as text classification.

Creating the classification labels

In [None]:
id_to_label = {0: "negative", 1: "positive"}
label_to_id = {"negative": 0, "positive": 1}

We use Autotokenizer and adjust the model for our binary classification problem

In [None]:
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
model = TFAutoModelForSequenceClassification.from_pretrained('bert-base-cased', num_labels=2, id2label=id_to_label, label2id=label_to_id)

Downloading (…)okenizer_config.json:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt: 0.00B [00:00, ?B/s]

Downloading (…)/main/tokenizer.json: 0.00B [00:00, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/436M [00:00<?, ?B/s]

All PyTorch model weights were used when initializing TFBertForSequenceClassification.

Some weights or buffers of the TF 2.0 model TFBertForSequenceClassification were not initialized from the PyTorch model and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


First, we apply the tokenizer to the second and third in line reviews in the training dataset to showcase how the tokenization process works and to highlight its features.

In [None]:
print(pre_train_ds[1])
print(pre_train_ds[2])
tokens = tokenizer([pre_train_ds["text"][1], pre_train_ds["text"][2]], padding=True, truncation=True)
print("input_ids: ",  tokens["input_ids"])
print("attention_mask: ", tokens["attention_mask"])
print("token_type_ids: ", tokens["token_type_ids"])


{'text': 'the gorgeously elaborate continuation of " the lord of the rings " trilogy is so huge that a column of words cannot adequately describe co-writer/director peter jackson\'s expanded vision of j . r . r . tolkien\'s middle-earth .', 'label': 1}
{'text': 'effective but too-tepid biopic', 'label': 1}
input_ids:  [[101, 1103, 10144, 1193, 9427, 14961, 1104, 107, 1103, 7692, 1104, 1103, 8374, 107, 14927, 1110, 1177, 3321, 1115, 170, 5551, 1104, 1734, 2834, 26449, 5594, 1884, 118, 2432, 120, 1900, 11109, 1200, 24498, 2142, 112, 188, 3631, 4152, 1104, 179, 119, 187, 119, 187, 119, 1106, 10493, 8584, 112, 188, 2243, 118, 4033, 119, 102], [101, 3903, 1133, 1315, 118, 21359, 25786, 25128, 20437, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]
attention_mask:  [[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

**input_ids**:

The input_ids attribute contains the tokenized input text represented as unique numerical IDs. In this case where the sentences are of different lengths, the shorter sentence ([2]) is padded to match the length of the longer sentence. Any additional IDs beyond the original length of the sentence are set to zero.

**attention_mask**:

The attention_mask attribute provides information to the model about which tokens should be attended to and which ones should be ignored.
When a token has a value of 1 in the attention_mask, it means the model should pay attention to it during processing.

In the given example, we can see that the padding process creates additional input_ids with a value of 0 for the second sentence. Correspondingly, the attention_mask also has 0 values for these padded input_ids.


**token_type_ids**:


"*token_type_ids helps identify which sequence each token belongs to when there are multiple sequences involved*."

In our specific case, where there is only a single sequence without sublists, all the token type values are set to 0 since there is no distinction between different sequences.


### Preprocessing

To handle the varying lengths of sentences in the dataset, we employ a preprocessing function that applies a padding and truncation strategy. This function ensures that all sentences are processed consistently by either adding padding tokens to shorter sentences or truncating longer sentences to a specified length

In [None]:
def tokenize_dataset(data):
    # Keys of the returned dictionary will be added to the dataset as columns
    return tokenizer(data["text"], padding=True, truncation=True)

train_ds = pre_train_ds.map(tokenize_dataset)
test_ds = pre_test_ds.map(tokenize_dataset)
val_ds = pre_val_ds.map(tokenize_dataset)


Map:   0%|          | 0/8530 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

Map:   0%|          | 0/1066 [00:00<?, ? examples/s]

### Preparing the subsets to be compatible with Tensorflow

In [None]:
tf_train = model.prepare_tf_dataset(train_ds, batch_size=64, shuffle=True, tokenizer=tokenizer)
tf_test = model.prepare_tf_dataset(test_ds, batch_size=64, shuffle=True, tokenizer=tokenizer)
tf_val = model.prepare_tf_dataset(val_ds, batch_size=64, shuffle=True, tokenizer=tokenizer)

You're using a BertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.


### Observing the generated tensors

By examining the dataset, we can observe the implementation of padding by observing the presence of additional zeros in the "input_ids" and "attention_mask" attributes

In [None]:
for element, labels in tf_train:
    print(element["input_ids"])
    print(element["attention_mask"])
    break

tf.Tensor(
[[  101  1136  1256 ...     0     0     0]
 [  101 21718 16882 ...     0     0     0]
 [  101  1103 10850 ...     0     0     0]
 ...
 [  101  1451  1208 ...     0     0     0]
 [  101  1199  1404 ...     0     0     0]
 [  101  1274   112 ...     0     0     0]], shape=(64, 60), dtype=int64)
tf.Tensor(
[[1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 ...
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]
 [1 1 1 ... 0 0 0]], shape=(64, 60), dtype=int64)


### Training the model


Initially, we experimented with a learning rate of 0.001 for the Adam optimizer, as suggested in the assignment. However, this resulted in a validation accuracy of only 50%.

Upon reducing the learning rate to 0.0003, we observed a significant improvement in the validation accuracy, reaching over 83%. Although the model shows slight signs of overfitting on the training data, with near-perfect accuracy, the validation accuracy remains satisfactory.

Regarding the loss function, we followed the guidance provided by Hugging Face:

"*You don’t have to pass a loss argument to your models when you compile() them! Hugging Face models automatically choose a loss that is appropriate for their task and model architecture if this argument is left blank.*"

**Source**: "https://huggingface.co/docs/transformers/training "

Therefore, in our implementation, we do not explicitly set a loss function, relying on the default behavior of the Hugging Face models.

In [None]:
adam = tf.keras.optimizers.Adam(
  learning_rate=3e-5,
  # beta_1=0.9,
  # beta_2=0.99
)

model.compile(
  optimizer=adam,
  # loss="binary_crossentropy",
  metrics=["accuracy"]
)

callback = tf.keras.callbacks.EarlyStopping(
  monitor="val_loss",
  patience=5
)

model.fit(
  tf_train,
  batch_size=64,
  validation_data=tf_val,
  epochs=10,
  callbacks=[callback]
)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10


<keras.callbacks.History at 0x7f04ad3123b0>

### Creating metrics functions
We develop two functions to facilitate our analysis.

The first function **confusion_matrix** calculates the confusion matrix when making predictions on a specific dataset. This matrix helps us understand the distribution of predicted values and their alignment with the actual values.

The second function calculates various metrics, including accuracy, precision, and recall, both as overall measures and on a per-class basis.

In [None]:
def confusion_matrix(model, subset):

    # Set the appropriate dataset based on the subset
    if subset=="test":
        ds = tf_test
    elif subset=="val":
        ds = tf_val
    elif subset=="train":
        ds = tf_train

    y_true_list = []
    y_pred_list = []

    for x, y in ds:
      y_pred = model.predict(x)
      y_pred_list.append(tf.argmax(y_pred.logits, axis=-1))
      y_true_list.append(y)

    y_pred_list = tf.concat(y_pred_list, axis=0)
    y_true_list = tf.concat(y_true_list, axis=0)

    return metrics.confusion_matrix(y_true_list, y_pred_list)

In [None]:
def calculate_metrics(cm):
  # Confusion Matrix
  print("-----------------------------")
  print(cm)
  print("-----------------------------")
  # Overall Accuracy
  total_samples = sum(sum(row) for row in cm)
  accuracy = (cm[0][0] + cm[1][1]) / total_samples
  print("Overall Accuracy:", accuracy)

  num_classes = len(cm)
  precision = []
  recall = []
  f1_score = []
  class_accuracy = []

  # Calculate accuracy, precision, recall, and F1-score for each class
  for i in range(num_classes):
    class_total = sum(cm[i])  # Total samples for the class
    class_correct = cm[i][i]  # Correctly classified samples for the class
    class_accuracy.append(class_correct / class_total)

    overall_tp = sum(cm[i][i] for i in range(num_classes))
    overall_fp = sum(sum(cm[j][i] for j in range(num_classes) if j != i) for i in range(num_classes))
    overall_fn = sum(sum(cm[i][j] for j in range(num_classes) if j != i) for i in range(num_classes))

    # True positives for class i
    tp = cm[i][i]
    # False positives for class i
    fp = sum(cm[j][i] for j in range(num_classes) if j != i)
    # False negatives for class i
    fn = sum(cm[i][j] for j in range(num_classes) if j != i)

    # Precision for class i
    precision.append(tp / (tp + fp) if tp + fp > 0 else 0)
    # Recall for class i
    recall.append(tp / (tp + fn) if tp + fn > 0 else 0)
    # F1-score for class i
    f1_score.append((2 * precision[i] * recall[i]) / (precision[i] + recall[i]) if (precision[i] + recall[i]) > 0 else 0)

  # Calculate overall precision, recall, and F1-score
  overall_precision = overall_tp / (overall_tp + overall_fp) if (overall_tp + overall_fp) > 0 else 0
  overall_recall = overall_tp / (overall_tp + overall_fn) if (overall_tp + overall_fn) > 0 else 0
  overall_f1_score = (2 * overall_precision * overall_recall) / (overall_precision + overall_recall) if (overall_precision + overall_recall) > 0 else 0

  # Print overall precision, recall, and F1-score
  print("Overall Precision:", round(overall_precision, 5))
  print("Overall Recall:", round(overall_recall, 5))
  print("Overall F1-score:", round(overall_f1_score, 5))
  print("-----------------------------")
  # Print accuracy, precision, recall, and F1-score per class
  print("Accuracy per class:", [f"{val:.5f}" for val in class_accuracy])
  print("Precision per class:", [f"{val:.5f}" for val in precision])
  print("Recall per class:", [f"{val:.5f}" for val in recall])
  print("F1 score per class:", [f"{val:.5f}" for val in f1_score])
  print("-----------------------------")

### Evaluate the model on all subsets

We evaluate the model on every subset (train, validation, test).

In [None]:
cm_train = confusion_matrix(model, "train")
cm_val = confusion_matrix(model, "val")
cm_test = confusion_matrix(model, "test")

# Print evaluation results
print("<========== Train ==========>")
calculate_metrics(cm_train)
print("<========== Val ============>")
calculate_metrics(cm_val)
print("<========== Test ===========>")
calculate_metrics(cm_test)
print("<===========================>")

-----------------------------
[[4243   10]
 [   2 4257]]
-----------------------------
Overall Accuracy: 0.9985902255639098
Overall Precision: 0.99859
Overall Recall: 0.99859
Overall F1-score: 0.99859
-----------------------------
Accuracy per class: ['0.99765', '0.99953']
Precision per class: ['0.99953', '0.99766']
Recall per class: ['0.99765', '0.99953']
F1 score per class: ['0.99859', '0.99859']
-----------------------------
-----------------------------
[[420  92]
 [ 52 460]]
-----------------------------
Overall Accuracy: 0.859375
Overall Precision: 0.85938
Overall Recall: 0.85938
Overall F1-score: 0.85938
-----------------------------
Accuracy per class: ['0.82031', '0.89844']
Precision per class: ['0.88983', '0.83333']
Recall per class: ['0.82031', '0.89844']
F1 score per class: ['0.85366', '0.86466']
-----------------------------
-----------------------------
[[416  95]
 [ 48 465]]
-----------------------------
Overall Accuracy: 0.8603515625
Overall Precision: 0.86035
Overall R

### Results

**Training Set:**

The model exhibits outstanding performance on the training set, achieving an overall accuracy, precision, recall, and F1-score of over 99%.

**Validation Set:**

On the validation set, the model shows slightly lower overall accuracy and other metrics, around 86%. This drop in performance compared to the training set indicates a degree of overfitting, as explained earlier. It performs overall very well.

**Test Set:**

Similar to the validation set, the model achieves an overall accuracy, precision, recall, and F1-score of approximately 86% on the test set.
Overall, it achieves great performance on unseen data.

