# NLP TP3: Hello World Transformers

# Mia Hallage

In [2]:
text = """Dear Amazon, last week I ordered an Optimus Prime action figure \
from your online store in Germany. Unfortunately, when I opened the package, \
I discovered to my horror that I had been sent an action figure of Megatron \
instead! As a lifelong enemy of the Decepticons, I hope you can understand my \
dilemma. To resolve the issue, I demand an exchange of Megatron for the \
Optimus Prime figure I ordered. Enclosed are copies of my records concerning \
this purchase. I expect to hear from you soon. Sincerely, Bumblebee."""

### **Question 1:** Understanding Pipelines

1. What is a pipeline in Hugging Face Transformers? What does it abstract away from the user?

A pipeline in Hugging Face Transformers is a high level interface that allows to run an entire NLP task using one function. It takes away the preprocessing steps, you only give it raw text and it turns it into predictions

2. Visit the pipeline documentation and list at least 3 other tasks (besides text-classification) that are available.

- Question-answering
- Summarization
- Text generation

3. What happens when you don't specify a model in the pipeline? How can you specify a specific model?

Hugging face will select a default model

First thing we will do is to classify the text into two categories: positive or negative.

To do this, we will use a pre-trained model from the Hugging Face library.

We will use the pipeline function to load the model and the text-classification task.

See the documentation for more details: https://huggingface.co/docs/transformers/main/en/pipeline_tutorial

In [8]:
!pip install transformers torch --upgrade



In [3]:
from transformers import pipeline

classifier = pipeline("text-classification")

No model was supplied, defaulted to distilbert/distilbert-base-uncased-finetuned-sst-2-english and revision 714eb0f (https://huggingface.co/distilbert/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


### **Question 2:** Text Classification Deep Dive

1. What is the default model used for text-classification? Look at the output above to find its name, then search for it on the Hugging Face Model Hub.

The model is distilbert/distilbert-base-uncased-finetuned-sst-2-english

2. What dataset was this model fine-tuned on? What kind of text does it work best with?


This model was fine tuned on SS2 dataset. It works best on short text that express sentiments

3. The output includes a score field. What does this score represent? What range of values can it have?

The score field is the probability assigned to the predicted class, it ranges from 0 to 1. Higher values mean greater confidence

4. Challenge: Find a different text-classification model on the Hub that classifies emotions (not just positive/negative). What is its name?

j-hartmann/emotion-english-distilroberta-base

In [10]:
import pandas as pd

outputs = classifier(text)
pd.DataFrame(outputs)    

Unnamed: 0,label,score
0,NEGATIVE,0.901546


In [11]:
ner_tagger = pipeline("ner", aggregation_strategy="simple")
outputs = ner_tagger(text)
pd.DataFrame(outputs)    

No model was supplied, defaulted to dbmdz/bert-large-cased-finetuned-conll03-english and revision 4c53496 (https://huggingface.co/dbmdz/bert-large-cased-finetuned-conll03-english).
Using a pipeline without specifying a model name and revision in production is not recommended.
Some weights of the model checkpoint at dbmdz/bert-large-cased-finetuned-conll03-english were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Device set to use mps:0


Unnamed: 0,entity_group,score,word,start,end
0,ORG,0.879009,Amazon,5,11
1,MISC,0.990859,Optimus Prime,36,49
2,LOC,0.999755,Germany,90,97
3,MISC,0.556568,Mega,208,212
4,PER,0.590258,##tron,212,216
5,ORG,0.669692,Decept,253,259
6,MISC,0.498349,##icons,259,264
7,MISC,0.77536,Megatron,350,358
8,MISC,0.987854,Optimus Prime,367,380
9,PER,0.812097,Bumblebee,502,511


### **Question 3:** Named Entity Recognition (NER)

1. What does the aggregation_strategy="simple" parameter do in the NER pipeline? Check the token classification documentation.

It groups all tokens that have the same entity label intoone merged entity 

2. Looking at the output above, what do the entity types mean? (ORG, MISC, LOC, PER)

- PER: Person (names)
- ORG: Organization (companies)
- LOC: Location 
- MISC: Miscellaneous entities (ex: events, product names)

3. Why do some words appear with ## prefix (like ##tron and ##icons)? What does this indicate about tokenization?

It means it is a continuation of the previous token

4. The model seems to have split "Megatron" and "Decepticons" incorrectly. Why might this happen? What does this tell you about the model's training data?

Maybe because the transformers names don't appear in the training data

5. Challenge: Find the model card for dbmdz/bert-large-cased-finetuned-conll03-english. What is the CoNLL-2003 dataset?

it is a BERT-large model that is fine tuned for named entity recognition on the CoNLL-2003 dataset. It predicts labels. The data set contains english newswire articles

In [12]:
reader = pipeline("question-answering")
question = "What does the customer want?"
outputs = reader(question=question, context=text)
pd.DataFrame([outputs])    

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use mps:0


Unnamed: 0,score,start,end,answer
0,0.631292,335,358,an exchange of Megatron


### **Question 4:** Question Answering Systems

1. What type of question answering is this? (Extractive vs. Generative) Check the question answering documentation.

This is extractive question answering because the model does not generate new text

2. The model outputs start and end indices. What do these represent? Why are they important?

start: position where answer begins \
end: position where the answer ends \
It is important because extractive podels predict where the answer is in the input text. 

3. What is the SQuAD dataset? (Look up the model distilbert-base-cased-distilled-squad on the Hub)

It is the Stanford Question Answering Dataset. It is a large dataset of question and answer pairs based on Wikipedia articles. The answers are extractives

4. Try to think of a question this model CANNOT answer based on the text. Why would it fail?

The model cannot answer questions that are not explicitly stated in the text like questions requiring external knowledge. It can fail because extractive models cannot generate new information

5. Challenge: What's the difference between extractive and generative question answering? Find an example of a generative QA model on the Hub.

Extractive QA selects spans from the text while generative produces new sentences. Examples of generative QA models on the Hub include flan-t5-base

### **Question 5:** Text Summarization

1. What is the difference between extractive and abstractive summarization? Check the summarization documentation.

Extractive summarization selects existing sentences or phrases from the original text, it doesn't create new wording \
Abstractive summarization generates new sentences that may not appear in the original text. For example in this lab the default pipeline uses an abstractive model (BART)

2. Looking at the code in the next cell, what is the default model used for summarization? Search for it on the Hugging Face Model Hub and determine:

- Is it an extractive or abstractive model?
- What architecture does it use? (Hint: look at the model name)
- What dataset was it trained on?


The distilBART model which is an abstractive encoder-decoder model for text generation.\
It is trained on the CNN/DailyMail summarization dataset which consists of new articles and human written highlights

3. What do the max_length and min_length parameters control? What happens if min_length > max_length?

max_length sets the maximum number of tokens the generated summary can contain \
min_length sets its minimum number \
if min_length > max_length you will get a warning

4. The parameter clean_up_tokenization_spaces=True is used. What does this parameter do? Why might it be useful for summarization?

It removes extra spaces, and deletes unwanted spaces around punctuation. It is useful because abstractive models sometimes produce errors like double spaces and it improves readability


5. Challenge: Find two different summarization models on the Hub:

- One optimized for short texts (like news articles)
- One that can handle longer documents \
Compare their architectures and training data.

One for short textts : facebook/bart-large-cnn \ Architecture: BART, short new summaries, trained on CNN/daily mail
For longer documents: google/pegasus-large \ Architecture: PEGASUS, long document summaries, Huge mixed pretraining


### I needed to change this part of the code as it required torch 2.6 and I tried multiple times but could not install it on my computer

In [None]:
summarizer = pipeline("summarization")
outputs = summarizer(text, max_length=45, clean_up_tokenization_spaces=True)
print(outputs[0]['summary_text'])

### **Question 6:** Machine Translation

1. What is the architecture behind the Helsinki-NLP/opus-mt-en-de model? Look it up on the Model Hub.
- What does "OPUS" stand for?
- What does "MT" stand for?

It is based on the MarianMT architecture which is an encoder-decoder neural machine translation model which is similar to transformer based sequence to sequence models. 
- OPUS (open parallel corpus): huge collection of multilingual parallel text used to train translation models
- MT (machine translation)

2. How would you find a model to translate from English to French? Visit the translation documentation and the Model Hub to find at least 2 different models.

Can search on the Hugging Face Model Hub or look at the translation pipeline documentation
- Helsinki-NLP/opus-mt-en-fr
- facebook/wmt19-en-fr

3. What is the difference between bilingual and multilingual translation models? What are the advantages and disadvantages of each?

- Bilingual: translates one specific language pair. It is high quality for that pair, smaller and faster
- Multilingual model: can translate between many language pairs. One model handles many translation directions. However, it is often lower quality and memory-intensive

4. In the code, we specify the task as "translation_en_to_de". How does this relate to the model we're loading?

The pipeline task name tells Hugging Face how to process inputs/outputs (i.e., English input → German output) \
The model name specifies which translation model to load.

5. The output shows a warning about sacremoses. What is this library used for in NLP? Check the MarianMT documentation.

It is a text normalization and tokenization library. It performs things like lowercasing, punctuation normalization, detokenization \

MarianMT models were originally trained using Moses tokenization, so sacremoses helps reproduce the same preprocessing

6. Challenge: Find a multilingual model (like mBART or M2M100) that can translate between multiple language pairs. How many language pairs does it support?

facebook/mbart-large-50-many-to-many-mmt \
Supports 50 languages \
Can translate between 50 × 49 = 2,450 possible language pairs

In [None]:
translator = pipeline("translation_en_to_de", 
                      model="Helsinki-NLP/opus-mt-en-de")
outputs = translator(text, clean_up_tokenization_spaces=True, min_length=100)
print(outputs[0]['translation_text'])

### **Question 7:** Text Generation

1. What is the default model used for text generation in the code below? Look it up on the Hub and answer:
- What architecture does GPT-2 use? (decoder-only, encoder-decoder, or encoder-only?)
- How many parameters does the base GPT-2 model have?
- What type of generation does it perform? (autoregressive, non-autoregressive, etc.)

GPT-2 uses a decoder only transformer architecture. It generates text autoregressively \
The base has 124 million parameters \
It performs autoregressive geneation (predicts the next token using only previous tokens)

2. Why do we use set_seed(42) before generation? What would happen without it? Check the generation documentation.

set_seed(42) makes the random process deterministic

3. The code uses max_length=200. What other parameters can control text generation? Research and explain:
- temperature
- top_k
- do_sample


- temperature: controls the randomness of sampling \
Low temperature (< 0.7) → more predictable, conservative output \
High temperature (> 1.0) → more creative, diverse, chaotic output \
- top_k: restricts sampling to the top k most likely tokens
- do_sample: enables radom sampling instead of greedy decoding \
do_sample=False → deterministic (always picks best token) \
do_sample=True → introduces randomness, more creative text

4. Looking at the output, you can see a warning about truncation. What does this mean? Why is the input being truncated?

Truncation = cutting off part of the input because it exceeds the allowed maximum length

5. What does pad_token_id being set to eos_token_id mean? Why is this necessary for GPT-2?

Meaning:
- Padding tokens are treated as “end of sentence --> prevents errors in batching
- Ensures the model does not try to interpret padding as real text \
Necessary because GPT-2 was originally trained without padding

6. What are the trade-offs between model size and generation quality?

Bigger model = better text quality + more context \
Smaller model = faster + easier to run locally