# Lecture 2 - Introduction to Transformers Laboratory - Part 1

## Overview of this Notebook

For this laboratory we will see how to use the **HuggingFace's Transformer Library**, which gives us access to thousands of models trained by other users, from just pre-trained ones, to fully functional downstream task ready models (i.e. Classifier etc.).

In particular, we'll see:
- How to find models on the HuggingFace model webpage;
- How to load and use Text and Token Classification models;
- How to load and use Text Generation models;



## ⚠️⚠️ Changing Runtime ⚠️⚠️

To use PyTorch underlying CUDA implementation that speeds up the use of the Transformer Library, **change the Colab's Runtime Type to GPU** like so:

Runtime > Change Runtime Type > T4 GPU

# Installing Transformers Library

To install the Transformer Library, we use pip, which is the Python packet manager.


In [None]:
!pip install Transformers



# How to Use a Pre-trained Transformer Model

To use an available Tranformer model, you can choose from the user-uploaded selection on [the Hugging Face models page](https://huggingface.co/models).



## Text Classification Model
Let's try a Text-Classification model. This is a fine-tuned UmBERTo Model for a sentiment classification task with four labels:
- Joy
- Fear
- Sadness
- Anger

We use the pipeline object from the Transformers Library, that handles most of the stuff.

In [None]:
from transformers import pipeline
classifier = pipeline("text-classification", model='MilaNLProc/feel-it-italian-emotion')

Device set to use cuda:0


In [None]:
prediction = classifier("Oggi sono proprio contento!")
print(prediction)

[{'label': 'joy', 'score': 0.9993919134140015}]


In [None]:
prediction = classifier("Oggi sono morti un sacco di bambini in guerra")
print(prediction)

[{'label': 'fear', 'score': 0.9564937353134155}]


In [None]:
prediction = classifier("Ho pianto un sacco per il finale di quel film")
print(prediction)

[{'label': 'sadness', 'score': 0.9985344409942627}]


In [None]:
prediction = classifier("Io odio sentir parlare di queste stupidaggini!")
print(prediction)

[{'label': 'anger', 'score': 0.9982812404632568}]


### What's happening inside the "pipeline" object?

The pipeline object is a short-hand version that takes care of a series of steps for you. Let's see how to do the same thing fully "manual".

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [None]:
tokenizer = AutoTokenizer.from_pretrained("MilaNLProc/feel-it-italian-emotion")

In [None]:
classifier = AutoModelForSequenceClassification.from_pretrained("MilaNLProc/feel-it-italian-emotion")

In [None]:
inputs = tokenizer("Ho pianto un sacco per il finale di quel film", add_special_tokens=True, return_tensors="pt")

# We need to return PyTorch tensors to pass them to the models, otherwise the tokenized only produces a list

In [None]:
inputs

{'input_ids': tensor([[    5,  1281, 22605,    44,  6194,    50,    59,  2292,    21,   266,
          1304,     6]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]])}

In [None]:
inputs['input_ids'].shape

torch.Size([1, 12])

In [None]:
# We can always decode these inputs_ids back to text with:
tokenizer.decode(inputs['input_ids'][0])

'<s> Ho pianto un sacco per il finale di quel film</s>'

In [None]:
with torch.no_grad():
    output = classifier(**inputs)

In [None]:
output.logits

tensor([[-2.2439, -2.4211, -1.8102,  5.4982]])

In [None]:
output.logits.shape

torch.Size([1, 4])

In [None]:
predicted_class_index = torch.argmax(output.logits, dim=1).item()

In [None]:
print(predicted_class_index)

3


In [None]:
# Great, but we don't know to which emotion this index refers to, luckily:

print(classifier.config.id2label)

{0: 'anger', 1: 'fear', 2: 'joy', 3: 'sadness'}


In [None]:
predicted_class_string = classifier.config.id2label[predicted_class_index]
predicted_class_string

'sadness'

What about **prediction "probability"**?

We just need to Softmax the logits.

In [None]:
import torch.nn.functional as F

In [None]:
probabilities = F.softmax(output.logits, dim=1)
probabilities = probabilities[0].tolist()
probabilities

[0.0004335226840339601,
 0.0003631210420280695,
 0.000668908644001931,
 0.9985344409942627]

Now we have everything to re-create the same output structure of the pipeline object:

In [None]:
{
    'label': predicted_class_string, 'score': probabilities[predicted_class_index]
}

{'label': 'sadness', 'score': 0.9985344409942627}

In [None]:
del classifier, tokenizer

## Token Classification Model

Now, a Token Classification model. It outputs a label for each token. This in particular is a Named Entity Recognition task, to each relevant token (i.e. those that are part of a Noun, receive a semantic label between these:

- B-PER
- I-PER
- B-ORG
- I-ORG
- B-LOC
- I-LOC
- B-MISC
- I-LISC

The model we chose is based upon the RoBERTa architecture and is trained for a variety of different languages.


In [None]:
classifier = pipeline("token-classification", model="xlm-roberta-large-finetuned-conll03-english")

config.json:   0%|          | 0.00/852 [00:00<?, ?B/s]

Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`


model.safetensors:   0%|          | 0.00/2.24G [00:00<?, ?B/s]

Some weights of the model checkpoint at xlm-roberta-large-finetuned-conll03-english were not used when initializing XLMRobertaForTokenClassification: ['roberta.pooler.dense.bias', 'roberta.pooler.dense.weight']
- This IS expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing XLMRobertaForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/25.0 [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/5.07M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/9.10M [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
prediction = classifier("Il Presidente della Repubblica Italiana, Sergio Mattarella, ha incontrato il Presidente degli Stati Uniti d'America a Roma."
"Dopo una visita ai Musei Vaticani, si sono recati sulle sponde del Tevere, per poi tornare al Quirinale.")

In [None]:
for entity in prediction:
  print(entity)

{'entity': 'I-LOC', 'score': np.float32(0.9997929), 'index': 4, 'word': '▁Repubblica', 'start': 20, 'end': 30}
{'entity': 'I-LOC', 'score': np.float32(0.99974006), 'index': 5, 'word': '▁Italiana', 'start': 31, 'end': 39}
{'entity': 'I-PER', 'score': np.float32(0.9999914), 'index': 7, 'word': '▁Sergio', 'start': 41, 'end': 47}
{'entity': 'I-PER', 'score': np.float32(0.99999416), 'index': 8, 'word': '▁Matt', 'start': 48, 'end': 52}
{'entity': 'I-PER', 'score': np.float32(0.9999653), 'index': 9, 'word': 'a', 'start': 52, 'end': 53}
{'entity': 'I-PER', 'score': np.float32(0.999987), 'index': 10, 'word': 'rella', 'start': 53, 'end': 58}
{'entity': 'I-LOC', 'score': np.float32(0.9999888), 'index': 18, 'word': '▁Stati', 'start': 94, 'end': 99}
{'entity': 'I-LOC', 'score': np.float32(0.99998724), 'index': 19, 'word': '▁Uniti', 'start': 100, 'end': 105}
{'entity': 'I-LOC', 'score': np.float32(0.99993765), 'index': 20, 'word': '▁d', 'start': 106, 'end': 107}
{'entity': 'I-LOC', 'score': np.float

To reconstruct the tokens into full words, one might do something like:

In [None]:
in_progress = ""
in_progress_entity = ""
for entity in prediction:
  if "▁" in entity['word']:
    print(in_progress, in_progress_entity)
    in_progress = entity['word'][1:]
    in_progress_entity = entity['entity'].split("-")[1]
  else:
    in_progress += entity['word']
    in_progress_entity = entity['entity'].split("-")[1]

 
Repubblica LOC
Italiana LOC
Sergio PER
Mattarella PER
Stati LOC
Uniti LOC
d'America LOC
Roma LOC
Musei LOC
Vaticani LOC
Tevere LOC


## Generating Text with Transformers

While Transformers takes many forms, lately the famous Large Language Models (LLMs) are mostly decoder-only Transformers, pre-trained with a generative pre-training, and are mostly used in Generation Tasks (even Classification or Regression can be cast as a t2t task).   
We will try [Minerva](https://huggingface.co/sapienzanlp/Minerva-3B-base-v1.0) a family of LLMs pre-trained on both Italian and English texts. To do that, you'll need a HuggingFace Account, and you'll need to accept to share your contact information to access the model.
Then, you'll need to login to HuggingFace on the notebook. This is a somewhat stardard procedure for most of the Big Tech's Large Language Model (e.g. Llama needs a similar process) so it's useful to see that here.

In [None]:
import torch
from transformers import pipeline

In [None]:
from huggingface_hub import notebook_login
notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

In [None]:
model = pipeline(
    "text-generation",
    model="sapienzanlp/Minerva-3B-base-v1.0",
    model_kwargs={"torch_dtype": torch.bfloat16},
    device_map="auto",
)

config.json:   0%|          | 0.00/594 [00:00<?, ?B/s]

model.safetensors.index.json:   0%|          | 0.00/23.9k [00:00<?, ?B/s]

Fetching 2 files:   0%|          | 0/2 [00:00<?, ?it/s]

model-00001-of-00002.safetensors:   0%|          | 0.00/4.97G [00:00<?, ?B/s]

model-00002-of-00002.safetensors:   0%|          | 0.00/817M [00:00<?, ?B/s]

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

generation_config.json:   0%|          | 0.00/133 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/959 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/795k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.12M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/414 [00:00<?, ?B/s]

Device set to use cuda:0


In [None]:
output = model(
    "La capitale dell'Italia è",
    temperature=0.0,
    do_sample=False,
    max_new_tokens=10
)

output

The following generation flags are not valid and may be ignored: ['temperature']. Set `TRANSFORMERS_VERBOSITY=info` for more details.
Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


[{'generated_text': "La capitale dell'Italia è Roma, la città più grande e più visitata"}]

In [None]:
output[0]['generated_text']

"La capitale dell'Italia è Roma, la città più grande e più visitata"

### What's Inside the Pipeline object (again)?

We will look on how to use a generation model without the pipeline object, doing everything manually.

In [None]:
del model

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [None]:
tokenizer = AutoTokenizer.from_pretrained("sapienzanlp/Minerva-3B-base-v1.0")

In [None]:
model = AutoModelForCausalLM.from_pretrained("sapienzanlp/Minerva-3B-base-v1.0", torch_dtype=torch.bfloat16)

Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]

In [None]:
inputs = tokenizer("La capitale dell'Italia è", add_special_tokens=True, return_tensors="pt")

In [None]:
inputs

{'input_ids': tensor([[    1,   613,  8309,   370, 32379,  3065,   413]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1]])}

In [None]:
inputs['input_ids'].shape

torch.Size([1, 7])

In [None]:
tokenizer.decode(inputs['input_ids'][0])

"<s> La capitale dell'Italia è"

In [None]:
with torch.no_grad():
  output = model(**inputs)
output

CausalLMOutputWithPast(loss=None, logits=tensor([[[22.8750, 41.5000, 34.2500,  ..., 35.2500, 32.2500, 32.2500],
         [71.5000, 83.5000, 72.5000,  ..., 83.5000, 81.0000, 82.5000],
         [70.0000, 85.0000, 75.0000,  ..., 79.0000, 78.0000, 78.0000],
         ...,
         [53.7500, 75.5000, 60.0000,  ..., 72.0000, 69.5000, 72.5000],
         [70.5000, 86.5000, 76.0000,  ..., 77.5000, 78.0000, 77.5000],
         [71.0000, 87.0000, 74.0000,  ..., 81.0000, 79.5000, 80.5000]]],
       dtype=torch.bfloat16), past_key_values=None, hidden_states=None, attentions=None)

In [None]:
logits = output.logits

In [None]:
logits.shape

torch.Size([1, 7, 32768])

In [None]:
all_predictions = torch.argmax(logits, dim=-1)

In [None]:
tokenizer.decode(all_predictions.squeeze())

" nostra della'Al è Roma"

For each token, the model produces a prediction, given the the left-context of what the next token should be.   
In this case, we have:

In [None]:
left_context = ""
for index in range(inputs['input_ids'].shape[1]):
  left_context += tokenizer.decode(inputs['input_ids'][0][index]) + " "
  print(f"For Left-Context '{left_context}' the model prediction is: '{tokenizer.decode(all_predictions[0][index])}'")

For Left-Context '<s> ' the model prediction is: ''
For Left-Context '<s> La ' the model prediction is: 'nostra'
For Left-Context '<s> La capitale ' the model prediction is: 'della'
For Left-Context '<s> La capitale dell ' the model prediction is: '''
For Left-Context '<s> La capitale dell ' ' the model prediction is: 'Al'
For Left-Context '<s> La capitale dell ' Italia ' the model prediction is: 'è'
For Left-Context '<s> La capitale dell ' Italia è ' the model prediction is: 'Roma'


But we don't care about all the prediction that the model makes on our prompt. We just care about the first new tokens it produces. The fact that the model behave likes this is an artifact of how its trained. During Pre-training, each of this predictions generates a loss and its used for back-propagation, and doing all of this in parallel speeds up both inference and training times.

To extract only the token we care about, we do:

In [None]:
next_token_logits = logits[:, -1, :]  # from shape: (batch_size, n_tokens, vocab_size) we extract last token, with shape: (batch_size, vocab_size)

In [None]:
next_token_logits.shape

torch.Size([1, 32768])

In [None]:
next_token_id = torch.argmax(next_token_logits, dim=-1)

In [None]:
next_token_id

tensor([2309])

In [None]:
tokenizer.decode(next_token_id)

'Roma'

#### Generate Method

These steps need to be done for each token we want to generate, in an auto-regressive way. I.e. for the next token we want to produce, the modell we'll see the prompt again, plus the newly generated token.

Luckly HuggingFaace provides a 'generate' method which does this until the model generates an EOS token, or reaches a pre-defined max_length.

In [None]:
with torch.no_grad():
  output_ids = model.generate(
    **inputs,
    max_new_tokens=10,
    do_sample=False    # greedy decoding (no sampling)
)

Setting `pad_token_id` to `eos_token_id`:2 for open-end generation.


In [None]:
output_ids

tensor([[    1,   613,  8309,   370, 32379,  3065,   413,  2309, 32368,   347,
          2175,   581,  1726,   293,  5330,   455,   398]])

In [None]:
tokenizer.decode(output_ids[0])

"<s> La capitale dell'Italia è Roma, la città più grande e popolata del"