# Lab | Transformers

---

### Section structure

1. The open-source ecosystem: increasing accessibility to machine learning (ML) software and hardware
2. Some simple code demonstrations
3. Q&A

## 1. Ease-of-use: Using Transformers in 3 lines of code


**Overview of different tasks that can be automated with ML**
* Key ingredients: (1) a model trained on a specific task; (2) input data (e.g. texts or images); (3) output produced by the model.
* Transformers are currently the most popular type of deep learning algorithm. Most tasks below are solved with Transformers. There might be other types of algorithms coming up in the medium term.



**Install the Transformers library & dependencies**

In [1]:
!pip install transformers  # The Transformers library from Hugging Face
!pip install sentencepiece
!pip install wikipedia
!pip install accelerate
!pip install tf-keras
!pip install torch

# NOTE: you might need to restart you jupyter kernel after installing the libraries



**The Hugging Face Pipeline**
* Makes automation of many NLP tasks possible in 3 lines of code
* Detailed documentation is available [here](https://huggingface.co/transformers/main_classes/pipelines.html)

In [2]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint

Note : You might need more libraries or updates to run the cells below, if that is the case, follow the error messages and pip install accordingly. Chat gpt can help you if given the error messages.

### 2.1 Many models tailored to specific tasks


#### 2.1.1 Text classification

Let's select a popular text classification model in the [HF model hub](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads).

Here we chose "cardiffnlp/twitter-roberta-base-irony".

We will classify text into ironic or non ironic.

In [3]:
pipeline_classification = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-irony")  # cardiffnlp/twitter-roberta-base-irony, SamLowe/roberta-base-go_emotions

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/705 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/499M [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

Device set to use cuda:0


In [4]:
!nvidia-smi

Wed Nov  5 17:42:02 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.54.15              Driver Version: 550.54.15      CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA L4                      Off |   00000000:00:03.0 Off |                    0 |
| N/A   46C    P0             28W /   72W |     719MiB /  23034MiB |      0%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

Now that we have the model we can pass it a string and have it give us a classification.

Feel free to experiment with different sentences by changing the contents of the variable text

In [6]:
text = "Well that workshop was totally not worth my time..."  # "Well that workshop was totally worth my time..."  "This smells weird, I'm not sure if I should eat this ... Yikes, it tasted like old socks!"
output = pipeline_classification(text, top_k=10)
print(output)

[{'label': 'non_irony', 'score': 0.6852609515190125}, {'label': 'irony', 'score': 0.31473907828330994}]


Let's make the output a little cleaner

In [7]:
# make output a bit cleaner
df_output = pd.DataFrame(output)
print(df_output)

       label     score
0  non_irony  0.685261
1      irony  0.314739


As you can see, in a few lines of code and by leveraging an existing model we can classify text as ironic or non ironic. Now you have one more tool in your machine learning toolbox.

Remember that : 'when you only have a hammer everything is a nail'. But if we want to build a house (perform machine learning the right way), we need to use the right tool for the right job.

#### 2.1.2 Machine Translation

* Open source machine translation (MT) models enable you to translate between many different languages without Google Translate.
* [University of Helsinki](https://huggingface.co/Helsinki-NLP) uploaded models for more than 1000 language pairs to the Hugging Face hub
* [Facebook AI](https://huggingface.co/models?search=facebook+m2m) open-sourced several multi-lingual models
* The [EasyNMT library](https://github.com/UKPLab/EasyNMT), provides an easy wrapper for all these models
* Most machine translation models translate between two languages in one direction (e.g. German to English, but not English to German), some can translate in multiple directions.


In [8]:
# translation pipeline docs: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TranslationPipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")

config.json:   0%|          | 0.00/908 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.94G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/233 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/298 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json: 0.00B [00:00, ?B/s]

Device set to use cuda:0


M2M100 is a multilingual encoder-decoder (seq-to-seq) model trained for Many-to-Many multilingual translation.

The model that can directly translate between the 9,900 directions of 100 languages.

Here we specify to translate from German 'de' to English 'en'

In [12]:
text = "Ich bin ein Fisch und habe keine ahnung was du bist, du bist trotzdem aber kein schöner mensch"
pipeline_translate(text, src_lang="de", tgt_lang="en")

[{'translation_text': "I'm a fish and I don't know what you are, but you're not a beautiful person."}]

Let's do the same but with and entire wikipedia page in german.

In [16]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("de")

text = wikipedia.summary("Donald Trump").replace('\n', ' ')[:300]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="de", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")


Original text:
Donald John Trump [ˈdɑn.əld dʒɑn tɹɐmp] (* 14. Juni 1946 in New York City) ist ein US-amerikanischer Politiker der Republikanischen Partei. Er war von 2017 bis 2021 der 45. und ist seit dem 20. Januar 2025 der 47. Präsident der Vereinigten Staaten. Außerdem ist er Unternehmer und ehemaliger Showmast

Translated text:
Donald John Trump [ˈdɑn.əld dʒɑn tɔmp] (* 14 June 1946 in New York City) is an American politician of the Republican Party. he was from 2017 to 2021 the 45th and is from 20 January 2025 the 47th President of the United States.


#### 2.1.3 Text Summarization

In [17]:
# docs for summarisation pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
pipeline_summarize = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")  # sshleifer/distilbart-cnn-12-6 , google/pegasus-cnn_dailymail

config.json: 0.00B [00:00, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json: 0.00B [00:00, ?B/s]

merges.txt: 0.00B [00:00, ?B/s]

Device set to use cuda:0


In [None]:
# download any long text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')
print(f"Original text:\n{text_long}\n")

# translate the text from wikipedia
text_summarized = pipeline_summarize(text_long, min_length=5, max_length=30)
print(f"Summarized text:\n{text_summarized[0]['summary_text']}")

Original text:
Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021. Trump received a Bachelor of Science in economics from the University of Pennsylvania in 1968. His father named him president of his real estate business in 1971. Trump renamed it the Trump Organization and reoriented the company toward building and renovating skyscrapers, hotels, casinos, and golf courses. After a series of business failures in the late 1990s, he launched successful side ventures, mostly licensing the Trump name. From 2004 to 2015, he co-produced and hosted the reality television series The Apprentice. He and his businesses have been plaintiffs or defendants in more than 4,000 legal actions, including six business bankruptcies. Trump won the 2016 presidential election as the Republican Party nominee against Democratic Party candidate Hillary Clinton while losing the popular vote. A 

#### 2.1.4 Named Entity Recognition

NER is a task that involves identifying and classifying specific entities in text into predefined categories, such as names of people, organizations, locations, dates, and more.

For example, in the sentence "Apple Inc. was founded by Steve Jobs in California," NER would recognize "Apple Inc." as an organization, "Steve Jobs" as a person, and "California" as a location.

In [26]:
pipeline_ner = pipeline("token-classification", model="dslim/bert-base-NER", aggregation_strategy="simple")

config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/433M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER were not used when initializing BertForTokenClassification: ['bert.pooler.dense.bias', 'bert.pooler.dense.weight']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/59.0 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

added_tokens.json:   0%|          | 0.00/2.00 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

Device set to use cuda:0


In [27]:
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')

output = pipeline_ner(text_long)

pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.999391,Donald John Trump,0,17
1,MISC,0.999384,American,45,53
2,LOC,0.999369,United States,134,147
3,ORG,0.995757,Republican Party,165,181
4,LOC,0.999507,New York City,264,277
5,PER,0.999287,Trump,279,284
6,ORG,0.994052,University of Pennsylvania,304,330
7,ORG,0.697382,The Trump Organization,459,481
8,PER,0.998931,Trump,615,620
9,MISC,0.593421,The App,748,755


### 2.2. Universal models

The models mentioned above are designed to excel at a single specific task on a particular dataset. The key advantage of these models is their high performance and accuracy on that specific task and dataset.

However, in real-world applications, the problems you'll face often require solving slightly different tasks, possibly with varied category definitions or applied to different types of texts.

Universal models can help address this challenge. Although they also focus on one task, the task is general or universal enough that many other tasks can be reformulated into it. Two examples of universal tasks are:

- Natural Language Inference (NLI): A task that can effectively solve a wide range of classification tasks by determining whether a given premise supports, contradicts, or is neutral with respect to a hypothesis.

- Token Generation: An even more universal task that can be applied to solve virtually any text-related task, including translation, summarization, and text completion.

These universal tasks enable the models to be versatile and adaptable to various problems beyond the specific ones they were initially trained on.

#### Zero-shot classification


Zero-shot classification is a technique where a model can categorize data into classes it has never seen before.

Instead of relying on labeled examples for each class, the model understands the relationship between the input and the class descriptions, allowing it to make accurate predictions without needing specific training on those classes.

In [20]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")

config.json: 0.00B [00:00, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json: 0.00B [00:00, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

Device set to use cuda:0


Here we will give the model a list of classes ('payment issues', 'travel advice', 'bug report') for it to classify our string.

In [25]:
text = "Customer: I have not received my reimbursement yet. What the hell is going on?"
classes = ['payment issues', 'travel advice', 'bug report']  # "account opening", "customer complaint"

#text = "I do not think the government is trustworthy anymore. We need to mobilize and resist!"
#classes = ["rave festival", "praise of the government", "travel advice"]  # "collective action"

output = pipeline_zeroshot_classification(text, classes, multi_label=True)

pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T


Unnamed: 0,class,probability
0,payment issues,0.991144
1,bug report,0.07677
2,travel advice,0.018788


## Exercise

Now it is your turn to go to the hugging face library https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads

(you can select on the left menu of the website the type of NLP tasks you want models to perform)

- Find an NLP model that we have not used previously.

- Get some data from wikipedia or elsewhere.

- Perform inference with the model and print the result!

- Comment your code along the way, describe what your model does and what your end goal is from input to output.

Have fun!

In [4]:
pip install torch==2.1.2 torchvision==0.16.2

[31mERROR: Could not find a version that satisfies the requirement torch==2.1.2 (from versions: 2.2.0, 2.2.1, 2.2.2, 2.3.0, 2.3.1, 2.4.0, 2.4.1, 2.5.0, 2.5.1, 2.6.0, 2.7.0, 2.7.1, 2.8.0, 2.9.0)[0m[31m
[0m[31mERROR: No matching distribution found for torch==2.1.2[0m[31m
[0m

In [12]:
from transformers import pipeline

toxicity = pipeline(
    "text-classification",
    model="unitary/toxic-bert",
    device=-1   # safe CPU
)

texts = [
    "I hate people from your country.",
    "You are a wonderful person.",
    "Muslims are terrorists.",
    "Have a great day!",
    "Go back to where you came from you immigrant!",
    "I love everyone equally."
]

for t in texts:
    result = toxicity(t)[0]
    print(f"{t}\n→ {result['label']} ({result['score']:.2%})\n")


Device set to use cpu


I hate people from your country.
→ toxic (91.28%)

You are a wonderful person.
→ toxic (0.07%)

Muslims are terrorists.
→ toxic (94.31%)

Have a great day!
→ toxic (0.17%)

Go back to where you came from you immigrant!
→ toxic (24.30%)

I love everyone equally.
→ toxic (0.06%)

