# Lab | Transformers

---

### Section structure

1. The open-source ecosystem: increasing accessibility to machine learning (ML) software and hardware
2. Some simple code demonstrations
3. Q&A

## 1. Ease-of-use: Using Transformers in 3 lines of code


**Overview of different tasks that can be automated with ML**
* Key ingredients: (1) a model trained on a specific task; (2) input data (e.g. texts or images); (3) output produced by the model.
* Transformers are currently the most popular type of deep learning algorithm. Most tasks below are solved with Transformers. There might be other types of algorithms coming up in the medium term.



**Install the Transformers library & dependencies**

In [3]:
!pip install transformers~=4.31.0  # The Transformers library from Hugging Face
!pip install sentencepiece==0.1.96  # optional tokeniser, required for some models. e.g. machine translation
!pip install wikipedia==1.4.0  # to download any text from wikipedia
# running large models with accelerate https://huggingface.co/blog/accelerate-large-models
# NOTE: we need to restart the runtime after installing accelerate
!pip install accelerate~=0.21.0

Collecting sentencepiece==0.1.96
  Using cached sentencepiece-0.1.96.tar.gz (508 kB)
  Preparing metadata (setup.py) ... [?25ldone
[?25hBuilding wheels for collected packages: sentencepiece
  Building wheel for sentencepiece (setup.py) ... [?25lerror
  [1;31merror[0m: [1msubprocess-exited-with-error[0m
  
  [31m×[0m [32mpython setup.py bdist_wheel[0m did not run successfully.
  [31m│[0m exit code: [1;36m1[0m
  [31m╰─>[0m [31m[180 lines of output][0m
  [31m   [0m !!
  [31m   [0m 
  [31m   [0m         ********************************************************************************
  [31m   [0m         Usage of dash-separated 'description-file' will not be supported in future
  [31m   [0m         versions. Please use the underscore name 'description_file' instead.
  [31m   [0m 
  [31m   [0m         This deprecation is overdue, please update your project and remove deprecated
  [31m   [0m         calls to avoid build errors in the future.
  [31m   [0m 


In [4]:
# automatically chose CPU or GPU for inference, depending on your hardware
#import torch
#device_id = torch.cuda.current_device() if torch.cuda.is_available() else -1
# -1 == CPU ; 0 == GPU
#print(device_id)

**The Hugging Face Pipeline**
* Makes automation of many NLP tasks possible in 3 lines of code
* Detailed documentation is available [here](https://huggingface.co/transformers/main_classes/pipelines.html)

In [5]:
from transformers import pipeline
import pandas as pd
import numpy as np
from pprint import pprint

2024-06-14 11:30:16.483395: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### 2.1 Many models tailored to specific tasks


#### 2.1.1 Text classification

Let's search for a few popular text classification models in the [HF model hub](https://huggingface.co/models?pipeline_tag=text-classification&sort=downloads).

In [6]:
pipeline_classification = pipeline("text-classification", model="cardiffnlp/twitter-roberta-base-irony")  # cardiffnlp/twitter-roberta-base-irony, SamLowe/roberta-base-go_emotions

Xformers is not installed correctly. If you want to use memory_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


In [7]:
text = "Well that workshop was totally worth my time..."  # "Well that workshop was totally worth my time..."  "This smells weird, I'm not sure if I should eat this ... Yikes, it tasted like old socks!"
output = pipeline_classification(text, top_k=10)
print(output)

[{'label': 'irony', 'score': 0.9424387812614441}, {'label': 'non_irony', 'score': 0.057561274617910385}]


In [8]:
# make output a bit cleaner
df_output = pd.DataFrame(output)
print(df_output)

       label     score
0      irony  0.942439
1  non_irony  0.057561


#### 2.1.2 Machine Translation

* Open source machine translation (MT) models enable you to translate between many different languages without Google Translate.
* [University of Helsinki](https://huggingface.co/Helsinki-NLP) uploaded models for more than 1000 language pairs to the Hugging Face hub
* [Facebook AI](https://huggingface.co/models?search=facebook+m2m) open-sourced several multi-lingual models
* The [EasyNMT library](https://github.com/UKPLab/EasyNMT), provides an easy wrapper for all these models
* Most machine translation models translate between two languages in one direction (e.g. German to English, but not English to German), some can translate in multiple directions.


In [9]:
!pip install sentencepiece

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


In [10]:
# translation pipeline docs: https://huggingface.co/transformers/main_classes/pipelines.html#transformers.TranslationPipeline
pipeline_translate = pipeline("translation", model="facebook/m2m100_418M")



huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


vocab.json:   0%|          | 0.00/3.71M [00:00<?, ?B/s]

sentencepiece.bpe.model:   0%|          | 0.00/2.42M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.14k [00:00<?, ?B/s]

In [11]:
text = "Ich bin ein Fisch"
pipeline_translate(text, src_lang="de", tgt_lang="en")

[{'translation_text': 'I am a fish'}]

In [12]:
# download any text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("de")

text = wikipedia.summary("Donald Trump").replace('\n', ' ')[:318]
print(f"Original text:\n{text}\n")

# translate the text from wikipedia
text_translated = pipeline_translate(text, src_lang="de", tgt_lang="en")
print(f"Translated text:\n{text_translated[0]['translation_text']}")


Original text:
Donald John Trump [ˈdɑn.əld dʒɑn tɹɐmp] (* 14. Juni 1946 in Queens, New York City, New York) ist ein US-amerikanischer Unternehmer, Entertainer und Politiker der Republikanischen Partei, der von 2017 bis 2021 der 45. Präsident der Vereinigten Staaten war. Er gilt als einer der umstrittensten Politiker der US-Geschich

Translated text:
Donald John Trump [ˈdɑn.əld dʒɑn tɔmp] (born 14 June 1946 in Queens, New York City, New York) is an American entrepreneur, entertainer and politician of the Republican Party, who from 2017 to 2021 was the 45th President of the United States.


#### 2.1.3 Text Summarization

In [13]:
# docs for summarisation pipeline: https://huggingface.co/transformers/main_classes/pipelines.html#summarizationpipeline
pipeline_summarize = pipeline("summarization", model="sshleifer/distilbart-cnn-12-6")  # sshleifer/distilbart-cnn-12-6 , google/pegasus-cnn_dailymail

config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

In [14]:
# download any long text from wikipedia, via  https://pypi.org/project/wikipedia/
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')
print(f"Original text:\n{text_long}\n")

# translate the text from wikipedia
text_summarized = pipeline_summarize(text_long, min_length=5, max_length=30)
print(f"Summarized text:\n{text_summarized[0]['summary_text']}")

Original text:
Donald John Trump (born June 14, 1946) is an American politician, media personality, and businessman who served as the 45th president of the United States from 2017 to 2021.     Trump received a Bachelor of Science in economics from the University of Pennsylvania in 1968. His father named him president of his real estate business in 1971. Trump renamed it the Trump Organization and reoriented the company toward building and renovating skyscrapers, hotels, casinos, and golf courses. After a series of business failures in the late twentieth century, he launched successful side ventures, mostly licensing the Trump name. From 2004 to 2015, he co-produced and hosted the reality television series The Apprentice. He and his businesses have been plaintiffs or defendants in more than 4,000 legal actions, including six business bankruptcies. Trump won the 2016 presidential election as the Republican Party nominee against Democratic Party nominee Hillary Clinton while losing the po

#### 2.1.4 Named Entity Recognition

In [15]:
pipeline_ner = pipeline("token-classification", model="dslim/bert-base-NER-uncased", aggregation_strategy="simple")

config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/438M [00:00<?, ?B/s]

Some weights of the model checkpoint at dslim/bert-base-NER-uncased were not used when initializing BertForTokenClassification: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


tokenizer_config.json:   0%|          | 0.00/39.0 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

In [16]:
import wikipedia
wikipedia.set_lang("en")

text_long = wikipedia.summary("Donald Trump").replace('\n', ' ')

output = pipeline_ner(text_long)

pd.DataFrame(output)

Unnamed: 0,entity_group,score,word,start,end
0,PER,0.989946,donald john trump,0,17
1,MISC,0.992103,american,45,53
2,LOC,0.992705,united states,141,154
3,PER,0.989692,trump,178,183
4,ORG,0.715599,university of pennsylvania,237,263
5,PER,0.972799,trump,341,346
6,ORG,0.698261,trump organization,362,380
7,PER,0.893604,trump,613,618
8,MISC,0.95431,the apprentice,700,714
9,PER,0.992253,trump,844,849


### 2.2. Universal models

The models above are always tailored to **one specific task from one dataset**. The main advantage of these models is, that they are very good at this specific task and perform well on one specific dataset. In reality, however, he problems you will encounter in the real world will require a slightly different task, with different definitions of categories or on different types of texts.

Universal models can partly address this issue. They also only one task. But this one task is to general/universal, that many other tasks can be reformulated as this universal task. Two examples for universal tasks are:
- Natural Language Inference (NLI): a task that can solve any classification task.
- Token generation: an even more universal task that can solve any text-related task.

#### Zero-shot classification

In [17]:
pipeline_zeroshot_classification = pipeline("zero-shot-classification", model="MoritzLaurer/mDeBERTa-v3-base-mnli-xnli")



config.json:   0%|          | 0.00/1.07k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/558M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/1.26k [00:00<?, ?B/s]

spm.model:   0%|          | 0.00/4.31M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/16.3M [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/23.0 [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/286 [00:00<?, ?B/s]

In [18]:
text = "Customer: I have not received my reimbursement yet. What the hell is going on?"
classes = ['payment issues', 'travel advice', 'bug report']  # "account opening", "customer complaint"

#text = "I do not think the government is trustworthy anymore. We need to mobilize and resist!"
#classes = ["civil disobedience", "praise of the government", "travel advice"]  # "collective action"

output = pipeline_zeroshot_classification(text, classes, multi_label=True)

pd.DataFrame(data=[output["labels"], output["scores"]], index=["class", "probability"]).T


Unnamed: 0,class,probability
0,payment issues,0.991133
1,bug report,0.076115
2,travel advice,0.018696


## Exercise  +  Q&A


**1. Exercise:** (5 min)

Browse through the Hugging Face Hub and **identify a model or dataset that could be useful for you**. Then open this Google Doc and copy-paste the model identifier and a short explanation why this model is interesting for you. Googel Doc: https://docs.google.com/document/d/1KZ6DnZDUg_sxqpS8hhF0MDohZ0IRUZaV83Ixu93n-X8/edit?usp=sharing




Dataset: https://huggingface.co/datasets/rufimelo/PortugueseLegalSentences-v3
Interesting because it is a dataset of legal sentences in Portuguese. This could be useful for legal NLP tasks, such as legal document classification, legal document summarization. I am working on a project that involves legal documents in Portuguese, so this dataset could be useful for me.

Model: https://huggingface.co/raquelsilveira/legalbertpt_sc
Interesting because it is a Portuguese version of the LegalBERT model. This could be useful for legal NLP tasks, such as legal document classification, legal document summarization. I am working on a project that involves legal documents in Portuguese, so this model could be useful for me.

**2. Reading, thinking & asking:** (5 min)

a) Go through the notebook and ask any questions you might have. You can also run the notebook yourself.

b) Write the answers to the following questions on a piece of paper / digital notebook in your own words:

* How does open source help increase accessibility to machine learning? Where does it not help?

* In your own words, write down the main difference between standard models and universal models.

* **Post any questions in the chat/Slack!**


Open source helps increase accessibility to machine learning by making it easier for people to access and use machine learning models and tools. Open source allows people to use pre-trained models and tools that have been developed by others, which can save time and resources. Open source also allows people to contribute to the development of machine learning models and tools, which can help improve the quality and performance of these models and tools. Training machine learning models is not good for environment because there are cientific studies that show that training a single model can emit as much carbon as five cars in their lifetime, so open source can help to reduce the environmental impact of machine learning.

The main difference between standard models and universal models is that standard models are tailored to one specific task from one dataset, while universal models are tailored to a general/universal task that can be used to solve many different tasks. Standard models are very good at the specific task they are trained on, but they may not perform as well on other tasks or datasets. Universal models are not as good at specific tasks as standard models, but they can be used to solve many different tasks and datasets.