<a target="_blank" href="https://colab.research.google.com/github/dataclair-ai/mlprague2023/blob/master/notebooks/01_assignment_intro_to_transformers.ipynb">
  <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Transformer Encoder
- easy to use HuggingFace models

In [1]:
!pip install transformers



In [1]:
from transformers import pipeline

classifier = pipeline("sentiment-analysis") # or can pick a specific model: https://huggingface.co/models?search=sentiment
classifier("ML Prague is the biggest European conference about ML, AI and deep learning applications.")

2023-06-02 10:18:06.291656: I tensorflow/core/platform/cpu_feature_guard.cc:182] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
No model was supplied, defaulted to distilbert-base-uncased-finetuned-sst-2-english and revision af0f99b (https://huggingface.co/distilbert-base-uncased-finetuned-sst-2-english).
Using a pipeline without specifying a model name and revision in production is not recommended.


Downloading pytorch_model.bin:   0%|          | 0.00/268M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Xformers is not installed correctly. If you want to use memorry_efficient_attention to accelerate training use the following command to install Xformers
pip install xformers.


[{'label': 'POSITIVE', 'score': 0.9989205598831177}]

In [3]:
# do the same as above but this time specifiy a suitable model that the pipeline should use
classifier = pipeline("sentiment-analysis", "cardiffnlp/twitter-roberta-base-sentiment")
classifier("ML Prague is the biggest European conference about ML, AI and deep learning applications.")

Downloading (…)lve/main/config.json:   0%|          | 0.00/747 [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/499M [00:00<?, ?B/s]

Downloading (…)olve/main/vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/150 [00:00<?, ?B/s]

[{'label': 'LABEL_2', 'score': 0.888073205947876}]

## Pipeline Steps:
### Text -> Tokenizer -> Model -> Post Processing

### Tokenizer 
#### translates text into data that can be processed by a model

- splits text into words, subwords or symbols called *tokens*
- each *token* is given a unique id (integer)
- pretrained models come with a ready made tokenizer 
- important to use the tokenizer that macthes the model you use otherwise the token ids are not guaranteed match (as well as preprocessing of text could be different)
- For example: 
   - "He wore a hat" -> [tokenizer_a] -> [3, 142, 95, 7]
   - "He wore a hat" -> [tokenizer_b] -> [12, 205, 48, 2]

#### HuggingFace makes it easy to download the model and its matching tokenizer

In [12]:
from transformers import AutoTokenizer

model_name = "distilbert-base-uncased-finetuned-sst-2-english"
tokenizer = AutoTokenizer.from_pretrained(model_name)

In [13]:
raw_inputs = [
    "ML Prague is the biggest European conference about ML, AI and deep learning applications",
    "HuggingFace is not easy to use!",
]
inputs = tokenizer(raw_inputs, padding=True, truncation=True, return_tensors="pt")
print(inputs)

{'input_ids': tensor([[  101, 19875,  8634,  2003,  1996,  5221,  2647,  3034,  2055, 19875,
          1010,  9932,  1998,  2784,  4083,  5097,   102],
        [  101, 17662, 12172,  2003,  2025,  3733,  2000,  2224,   999,   102,
             0,     0,     0,     0,     0,     0,     0]]), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
        [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0]])}


### Padding and truncation
- Ensures all inputs are the same length
- Attention mask has zeros where inputs are ignored

#### Special tokens added to lists of input ids

In [18]:
tokenized_inputs = tokenizer.encode(raw_inputs)
print(tokenized_inputs)

[101, 19875, 8634, 2003, 1996, 5221, 2647, 3034, 2055, 19875, 1010, 9932, 1998, 2784, 4083, 5097, 102, 17662, 12172, 2003, 2025, 3733, 2000, 2224, 999, 102]


- Find what *tokens* do these ids correspond to?

In [19]:
print(tokenizer.convert_ids_to_tokens(tokenized_inputs))

['[CLS]', 'ml', 'prague', 'is', 'the', 'biggest', 'european', 'conference', 'about', 'ml', ',', 'ai', 'and', 'deep', 'learning', 'applications', '[SEP]', 'hugging', '##face', 'is', 'not', 'easy', 'to', 'use', '!', '[SEP]']


- The tokenizer added the special word [CLS] at the beginning and the special word [SEP] at the end. - This is because the model was pretrained with those, so to get the same results for inference we need to add them as well
- Not every model will have these special tokens, some will have different ones, others may have none

In [20]:
tokenizer.decode(tokenized_inputs)

'[CLS] ml prague is the biggest european conference about ml, ai and deep learning applications [SEP] huggingface is not easy to use! [SEP]'

- decode converts the indices back to tokens, but also groups together the tokens that were part of the same words to produce a readable sentence
- really useful when you want to interpret a model's prediction

#### Downloading the model
- compare this with cell for downloading tokenizer

- *AutoModel* means you can use this method to load a model with a different architecture
- *ForSequenceClassification* defines the task that the head of the model was used for

In [21]:
from transformers import AutoModelForSequenceClassification

# load a model from huggingface that is suitable for sequence classification
model = AutoModelForSequenceClassification.from_pretrained(model_name)

In [24]:
outputs = model(**inputs)

- what do we expect at the output?

In [25]:
print(outputs.logits.shape)

torch.Size([2, 2])


- a prediction probability for the 2 input sentences
- a prediction probability for both classes (positive and negative)

In [26]:
print(outputs.logits)

tensor([[-3.3349,  3.4809],
        [ 3.5385, -2.9917]], grad_fn=<AddmmBackward0>)


- these are not probabilities, but raw unnormalized scores
- need to use softmax layer to convert these scores to probabilities

In [27]:
import torch

predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
print(predictions)

tensor([[0.0011, 0.9989],
        [0.9985, 0.0015]], grad_fn=<SoftmaxBackward0>)


In [28]:
predictions.detach().numpy()[0]

array([0.00109515, 0.9989048 ], dtype=float32)

In [29]:
predictions.detach().numpy()[1]

array([0.9985435 , 0.00145652], dtype=float32)

In [30]:
model.config.id2label

{0: 'NEGATIVE', 1: 'POSITIVE'}

- First sentence input has NEGATIVE: 0.00109, POSITIVE: 0.9989
- Second sentence input has NEGATIVE: 0.9995, POSITIVE: 0.0005