<a href="https://colab.research.google.com/github/lozrigby/lab-encorder-models-bert/blob/main/lab_encorder_models_bert.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Lab | Encorder Models - BERT

---

### Transformers' main components

**Hugging Face Transformers has two main components:**



1. The **tokenizer** prepares the text in a clean format, which the model understands.
    - A token is a word or a sub-word unit. In BERT's vocabulary, the word "good" is one token and the word "darwinism" is two tokens  ("darwin" and "ism")
    - The tokenizer transforms words into token-ids. With these token-ids, BERT can link words to any token it has already learned during pre-training.

2. The **model** processes the tokenizer's ouput and returns a prediction, e.g. which class an input text belongs to.



Independently of the type of model (classification, summarisation, translation, etc.), these two components are almost the same.

In [None]:
#!pip install transformers~=4.31.0  # The Transformers library from Hugging Face

## Models like BERT (encoders)

In [None]:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

In [None]:
# load any classification model from the HuggingFace model hub
# See here: https://huggingface.co/models?pipeline_tag=text-classification

model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
# instantiate the model
model = AutoModelForSequenceClassification.from_pretrained(model_name)

### Tokenization


In [None]:
### 1. Tokenization
# Tokenizer documentation: https://huggingface.co/transformers/main_classes/tokenizer.html

text = 'I believe that the EU is trustworthy.'
print(f"Input text: '{text}'\n")

input_ids = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]
print(f"""The tokenizer splits the text string into separate tokens. A token is either an entire word,
or a 'sub-word unit' in case of rare words (or punctuation).
The word 'trustworthy', for example is split into two tokens: {tokenizer.tokenize("Trustworthy")}.
The main advantage of these sub-word units is that rare words cannot be out-of-vocabulary (an issue of other text-as-data approaches).
Transformer models typically have a vocabulary of around 30.000 - 250.000 tokens, learned from the training data.
Here is e.g. the vocabulary of DistilBERT: https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt\n""")

print(f"The input text is split into the following tokens:\n{tokenizer.tokenize(text)}.")
print("The tokenizer then maps each token to the corresponding token ID in the model's vocabulary:")
print(input_ids[0].tolist()[1:-1])
print("Transformer models only understand these token IDs.\n")

print("""In addition, the tokenizer adds two special tokens:
 First, the [CLS] (classification) token is always added at the beginning.
        While individual tokens represent individual (sub)words, the [CLS] token represents the entire text.
        The [CLS] token "is  used  as  the  aggregate sequence representation for classification tasks" (Devlin et al. 2019: 4). Details: https://arxiv.org/pdf/1810.04805.pdf
 Second, the [SEP] token separates two texts. It is useful for tasks which require two text inputs, for example Questions & Answer tasks.
        (It is not relevant in our case)
\n""")

print("""The final input for a BERT transformer model therefore looks like this:""")
token_strings = tokenizer.convert_ids_to_tokens(ids=input_ids[0])
#token_strings = tokenizer.tokenize(text)
for token_id, token_string in zip(input_ids[0].tolist(), token_strings):
  print(token_id, " == ", token_string)


# entire vocabulary: tokenizer.pretrained_vocab_files_map["vocab_file"]["distilbert-base-uncased"]
# https://huggingface.co/distilbert-base-uncased/resolve/main/vocab.txt

### Tokens (words) flowing through the neural network

In [None]:
### Processing the input with the model
# Model class documentation: https://huggingface.co/transformers/main_classes/model.html
# Documentation for DistilBERT specifically: https://huggingface.co/transformers/model_doc/distilbert.html

print(f"""\nAfter the preprocessing by the tokenizer, the model then feeds the sequence of tokens through the neural network.
Each token is represented by a vector of 768 numbers (a 768 dimensional tensor).
The tensor for the token "trust" looks for example like this before being fed into the first neural network layer
(only 100 numbers are displayed):\n""")
print(model.distilbert.embeddings.word_embeddings(input_ids[0][7])[:100], "\n")

print(f"""The tensors for each token are then fed through and transformed by between 6-24~ neural network layers.\n""")

output = model(input_ids, output_hidden_states=True, output_attentions=False, return_dict=True)
print("Same word after the first layer:\n\n", output.hidden_states[1][0][7][:100], "\n")  # same word embedding after the first attention layer
print("Same word after the second layer:\n\n", output.hidden_states[2][0][7][:100], "\n")  # same word embedding after the second attention layer
#print("Same word after the third layer:\n", output.hidden_states[3][0][7][:100], "\n")  # same word embedding after the third attention layer
print("\n ... etc ...\n")

print(f'The final output is a a contextualised representation of the sequence: "{text}"')
#output.hidden_states[6][0][0][:100]  # final CLS token

In [None]:
print("This is what the different model layers ('the architecture') look like:\n")
print(model)

### The final output

In [None]:
print(f"""At the end, Transformer models always output so called 'logits',\n one number for each class the model was trained to classify text into.\n
Our input text was: '{text}'\n
These logis represent the predicted probability for our binary sentiment classification task:\n\n{output["logits"][0].tolist()}\n""")

print("Logits are not very interpretable, so they are then converted to percentages.\nEach percentages represents the model's prediction, which class the input text belongs to.\n")
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

### Everything put together


In [None]:
## In short, the code looks like this:

# load the relevant functions from HuggingFace and PyTorch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch

# Choose any classification model from the model hub
model_name = "distilbert-base-uncased-finetuned-sst-2-english"

# instantiate the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# tokenization
text = 'I believe that the EU is trustworthy.'
input = tokenizer(text, truncation=True, return_tensors="pt")["input_ids"]

# model prediction
output = model(input, output_hidden_states=False, output_attentions=False, return_dict=True)
probabilities = torch.softmax(output["logits"][0], -1).tolist()
label_names = model.config.id2label.values()
prediction = {name: round(float(pred) * 100, 1) for pred, name in zip(probabilities, label_names)}
print(prediction)

In [None]:
## Or via the simplified pipeline:
from transformers import pipeline
pipe_classification = pipeline("text-classification", model="distilbert-base-uncased-finetuned-sst-2-english", top_k=2)
text = 'I believe that the EU is trustworthy.'
pipe_classification(text)

## Generative models like GPT (decoders)

In [None]:
from transformers import AutoTokenizer, AutoModelForCausalLM

# https://huggingface.co/gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"
inputs = tokenizer(prompt, return_tensors="pt")

outputs = model.generate(**inputs, max_length=30)

outputs_decoded = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(outputs_decoded)


In [None]:

# https://huggingface.co/gpt2
tokenizer = AutoTokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("gpt2")

prompt = "Today I believe we can finally"
input_ids = tokenizer(prompt, return_tensors="pt").input_ids

# gpt2's vocabulary: https://huggingface.co/gpt2/raw/main/vocab.json

outputs = model.generate(
    input_ids, max_length=30,
    output_scores=True, return_dict_in_generate=True,
    output_attentions=False, do_sample=False
)

print("\nThe output looks quite messy:\n")
print(outputs)


In [None]:
print("GPT2's vocabulary is composed of 50257 tokens. Each has a 'word vector' composed of 768 numbers:")
print(model.transformer.wte)

print(f"""\nWe can look at GPT2's entire vocabulary here: https://huggingface.co/gpt2/raw/main/vocab.json
\nFor example, the token 'Love' is at position 18565.
\nWe can access it's word vector here (first 100 numbers):\n
{model.transformer.wte.weight[18565][:100]}
""")

In [None]:

print(f"""
While the outputs produce by classifiers like BERT are probabilities of classes,
the outputs produced by generators like GPT2 are probabilities of tokens.
\nThese probabilities of tokens are in the 'outputs' object returned by model.generate()
\nThe IDs of the most probably tokens are:
{outputs.sequences}
\nThese token IDs can be mapped to actuall words/tokens in the vocabulary:
{tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)}\n\n

Our original prompt was:\n'{prompt}'
GPT2 then tries to predict the most probable next token. One token after the other.

To calculate the first token, it makes a prediction over ALL of the 50257 tokens it knows.
Each of the 50257 tokens receives a probability.
First the first token, the probability distribution over its ENTIRE vocabulary looks like this:
{outputs.scores[0][0]}

The ID of the most probable *first* token is {torch.argmax(outputs.scores[0][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[0][0], dim=0))}

The ID of the most probable *second* token is {torch.argmax(outputs.scores[1][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[1][0], dim=0))}

The ID of the most probable *third* token is {torch.argmax(outputs.scores[2][0], dim=0)}
The corresponding token is: {tokenizer.decode(torch.argmax(outputs.scores[2][0], dim=0))}

This is how GPT2 gradually generated the text:
{tokenizer.batch_decode(outputs.sequences, skip_special_tokens=True)}

The same principles apply to all generative LLMs like GPT4, Llama-2 etc.
Only that they are bigger, with a better architecture and better fine-tuning.
""")




---



---

## Reflection  +  Q&A


**Reading, thinking & asking:** (5 min)
* Write your answers to the following questions on a piece of paper / digital notebook. While thinking about these questions, also don't hesitate to ask any questions that come up in the chat/Slack.
    * In your own words, write down the main differences between models like BERT and models like GPT with regard to their outputs.
    * What could be disadvantages and advantages of these two different approaches (encoders vs. decoders)?

BERT is an encoder-based model that generates contextual embeddings for input tokens by considering both preceding and succeeding words. In contrast, GPT is a decoder-based model that generates text sequentially from left to right, predicting the next word based on previous words. Thus, BERT is optimized for understanding tasks, while GPT excels in text generation.

**Disadvantages of bert/encoders**

BERT’s bidirectional context provides a deep understanding of each word in relation to its full context, making it effective for tasks like sentiment analysis and question answering. It performs exceptionally well on tasks requiring comprehensive text understanding, such as named entity recognition and paraphrase detection.






**advantages of bert/encoders**

BERT cannot generate text as it does not predict the next word in a sequence, limiting its use in creative text tasks. Also, as it is bidirectonal it makes inference slower because it must consider the entire sequence for each prediction.

**Disadvantages of GPT/de-encoders**

GPT is highly effective at generating coherent and contextually relevant text, making it ideal for applications like chatbots and content creation. Its sequential, left-to-right processing simplifies the text generation process, producing fluent continuations of prompts.

**advantages of GPT/de-encoders**

GPT's left-to-right processing limits its ability to fully understand a word’s context using future tokens, affecting tasks requiring bidirectional context. It is also constrained by its context window size, limiting the number of previous tokens it can consider when generating text, which can be problematic for understanding or generating lengthy texts.