<a href="https://colab.research.google.com/github/nicolaiberk/llm_ws/blob/main/notebooks/04a_tokens_attention.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Transformers: Contextualized Embeddings, Tokenization, and Inference

In [None]:
!pip install torch
!pip install transformers datasets evaluate accelerate
!pip install torch

### REMEMBER TO RESTART HERE

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.decomposition import PCA

## Contextual Embeddings

This notebook provides a hands-on introduction to three fundamental concepts in modern NLP:
1. **Simple Attention Mechanism** - Understanding how context changes word meaning
2. **Subword Tokenization** - How text is broken down for neural models
3. **Hugging Face Pipelines** - Quick inference with pre-trained models

## 1. Simple Attention Mechanism

### Understanding the Problem

The word "flies" has different meanings in these sentences:
- "Fruit flies like bananas" - *flies* = insects
- "Time flies like an arrow" - *flies* = moves quickly

Let's see how attention helps disambiguate this!

### Setup and Get Word Embeddings

We start with some example word vectors.

In [None]:
## word embeddings for the sentences above
embeddings = {
    'fruit': np.array([0.8, 0.2, 0.1, 0.3]),
    'flies': np.array([0.5, 0.5, 0.6, 0.3]),
    'like': np.array([0.3, 0.7, 0.4, 0.5]),
    'bananas': np.array([0.9, 0.1, 0.2, 0.4]),
    'time': np.array([0.1, 0.3, 0.8, 0.7]),
    'an': np.array([0.2, 0.4, 0.3, 0.1]),
    'arrow': np.array([0.2, 0.4, 0.9, 0.8]),
    # Related words for comparison
    'insects': np.array([0.7, 0.3, 0.2, 0.1]),
    'bugs': np.array([0.6, 0.4, 0.3, 0.2]),
    'soars': np.array([0.3, 0.8, 0.7, 0.9]),
    'glides': np.array([0.2, 0.7, 0.8, 0.8])
}

In [None]:
# reduce dimensionality to 2 and plot selected word embeddings

## dimensionality reduction using PCA
# reduce dimensionality to 2 and plot selected word embeddings

## dimensionality reduction using PCA
interesting_words = ['fruit', 'flies', 'bananas', 'insects', 'bugs', 'soars', 'glides', 'arrow']
interesting_vecs = np.array([embeddings[w] for w in interesting_words])
pca = PCA(n_components=2)
wv_2d = pca.fit_transform(interesting_vecs)
wv_2d = pd.DataFrame(wv_2d, index=interesting_words)


In [None]:
import matplotlib.pyplot as plt

plt.scatter(wv_2d[0], wv_2d[1])

for i in wv_2d.index:
    plt.annotate(i, (wv_2d[0][i], wv_2d[1][i]))

plt.show()

*Stop here for a second. Attention calculates weights (roughly) by taking the dot product of the token representations. What properties does the dot product have? What does that mean for the attention weights of different context words vis-a-vis 'flies'?*

## Calculate Attention

For each sentence, we take the dot product of the vector for 'flies' against all others, ensure that the weights sum to 1, multiply with the initial vector, and plot the result. What can you observe?

In [None]:
## calculate attention weights
sentence_a = ["time", "flies", "like", "an", "arrow"]
sentence_b = ["fruit", "flies", "like", "bananas"]

scores_a = pd.DataFrame(columns=sentence_a)
scores_b = pd.DataFrame(columns=sentence_b)

query = embeddings['flies']

for key in sentence_a:
    score = np.dot(embeddings[key], query)
    scores_a.at['flies', key] = score

for key in sentence_b:
    score = np.dot(embeddings[key], query)
    scores_b.at['flies', key] = score

In [None]:
scores_a

In [None]:
scores_a = scores_a.astype(float).values.flatten() ## make list of values
scores_b = scores_b.astype(float).values.flatten()

In [None]:
from scipy.special import softmax
# Normalize scores to sum to 1 using softmax
norm_scores_a = softmax(scores_a)
norm_scores_b = softmax(scores_b)

In [None]:
norm_scores_a

In [None]:
sum(norm_scores_a) ## double-check

In [None]:
## Calculate Contextualized Vector
context_vector_a = np.dot(norm_scores_a, np.array([embeddings[context] for context in sentence_a]))
context_vector_b = np.dot(norm_scores_b, np.array([embeddings[context] for context in sentence_b]))

In [None]:
new_vecs = np.append(interesting_vecs, [context_vector_a, context_vector_b], axis=0)

In [None]:
pca = PCA(n_components=2)
wv_2d = pca.fit_transform(new_vecs)
wv_2d = pd.DataFrame(wv_2d, index=interesting_words + ["'flies' (A)", "'flies' (B)"])

In [None]:
import matplotlib.pyplot as plt

plt.scatter(wv_2d[0], wv_2d[1])

for i in wv_2d.index:
    plt.annotate(i, (wv_2d[0][i], wv_2d[1][i]))

plt.show()

Note that this would not look that great with high-dimensional vectors used in transformers. The large dot products resulting from these vectors create issues. Transformers therefore uses a scaled dot product and learn a weight matrix to learn the query, key, and value vectors.

## Tokenization for Transformers

In this part of the tutorial, we are going to explore tokenization in the Huggingface Transformers framework.

The simplest way to access a tokenizer using the transformers library is the `AutoTokenizer` class. This class automatically provides the right tokenizer for a corresponding model.

In [None]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

You can encode any input using the tokenizer you just created. It will return a dictionary with three values for each token.

In [None]:
encoded_input = tokenizer("Use the GPU!")
print(encoded_input)

The `input_ids` provide the ID for a given token. The `attention_mask` indicates which tokens should be attended to. In this case, all tokens are attended to, so the attention mask is all ones. By setting some to zero, you can tell the tokenizer to ignore specific tokens. The `token_type_ids` can be safely ignored for our purposes today (if you are very eager, you can learn more [here](https://huggingface.co/docs/transformers/main/en/glossary#token-type-ids)).

You can see that there are plenty more IDs (7) than words (3). Why might that be?

In [None]:
len(encoded_input['input_ids'])

You can assess the tokenization using the `tokenize` method...

In [None]:
tokenizer.tokenize("Use the GPU!")

...and map the IDs back to actual text using the `decode` method.

In [None]:
tokenizer.decode(encoded_input["input_ids"])

Can you remember the use of the [CLS] and [SEP] tokens?

Let's compare this output to another tokenizer, `bert-base-uncased`:

In [None]:
uncased_tknzr = AutoTokenizer.from_pretrained("bert-base-uncased")
encoded_input = uncased_tknzr("Transformers rule!")
print(uncased_tknzr.decode(encoded_input["input_ids"]))

You can also use the method to encode multiple texts:

In [None]:
batch_sentences = [
    'What about second breakfast?',
    "Don't think he knows about second breakfast, Pip.",
    'What about elevensies?'
]
tokenizer(batch_sentences)

Note that you can also explicitly ask for padding tokens in the encoding of batches. This will lead to speed-ups when processing batches (you usually don't have to take care of this yourself).

In [None]:
tokenizer(batch_sentences, padding=True)

In [None]:
tokenizer.decode([101, 1327, 1164, 1248, 6462, 136, 102, 0, 0, 0, 0, 0, 0, 0, 0])

Similarly, you can ask for truncation to shorten texts which might be longer than the maximum input of your model:

In [None]:
tokenizer(batch_sentences, padding=True, truncation=True, max_length=10)

The Huggingface library provides the tokenizers alongside the transformer models so you use the right transformer for each model. You can find an extensive explanation of tokenizers [here](https://github.com/huggingface/notebooks/blob/main/transformers_doc/en/tokenizer_summary.ipynb).

## First Inference with Transformers

The fastest way to run inference is the pipeline function:

In [None]:
from transformers import pipeline
feature_extractor = pipeline('feature-extraction', model='bert-base-cased', tokenizer='bert-base-cased')

In [None]:
features = feature_extractor("Transformers are great for NLP tasks!")
len(features[0])

In [None]:
len(features[0][0])  # Length of the feature vector for each token

In [None]:
## You can simply call for a different task and the pipeline will replace the classification head
classifier = pipeline('sentiment-analysis', model='bert-base-cased', tokenizer='bert-base-cased')
classifier("Transformers are great for NLP tasks!")

Particularly useful is zero-shot classification:

In [None]:
from transformers import pipeline

classifier = pipeline(
    "zero-shot-classification",
    model="MoritzLaurer/deberta-v3-base-zeroshot-v2.0",
    tokenizer="MoritzLaurer/deberta-v3-base-zeroshot-v2.0"
)

classifier(
    "This is a course about the Transformers library",
    candidate_labels=["education", "politics", "business"],
)

In [None]:
emotions = classifier(
    ["I just cant do this anymore...", "I HATE this!", "Transformers are amazing!"],
    candidate_labels=["happy", "angry", "sad"],
)
[e['labels'][0] for e in emotions]

You can even process images or audio - the pipeline takes care of the preprocessing! (probably best to restart here)

In [None]:
from transformers import pipeline
classifier = pipeline("image-segmentation", model="facebook/detr-resnet-50-panoptic", device="cuda")

![](https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png)

In [None]:
segments = classifier("https://huggingface.co/datasets/Narsil/image_dummy/raw/main/parrots.png")
print(segments[0]["label"])
print(segments[1]["label"])


![](https://c8.alamy.com/comp/E75E30/cute-pet-kitten-plays-with-his-deathly-injured-pray-a-magpie-bird-E75E30.jpg)

In [None]:
classifier("https://c8.alamy.com/comp/E75E30/cute-pet-kitten-plays-with-his-deathly-injured-pray-a-magpie-bird-E75E30.jpg")

# Exercise

0. Explain the logic of contextualized embeddings and their differentiation from classic word embeddings.

1. Consider that the tokenizer we load is identified with a model name - why would this be necessary? What problems would you encounter if that were not the case?

2) Take a concept from your own research and define some relevant labels based on the concept. Give three to four examples for each label.

3) Annotate the texts using the zero-shot classifier from above. Do you agree with the classifier?

BONUS: Go to [https://huggingface.co/models](https://huggingface.co/models) and select another task from the Natural Language Processing filters on the left side. Select a model and try to use the pipeline function to run the task. Tip: many models provide the code to implement them on the model page.