#Transformers & LLMs

#Prequisites
Please install Python and required libraries for this exercise, or do it in google colab.


# 1.
 Neural networks can't process text directly - they only work with numbers. Tokenization is the essential first step that bridges human language and machine learning. In this exercise you are asked to build such a a basic word-level tokenizer class that can:

- Build a vocabulary from a list of sentences
- Encode sentences into token IDs
- Decode token IDs back into sentences

Include special token:
- (unknown) for words not in the vocabulary.

NOTE: All vector representations (embeddings) need to have the same length.

(15 points)

In [7]:
# Solution for Exercise 1: SimpleTokenizer

class SimpleTokenizer:
    def __init__(self, pad_token="<PAD>", unk_token="<UNK>"):
        self.pad_token = pad_token
        self.unk_token = unk_token
        self.word2id = {}
        self.id2word = {}
        self.vocab_built = False

    def build_vocab(self, sentences):
        # Reserve 0 for padding, 1 for unknown
        self.word2id = {self.pad_token: 0, self.unk_token: 1}
        self.id2word = {0: self.pad_token, 1: self.unk_token}
        next_id = 2

        for sentence in sentences:
            for word in sentence.lower().split():
                if word not in self.word2id:
                    self.word2id[word] = next_id
                    self.id2word[next_id] = word
                    next_id += 1

        self.vocab_built = True

    def encode(self, sentence, max_length=None):
        if not self.vocab_built:
            raise ValueError("Call build_vocab(...) before encode(...).")

        tokens = sentence.lower().split()
        unk_id = self.word2id[self.unk_token]
        ids = [self.word2id.get(tok, unk_id) for tok in tokens]

        # Make all representations the same length if max_length is given
        if max_length is not None:
            if len(ids) < max_length:
                # pad with PAD token
                pad_id = self.word2id[self.pad_token]
                ids = ids + [pad_id] * (max_length - len(ids))
            else:
                ids = ids[:max_length]

        return ids

    def decode(self, token_ids):
        pad_id = self.word2id.get(self.pad_token, 0)
        words = []
        for tid in token_ids:
            if tid == pad_id:
                continue  # skip padding
            word = self.id2word.get(tid, self.unk_token)
            words.append(word)
        return " ".join(words)


# Build vocabulary
training_sentences = [
    "I love deep learning",
    "Transformers are powerful models",
    "Tokenization turns text into numbers",
]

tokenizer = SimpleTokenizer()
tokenizer.build_vocab(training_sentences)

print("Vocabulary:", tokenizer.word2id)

# Encode a sentence with known words only
sentence1 = "I love deep learning"
encoded1 = tokenizer.encode(sentence1, max_length=6)
print("Encoded (known words):", encoded1)
print("Decoded:", tokenizer.decode(encoded1))

# Encode a sentence with unknown words
sentence2 = "I love neural networks"
encoded2 = tokenizer.encode(sentence2, max_length=6)
print("Encoded (unknown words):", encoded2)
print("Decoded:", tokenizer.decode(encoded2))


Vocabulary: {'<PAD>': 0, '<UNK>': 1, 'i': 2, 'love': 3, 'deep': 4, 'learning': 5, 'transformers': 6, 'are': 7, 'powerful': 8, 'models': 9, 'tokenization': 10, 'turns': 11, 'text': 12, 'into': 13, 'numbers': 14}
Encoded (known words): [2, 3, 4, 5, 0, 0]
Decoded: i love deep learning
Encoded (unknown words): [2, 3, 1, 1, 0, 0]
Decoded: i love <UNK> <UNK>


#2.
Now you can use your tokenizer to embed sentences into multidimensional space. Normally these embeddings would catpure semantic relationships between sentences - it is perfectly fine if yours does not do this. However, let's pretend that it does. Your task is to train your tokenizer on a given set of sentences and then check for other sentences which of the trained sentences are closest ones. Use eucledian or cosine similarity measures.

In [8]:
import numpy as np

documents = [
    "Osteoarthritis causes joint pain and stiffness",
    "Diabetes affects how the body uses sugar",
    "Hypertension leads to high blood pressure",
    "Exercise can help with osteoarthritis symptoms"
]

# Embed the documents

# Embed the following sentence too and find "the most similar" sentences from the vocabulary
query = "joint stiffness relief"

In [9]:
# Solution for Exercise 2: Embed documents and find most similar to the query

# Build vocabulary on documents + query
similarity_tokenizer = SimpleTokenizer()
similarity_tokenizer.build_vocab(documents + [query])

# Fix a common max length so all embeddings have the same size
max_len = max(len(d.split()) for d in documents + [query])


def sentence_embedding(sentence, tokenizer, max_length):
    """Use token IDs (with padding) as a simple embedding vector."""
    token_ids = tokenizer.encode(sentence, max_length=max_length)
    return np.array(token_ids, dtype=float)


doc_embeddings = np.vstack([
    sentence_embedding(doc, similarity_tokenizer, max_len)
    for doc in documents
])
query_embedding = sentence_embedding(query, similarity_tokenizer, max_len)


def cosine_similarity(a, b):
    a = a.astype(float)
    b = b.astype(float)
    denom = (np.linalg.norm(a) * np.linalg.norm(b))
    if denom == 0:
        return 0.0
    return float(np.dot(a, b) / denom)


similarities = [cosine_similarity(query_embedding, emb) for emb in doc_embeddings]

for doc, sim in zip(documents, similarities):
    print(f"Similarity to query: {sim:.4f} -> {doc}")

best_idx = int(np.argmax(similarities))
print("\nMost similar sentence to query '", query, "':", sep="")
print(documents[best_idx])


Similarity to query: 0.4144 -> Osteoarthritis causes joint pain and stiffness
Similarity to query: 0.4409 -> Diabetes affects how the body uses sugar
Similarity to query: 0.5237 -> Hypertension leads to high blood pressure
Similarity to query: 0.5956 -> Exercise can help with osteoarthritis symptoms

Most similar sentence to query 'joint stiffness relief':
Exercise can help with osteoarthritis symptoms


We are given:
- Query (for "The"):  **Q = [1, 0]**
- Keys:
  - K₁ ("The") = [1, 0]
  - K₂ ("cat") = [0, 1]
  - K₃ ("sat") = [1, 1]

#### 1. Attention scores (dot products)
For each key Kᵢ, the (unnormalized) attention score is:

- score₁ = Q · K₁ = [1, 0] · [1, 0] = 1·1 + 0·0 = **1**
- score₂ = Q · K₂ = [1, 0] · [0, 1] = 1·0 + 0·1 = **0**
- score₃ = Q · K₃ = [1, 0] · [1, 1] = 1·1 + 0·1 = **1**

So the scores are: **[1, 0, 1]**.

#### 2. Which key gets the highest attention weight?
- The largest score is **1**, achieved by both K₁ ("The" itself) and K₃ ("sat").
- K₂ ("cat") has score 0 and therefore gets the lowest weight.

**Intuition:** The query Q focuses entirely on the first dimension. Keys that have a large first coordinate (K₁ and K₃) are considered most similar to Q, so they receive higher attention.

#### 3. Softmax and weighted value combination
Scores = [1, 0, 1]

Softmax weights αᵢ are:

αᵢ = exp(scoreᵢ) / (exp(1) + exp(0) + exp(1)) = exp(scoreᵢ) / (2·exp(1) + 1)

Let Z = 2·exp(1) + 1. Then:
- α₁ = exp(1) / Z
- α₂ = exp(0) / Z = 1 / Z
- α₃ = exp(1) / Z

Values:
- V₁ = [0.5, 0.2]
- V₂ = [0.1, 0.8]
- V₃ = [0.6, 0.5]

The contextualized representation of "The" is:

**context = α₁·V₁ + α₂·V₂ + α₃·V₃**

Numerically, this gives a vector approximately around **[0.48, 0.42]**.

#### Why does "The" attend more to certain words?
Because attention uses similarity (dot product) between Q and each K:
- Words whose keys point in a similar direction to the query (here, large first component) get **higher scores**, then **higher softmax weights**.
- Here, K₁ and K₃ align better with Q than K₂, so "The" mainly attends to itself and "sat", and much less to "cat".

In [6]:
# Numeric check for the self-attention calculation
import numpy as np

Q = np.array([1.0, 0.0])
K = np.array([
    [1.0, 0.0],  # "The"
    [0.0, 1.0],  # "cat"
    [1.0, 1.0],  # "sat"
])
V = np.array([
    [0.5, 0.2],  # V1
    [0.1, 0.8],  # V2
    [0.6, 0.5],  # V3
])

scores = K @ Q  # dot products with Q
print("Scores:", scores)

# Softmax over scores
exp_scores = np.exp(scores)
weights = exp_scores / exp_scores.sum()
print("Softmax weights:", weights)

# Weighted sum of values
context = weights @ V
print("Context vector for 'The':", context)


Scores: [1. 0. 1.]
Softmax weights: [0.4223188 0.1553624 0.4223188]
Context vector for 'The': [0.48008692 0.41991308]
