# Lab 1: Tokenisation and embeddings

In this lab, you will build an understanding of how text can be transformed into representations that computers can process and learn from. Specifically, you will explore two key concepts: *tokenisation* and *embeddings*. Tokenisation splits text into smaller units such as words, subwords, or characters. Embeddings are dense, fixed-size vector representations of tokens in a continuous space.

*Tasks you can choose for the oral exam are marked with the graduation cap 🎓 emoji.*

## Part 1: Tokenisation

In the first part of the lab, you will code and analyse a tokeniser based on the Byte Pair Encoding (BPE) algorithm.

### Utility functions

The BPE tokeniser transforms text into a list of integers representing tokens. As a warm-up, you will implement two utility functions on such lists. To simplify things, we define a shorthand for the type of pairs of integers:

In [1]:
Pair = tuple[int, int]

#### 🎈 Task 1.01: Counting pairs

Write a function that counts all occurrences of pairs of consecutive token IDs in a given list. The function should return a dictionary that maps each pair to its count. Skip counts that are zero.

In [2]:
ids = [1, 2, 3, 2, 1, 2, 3, 1, 2, 3 ,1 , 1, 2, 2, 3, 1, 2]

In [3]:
def count(ids: list[int]) -> dict[Pair, int]:
    pair_freqs = {}
    
    # Iterate through the list of token IDs
    for i in range(len(ids) - 1):
        pair = (ids[i], ids[i + 1])
        pair_freqs[pair] = pair_freqs.get(pair, 0) + 1
    
    return pair_freqs

# Example usage:
#ids = [1, 2, 3, 2, 1, 2, 3, 1]
print(count(ids))

{(1, 2): 5, (2, 3): 4, (3, 2): 1, (2, 1): 1, (3, 1): 3, (1, 1): 1, (2, 2): 1}


#### 🎈 Task 1.02: Replacing pairs

Write a function that replaces all occurrences of a specified pair of consecutive token IDs in a given list by a new ID. The function should return the modified list.

In [4]:
def replace(ids: list[int], pair: Pair, new_id: int) -> list[int]:
    modified_ids = ids.copy()  # copy of the original list
    
    i = 0
    while i < len(modified_ids) - 1:
        if (modified_ids[i], modified_ids[i + 1]) == pair:
            # Replace pair with new ID
            modified_ids[i : i + 2] = [new_id]
        else:
            i += 1
    
    return modified_ids

pair = (1, 2)
new_id = 69
print(replace(ids, pair, new_id))

[69, 3, 2, 69, 3, 69, 3, 1, 69, 2, 3, 69]


### Encoding and decoding

The next cell contains the core code for the tokeniser in the form of a class `Tokenizer`. This class implements two methods: `encode()` converts an input text to a list of token IDs by exhaustively applying rules for merging pairs of consecutive IDs, and `decode()` reverses this process by looking up the tokens corresponding to the token IDs.

In [5]:
class Tokenizer:
    def __init__(self):
        self.merges = {}
        self.vocab = {i: bytes([i]) for i in range(2**8)}

    def encode(self, text):
        ids = list(text.encode("utf-8"))
        while True:
            counts = count(ids)
            mergeable_pairs = counts.keys() & self.merges.keys()
            if len(mergeable_pairs) == 0:
                break
            to_merge = min(mergeable_pairs, key=self.merges.get)
            ids = replace(ids, to_merge, self.merges[to_merge])
        return ids

    def decode(self, ids):
        return b"".join((self.vocab[i] for i in ids)).decode("utf-8")

In [6]:
from typing import Dict, List

class Tokenizer:
    def __init__(self) -> None:
        self.merges: Dict[tuple, int] = {}  # Mapping of pairs to merge and their priority
        self.vocab: Dict[int, bytes] = {i: bytes([i]) for i in range(2**8)}  # Vocabulary mapping integers to byte values

    def encode(self, text: str) -> List[int]:
        """Encodes a string of text into a list of integer IDs based on the tokenizer's merges."""
        ids: List[int] = list(text.encode("utf-8"))  # List of integers representing the text in UTF-8 encoding
        while True:
            counts = count(ids)  # Count occurrences of pairs in `ids`
            mergeable_pairs = counts.keys() & self.merges.keys()  # Find intersecting pairs
            if len(mergeable_pairs) == 0:
                break
            to_merge = min(mergeable_pairs, key=self.merges.get)  # Select the pair with the lowest merge score
            ids = replace(ids, to_merge, self.merges[to_merge])  # Replace the pair with the merged value
        return ids

    def decode(self, ids: List[int]) -> str:
        """Decodes a list of integer IDs back into a string."""
        return b"".join((self.vocab[i] for i in ids)).decode("utf-8")  # Join and decode into UTF-8 string


#### 🎓 Task 1.03: Encoding and decoding

Explain how the code implements the BPE algorithm. Use the following steps to check your understanding:

**Step&nbsp;1.** Annotate the attributes and methods of the `Tokenizer` class with their Python types. In particular, what is the type of `self.merges`? Use the `Pair` shorthand.

**Step&nbsp;2.** Explain how the implementation chooses which merge rule to apply. Provide an example that illustrates the logic.

#### self.merges: This is a dictionary (dict) that maps token pairs to a new token ID.It represents the merging rules for BPE.Each key in self.merges is a Pair, meaning it is a tuple of two integers (each representing a byte token).Each value in self.merges is an integer, which represents the new token ID resulting from the merge. 
#### self.vocab:This is a dictionary mapping token IDs (integers) to their corresponding byte representations.Initially, it contains the byte values from 0 to 255, each mapped to its corresponding byte representation.

## Step 1:Tokenizer class:This class has two methods:
- **(1)** encode(self, text: str) -> list[int].Converts a given text into a list of token IDs.It:Encodes the text into bytes.Iteratively applies merge rules from self.merges to combine token pairs.Returns the final tokenized representation as a list of integers.
- **(2)** decode(self, ids: list[int]) -> str:Converts a list of token IDs back into the original text.It:Looks up each token ID in self.vocab to retrieve the corresponding byte representation.Joins the bytes together.Decodes them back into a UTF-8 string.

## Step 2: Explain How the Merge Rule is Applied::
- **(1)** Identify adjacent token pairs The method count(ids) (which is missing in the provided code) is expected to count the occurrences of adjacent token pairs in ids
- **(2)** Check which pairs exist in self.mergesThe intersection of counted token pairs and self.merges.keys() determines the possible merge candidates.
- **(3)** Select the best pair to merge
The merge rule with the lowest assigned integer value in self.merges is chosen first
- **(4)** Replace the selected pair with the new merged token.The function replace(ids, to_merge, self.merges[to_merge]) (not defined) replaces occurrences of to_merge in ids with the new merged token ID.
- **(5)** Repeat the process until no more merges are possible
- Example:[101,102,103] ->pairs (101,102)and (102,103) .next it looks for pairs that exist in selfmerge.among them it selects the pair with smallest value in self merge. if self merge applied = {(101, 102): 256, (102, 103): 257}, the pair (101, 102) (value 256) is chosen over (102, 103) (value 257).the selected pair is replaced with 256.

### Feedback 
### 1.03: For step 1 you should annotate the code, you have just mentioned that self.merges is a dict, but could you elaborate more on what this dict contains. The question also asks you to “Annotate the attributes and methods of the `Tokenizer` class with their Python types.”, which you did not do. For step 2 you did not “Provide an example that illustrates the logic.”. If you are uncertain of how to do this, ask during the lab sessions. 

#### Respond/correction made: Most of the changes have been made  in the updated version

### Training a tokeniser

Upon initialisation, a tokeniser has an empty set of merge rules. Your next task is to complete the BPE algorithm and write code to learn these merge rules from a text.

#### 🎓 Task 1.04: Training a tokeniser

Write a function that induces a BPE tokeniser from a given text. The function should take the text (a string) and a target vocabulary size as input and return the trained tokeniser.

In [7]:
from collections import Counter
def from_text(text: str, vocab_size: int) -> Tokenizer:
    tokenizer = Tokenizer()
    ids = list(text.encode("utf-8"))
   
    counter = Counter(text)
    tokenizer.vocab = list(counter)
  #  ids = []
#for i in range(len(tokenizer.vocab):
  #      ids[i] = ord(tokenizer.vocab[i])
    new_idx = 256
    while len(tokenizer.vocab) < vocab_size:
        counts = count(ids)
        if not counts:
            break
        most_frequent = max(counts, key=counts.get)
        #print(most_frequent)
        tokenizer.merges[most_frequent] = new_idx
        tokenizer.vocab.append( most_frequent[0] + most_frequent[1])
        ids = replace(ids, most_frequent, new_idx)
        new_idx+=1
    return tokenizer

To help you test your implementation, we provide three text files together with tokenisers trained on these files. Each text file contains the first 1&nbsp;million Unicode characters in a language-specific Wikipedia:

| Text file | Tokeniser file | Wikipedia |
|---|---|---|
| `wiki-en-1m.txt` | `wiki-en-1m.tok` | [Simple English](https://simple.wikipedia.org/) |
| `wiki-is-1m.txt` | `wiki-is-1m.tok` | [Icelandic](https://is.wikipedia.org/) |
| `wiki-sv-1m.txt` | `wiki-sv-1m.tok` | [Swedish](https://sv.wikipedia.org/) |

A tokeniser file consists of lines specifying merge rules. For example, the first line in the tokeniser file for Swedish is `101 114`, which expresses that this rule combines the token with ID 101 (`e`) and the token with ID 114 (`r`). The ID of the new token (`er`) is 256 plus the (zero-indexed) line number on which the rule is found. The following code saves a `Tokenizer` to a file with this format:

In [8]:
def train_on_wikipedia(file_path: str, vocab_size: int) -> Tokenizer:
    #with open(file_path, "r", encoding="utf-8") as file:
        #text = file.read()
    text = open("wiki-en-1m.txt",'r').read()
    
    return from_text(text, vocab_size)

In [9]:
def train_on_wikipedia(file_path: str, vocab_size: int) -> Tokenizer:
    #with open(file_path, "r", encoding="utf-8") as file:
        #text = file.read()
    text = open("wiki-is-lm.txt",'r').read()
    
    return from_text(text, vocab_size)

In [10]:
def save(tokenizer: Tokenizer, filename: str) -> None:
    with open(filename, "w") as f:
        for fst, snd in tokenizer.merges:
            print(f"{fst} {snd}", file=f)

##### Solution for 1.04:
Training a BPE Tokenizer
The from_text() function trains a Byte Pair Encoding (BPE) tokenizer by progressively merging the most frequent adjacent character pairs into new tokens until the vocabulary reaches a specified size. This technique is widely used in natural language processing (NLP) to create compact and efficient text representations.

##### The process begins by converting the input text into a sequence of byte values using UTF-8 encoding. Each character in the text is mapped to its corresponding byte representation. A Counter function is then used to extract unique characters, which are stored in the initial vocabulary. This ensures that the tokenizer starts with a minimal set of single-character tokens before learning larger units.

To enable token merging, token IDs are assigned. Standard byte values range from 0-255, while new merged tokens receive IDs starting from 256 and increment as more merges occur. This prevents conflicts with existing byte values and allows for structured vocabulary expansion.

The function then iteratively counts occurrences of adjacent token pairs. The most frequently occurring pair is selected and merged into a new token. This newly merged token is stored in the tokenizer.merges dictionary and added to the vocabulary. The token sequence is updated by replacing occurrences of the merged pair with its new token ID. This process repeats until the vocabulary reaches the desired size.

Finally, the trained tokenizer is saved in a file, where each line represents a merge rule. The format specifies which token IDs should be merged, and new token IDs are assigned sequentially as 256 + (zero-indexed line number). This saved tokenizer can be reloaded later for text encoding and decoding, making it a key component in machine learning models such as GPT.

#### Feedback::Your “from_text” function changes the tokenizer.vocab from a dict to a list. Even though this might work, it is advised to keep the type of the vocabulary as it is instantiated

To test your code, compare the saved tokeniser to the provided tokeniser using `diff`.

In [11]:
import os
with open('wiki-en-1m.txt',"r",encoding = "utf-8") as file:
  text = file.read()
#  Generate the tokenizer using the from_text function
  tokenizer = from_text(text, 1024)
  # print(tokenizer)

#  Save the generated tokenizer using the save function
  save(tokenizer, "generated_tokenizer.tok")
  
  #  Compare the generated tokenizer with the provided tokenizer using 'diff'
os.system("diff generated_tokenizer.tok wiki-en-1m.tok")

769,806d768
< 328 361
< 270 115
< 275 115
< 109 260
< 101 451
< 809 323
< 98 344
< 324 335
< 521 792
< 366 312
< 50 49
< 476 105
< 50 32
< 260 118
< 429 292
< 116 300
< 266 302
< 760 285
< 557 674
< 105 102
< 44 417
< 278 290
< 584 434
< 115 292
< 384 116
< 470 313
< 422 103
< 447 100
< 97 119
< 414 265
< 108 421
< 292 312
< 546 268
< 448 318
< 111 262
< 418 259
< 938 617
< 99 322


256

### Tokenisation quirks

The tokeniser is a key component of language models, as it defines the minimal chunks of text the model can “see” and work with. As you will see in this section, tokenisation is also responsible for several deficiencies and unexpected behaviours of language models.

One helpful tool for experimenting with tokenisers in language models is the web app [Tiktokenizer](https://tiktokenizer.vercel.app/). This app lets you play around with, among others, [`cl100k_base`](https://tiktokenizer.vercel.app/?model=cl100k_base), the tokeniser used in the free version of ChatGPT and OpenAI’s APIs, and [`o200k_base`](https://tiktokenizer.vercel.app/?model=o200k_base), used in GPT-4o.

#### 🎓 Task 1.05: Tokenisation quirks

Prompt [ChatGPT](https://chatgpt.com/) to reverse the letters in the following words:

```
creativecommons
MERCHANTABILITY
NSNotification
authentication
```

How many of these words come out right? What could be the problem when words come out wrong? Generate ideas by inspecting the words in Tiktokenizer. Try to come up with other prompts that illustrate problems related to tokenisation.

 **(1)** to analyze how ChatGPT handles word reversal, particularly with composite words, capitalization, and tokenization structures. Initially, there was a theory that compound words like "swimsuit" or "popcorn" might be problematic for ChatGPT due to their meaning being tied to a single concept. However, ChatGPT handled these well despite them being split into multiple tokens in tiktokenizer.

However, the model struggled more with random capitalization, likely because capitalized letters increase the number of tokens. The results from ChatGPT showed inconsistencies when reversing words, with only "authentication" being reversed correctly, while others like "MERCHANTABILITY" and "NSNotification" were incorrect. This suggests that uppercase letters disrupt word structure, making reversal more challenging.

 Factors like capitalization, special characters, and formatting can influence how the model interprets words, potentially leading to unexpected results.



- The compound word "creativecommons" was tokenized into a single token, while a random word like "unbelievable" was correctly tokenized into three subwords. This difference illustrates that ChatGPT attempts to process "creativecommons" by performing character-level reversal, which ultimately leads to an error.

##### Feedback:The image that you added is not visible to me. Perhaps it is because when it is included in the notebook, it is only included as a link.

##### Solution to the feedback: corrections have been made

### Tokenisation and multi-linguality

Many NLP systems and the tokenisers used with them are primarily trained on English data. In the next task, you will reflect on the effect this has when they are used to process non-English data.

The *context length* of a language model is the maximum number of preceding tokens the model can condition on when predicting the next token. This number is fixed and cannot be changed after training the model. For example, the context length of GPT-2 ([Radford et al., 2019](https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf)) is 1,024. 

While the context length of a language model is fixed, the amount of information that can be squeezed into this context length will depend on the tokeniser. Informally speaking, a model that needs more tokens to represent a given text cannot condition on as much information as one that needs fewer tokens.

#### 🎓 Task 1.06: Tokenisation and multi-linguality

Train a tokeniser on the English text file from Task&nbsp;1.04 and test it on the same text. How many tokens does it split the text into? Based on this, what is the expected number of Unicode characters of English text that can be fit into a context length of 1,024?

What do the numbers look like if you test the English tokeniser on the Icelandic text instead? What could explain the differences?

Interpreting the expected number of Unicode characters as a measure of representation efficiency, what do your results tell you about the efficiency of a language model primarily trained on English data when it is used to process non-English data? Why are these findings relevant?

In [12]:
#  Load and Process the Text File (wiki-en.txt)
with open('wiki-en-1m.txt', 'r', encoding='utf-8') as file:
    english_text = file.read()

# Train the tokenizer
vocab_size = 1024  # Set the desired vocabulary size
tokenizer = from_text(text, vocab_size)

# Test the tokenizer by encoding and decoding
encoded_text_en = tokenizer.encode(text)
# decoded_text = tokenizer.decode(encoded_text)

# Number of tokens in the English text
num_tokens_en = len(encoded_text_en)

# Number of characters in the English text
num_chars_en = len(english_text)

# Number of bytes in the UTF-8 encoded English text
num_bytes_en = len(english_text.encode('utf-8'))

# Estimate average bytes per token
avg_bytes_per_token_en = num_bytes_en / num_tokens_en if num_tokens_en != 0 else 0

# Expected number of Unicode characters that fit into a context length of 1,024 bytes
max_chars_in_context_en = 1024 // avg_bytes_per_token_en

print(f"English - Number of tokens: {num_tokens_en}")
print(f"English - Number of characters: {num_chars_en}")
print(f"English - Number of bytes (UTF-8): {num_bytes_en}")
print(f"English - Average bytes per token: {avg_bytes_per_token_en}")
print(f"English - Expected number of Unicode characters in 1,024 bytes: {max_chars_in_context_en}")

# Optionally, save the trained tokenizer
save(tokenizer, "tokenizer_output.tok")

English - Number of tokens: 375309
English - Number of characters: 1000000
English - Number of bytes (UTF-8): 1001360
English - Average bytes per token: 2.6680948231990174
English - Expected number of Unicode characters in 1,024 bytes: 383.0


In [13]:
# Read the Icelandic text file
with open('wiki-is-1m.txt', 'r', encoding='utf-8') as file:
    icelandic_text = file.read()

# Encode the Icelandic text to get token ids
encoded_text_is = tokenizer.encode(icelandic_text)

# Number of tokens in the Icelandic text
num_tokens_is = len(encoded_text_is)

# Number of characters in the Icelandic text
num_chars_is = len(icelandic_text)

# Number of bytes in the UTF-8 encoded Icelandic text
num_bytes_is = len(icelandic_text.encode('utf-8'))

# Estimate average bytes per token
avg_bytes_per_token_is = num_bytes_is / num_tokens_is if num_tokens_is != 0 else 0

# Expected number of Unicode characters that fit into a context length of 1,024 bytes
max_chars_in_context_is = 1024 // avg_bytes_per_token_is

print(f"Icelandic - Number of tokens: {num_tokens_is}")
print(f"Icelandic - Number of characters: {num_chars_is}")
print(f"Icelandic - Number of bytes (UTF-8): {num_bytes_is}")
print(f"Icelandic - Average bytes per token: {avg_bytes_per_token_is}")
print(f"Icelandic - Expected number of Unicode characters in 1,024 bytes: {max_chars_in_context_is}")
# Optionally, save the trained tokenizer
save(tokenizer, "tokenizer_output_is.tok")

Icelandic - Number of tokens: 752403
Icelandic - Number of characters: 1000000
Icelandic - Number of bytes (UTF-8): 1091189
Icelandic - Average bytes per token: 1.4502719951940648
Icelandic - Expected number of Unicode characters in 1,024 bytes: 706.0


#### Task 1.06 Solution
- Icelandic has more tokens than English due to its high inflection and morphological complexity. Its UTF-8 byte size is higher because of special characters, but it uses fewer bytes per token, making tokenization efficient. Conversely, English tokens consume more bytes due to inefficiencies with complex words. Icelandic text fits more characters into 1,024 bytes since it generally requires fewer bytes per token. 

- For English-trained models, processing non-English languages can be less efficient due to differing linguistic structures; for example, Icelandic may need more tokens to express the same idea because of its compound words. The efficiency of a model hinges on its ability to tokenize and encode characters properly, implying that models trained mainly on English may struggle with multilingual tasks.

- These insights underscore the need for optimizing tokenization strategies and pre-trained models for multiple languages, emphasizing training on diverse datasets. tokenization efficiency across languages

- When a tokenizer trained on English text is applied to both English and Icelandic texts, clear differences in tokenisation efficiency emerge. The English tokenizer performs efficiently on English data, producing fewer, longer tokens by capturing frequent patterns and subwords. In contrast, when applied to Icelandic, the same tokenizer splits the text into many more, shorter tokens. This happens because Icelandic contains characters and patterns unfamiliar to the English-trained tokenizer, leading to more frequent breaks and less compression. As a result, the same number of tokens (e.g., 1,024) can represent more characters in English than in Icelandic. This directly affects the performance of language models like GPT, which have fixed context limits in terms of tokens—not characters or bytes. When non-English languages require more tokens to express the same amount of text, the model can “see” less of the input at once, leading to reduced performance. This means that language models trained primarily on English tend to be less efficient and less accurate when processing non-English content. The findings highlight a core challenge in multilingual NLP: tokenisation strategies trained on a single language often underperform on others, reducing both efficiency and fairness. To address this, models need to be trained with multilingual or language-agnostic tokenisers that can handle a wide variety of scripts and language structures.

## Part 2: Embeddings

In the second part of the lab, you will explore embeddings. An embedding layer is a network component that assigns each item in a finite set of elements (often called a *vocabulary*) a fixed-size vector. At first, these vectors are filled with random values, but during training, they are adjusted to suit the task at hand.

### Bag-of-words classifier

To help you build an intuition for embeddings and the vector representations learned by them, we will use a simple bag-of-words text classifier. The core part of this classifier only takes a few lines of code:

In [14]:
import torch.nn as nn


class Classifier(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
        self.linear = nn.Linear(embedding_dim, num_classes)

    def forward(self, x):
        return self.linear(self.embedding(x).mean(dim=-2))

#### 🎈 Task 1.07: Bag-of-words classifier

Explain how the bag-of-words classifier works. How does the code match the diagram you saw in the lectures? Why is there only one `nn.Embedding`, while the diagram shows three embedding layers? What does the keyword argument `dim=-2` do?

##### Solution for task 1.07
- Bag-of-Words (BoW) Classifier: Represents text as an unordered collection of words, mapping each word to a dense vector using an embedding layer. The embeddings are averaged to create a fixed-size document representation, which is then passed through a linear layer for classification. Softmax is often applied to generate probabilities.

- Code vs. Diagram:
The code tokenizes words into unique IDs and maps them to dense vectors using nn.Embedding.
The embeddings are averaged using .mean(dim=-2), ensuring a fixed-size representation.
A single nn.Embedding layer is used in the code, though the diagram illustrates multiple embedding layers. PyTorch handles multiple words within a single embedding layer.

- Role of dim=-2:
The embedding layer outputs a tensor of shape (batch_size, num_words, embedding_dim).
The .mean(dim=-2) operation averages embeddings across words, reducing the shape to (batch_size, embedding_dim), producing one vector per sentence.
This ensures efficiency while preserving key semantic information for text classification tasks where word order is not important.

### Dataset

You will apply the classifier to a small dataset with Amazon customer reviews. This dataset is taken from [a much larger dataset](https://www.cs.jhu.edu/~mdredze/datasets/sentiment/) first described by [Blitzer et al. (2007)](https://aclanthology.org/P07-1056/).

The dataset contains whitespace-tokenised product reviews from two topics: cameras (`camera`) and music (`music`). Each review is additionally annotated for sentiment towards the product at hand: negative (`neg`) or positive (`pos`). The topic and sentiment labels are prepended to the review. As an example, here is the first review from the training data:

```
music neg oh man , this sucks really bad . good thing nu-metal is dead . thrash metal is real metal , this is for posers
```

The next cell contains a custom [`Dataset`](https://pytorch.org/tutorials/beginner/basics/data_tutorial.html) class for the review dataset. To initialise an instance of this class, you specify the name of the file containing the reviews you want to load (`filename`) and which of the two labels you want to use (`label`): topic (0) or sentiment (1).

In [15]:
from torch.utils.data import Dataset


class ReviewDataset(Dataset):
    def __init__(self, filename: str, label: int = 0) -> None:
        with open(filename) as f:
            tokenized_lines = [line.split() for line in f]
        self.items = [(tokens[2:], tokens[label]) for tokens in tokenized_lines]

    def __len__(self) -> int:
        return len(self.items)

    def __getitem__(self, idx: int) -> tuple[list[str], str]:
        return self.items[idx]

### Vectoriser

To feed a review into the bag-of-words classifier, you first need to turn it into a vector of token IDs. Likewise, you need to convert the label (topic or sentiment) into an integer. The next cell contains a partially completed `ReviewVectoriser` class that handles this transformation.

In [16]:
from collections import Counter

import torch

# Type abbreviation for review–label pairs
Item = tuple[list[str], str]


class ReviewVectorizer:
    PAD = "[PAD]"
    UNK = "[UNK]"

    def __init__(self, dataset: ReviewDataset, n_vocab: int = 1024) -> None:
        # Unzip the dataset into reviews and labels
        reviews, labels = zip(*dataset)

        # Count the tokens and get the most common ones
        counter = Counter(t for r in reviews for t in r)
        most_common = [t for t, _ in counter.most_common(n_vocab - 2)]

        # Create the token-to-index and label-to-index mappings
        self.t2i = {t: i for i, t in enumerate([self.PAD, self.UNK] + most_common)}
        self.l2i = {l: i for i, l in enumerate(sorted(set(labels)))}

    def __call__(self, items: list[Item]) -> tuple[torch.Tensor, torch.Tensor]:
        reviews, labels = zip(*items)
                
        # Convert reviews to lists of token IDs, replacing unknown tokens
        tokenized_reviews = [[self.t2i.get(t, self.t2i[self.UNK]) for t in review] for review in reviews]
        
        # Pad the tokenized reviews to the same length
        max_len = max(len(r) for r in tokenized_reviews)
        padded_reviews = [r + [self.t2i[self.PAD]] * (max_len - len(r)) for r in tokenized_reviews]
        
        # Convert labels to indices
        label_indices = [self.l2i[label] for label in labels]
        
        return torch.tensor(padded_reviews, dtype=torch.long), torch.tensor(label_indices, dtype=torch.long)

A `ReviewVectoriser` maps tokens and labels to IDs using two Python dictionaries. These dictionaries are set up when the vectoriser is initialised and queried when the vectoriser is called on a batch of review–label pairs. They include IDs for two special tokens:

`[PAD]` (Padding): Reviews can have different lengths, but PyTorch requires all vectors in a batch to be the same size. To handle this, the vectoriser adds `[PAD]` tokens to the end of shorter reviews so they match the length of the longest review in the batch.

`[UNK]` (Unknown): If a review contains a token that is not in the token-to-ID dictionary, the vectoriser assigns it the ID of the `[UNK]` token instead of a regular ID.

#### 🎓 Task 1.08: Vectoriser

Explain and complete the code of the vectoriser. Follow these steps:

**Step&nbsp;1.** Explain how unzipping works. What are the types of `reviews` and `labels`?

**Step&nbsp;2.** Explain how the token-to-ID and label-to-ID mappings are constructed. How does the `most_common()` method deal with elements that occur equally often?

**Step&nbsp;3.** Complete the implementation of the `__call__()` method. This method should convert a list of $m$ review–label pairs into a pair $(X, y)$ where $X$ is a matrix containing the vectors with token IDs for the reviews, and $y$ is a vector containing the IDs of the corresponding labels.

### SOLUTION FOR 1.08

The `ReviewVectorizer` class is designed to convert text-based reviews into numerical representations for machine learning models using PyTorch. The class takes a dataset of text reviews with their labels and provides a method to transform a batch of reviews into fixed-length numerical tensors.

---

##### **Understanding the Initialization (`__init__` Method)**

When an instance of the `ReviewVectorizer` class is created, it takes two parameters:

1. **`dataset`**: A list of review-label pairs, where:
   - Each review is represented as a list of tokens (words).
   - The label is a string.

2. **`n_vocab`**: The maximum number of words included in the vocabulary (default is `1024`).

##### **Processing the Dataset**
- The dataset is split into separate lists:
  - One containing all reviews.
  - Another containing all labels.

- A `Counter` object counts the frequency of each token (word).
- The `most_common(n_vocab - 2)` function selects the most frequent words while reserving space for two special tokens:

  1. **`"[PAD]"`** (Padding Token) → Ensures all reviews have the same length.
  2. **`"[UNK]"`** (Unknown Token) → Represents rare or out-of-vocabulary words.

##### **Creating the Token-to-Index Mapping (`t2i`)**
The `t2i` dictionary assigns unique indices to each token:
- `"[PAD]"` → Index **0**  
- `"[UNK]"` → Index **1**  
- Most common words are assigned subsequent indices.

---

##### **Understanding the `__call__` Method**

This method enables the class instance to be used as a function to transform a batch of reviews into tensors.

##### **Processing Steps:**
1. The input batch (a list of review-label pairs) is split into:
   - A list of reviews.
   - A list of labels.

2. Each review (a list of words) is converted into a list of numerical token IDs:
   - If a word exists in `t2i`, its corresponding index is used.
   - Otherwise, it is replaced with `"[UNK]"` (index **1**).

3. Reviews are **padded** to match the length of the longest review:
   - Shorter reviews are padded with `"[PAD]"` (index **0**).

4. Labels are converted to numerical indices using the `l2i` dictionary.

5. The processed reviews and labels are returned as **PyTorch tensors**, making them ready for deep learning models.

---

The `ReviewVectorizer` class is a crucial preprocessing step in NLP tasks, ensuring that text data is transformed into a structured numerical format compatible with deep learning models. By implementing tokenization, padding, and label encoding, this class efficiently prepares review data for training and evaluation.

---


#### SOL CONTD:Step 1: Understanding Unzipping and Data Types
In Python, the zip(*) function is used to "unzip" a list of tuples, effectively separating them into individual lists. In this case, the dataset consists of review-label pairs, where each review is a list of strings (tokens), and each label is a single string. When zip(*dataset) is called, it extracts all the reviews into one tuple and all the labels into another tuple. The reviews are tokenized text data, meaning each review is a list of words or subwords, while the labels represent categorical classifications, such as sentiment categories ("positive," "negative," etc.) or topic labels. Since zip(*) returns tuples, reviews and labels are tuples where reviews contain sequences of words, and labels contain string identifiers representing classes.

##### Step 2: Constructing Token-to-ID and Label-to-ID Mappings
The process begins with counting the occurrences of each token across all reviews using Counter, a specialized dictionary from the collections module. It collects word frequencies efficiently. Then, the most_common(n_vocab - 2) method selects the most frequently occurring tokens, ensuring that only the most relevant words are included in the vocabulary. The most_common() method resolves ties (words with the same frequency) based on their order of appearance in the dataset.

The token-to-index (t2i) dictionary is then built by first including special tokens [PAD] (for padding shorter reviews) and [UNK] (for unknown words not in the vocabulary), followed by the most frequent tokens. This dictionary maps each token to a unique integer ID. Similarly, the label-to-index (l2i) dictionary is created by assigning an integer index to each unique label, sorted alphabetically to ensure consistency across different datasets. This guarantees that labels are converted into numerical values for model training.

##### Feedback:Unzipping actually splits a list of pairs into two separate lists. 

##### Solution:the corrections have been made

### Training the classifier

With the vectoriser completed, you are ready to train a classifier. More specifically, you can train two separate classifiers: one to predict the topic of a review, and one to predict the sentiment. The next cell contains a simple training loop that you can adapt for this purpose.

In [18]:
import torch.nn.functional as F


def train(filename: str="reviews-train.txt", # this is the path of the traing data
          learning_rate: float=0.001,  #this is the learning rate of the optimizer
          batch_size: int=16,       # here the barch size of the dataloader
          num_epochs: int=100,       # number of training epoch
          vocab_size: int =1024,      #size of the vocabulary
        hidden_size : int =64 ):      #Hidden layer size of the classifier 
    
    dataset = ReviewDataset(filename, label=0) #Load the data set and initialoze components & Column index for labels in the dataset
    processor = ReviewVectorizer(dataset, vocab_size)
    model = Classifier(vocab_size, hidden_size, len(processor.l2i))
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

    
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=processor,
    )                    #It configures the dataset with flexible batch size

    
    for epoch in range(num_epochs):  # this is the training loop with configurable epoch
        model.train()
        running_loss = 0                 #Reset accumulated loss for this epoch
        # this for loop iterates over the batches of data
        for bx, by in data_loader:    # bx andby are the batch od token ID and the batch of the labels         
            optimizer.zero_grad()           
            output = model(bx)# here is the forward pas where the model compute the predctions for the current batch
            # crossentropy expects predictions and handles sofmax internally
            loss = F.cross_entropy(output, by)
            loss.backward()# here we compute the gradients of loss 
            optimizer.step()# this updates the model parameter using compute gradient
            running_loss += loss.item()
        # len(data_loader)is the number of batches,,,running_loss is the total loss accross all the batches    
        print(f"Epoch {epoch}, loss: {running_loss / len(data_loader):.4f}")
    return processor, model

In [19]:
import torch.nn.functional as F


def train(filename: str="reviews-train.txt", learning_rate: float=0.001, batch_size: int=16, num_epochs: int=100):
    dataset = ReviewDataset(filename, label=1)
    processor = ReviewVectorizer(dataset, 1024)
    model = Classifier(1024, 64, len(processor.l2i))
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=processor,
    )
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0
        for bx, by in data_loader:
            optimizer.zero_grad()
            output = model(bx)
            loss = F.cross_entropy(output, by)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        print(f"Epoch {epoch}, loss: {running_loss / len(data_loader):.4f}")
    return processor, model

#### 🎓 Task 1.09: Training loop

Explain the training loop. Follow these steps:

**Step&nbsp;1.** Go through the training loop line-by-line and add comments where you find it suitable. Your comments should be detailed enough for you to explain the main steps of the loop.

**Step&nbsp;2.** The training loop contains various hard-coded values like filename, learning rate, batch size, and epoch count. This makes the code less flexible. Revise the code so that you can specify these values using keyword arguments. Use the concrete values from the code as defaults.

### SOLUTION FOR TASK 1.09

The `train` function is responsible for training a neural network model for text classification. Below is a detailed explanation of the training loop.

### 1. Dataset Preparation
- The dataset is loaded using `ReviewDataset`, which reads training data from a file (`reviews-train.txt` by default). It assumes a binary classification task.
- `ReviewVectorizer` processes the dataset, converting textual data into numerical vectors of a fixed size (`1024`).

### 2. Model and Optimizer Initialization
- The `Classifier` model is created with:
  - An input size of `1024`
  - A hidden layer of `64` neurons
  - An output layer corresponding to the number of unique labels in the dataset (`len(processor.l2i)`).
- The Adam optimizer is initialized with a learning rate of `0.001` to update the model’s parameters during training.

### 3. Data Loader Setup
- The dataset is wrapped in a PyTorch `DataLoader`, which enables efficient batching and shuffling.
- The `collate_fn` argument ensures that `ReviewVectorizer` correctly formats batches of data for training.

### 4. Training Loop Execution
- The model is trained for `num_epochs` (default: `100`), iterating over the dataset multiple times to optimize performance.
- At the start of each epoch:
  - The model is set to training mode using `model.train()`, ensuring layers like dropout (if any) function correctly.
  - `running_loss` is initialized to track the cumulative loss for the epoch.

### 5. Mini-Batch Training
For each batch:
- The optimizer resets gradients using `optimizer.zero_grad()`.
- The input batch (`bx`) is passed through the model to generate predictions.
- The loss is computed using `F.cross_entropy(output, by)`, which measures the difference between predicted and actual labels.
- Backpropagation is performed using `loss.backward()`, computing gradients for each model parameter.
- The optimizer updates the model parameters using `optimizer.step()`.
- The batch loss is added to `running_loss` to track the total loss for the epoch.

### 6. Epoch Summary
- After processing all batches, the average loss for the epoch is printed as:
  ```python
  print(f"Epoch {epoch}, loss: {running_loss / len(data_loader):.4f}")
  ```
- This helps monitor training progress.

### 7. Final Return Values
- After training, the function returns the trained `processor` and `model`, which can be used for inference or further evaluation.

This training loop follows a standard supervised learning workflow in PyTorch, iterating through multiple epochs to minimize classification error and improve model performance.


## Solution for step 2

In [20]:
import torch
import torch.nn.functional as F

def train(
    filename: str = "reviews-train.txt",
    learning_rate: float = 0.001,
    batch_size: int = 16,
    num_epochs: int = 100,
    input_dim: int = 1024,
    hidden_dim: int = 64,
):
    dataset = ReviewDataset(filename, label=1)
    processor = ReviewVectorizer(dataset, input_dim)
    model = Classifier(input_dim, hidden_dim, len(processor.l2i))
    optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)
    data_loader = torch.utils.data.DataLoader(
        dataset,
        batch_size=batch_size,
        shuffle=True,
        collate_fn=processor,
    )
    
    for epoch in range(num_epochs):
        model.train()
        running_loss = 0.0
        for bx, by in data_loader:
            optimizer.zero_grad()
            output = model(bx)
            loss = F.cross_entropy(output, by)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        
        print(f"Epoch {epoch + 1}, Loss: {running_loss / len(data_loader):.4f}")
    
    return processor, model


###### Feedback:The comments for the training loop are missing.

###### solution :Have completed this

#### 🎈 Task 1.10: Training the classifier

Adapt the next cell to train the classifier for the two prediction tasks. Based on the loss values, which task appears to be the harder one? What is the purpose of setting a seed?

In [21]:
torch.manual_seed(42)
vectorizer, model = train()

Epoch 1, Loss: 0.6932
Epoch 2, Loss: 0.6878
Epoch 3, Loss: 0.6784
Epoch 4, Loss: 0.6719
Epoch 5, Loss: 0.6615
Epoch 6, Loss: 0.6488
Epoch 7, Loss: 0.6327
Epoch 8, Loss: 0.6122
Epoch 9, Loss: 0.5928
Epoch 10, Loss: 0.5724
Epoch 11, Loss: 0.5532
Epoch 12, Loss: 0.5334
Epoch 13, Loss: 0.5134
Epoch 14, Loss: 0.4928
Epoch 15, Loss: 0.4726
Epoch 16, Loss: 0.4564
Epoch 17, Loss: 0.4424
Epoch 18, Loss: 0.4286
Epoch 19, Loss: 0.4135
Epoch 20, Loss: 0.4016
Epoch 21, Loss: 0.3921
Epoch 22, Loss: 0.3804
Epoch 23, Loss: 0.3690
Epoch 24, Loss: 0.3612
Epoch 25, Loss: 0.3550
Epoch 26, Loss: 0.3437
Epoch 27, Loss: 0.3353
Epoch 28, Loss: 0.3294
Epoch 29, Loss: 0.3198
Epoch 30, Loss: 0.3149
Epoch 31, Loss: 0.3082
Epoch 32, Loss: 0.3022
Epoch 33, Loss: 0.2955
Epoch 34, Loss: 0.2879
Epoch 35, Loss: 0.2832
Epoch 36, Loss: 0.2774
Epoch 37, Loss: 0.2736
Epoch 38, Loss: 0.2698
Epoch 39, Loss: 0.2643
Epoch 40, Loss: 0.2604
Epoch 41, Loss: 0.2550
Epoch 42, Loss: 0.2483
Epoch 43, Loss: 0.2444
Epoch 44, Loss: 0.24

#### Task 1.10 Solution
Based on the loss progression and task characteristics, sentiment classification appears to be the harder task in this setup. Here's the analysis:
 - Topic: Final loss = 0.223 (started at 0.6838)(label_column=0)
 - Sentiment: Final loss = 0.572 (started at 0.6932)(label_column=1)
 - The sentiment classifier retains ~2.5× higher loss after 10 epochs
 
Setting the seed with torch.manual_seed(42) ensures that the random number generator in PyTorch produces the same sequence of random numbers every time the code is executed.

### Inspecting the embeddings

Now that you have trained the classifier on two separate prediction tasks, it is interesting to inspect and compare the embedding vectors it learned in the process. For this you will use an online tool called the [Embedding Projector](http://projector.tensorflow.org). The next cell contains code to save the embeddings from a trained classifier in a format that can be loaded into this tool.

In [22]:
def save_embeddings(
    vectorizer: ReviewVectorizer,
    model: Classifier,
    vectors_filename: str,
    metadata_filename: str,
):
    i2t = {i: t for t, i in vectorizer.t2i.items()}
    embeddings = model.embedding.weight.detach().numpy()
    items = [(i2t[i], e) for i, e in enumerate(embeddings)]
    with open(vectors_filename, "wt") as f1, open(metadata_filename, "wt") as f2:
        for w, e in items:
            print("\t".join("{:.5f}".format(x) for x in e), file=f1)
            print(w, file=f2)

Call this code as follows:

In [23]:
save_embeddings(vectorizer, model, "vectors.tsv", "metadata.tsv")

In [24]:
save_embeddings(vectorizer, model, "vectors(label1feb15).tsv", "metadata(label1feb15).tsv")

#### 🎓 Task 1.11: Inspecting the embeddings

Load the embeddings from the two classification tasks (topic classification and sentiment classification) into the Embedding Projector web app and inspect the vector spaces. How do they compare visually? Does the visualisation make sense to you?

The Embedding Projector offers visualisations based on three dimensionality reduction methods: [UMAP](https://umap-learn.readthedocs.io/en/latest/), [T-SNE](https://en.m.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding), and [PCA](https://en.m.wikipedia.org/wiki/Principal_component_analysis). Which of these seems most useful to you?

Focus on the embeddings for the words *repair* and *sturdy*. Are they close to each other or far away from another? What happens if you switch to the other task? How do you explain that?

### Solution 1.11

Both classifiers produced meaningful results, displaying a well-distributed set of words across the three vector spaces. Words with high cosine similarity were clustered together and exhibited shorter Euclidean distances.

For PCA, we observed that it provides a more linear projection, drawing words with similarities closer together or placing them in the same vector space. UMAP and T-SNE, on the other hand, offered better clustering compared to PCA. While PCA helps in understanding the overall variance in the dataset and provides a global view of the data structure, UMAP and T-SNE excel at preserving local relationships and forming tighter clusters.

In topic classification, the embeddings of "repair" and "sturdy" were positioned closer together because they are contextually related in product reviews, often appearing in discussions about durability and maintenance.

In sentiment classification, however, "repair" and "sturdy" were placed farther apart since they represent different sentiment polarities. "Repair" is more likely associated with negative or neutral sentiments, while "sturdy" is often linked to positive sentiment, leading to their separation in the vector space.

### Initialisation of embedding layers

The error surfaces explored when training neural networks can be very complex. Because of this, it is crucial to choose “good” initial values for the parameters. In the final task of this lab, you will run a small experiment to see how alternative initialisations can affect a model’s performance.

In PyTorch, the weights of the embedding layer are initially set by sampling from the standard normal distribution, $\mathcal{N}(0, 1)$. However, research suggests other approaches may work better. For example, you have seen that embedding layers share similarities with linear layers, so it makes sense to use the same initialisation method for both. The default initialisation method for linear layers in PyTorch is the so-called Kaiming initialisation, introduced by [He et al. (2015)](https://www.cv-foundation.org/openaccess/content_iccv_2015/papers/He_Delving_Deep_into_ICCV_2015_paper.pdf).

#### 🎈 Task 1.12: Initialisation of embedding layers

Check the [source code of `nn.Linear`](https://pytorch.org/docs/stable/_modules/torch/nn/modules/linear.html#Linear) to see how PyTorch initialises the weights of linear layers using the Kaiming initialisation method. Apply the same method to the embedding layer of your classifier and see how this affects the loss of your model and the vector spaces.

In [27]:
import torch.nn.init as init
import math

class Classifier(nn.Module):
    def __init__(self, num_embeddings, embedding_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(num_embeddings, embedding_dim)
       
        self.linear = nn.Linear(embedding_dim, num_classes)
          # Apply Kaiming initialization to the embedding layer weights
        init.kaiming_uniform_(self.embedding.weight, a=math.sqrt(5))
       

    def forward(self, x):
        return self.linear(self.embedding(x).mean(dim=-2))

In [28]:
# Compare training with default vs. Kaiming initialization
print("Default Initialization:")
_, _ = train()

print("\nKaiming Initialization:")
_, _ = train()  # Now uses Kaiming

Default Initialization:
Epoch 1, Loss: 0.6934
Epoch 2, Loss: 0.6821
Epoch 3, Loss: 0.6622
Epoch 4, Loss: 0.6278
Epoch 5, Loss: 0.5816
Epoch 6, Loss: 0.5352
Epoch 7, Loss: 0.4911
Epoch 8, Loss: 0.4530
Epoch 9, Loss: 0.4237
Epoch 10, Loss: 0.3966
Epoch 11, Loss: 0.3732
Epoch 12, Loss: 0.3535
Epoch 13, Loss: 0.3347
Epoch 14, Loss: 0.3225
Epoch 15, Loss: 0.3102
Epoch 16, Loss: 0.2939
Epoch 17, Loss: 0.2811
Epoch 18, Loss: 0.2727
Epoch 19, Loss: 0.2606
Epoch 20, Loss: 0.2526
Epoch 21, Loss: 0.2447
Epoch 22, Loss: 0.2391
Epoch 23, Loss: 0.2296
Epoch 24, Loss: 0.2219
Epoch 25, Loss: 0.2146
Epoch 26, Loss: 0.2102
Epoch 27, Loss: 0.2026
Epoch 28, Loss: 0.1961
Epoch 29, Loss: 0.1907
Epoch 30, Loss: 0.1843
Epoch 31, Loss: 0.1795
Epoch 32, Loss: 0.1760
Epoch 33, Loss: 0.1737
Epoch 34, Loss: 0.1686
Epoch 35, Loss: 0.1608
Epoch 36, Loss: 0.1579
Epoch 37, Loss: 0.1530
Epoch 38, Loss: 0.1505
Epoch 39, Loss: 0.1449
Epoch 40, Loss: 0.1455
Epoch 41, Loss: 0.1397
Epoch 42, Loss: 0.1360
Epoch 43, Loss: 0.1

### 1.12 Solution

Comparing the loss values between Default and Kaiming initialization, we see that both methods follow a similar downward trend, but Kaiming initialization consistently results in slightly lower loss values after each epoch.


**🥳 Congratulations on finishing lab&nbsp;1!**