## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

# 8. Word Embeddings (Homework)

**Learning Goals:**

* Implement multi-class classification with a multi-layer perceptron
* Understand PyTorch tensors
* Load pre-trained word embeddings and use them in a neural network
* Perform stance classification

But first, some imports.

In [None]:
!pip install numpy==1.23.5  # gensim compatibility
!pip install -U datasets
!pip install nltk
!pip install --upgrade gensim

import csv
import os
import random

import scipy
import gensim.downloader as api

import torch
import torch.nn as nn
from torch.nn import CosineSimilarity
from torch.utils.data import Dataset, DataLoader

from datasets import load_dataset

import nltk
nltk.download("punkt_tab")
from nltk.tokenize import sent_tokenize, word_tokenize

device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
print("Computing on:", device)

## Loading Pre-trained Word Embeddings

First, we will load pre-trained word embeddings via the [gensim](https://radimrehurek.com/gensim/) library. We use this library to avoid downloading the files to our local machines, but in theory, you can do that as well. You can find pre-trained word embeddings on many places in the web, e.g., [here](https://wikipedia2vec.github.io/wikipedia2vec/pretrained/).

The next code cell will download the 300-dimensional word2vec vectors pre-trained on Google News (be patient, this will take around 5-10 minutes).

In [None]:
word_vectors = api.load("word2vec-google-news-300")

The `gensim` library provides several functionalities for querying the word embeddings (called _keyed vectors_). Check how to use the function `similarity` on [this page](https://radimrehurek.com/gensim/models/keyedvectors.html).

❓Re-iterate the exercise for word similarity ratings on `wordsim_relatedness_goldstandard.txt`. Compute the system ratings for all word pairs using this function. How does your result compare to your earlier results using LCH and PMI?

In [None]:
def compute_correlation(human_ratings, system_ratings):
  """ Input: two lists (of equal length) with numeric values.
  Computes Pearson's correlation coefficient.
  """
  assert len(human_ratings), len(system_ratings)
  return scipy.stats.pearsonr(human_ratings, system_ratings)


In [None]:
# Compute similarities of word embeddings using cosine
# Check the "most similar words", using the default "cosine similarity" measure.

# A list of instances to read the /t-separated value strings into
instances = []
# A list of only the word pairs of the instances (useful later)
wordPairs = []
# A list of only the (float) scores of the instances (useful for min, max, mean)
scores = []

with open('/content/wordsim_relatedness_goldstandard.csv', newline='') as csvfile:
    filereader = csv.reader(csvfile, delimiter='\t', quotechar='|')


    for line in filereader:
        instances.append(line)
        wordPairs.append((line[0], line[1]))
        scores.append(float(line[2]))

# A list of computed similarities
sims = []

for wp in wordPairs:
        sim = word_vectors.similarity(wp[0], wp[1])
        print(f"{wp[0]} - {wp[1]} = {sim:.4f}")
        sims.append(sim)

print(f"{compute_correlation(scores, sims)}")

If your implementation is correct, the output should be:

```
PearsonRResult(statistic=0.5920509820875375, pvalue=3.143384293094993e-25)
```

This correlation is pretty high!
<br/>
<br/>


## PyTorch Tensors

❓ In order to understand the following code and solve the following exercises, you need to have a basic understanding of PyTorch tensors. Work through the following tutorial and note down the most important facts about tensors here in this notebook (with code examples).

[PyTorch.org Tutorial on Tensors](https://pytorch.org/tutorials/beginner/introyt/tensors_deeper_tutorial.html)

_Your text here_

In [None]:
# Your code here


### Initializing a PyTorch Embedding Layer With Pretrained Embeddings

The code in the next cell takes word2vec's weight matrix from gensim and initalizes an [Embedding](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html) layer in PyTorch. This is a layer that can be queried using indices and that returns the word embeddings (or rather, passes them on to the next layer).

Word embeddings are of the first layer of a neural network and can either be "frozen" (i.e., we optimize only the rest of the model's parameters) or optimized further during training.

In [None]:
# Create a float tensor from the word vectors
weights = torch.FloatTensor(word_vectors.vectors) # two-dimensional matrix. Rows = vocabulary items, columns = dimensions of word embedding.
embedding = nn.Embedding.from_pretrained(weights)

# Query embedding layer for a specific word embedding
print("The 1000st word of the vocabulary:", word_vectors.index_to_key[999])
input = torch.LongTensor([5]) # the 1000st word in the vocabulary

print(embedding(input).shape)
# print(embedding(input)) # uncomment this to look at the tensor

You may wonder about the extra dimension `1` above. Keep in mind that the embedding layer contains an embedding tensor (matrix) of size [vocab_size, 300] where 300 is the dimensionality of the word embeddings.
When selecting a row from this (the vector for the word "raised" that we queried above), PyTorch keeps this extra dimension, it always provides for the possibility that we might want to retrieve more than one vector a time. For example, we could retrieve the two embeddings of two words, which are returned as new matrix with dimensions [2,300], i.e., the first dimension corresponds to the number of inputs.

If you haven't checked it out yet, now is the time to work through [Towards DataScience: Understanding Dimensions in PyTorch](https://towardsdatascience.com/understanding-dimensions-in-pytorch-6edf9972d3be) unless you are already familiar with PyTorch.

In [None]:
input = torch.LongTensor([77, 88])
print(embedding(input).shape)

If we want to get rid of this extra dimension to have a vector that corresponds to a simple vector, we can use `squeeze`.
(Note: we did that already in 07_Neural_Networks.ipynb when we removed the batch dimension. There, we collected inputs into batches, and the model returned the predictions similarly packed into batches. We used `squeeze` to create a simple vector from these probability scores predicted by the binary classifier, as our gold labels happened to have this format, and the loss function expected this format.)

In [None]:
input = torch.LongTensor([5]) # the 1000st word in the vocabulary
emb_5 = embedding(input)
print(emb_5.shape) # this has an "outer" tensor dimension
emb_5 = emb_5.squeeze(0) # removed the first dimension (~ the outer list)
print(emb_5.shape) # this is just a vector

Gensim's `word_vector` object provides two dictionaries: `key_to_index`, which returns the row corresponding to a word, and `index_to_key`, which maps a row index to the corresponding word. We use the latter to display some words of word2vec's vocabulary:

In [None]:
vocab = ""
for i in range(0,30):
  vocab += word_vectors.index_to_key[i] + " "
vocab += "\n"
for i in range(500,520):
  vocab += word_vectors.index_to_key[i] + " "
print(vocab)

From this, we can tell that vocabulary has not been lemmatized (e.g., "doing" and "known" are not lemmas, and "The" is uppercase, etc.). Hence, if we want to work with these embeddings, standard tokenization should do the trick. (❗This is not always the case, vocabularies differ in their preprocessing!)

The `similarity` function of the gensim library that we have used above computes the cosine between the word vectors. We can compute cosine similarity using PyTorch as well. Let's do that in order to get a bit more familiar with tensors in PyTorch.

❓The code below compares the similarity scores returned by the two frameworks for "bike" and "car". Add the comparisons "motorbike-bike" and "motorbike-car".

In [None]:
cos = nn.CosineSimilarity(dim=1, eps=1e-6)

idx_bike = word_vectors.key_to_index["bike"]
idx_car = word_vectors.key_to_index["car"]

emb_car = embedding(torch.LongTensor([idx_car]))
emb_bike = embedding(torch.LongTensor([idx_bike]))

print("car-bike:", cos(emb_car, emb_bike), word_vectors.similarity("car", "bike"))

# Compare motorbike-bike and motorbike-car.
idx_motorbike = word_vectors.key_to_index["motorbike"]
emb_motorbike = embedding(torch.LongTensor([idx_motorbike]))

print("motorbike-bike:", cos(emb_motorbike, emb_bike), word_vectors.similarity("motorbike", "bike"))
print("motorbike-car:", cos(emb_motorbike, emb_car), word_vectors.similarity("motorbike", "car"))


Same results! The difference is the data type: gensim returns simple floats, PyTorch returns tensors (even if the tensor has only one dimension and a single value here).

### Load the TweetEval Dataset

We will now work with the [TweetEval](https://huggingface.co/datasets/tweet_eval) dataset.
Your task is to train a classifier using word embeddings for predicting the _stance_ of a tweet towards "climate change":

* neutral (0)
* against (1) - This label does not mean the person is against climate change, but does not believe that climate change is happening.
* favor (2) - This label does not mean that the person is in favor of climate change, but believes that climate change is real and that we need to do something about it.

In [None]:
train_data = load_dataset("tweet_eval", "stance_climate", split="train")
val_data = load_dataset("tweet_eval", "stance_climate", split="validation")
test_data = load_dataset("tweet_eval", "stance_climate", split="test")

def simpler_datastructure(data_set):
  # Returns a simpler data structure (easier to work with for Python beginners)
  new_data_set = [] # list of instances
  for inst in data_set:
    new_inst = {}
    for feature in inst:
      new_inst[feature] = inst[feature]
    new_data_set.append(new_inst)
  return new_data_set

train_data = simpler_datastructure(train_data)
val_data = simpler_datastructure(val_data)
test_data = simpler_datastructure(test_data)


In [None]:
# Let's look at the data
print(train_data[2])
print(train_data[80])
print(train_data[1])

❓Use nltk's `sent_tokenize` and `word_tokenize` methods to tokenize the input texts. Add two lists into the dictionary that represents an instance: (1) a list of "tokens", and (b) a list of "token_ids" that corresponds to the word2vec token ids. If the model's vocabulary does not contain a word, you can simply skip the word. (Hint: Some models also provide a particular OOV or UNK (for unknown) token.) The function should return the modified `data_set` and the number of tokens (that occur in the vocabulary) of the longest instance.

In [None]:
# Your code here
def tokenize(data_set, key2idx):
  pass
  # Your code here


train_data, max_len_train = tokenize(train_data, word_vectors.key_to_index)
val_data, max_len_val = tokenize(val_data, word_vectors.key_to_index)
test_data, max_len_test = tokenize(test_data, word_vectors.key_to_index)

print(train_data[2]["tokens"])
print(train_data[2]["token_ids"])
print()

print(train_data[80]["tokens"])
print(train_data[80]["token_ids"])
print()

print(train_data[1]["tokens"])
print(train_data[1]["token_ids"])
print()

print("Maximum Number of tokens per instance:")
print("- train:", max_len_train)
print("- val:  ", max_len_val)
print("- test: ", max_len_test)
# Training data has the longest max length

*Self Control:* The output should be:


```
['It', "'s", 'nights', 'like', 'this', 'when', 'I', "'m", 'not', 'so', 'fond', 'of', 'my', 'long', 'hair', '.', 'I', 'just', 'wan', 'na', 'chop', 'it', 'all', 'off', '!', '#', 'heatwave', '#', 'pnwgirl', '#', 'SemST']
[51, 4374, 87, 28, 61, 20, 236, 13, 85, 18054, 126, 180, 3227, 20, 76, 91445, 34255, 22646, 15, 52, 104, 2992, 58827, 2992, 2992]

['tsgtalexander', ':', 'GlblWarmingNews', 'And', 'the', 'tooth', 'fairy', 'might', 'be', 'causing', 'kids', 'to', 'lose', 'teeth', '!', '#', 'carbontaxscam', '#', 'Chemtrails', '#', 'SemST']
[169, 11, 17803, 46797, 327, 16, 2733, 809, 1334, 6969, 2992, 2992, 773358, 2992]

['We', 'support', 'Australia', "'s", 'Climate', 'Roundtable', 'which', 'is', 'providing', 'a', 'framework', 'for', 'sensible', 'debate', 'ahead', 'of', 'Paris', '@', 'user', '#', 'SemST']
[62, 240, 904, 13464, 30588, 48, 4, 1109, 5600, 2, 11603, 1539, 619, 2575, 3824, 1928, 2992]

Maximum Number of tokens per instance:
- train: 43
- val:   33
- test:  37
```


### Padding
If your code above works correctly, it should output:

```
Maximum Number of tokens per instance:
- train: 43
- val:   33
- test:  37
```

Not all the "token_id" lists are of equal length. However, when operating with tensors, the inputs (and also any other tensor at any step inside the model) must always have exactly the same shape. But our inputs just aren't of equal length! What can we do?

We will use a concept called __padding__: we will simply fill up all the "token_ids" lists with zeros until they all have max_len.

❓Implement a function `pad(data_set, max_len)` that fills up the "token_ids" list of each instance with zeros such that it has a length of `max_len`. Use the training data's maximum length to pad all three datasplits.

❓Advanced, optional: The first token in the vocabulary is not really a special token in wordvec. We use this here for simplicity. Ideally, we'd use a mask here that tells our model which entries to fill up. We'll learn about this later, but if you are already advanced in deep learning, read on masks and implement a mask that indicates where a real token resides in our input. You will also need to modify the forward pass below (the input will have several dimensions, and the mask needs to be applied). But this is advanced - our example today will also run with our simple heuristic. ;)

In [None]:
def pad(data_set, max_len):
  # Your code here
  pass


train_data = pad(train_data, max_len_train)
val_data = pad(val_data, max_len_train)
test_data = pad(test_data, max_len_train)

print(train_data[80]["token_ids"])
print(train_data[1]["token_ids"])
print(train_data[2]["token_ids"])


*Self-control:* The output should be:

```
[169, 11, 17803, 46797, 327, 16, 2733, 809, 1334, 6969, 2992, 2992, 773358, 2992, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[62, 240, 904, 13464, 30588, 48, 4, 1109, 5600, 2, 11603, 1539, 619, 2575, 3824, 1928, 2992, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[51, 4374, 87, 28, 61, 20, 236, 13, 85, 18054, 126, 180, 3227, 20, 76, 91445, 34255, 22646, 15, 52, 104, 2992, 58827, 2992, 2992, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
```



❓ Write a call TweetEvalDataset that is a subclass of torch.utils.data.Dataset that converts these "token_id" lists to a tensor. Hint: As our inputs X and Y are now just the long integers that we used in the example above to retrieve items from the embeddings, we should use that `dtype` here, too. Use `dtype=torch.long` when you create the tensors.

In [None]:
class TweetEvalDataset(Dataset):
  """
  This is a custom dataset class that we need to write for our specific dataset.
  """
  def __init__(self, data_set):
    # Your code here
    pass

  def __len__(self):
    # Your code here
    pass

  def __getitem__(self, index):
    # This returns an instance for a particular index.
    # Your code here
    pass

torch_data_train = TweetEvalDataset(train_data)
torch_data_val = TweetEvalDataset(val_data)
torch_data_test = TweetEvalDataset(test_data)

In [None]:
import random, os

# Always fun with the random seeds ...
# We need to set them such that our results will be replicable.
# (Hint: for an experiment later, you can change the random seed here and check what happens.
# But for now, let's keep the answer to all questions of the universe, 42.)
seed=42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
if torch.cuda.is_available():
  # This is needed on Colab as we are working in a distributed environment
  # If you are working in a different GPU environment, you can probably omit this line if it results in errors.
  os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"

# Should we still have some source for non-determinism in our code, this will complain:
torch.use_deterministic_algorithms(True)

############
# MODEL    #
############

"""
Today, we see a different (more flexible) way of defining PyTorch models.
The __init__ method creates the various layers the model will have.
In the forward method, we define how the input x is passed through the layers.
This allows for more flexible neural network architectures than Sequential.
However, we must also be careful that the dimensions of the various inputs and
outputs of the layers match. During development, it's a good idea to inspect
the shapes of the tensors to identify potential bugs.
"""

class MyMLP(torch.nn.Module):

  def __init__(self, weights, max_len, emb_size, num_classes):
  # max_len is the number of input_ids per token
    super(MyMLP, self).__init__()
    self.embedding = nn.Embedding.from_pretrained(weights)
    hidden_size1 = 50
    self.linear1 = torch.nn.Linear(emb_size, hidden_size1)
    self.activation = torch.nn.ReLU()
    self.linear2 = torch.nn.Linear(hidden_size1, num_classes)
    # no softmax here as it's included in the implementation of the loss!

  def forward(self, x):
    #print("input:", x.shape)
    x = self.embedding(x) # obtain embeddings for input_ids
    #print("embeddings:", x.shape)
    x = torch.sum(x, dim=1) # Sum up the word vectors of all the input words
    #print("sum:", x.shape)
    x = torch.nn.functional.normalize(x) # Normalize the vector (otherwise longer inputs would differ from shorter inputs)
    #print("normalized:", x.shape)
    x = self.linear1(x) # hidden layer, reducing size
    x = self.activation(x)
    #print("activated:", x.shape)
    logits = self.linear2(x) # Classifier layer mapping to logits
    #print("logits:", logits.shape)
    # NO SOFTMAX HERE!! It's in the implementation of the loss (in PyTorch).
    return logits


weights = torch.FloatTensor(word_vectors.vectors)
model = MyMLP(weights, max_len_train, 300, 3)
print(model)

model = model.to(device)


#######################
# TRAINING PARAMETERS #
#######################

# Modify the training parameters here to experiment
num_epochs = 40
learning_rate = 1e-3
batch_size = 16

loss_fn = nn.CrossEntropyLoss() # this includes the softmax computation!
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


#######################
# DATA LOADERS        #
#######################
data_loader_train = DataLoader(torch_data_train, batch_size=batch_size, shuffle=True)
data_loader_dev = DataLoader(torch_data_val, batch_size=batch_size, shuffle=False)
data_loader_test = DataLoader(torch_data_test, batch_size=batch_size, shuffle=False)


#######################
# EVALUATION          #
#######################

def evaluate(model, data_loader):
  # Compute accuracy of model on data provided by data_loader
  correct = 0
  num_instances = len(data_loader.dataset)
  with torch.no_grad(): # This tells the model that we're not training
                        # Will not remember gradients for this block
    for X, y in iter(data_loader):
      logits = model(X)
      # predicted class: we do not need the softmax for prediction, just pick highest logit
      arg_maxs = torch.argmax(logits, dim=1) # argmax returns the class index, logits = [batch_dim, logits_per_inst]
                                             # argmax is applied to the second of these dimensions!
      # print(arg_maxs == y) # A vector where each dimension is True/False depending on whether the two tensors matched
      num_correct = torch.sum(arg_maxs == y).item() # sum up the "Trues" (where they were equal)
      correct += num_correct

  accuracy = 100 * correct / num_instances
  return accuracy


#######################
# TRAINING            #
#######################

for epoch in range(num_epochs):
  it = iter(data_loader_train)
  epoch_loss, steps = 0, 0

  for  X, y in it:
    y_pred = model(X)
    loss = loss_fn(y_pred, y)   # Have the loss function compute the loss value
    optimizer.zero_grad()       # Reset the optimizer (otherwise it accumulates results - would be wrong here)
    loss.backward()             # Compute the gradients (partial derivatives)
    optimizer.step()            # Update the network's weights
    epoch_loss += loss          # For tracking the epoch's loss
    steps += 1

  print("\nEpoch:", epoch+1, "    Loss: {:0.4f}".format(epoch_loss/steps))
  # evaluate model at end of epoch
  print("Training accuracy: {:2.1f}".format(evaluate(model, data_loader_train)))
  print("     Dev accuracy: {:2.1f}".format(evaluate(model, data_loader_dev)))


print("\n")
print("--"*50)
print("TRAINING DONE. Epochs trained:", epoch+1)

# Compute accuracy on test
print("\nTest accuracy: {:2.1f}".format(evaluate(model, data_loader_test)))

_Self control:_ With the vanilla settings above, the code should output:


```
----------------------------------------------------------------------------------------------------
TRAINING DONE. Epochs trained: 40

Test accuracy: 76.3
```


❓Add one or two additional linear layers to `MyMLP` and try out different activation functions.

❓Try out different learning rates or batch sizes.

❓Optional: The gensim API can also load several other word embeddings. To see which, use:


```
print(api.info(name_only=True))
```
Experiment with several of those word embeddings. Do they work better/worse? Why?
