## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

# 7. Neural Networks

__Learning Goals:__

* Learn some basics of Pytorch.
* Implement a multi-layer perceptron (MLP) in Pytorch.
* Implement early stopping.
* Experiment with the MLP on the SMS Spam dataset.

<br/>

In this notebook, we'll train a spam classifier using simple neural network.
We will use the [SMS Spam dataset](https://huggingface.co/datasets/sms_spam) to train and evaluate our classifier.
This dataset consists of text messages labeled with whether they are spam or not (1=SPAM, 0=NO_SPAM).

```
1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

0 U dun say so early hor... U c already then say...

0 Nah I don't think he goes to usf, he lives around here though

1 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv
```


When running on Colab, choose a GPU runtime:

```
Runtime --> Change Runtime --> GPU --> T4
```

If you're working locally, this time, it will probably still work. The factor in speed-up of GPU vs. CPU when training (or evaluating) neural networks is about 10, though. The model in this notebook takes around 30 seconds to train with a GPU backend. If you're working on a CPU, that will be 300 seconds - this makes quite a difference in development time!

In [79]:
#  Imports
!pip install -U datasets # HuggingFace datasets
!pip install nltk

from datasets import load_dataset

import pprint as pp
from collections import defaultdict
import random
import os

import nltk
nltk.download('punkt_tab')
from nltk.tokenize import sent_tokenize, word_tokenize

import numpy as np

# Import PyTorch
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader

# Import time for measuring how long training took
import time



[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


### Dataset and Datasplits

Let's load the dataset and create a train, dev, and test split. You know this procedure!

In [80]:
# Load the dataset
sms_spam = load_dataset("sms_spam", split="train")

In [81]:
# Look at some instances
for i in range(10):
  label = sms_spam["label"][i]
  text = sms_spam["sms"][i]
  print(label, text)

0 Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...

0 Ok lar... Joking wif u oni...

1 Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's

0 U dun say so early hor... U c already then say...

0 Nah I don't think he goes to usf, he lives around here though

1 FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

0 Even my brother is not like to speak with me. They treat me like aids patent.

0 As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune

1 WINNER!! As a valued network customer you have been selected to receivea £900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only.

1 Had your mobile 

In [82]:
# Number of instances
print(len(sms_spam))

5574


In [83]:
# Let's define a training, a development, and a test set
#train_data, dev_data, test_data = {}, {}, {}
train_data = {"text":sms_spam["sms"][:2000]}
train_data["label"] = sms_spam["label"][:2000]
dev_data = {"text":sms_spam["sms"][4000:5000]}
dev_data["label"] = sms_spam["label"][4000:5000]
test_data ={"text":sms_spam["sms"][5000:]}
test_data["label"] = sms_spam["label"][5000:]


# Label distributions in these datasplits
num_spam_train = train_data["label"].count(1)
print("% spam in train:", round(100*num_spam_train/len(train_data["label"]), 1))
num_spam_dev = dev_data["label"].count(1)
print("% spam in dev:", round(100*num_spam_dev/len(dev_data["label"]), 1))
num_spam_test = test_data["label"].count(1)
print("% spam in test:", round(100*num_spam_test/len(test_data["label"]), 1))

% spam in train: 14.0
% spam in dev: 13.9
% spam in test: 12.9


### Vocabulary and Text "Embeddings"

We will represent each instance (text message) by its word counts, i.e., each dimension in the input vector corresponds to one word (token). As using the full vocabulary would result in too many dimensions, we will filter it first.
A numeric vector representation of a text is also called an __embedding__.

In [84]:
# Build vocabulary from training data
# We'll collect the most frequent tokens separately for each class (question: why does that make sense?)
complete_vocab = set()
for label in [0, 1]:
  vocab = defaultdict(int) # collect counts so we can filter the vocabulary
  for i, text in enumerate(train_data["text"]):
    if label == train_data["label"][i]:
      for sent in sent_tokenize(text):
        for token in word_tokenize(sent):
          vocab[token] += 1
  # Sorty by frequency
  vocab_list = sorted([(v, k) for k, v in vocab.items()], reverse=True)
  print(len(vocab_list))
  print(vocab_list[:10])
  # Add the 50 most frequent words to the complete vocabulary (hint: experiment with larger vocabulary sizes!)
  complete_vocab.update(set([v for k, v in vocab_list[:50]])) # update is the a.union(b) of two sets, adds the items of b into a

print(len(complete_vocab))
vocab = sorted(complete_vocab) # this order defines the dimensions of our input vectors
vocab2idx = {v:i for i, v in enumerate(vocab)} # obtain dimension of a word
vocab_size = len(vocab)

4854
[(1412, '.'), (701, 'I'), (638, 'you'), (564, 'to'), (560, ','), (503, '?'), (394, '...'), (364, 'the'), (349, 'a'), (331, 'i')]
2186
[(380, '.'), (232, 'to'), (203, '!'), (137, ','), (128, 'a'), (81, '2'), (75, '&'), (71, 'you'), (69, 'or'), (65, 'the')]
75


In [85]:
# Tokenize datasets
def tokenize_data(data_set):
  data_set["tokens"] = []
  for text in data_set["text"]:
    tokens = []
    for sent in sent_tokenize(text):
      tokens += list(word_tokenize(sent))
    data_set["tokens"].append(tokens)


tokenize_data(train_data)
tokenize_data(dev_data)
tokenize_data(test_data)

In [86]:
def vectorize_data(data_set, vocab2idx):
  count = 0
  data_x = []
  vocab_size = len(vocab2idx.keys())
  data_set["x"] = []
  data_set["x_tf"] = []
  for tokens in data_set["tokens"]:
    x = np.zeros(vocab_size)
    for token in tokens:
      if token in vocab2idx:
        x[vocab2idx[token]] += 1
    if np.sum(x) == 0:
      count+= 1

    # numpy vector
    data_set["x"].append(x) # appending a numpy vector

  print(count) # this many instances are not representable - just 0s in the vector


vectorize_data(train_data, vocab2idx)
vectorize_data(dev_data, vocab2idx)
vectorize_data(test_data, vocab2idx)

26
15
7


## Multi-Layer Perceptron in PyTorch

Today, we will look at a code example of how to implement a multi-layer perceptron in [PyTorch](https://pytorch.org/), a deep learning library for Python that will help us to run our models on a GPU and that provides lots of pre-defined functionalities, e.g., neural network layers, optimizers, and loss functions.

### Dataset
First, we will map out SMS Spam dataset to a [torch.util.data.Dataset](https://pytorch.org/docs/stable/data.html) object. For doing so, we first need to define a class (here called `SmsSpanDataset`) that is a subclass of `Dataset`.
This class must implement two methods: First, `__init__` for putting the x and y values in two lists (of equal length and in the same order, of course). This method will be called exactly once: when an object of this type is created.
The second method, `__getitem__` is a called whenever the object of type `SmsSpanDataset` is asked to return a tuple corresponding to a single x and a single y value.

Note: `__getitem__` is a magic method in Python that will be called when we access an object with an index: `train_data[i]` will call this method with `index=i`.

The most important part happens in these lines:
```
self.X = [torch.tensor(x, dtype=torch.float32) for x in data_set["x"]]
self.Y = [torch.tensor(y, dtype=torch.float32) for y in data_set["label"]]
```
Here, we map each input vector `x` to an input vector object that torch can use on the GPU. In PyTorch, vectors, matrices, and higher-dimensional tensors are all modelled using the [`torch.tensor`](https://pytorch.org/docs/stable/tensors.html) datatype. The `dtype` parameter tells PyTorch what kind of values we want to store in this tensor, here, it's 32-bit floating point values.

If you are new to PyTorch and tensors, I recommend this excellent explanation:
[Towards DataScience: Understanding Dimensions in PyTorch](https://towardsdatascience.com/understanding-dimensions-in-pytorch-6edf9972d3be)

Next, if you have access to a GPU, let's use it! (This notebook will still run quickly on the CPU as we use only very few features. But in the future, you'll appreciate a GPU. So let's learn how to do that.)

In [87]:
cuda_available = torch.cuda.is_available() # checks if a GPU is available

# Set the device variable: quite handy for the code below.
# The device variable will either point to the CPU or to the GPU.
device = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")

print("We will compute on device:", device)

We will compute on device: cuda


In [88]:
############
# DATASET  #
############

class SmsSpanDataset(Dataset):
  """
  This is a custom dataset class that we need to write for our specific dataset.
  """
  def __init__(self, data_set):
    # Note the device=device assignment for the tensors. It will create the tensors
    # on the device that we configured above.
    self.X = [torch.tensor(x, dtype=torch.float32, device=device) for x in data_set["x"]]
    self.Y = [torch.tensor(y, dtype=torch.float32, device=device) for y in data_set["label"]]
    if len(self.X) != len(self.Y):
      raise Exception("X and Y must have the same length!")

  def __len__(self):
    return len(self.X)

  def __getitem__(self, index):
    # This returns an instance for a particular index.
    _x = self.X[index]
    _y = self.Y[index]

    return _x, _y

torch_data_train = SmsSpanDataset(train_data)
torch_data_dev = SmsSpanDataset(dev_data)
torch_data_test = SmsSpanDataset(test_data)

### Model

Next, we will define our model. Our model is a simple sequence of layers, hence, we'll use [`torch.nn.Sequential`](https://pytorch.org/docs/stable/generated/torch.nn.Sequential.html).
We define the layers of our model just in the order in that they occur:

* An input layer taking in vectors of size `vocab_size` (as we defined them above) and mapping them to a hidden layer of size 2
* A ReLU activation applied element-wise to the output of the previous layer
* A second linear layer (also often called the "classifier layer"), which takes the output of the ReLU layer (here 2 dimensions) and maps it to a single logit. This is our score $z$!
* A sigmoid layer computing the probability for $z$. This layer also takes the output of its preceding layer as input.

In [None]:
# Variables for all the parameters that I am changing for the suggested experiments portion are in THIS cell
h_layer_size = 5
num_epochs = 35
learning_rate = 1e-3
batch_size = 32

In [99]:
############
# MODEL    #
############

model = nn.Sequential(
    nn.Linear(vocab_size, h_layer_size), # maps inputs to layer of hidden size 2
    nn.ReLU(),
    nn.Linear(h_layer_size, 1),          # maps inputs of size 2 to a single activation value (z, aka logit)
    nn.Sigmoid()              # maps value to a number between 0 and 1 ("probability"/"confidence")
)
print(model)

Sequential(
  (0): Linear(in_features=75, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=1, bias=True)
  (3): Sigmoid()
)


In [None]:
# Switch optimizers here by un/commenting as desired
#optimizer = torch.optim.SGD(model.parameters(), lr=learning_rate)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)


In [100]:
# Is our model on the CPU or on the GPU?
print("CUDA available?", cuda_available)
print("Model on GPU?", next(model.parameters()).device)
# Models are first created on the CPU.

model = model.to(device)  # Move the model to the GPU (or keep in to the CPU).
print("And now, model on GPU?", next(model.parameters()).device)

CUDA available? True
Model on GPU? cpu
And now, model on GPU? cuda:0


In [101]:
# Always fun with the random seeds ...
# We need to set them such that our results will be replicable.
# (Hint: for an experiment later, you can change the random seed here and check what happens.
# But for now, let's keep the answer to all questions of the universe, 42.)
seed=42
torch.manual_seed(seed)
torch.cuda.manual_seed_all(seed)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False
np.random.seed(seed)
random.seed(seed)
os.environ['PYTHONHASHSEED'] = str(seed)
if cuda_available:
  # This is needed on Colab as we are working in a distributed environment
  # If you are working in a different GPU environment, you can probably omit this line if it results in errors.
  os.environ["CUBLAS_WORKSPACE_CONFIG"]=":4096:8"

# Should we still have some source for non-determinism in our code, this will complain:
torch.use_deterministic_algorithms(True)


############
# MODEL    #
############

# I am repeating the model code here such that it gets set up properly before each
# training run (otherwise, we'd keep training the same parameters). Modify the model here.

model = nn.Sequential(
    nn.Linear(vocab_size, h_layer_size), # maps inputs to layer of hidden size 2
    nn.ReLU(),
    nn.Linear(h_layer_size, 1),          # maps inputs of size 2 to a single activation value (z, aka logit)
    nn.Sigmoid()              # maps value to a number between 0 and 1 ("probability"/"confidence")
)
print(model)

model = model.to(device)


#######################
# TRAINING PARAMETERS #
#######################

# Modify the training parameters here to experiment
# I MOVED THE DEFINITION OF THESE PARAMETERS TO A CELL ABOVE
'''
num_epochs = 35
learning_rate = 1e-3
batch_size = 32
'''

loss_fn = nn.BCELoss() # Binary cross entropy loss



#######################
# DATA LOADERS        #
#######################
"""
In Pytorch, the DataLoader takes a Dataset and creates batches from it.
We then create an iterator over this dataset (at this time, it is shuffled!).
We can obtain one batch at a time by calling next() on the iterator.
Or, as below, we can directly use this iterator in for loop.
The Dataloader returns matching X and y values, but not just one pair. It returns
a tuple of tensors (imagine it like a list in this case), where the first
entry is a tensor of size batch_size * [dimensions of x], and the second is
batch_size * [dimensions of y].
The advantage over a list is that we immediately get a multi-dimensional tensor
object on which the GPU can perform very fast operations.
"""

# We should always randomly shuffle the training dataset for each epoch.
# Don't worry, we fixed the random seeds above.
data_loader_train = DataLoader(torch_data_train, batch_size=batch_size, shuffle=True)
data_loader_dev = DataLoader(torch_data_dev, batch_size=batch_size, shuffle=False)
data_loader_test = DataLoader(torch_data_test, batch_size=batch_size, shuffle=False)


#######################
# EVALUATION          #
#######################

# Let's define a very simple evaluation method that computes accuracy
# Hint: it may be easier to read the training code below first, and then
# come back here.

def evaluate(model, data_loader):
  # Compute accuracy of model on data provided by data_loader
  correct = 0
  num_instances = len(data_loader.dataset)
  with torch.no_grad(): # This tells the model that we're not training
                        # Will not remember gradients for this block
    for X, y in iter(data_loader):
      y_probs = model(X) # make prediction
      y_probs = y_probs.squeeze(1) # removes the batch dimension
      y_pred = torch.where(y_probs >= 0.5, 1, 0.) # creates new vector based on the condition statement
      correct += (y_pred == y).float().sum() # count how many predictions were correct

  accuracy = 100 * correct / num_instances
  return accuracy


#######################
# TRAINING            #
#######################

# Early Stopping:
epochs_no_change = 1  # A counter for epochs gone without change,
prev_dev_acc = 0      # an acc variable for storing the last accuracy reached on the dev set,
k = 5                 # and a parameter k for setting the patience

start_time = time.time()
# Here, the training happens!
for epoch in range(num_epochs): # One epoch = step once over the entire training dataset
  it = iter(data_loader_train)  # Create the iterator from the training dataset
  epoch_loss, steps = 0, 0      # To keep track of the current epoch's loss

  for  X, y in it:              # Obtain a tensor X = batch of X-values, y accordingly
    y_pred = model(X)           # Have our model with current weights make a prediction
    y_pred = y_pred.squeeze(1)  # Removes the extra batch dimension (trick, more on this later)
    loss = loss_fn(y_pred, y)   # Have the loss function compute the loss value
    optimizer.zero_grad()       # Reset the optimizer (otherwise it accumulates results - would be wrong here)
    loss.backward()             # Compute the gradients (partial derivatives)
    optimizer.step()            # Update the network's weights
    epoch_loss += loss          # For tracking the epoch's loss
    steps += 1

  print("\nEpoch:", epoch+1, "    Loss: {:0.4f}".format(epoch_loss/steps))
  # evaluate model at end of epoch
  print("Training accuracy: {:2.1f}".format(evaluate(model, data_loader_train)))
  dev_acc = evaluate(model, data_loader_dev)
  print("Dev accuracy: {:2.1f}".format(dev_acc))

  # Early stopping
  if dev_acc == prev_dev_acc:
    epochs_no_change += 1
    print(f"Epochs no change: {epochs_no_change}")
  else:
    epochs_no_change = 1

  if epochs_no_change >= k:
    print(f"No change in accuracy for dev set in the last {k} epochs.")
    print("Accuracy(dev) = {:2.1f}.".format(dev_acc))
    print(f"Stopping early.")
    break

  prev_dev_acc = dev_acc

end_time = time.time()

print("\n")
print("--"*50)
print("TRAINING DONE. Epochs trained:", epoch+1)
print(f"Execution time: {(end_time-start_time):.4f} seconds")

# Compute accuracy on test
print("\nTest accuracy: {:2.1f}".format(evaluate(model, data_loader_test)))

Sequential(
  (0): Linear(in_features=75, out_features=5, bias=True)
  (1): ReLU()
  (2): Linear(in_features=5, out_features=1, bias=True)
  (3): Sigmoid()
)

Epoch: 1     Loss: 0.8892
Training accuracy: 14.0
Dev accuracy: 13.9

Epoch: 2     Loss: 0.8686
Training accuracy: 14.1
Dev accuracy: 13.9
Epochs no change: 2

Epoch: 3     Loss: 0.8492
Training accuracy: 14.1
Dev accuracy: 13.9
Epochs no change: 3

Epoch: 4     Loss: 0.8319
Training accuracy: 14.1
Dev accuracy: 13.9
Epochs no change: 4

Epoch: 5     Loss: 0.8146
Training accuracy: 14.2
Dev accuracy: 14.1

Epoch: 6     Loss: 0.7987
Training accuracy: 14.3
Dev accuracy: 14.2

Epoch: 7     Loss: 0.7835
Training accuracy: 14.6
Dev accuracy: 14.7

Epoch: 8     Loss: 0.7691
Training accuracy: 15.4
Dev accuracy: 15.5

Epoch: 9     Loss: 0.7551
Training accuracy: 16.7
Dev accuracy: 17.8

Epoch: 10     Loss: 0.7415
Training accuracy: 20.4
Dev accuracy: 20.4

Epoch: 11     Loss: 0.7288
Training accuracy: 25.2
Dev accuracy: 24.7

Epoch: 12

__Coding Task:__

❓Check above after which epoch the training actually ends (defined as no improvements on the development set). The training loss keeps going down: we are overfitting to the training set! Implement _early stopping_: Track the changes of the accuracy on the development set. If this accuracy has not increased for 5 epochs, stop the training. (We call this a _patience_ or _tolerance_ of 5.)

💬 Dev Accuracy reached a max of 86.1 on epoch 27. I have implemented Early Stopping with a patience of k epochs without change.

__Suggested Experiments:__

❓Experiment with the hyperparameters: `batch_size`, `num_epochs`, `learning_rate`. Note down your findings. When do results get better/worse? When does learning get faster/slowlier?

💬 Vanilla's test score of 87.1 was not bested by any other combination of parameters, but the same cannot be said of its training time of 11.91s:
- ➕ Halving the batch size resulted in identical scores in only 7.44s.
- ➕ Multiplying the learning rate by a factor of 10 resulted in identical scores in only 1.3732s
- ➖ For contrast, dividing the learning rate by a factor of 10, or doubling the batch size produced a steep drop in scores, landing below 14%.


❓Make the hidden layer size (above: 2) bigger. Important: you need to change the output size of the first linear layer and the input size of the second linear layer! Increase the number of epochs. Are your results getting better?

❓Try a different optimizier: `optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)` What happens?

Great job! Are you wondering now why commercial spam filters are often so bad? 😀