![picture](https://drive.google.com/uc?export=view&id=1eCsjNAtjXuXfqBLxeEnsBpOikUO06msr)

# **NLP701@MBZUAI Fall 2025 - Lab 10**



## **Learning Outcomes**
- Critically analyze, evaluate, and improve the performance of RNN and LSTM.
- Gain proficiency in implementing neural networks.

## **Learning Activities**
- Implement an NER tagger using RNNs.
- Improving the model by adding character level information or other methods.


# **RNNs for Sequence Labeling**
In this lab exercise, we build an NER tagger for Arabic using **RNNs**.
We use the same dataset as assignment 1, i.e., ANERcorp that has 150K words annotated for four entities: Location (LOC), Organization (ORG), and Person (PER), and Miscellaneous (MISC).

## **Data Preparation**

In [1]:
!pip install scikit-learn seqeval

Collecting seqeval
  Using cached seqeval-1.2.2.tar.gz (43 kB)
  Preparing metadata (setup.py) ... [?25ldone
Building wheels for collected packages: seqeval
[33m  DEPRECATION: Building 'seqeval' using the legacy setup.py bdist_wheel mechanism, which will be removed in a future version. pip 25.3 will enforce this behaviour change. A possible replacement is to use the standardized build interface by setting the `--use-pep517` option, (possibly combined with `--no-build-isolation`), or adding a `pyproject.toml` file to the source tree of 'seqeval'. Discussion can be found at https://github.com/pypa/pip/issues/6334[0m[33m
[0m  Building wheel for seqeval (setup.py) ... [?25ldone
[?25h  Created wheel for seqeval: filename=seqeval-1.2.2-py3-none-any.whl size=16249 sha256=ad6f6ef059e929fecff93e4b453c477f43c85c6634325185cfa5730aef6c607e
  Stored in directory: /home/quang.nguyen/.cache/pip/wheels/14/cf/a7/8f28ef376d707ff10e3922899482a2f23ef3002f8a952f47ac
Successfully built seqeval
Instal

In [2]:
!wget "https://camel.abudhabi.nyu.edu/anercorp/ANERcorp-CamelLabSplits.zip"
!unzip ANERcorp-CamelLabSplits.zip

--2025-10-23 15:38:47--  https://camel.abudhabi.nyu.edu/anercorp/ANERcorp-CamelLabSplits.zip
Resolving camel.abudhabi.nyu.edu (camel.abudhabi.nyu.edu)... 91.230.41.24
Connecting to camel.abudhabi.nyu.edu (camel.abudhabi.nyu.edu)|91.230.41.24|:443... connected.
HTTP request sent, awaiting response... 200 
Length: 926414 (905K) [application/zip]
Saving to: ‘ANERcorp-CamelLabSplits.zip’


2025-10-23 15:38:48 (7.58 MB/s) - ‘ANERcorp-CamelLabSplits.zip’ saved [926414/926414]

Archive:  ANERcorp-CamelLabSplits.zip
   creating: ANERcorp-CamelLabSplits/
  inflating: __MACOSX/._ANERcorp-CamelLabSplits  
  inflating: ANERcorp-CamelLabSplits/ANERCorp_Benajiba.txt  
  inflating: __MACOSX/ANERcorp-CamelLabSplits/._ANERCorp_Benajiba.txt  
  inflating: ANERcorp-CamelLabSplits/ANERCorp_CamelLab_train.txt  
  inflating: __MACOSX/ANERcorp-CamelLabSplits/._ANERCorp_CamelLab_train.txt  
  inflating: ANERcorp-CamelLabSplits/README.txt  
  inflating: __MACOSX/ANERcorp-CamelLabSplits/._README.txt  
  inflati

In [3]:
from sklearn.model_selection import train_test_split

# read ANER named entity dataset
def read_data(file_path):
    with open(file_path, mode='r') as f:
        sent, sents = [], []
        for line in f.readlines():
            if len(line.strip()) > 0:
                # split the line by space
                token = line.strip().split(' ')
                # unpack the token
                word, ner = token
                # append the tuple (word, ner tag) to a list for one sentence
                sent.append((word, ner.replace('PERS', 'PER')))
            # if the line is empty and the list is not empty
            elif len(sent):
                sents.append(sent)
                sent = []
        if sent:
            sents.append(sent)
    return sents

# read training and test data
_train_sents = read_data('./ANERcorp-CamelLabSplits/ANERCorp_CamelLab_train.txt')
test_sents = read_data('./ANERcorp-CamelLabSplits/ANERCorp_CamelLab_test.txt')

# since this dataset does not have dev set, we take 10% of the train as dev set
train_sents, dev_sents = train_test_split(_train_sents, test_size=0.1, random_state=12345)

In [4]:
# check the dataset
print('Size of train_sents: %d'%len(train_sents))
print('Size of dev_sents: %d'%len(dev_sents))

print('The training data looks like:')
for sent in train_sents[:5]:
  print(sent)

Size of train_sents: 3575
Size of dev_sents: 398
The training data looks like:
[('وقال', 'O'), ('إن', 'O'), ('العملية', 'O'), ('استهدفت', 'O'), ('"', 'O'), ('بنى', 'O'), ('تحتية', 'O'), ('تستعمل', 'O'), ('لتخزين', 'O'), ('أسلحة', 'O'), ('عائدة', 'O'), ('للجهاد', 'O'), ('الإسلامي', 'O'), ('"', 'O'), ('،', 'O'), ('واوضح', 'O'), ('إن', 'O'), ('إسرائيل', 'B-LOC'), ('أخطرت', 'O'), ('سكان', 'O'), ('المنطقة', 'O'), ('بوقوع', 'O'), ('الغارة', 'O'), ('.', 'O')]
[('وهذه', 'O'), ('هي', 'O'), ('المباراة', 'O'), ('الرسمية', 'O'), ('الأولى', 'O'), ('لمنتخب', 'O'), ('إنجلترا', 'B-LOC'), ('تحت', 'O'), ('قيادة', 'O'), ('مدربه', 'O'), ('ستيف', 'B-PER'), ('مكلارين', 'I-PER'), ('الذي', 'O'), ('خلف', 'O'), ('السويدي', 'O'), ('غوران', 'B-PER'), ('إريكسون', 'I-PER'), ('بعد', 'O'), ('نهائيات', 'O'), ('المونديال', 'B-MISC'), ('.', 'O')]
[('في', 'O'), ('عيد', 'O'), ('الملكة', 'O'), ('السماوية', 'O'), ('"', 'O'), ('تيني', 'B-PER'), ('هاو', 'I-PER'), ('"', 'O'), ('يسود', 'O'), ('إحساس', 'O'), ('بالروحانية', 'O'),

## **Build Vocabulary**

In [5]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from tqdm import tqdm

torch.manual_seed(1)

def build_vocab(sents, idx, special_tokens=['UNK']):
    vocab = {}
    if special_tokens:
        vocab = {k: v for v, k in enumerate(special_tokens)}

    for sent in sents:
        for word in sent:
            if word[idx] not in vocab:
                vocab[word[idx]] = len(vocab)

    return vocab

def prepare_sequence(seq, to_idx):
    idxs = []
    for w in seq:
        if w in to_idx:
            idxs.append(to_idx[w])
        else:
            idxs.append(to_idx['UNK'])
    return torch.tensor(idxs, dtype=torch.long)

In [6]:
word_to_idx = build_vocab(train_sents, 0)
tag_to_idx = build_vocab(train_sents, 1, special_tokens=None)

## **Create the model**

In [7]:
class RNNTagger(nn.Module):

    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(RNNTagger, self).__init__()
        self.hidden_dim = hidden_dim
        # Word embedding layer. This maps each word to a vector representation.
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)

        # The RNN takes word embeddings as inputs, and outputs hidden states
        # with dimensionality hidden_dim.
        # You can utilize the multi-layer rnn by changing num_layers.
        # Also you can use the bidirectional rnn with bidirectional=True.
        self.rnn = nn.RNN(embedding_dim, hidden_dim, num_layers=1, bidirectional=False)

        # The linear layer that maps from hidden state space to tag space
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        rnn_out, _ = self.rnn(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(rnn_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

    def predict(self, sentence):
        with torch.no_grad():
            inputs = prepare_sequence(sentence, word_to_idx)
            tag_scores = self.forward(inputs)
            _, indices = torch.max(tag_scores, 1)
            tags = []
            for i in range(len(indices)):
                for key, value in tag_to_idx.items():
                    if indices[i] == value:
                        tags.append(key)
        return tags

## **Train the model**

In [8]:
def train(model, train_sents, lr=0.1, epochs=3):
    loss_function = nn.NLLLoss()
    optimizer = optim.SGD(model.parameters(), lr=lr)

    for epoch in range(epochs):

        for sentence in tqdm(train_sents):
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            optimizer.zero_grad() # or model.zero_grad()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.
            x = [word[0] for word in sentence]
            y = [word[1] for word in sentence]
            sentence_in = prepare_sequence(x, word_to_idx)
            targets = prepare_sequence(y, tag_to_idx)

            # Step 3. Run our forward pass.
            tag_scores = model(sentence_in)

            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()

        print(f'epoch: {epoch+1}, loss: {loss}')

    return model

In [9]:
EMBEDDING_DIM = 128
HIDDEN_DIM = 256

RNN_model = RNNTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_idx), len(tag_to_idx))
RNN_tagger = train(RNN_model, train_sents, epochs=5)

torch.save(RNN_tagger.state_dict(), './rnn_tagger.pt')

100%|██████████| 3575/3575 [00:09<00:00, 358.57it/s]


epoch: 1, loss: 0.1423053741455078


100%|██████████| 3575/3575 [00:09<00:00, 375.55it/s]


epoch: 2, loss: 0.0937434509396553


100%|██████████| 3575/3575 [00:10<00:00, 340.54it/s]


epoch: 3, loss: 0.055363357067108154


100%|██████████| 3575/3575 [00:10<00:00, 337.62it/s]


epoch: 4, loss: 0.03216924890875816


100%|██████████| 3575/3575 [00:10<00:00, 328.33it/s]


epoch: 5, loss: 0.029427779838442802


## **Evaluate the Model**

Load the model:

In [10]:
RNN_tagger = RNNTagger(EMBEDDING_DIM,
                       HIDDEN_DIM,
                       len(word_to_idx),
                       len(tag_to_idx))
RNN_tagger.load_state_dict(torch.load('./rnn_tagger.pt'))
RNN_tagger.eval()

RNNTagger(
  (word_embeddings): Embedding(27364, 128)
  (rnn): RNN(128, 256)
  (hidden2tag): Linear(in_features=256, out_features=9, bias=True)
)

Evalute the model:

In [11]:
from seqeval.metrics import f1_score

dev = [[word[0] for word in sent] for sent in dev_sents]
dev_gold = [[word[1] for word in sent] for sent in dev_sents]

dev_pred = [RNN_tagger.predict(sent) for sent in dev]
f1_score(dev_gold, dev_pred)

np.float64(0.44234404536862004)

# **Exercise 01: LSTM**

Create the model:

In [None]:
class LSTMTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim)
        self.hidden2tag = nn.Linear(hidden_dim, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

    def predict(self, sentence):
        with torch.no_grad():
            inputs = prepare_sequence(sentence, word_to_idx)
            tag_scores = self.forward(inputs)
            _, indices = torch.max(tag_scores, 1)
            tags = []
            for i in range(len(indices)):
                for key, value in tag_to_idx.items():
                    if indices[i] == value:
                        tags.append(key)
        return tags


Train the model:

In [None]:
EMBEDDING_DIM = 128
HIDDEN_DIM = 256

"""
[Your code here]
"""

lstm_model = LSTMTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_idx), len(tag_to_idx))
lstm_tagger = train(lstm_model, train_sents, epochs=5)

torch.save(lstm_tagger.state_dict(), './lstm_tagger.pt')

100%|██████████| 3575/3575 [00:09<00:00, 377.18it/s]


epoch: 1, loss: 0.20392757654190063


100%|██████████| 3575/3575 [00:09<00:00, 372.29it/s]


epoch: 2, loss: 0.07214045524597168


100%|██████████| 3575/3575 [00:09<00:00, 365.14it/s]


epoch: 3, loss: 0.023382777348160744


100%|██████████| 3575/3575 [00:09<00:00, 362.81it/s]


epoch: 4, loss: 0.013818600215017796


100%|██████████| 3575/3575 [00:09<00:00, 363.08it/s]

epoch: 5, loss: 0.019663896411657333





Evalute the model:

In [31]:
"""
[Your code here]
"""
from seqeval.metrics import f1_score

dev = [[word[0] for word in sent] for sent in dev_sents]
dev_gold = [[word[1] for word in sent] for sent in dev_sents]

dev_pred = [lstm_tagger.predict(sent) for sent in dev]
f1_score(dev_gold, dev_pred)

np.float64(0.5803971812940423)

## **What would be a good example to show that LSTM performs better than a simple RNN?**
- No coding involved, and no need to use Arabic as an example.
- Imagine you're writing a paper, and you want to show an example that works well in LSTM compared to RNN.
- What kind of example would you present and why?

### Example that requires Long-Term Dependency like: 

The movie started off dull, with weak performances and predictable dialogue, but by the end, it delivered an incredibly emotional and satisfying conclusion.
- Simple RNN: early information (“dull”, “weak performances”) gets overwritten by later inputs. By the time the model reaches “emotional and satisfying conclusion,” it has largely forgotten the earlier context that helps interpret the reversal of sentiment.
- LSTM: can preserve earlier important cues (like “but by the end”) and decide what to forget or remember. It learns that “but by the end” signals a contrast — the final sentiment depends more on the later clause than the initial one.

# **Bonus Exercise: How Would You Improve the Model?**
- What can be done on top of the simple LSTM tagger to improve the performance?

In [101]:
class LSTMImproveTagger(nn.Module):
    def __init__(self, embedding_dim, hidden_dim, vocab_size, tagset_size):
        super(LSTMImproveTagger, self).__init__()
        self.hidden_dim = hidden_dim
        self.word_embeddings = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, bidirectional=True)
        self.hidden2tag = nn.Linear(hidden_dim*2, tagset_size)

    def forward(self, sentence):
        embeds = self.word_embeddings(sentence)
        lstm_out, _ = self.lstm(embeds.view(len(sentence), 1, -1))
        tag_space = self.hidden2tag(lstm_out.view(len(sentence), -1))
        tag_scores = F.log_softmax(tag_space, dim=1)
        return tag_scores

    def predict(self, sentence):
        with torch.no_grad():
            inputs = prepare_sequence(sentence, word_to_idx)
            tag_scores = self.forward(inputs)
            _, indices = torch.max(tag_scores, 1)
            tags = []
            for i in range(len(indices)):
                for key, value in tag_to_idx.items():
                    if indices[i] == value:
                        tags.append(key)
        return tags


EMBEDDING_DIM = 128
HIDDEN_DIM = 256

def train(model, train_sents, lr=0.1, epochs=3):
    loss_function = nn.NLLLoss()
    optimizer = optim.AdamW(model.parameters(), lr=lr, weight_decay=0.01)
    scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

    for epoch in range(epochs):

        for sentence in tqdm(train_sents):
            # Step 1. Remember that Pytorch accumulates gradients.
            # We need to clear them out before each instance
            optimizer.zero_grad() # or model.zero_grad()

            # Step 2. Get our inputs ready for the network, that is, turn them into
            # Tensors of word indices.
            x = [word[0] for word in sentence]
            y = [word[1] for word in sentence]
            sentence_in = prepare_sequence(x, word_to_idx)
            targets = prepare_sequence(y, tag_to_idx)

            # Step 3. Run our forward pass.
            tag_scores = model(sentence_in)

            # Step 4. Compute the loss, gradients, and update the parameters by
            #  calling optimizer.step()
            loss = loss_function(tag_scores, targets)
            loss.backward()
            optimizer.step()
        scheduler.step()
        print(f'epoch: {epoch+1}, loss: {loss}')

    return model

improvebilstm_model = LSTMImproveTagger(EMBEDDING_DIM, HIDDEN_DIM, len(word_to_idx), len(tag_to_idx))
improvebilstm_tagger = train(improvebilstm_model, train_sents, lr=0.01, epochs=5)

torch.save(improvebilstm_model.state_dict(), './lstm_improved_tagger.pt')

  0%|          | 0/3575 [00:00<?, ?it/s]

100%|██████████| 3575/3575 [00:23<00:00, 153.35it/s]


epoch: 1, loss: 0.09997943043708801


100%|██████████| 3575/3575 [00:22<00:00, 159.15it/s]


epoch: 2, loss: 0.18903404474258423


100%|██████████| 3575/3575 [00:21<00:00, 162.67it/s]


epoch: 3, loss: 0.3200090229511261


100%|██████████| 3575/3575 [00:21<00:00, 163.84it/s]


epoch: 4, loss: 0.021573113277554512


100%|██████████| 3575/3575 [00:21<00:00, 165.17it/s]


epoch: 5, loss: 0.05264861881732941


In [102]:
from seqeval.metrics import f1_score

dev = [[word[0] for word in sent] for sent in dev_sents]
dev_gold = [[word[1] for word in sent] for sent in dev_sents]

dev_pred = [improvebilstm_tagger.predict(sent) for sent in dev]
f1_score(dev_gold, dev_pred)

np.float64(0.6088362068965518)

### Use bidirectional LSTM to improve performance => F1: 60.9

## **Reference**
https://pytorch.org/tutorials/beginner/nlp/sequence_models_tutorial.html