## APAI/STAT 4011 Natural Language Processing

## Assignment 2

### Submission format: 2 files (please don't zip them together), one is the ipynb file implemented with code and comments here, and one is pdf/ html file generated from this notebook. It's highly suggested that you directly write in this notebook and submit a pdf file.

*The late submission policy*: If you have difficulty handing in on time (e.g., illness etc.), you would need to send the official certificate to Dr. Lau (and cc the tutor) at least one day before the deadline via email.


## Q1. (40 marks)

In Q1, we will be focusing on the IMDb dataset. This is a dataset for binary sentiment classification, and is provided with a set of 25,000 highly polar movie reviews for training, and 25,000 for testing. To load the dataset, you can easily download the dataset by adding this line in your colab notebook:

```
! wget http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz
```

In [1]:
# download the Large IMDB Movie Review Dataset
# the task is binary classification: positive or negative review

! wget http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz
! tar -xzf aclImdb_v1.tar.gz

--2024-11-30 02:32:28--  http://ai.stanford.edu/%7Eamaas/data/sentiment/aclImdb_v1.tar.gz
Resolving ai.stanford.edu (ai.stanford.edu)... 171.64.68.10
Connecting to ai.stanford.edu (ai.stanford.edu)|171.64.68.10|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 84125825 (80M) [application/x-gzip]
Saving to: ‘aclImdb_v1.tar.gz’


2024-11-30 02:32:57 (2.74 MB/s) - ‘aclImdb_v1.tar.gz’ saved [84125825/84125825]



In [2]:
import torch
import math
import numpy as np
import torch.nn as nn
from torch.utils.data import DataLoader
from torch import optim
import os
from collections import namedtuple

seed = 4011
torch.manual_seed(seed)
np.random.seed(seed)
torch.backends.cudnn.deterministic = True
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

train_path = "aclImdb/train/"
test_path = "aclImdb/test/"

# Hyperparameters for tuning

batch_size = 100
max_len = 300
embedding_size = 300
min_count = 10

In [3]:
###2. Build an appropriate embedding matrix based on the vocabulary and print out the size of this matrix.
from collections import Counter
from tqdm import tqdm

# Tokenize and build vocabulary
def build_vocab(paths):
    counter = Counter()
    for path in paths:
        for fname in tqdm(os.listdir(path), desc=f"Processing {path}"):
            with open(os.path.join(path, fname), 'r', encoding='utf-8') as f:
                tokens = f.read().strip().split()  # Simple whitespace tokenization
                counter.update(tokens)
    return counter

train_pos = "aclImdb/train/pos/"
train_neg = "aclImdb/train/neg/"
counter = build_vocab([train_pos, train_neg])

# Build word-to-index dictionary
special_tokens = ['<unk>', '<pad>']
word_to_idx = {word: idx for idx, (word, _) in enumerate(counter.most_common(), start=len(special_tokens))}
word_to_idx.update({tok: i for i, tok in enumerate(special_tokens)})

embedding_size = 100
vocab_size = len(word_to_idx)
embeddings = nn.Embedding(
    vocab_size,
    embedding_size,
    padding_idx=word_to_idx['<pad>']
)

print(embeddings.weight.size())

Processing aclImdb/train/pos/: 100%|██████████| 12500/12500 [00:01<00:00, 7978.80it/s]
Processing aclImdb/train/neg/: 100%|██████████| 12500/12500 [00:00<00:00, 12798.93it/s]


torch.Size([280619, 100])


In [4]:
# Load the dataset
# 25000 train and 25000 test sentences


###After loading the dataset, we will need to perform preprocessing (e.g. tokenization, build up vocabulary, etc.) on the text. We will set the minimum token frequency threshold to be 10. Then print out the size of your vocabulary.
#- Special notes: need some special tokens like `<UNK>`, `<PAD>`, `<BOS>`, `<EOS>`. `<UNK>` represents the tokens
#    that can not be found in our vocabulary. (Why do we need it?)
#    `<PAD>` means padding, and `<BOS>` and `<EOS>` represents beginning-of-sentence and end-of-sentence, respectively.

Sentence = namedtuple('Sentence', ['index', 'tokens', 'label'])

def read_imdb_movie_dataset(dataset_path):

    indices = []
    text = []
    rating = []

    i = 0

    for filename in os.listdir(os.path.join(dataset_path, "pos")):
        file_path = os.path.join(dataset_path, "pos", filename)
        data = open(file_path, 'r', encoding="ISO-8859-1").read()
        indices.append(i)
        text.append(data)
        rating.append(1)
        i = i + 1

    for filename in os.listdir(os.path.join(dataset_path, "neg")):
        file_path = os.path.join(dataset_path, "neg", filename)
        data = open(file_path, 'r', encoding="ISO-8859-1").read()
        indices.append(i)
        text.append(data)
        rating.append(0)
        i = i + 1

    sentences = [ Sentence(index, text.split(), rating)
                  for index, text, rating in zip(indices, text, rating)]

    return sentences

train_examples = read_imdb_movie_dataset(train_path)
test_examples = read_imdb_movie_dataset(test_path)

UNK = '<UNK>'
PAD = '<PAD>'
BOS = '<BOS>'
EOS = '<EOS>'


class VocabItem:

    def __init__(self, string, hash=None):
        self.string = string
        self.count = 0
        self.hash = hash


    def __str__(self):
        return 'VocabItem({})'.format(self.string)

    def __repr__(self):
        return self.__str__()


class Vocab:

    def __init__(
        self,
        min_count=0,
        no_unk=False,
        add_padding=False,
        add_bos=False,
        add_eos=False,
        unk=None):

        self.no_unk = no_unk
        self.vocab_items = []
        self.vocab_hash = {}
        self.word_count = 0
        self.special_tokens = []
        self.min_count = min_count
        self.add_padding = add_padding
        self.add_bos = add_bos
        self.add_eos = add_eos
        self.unk = unk

        self.UNK = None
        self.PAD = None
        self.BOS = None
        self.EOS = None

        self.index2token = []
        self.token2index = {}

        self.finished = False

    def add_tokens(self, tokens):
        if self.finished:
            raise RuntimeError('Vocabulary is finished')

        for token in tokens:
            if token not in self.vocab_hash:
                self.vocab_hash[token] = len(self.vocab_items)
                self.vocab_items.append(VocabItem(token))

            self.vocab_items[self.vocab_hash[token]].count += 1
            self.word_count += 1

    def finish(self):

        token2index = self.token2index
        index2token = self.index2token

        tmp = []

        if not self.no_unk:

            # we add/handle the special `UNK` token
            # and set it to have index 0 in our mapping
            if self.unk:
                self.UNK = VocabItem(self.unk, hash=0)
                self.UNK.count = self.vocab_items[self.vocab_hash[self.unk]].count
                index2token.append(self.UNK)
                self.special_tokens.append(self.UNK)

                for token in self.vocab_items:
                    if token.string != self.unk:
                        tmp.append(token)

            else:
                self.UNK = VocabItem(UNK, hash=0)
                index2token.append(self.UNK)
                self.special_tokens.append(self.UNK)

                for token in self.vocab_items:
                    if token.count <= self.min_count:
                        self.UNK.count += token.count
                    else:
                        tmp.append(token)
        else:
            for token in self.vocab_items:
                tmp.append(token)

        tmp.sort(key=lambda token: token.count, reverse=True)

        if self.add_bos:
            self.BOS = VocabItem(BOS)
            tmp.append(self.BOS)
            self.special_tokens.append(self.BOS)

        if self.add_eos:
            self.EOS = VocabItem(EOS)
            tmp.append(self.EOS)
            self.special_tokens.append(self.EOS)

        if self.add_padding:
            self.PAD = VocabItem(PAD)
            tmp.append(self.PAD)
            self.special_tokens.append(self.PAD)

        index2token += tmp

        for i, token in enumerate(self.index2token):
            token2index[token.string] = i
            token.hash = i

        self.index2token = index2token
        self.token2index = token2index

        if not self.no_unk:
            print('Unknown vocab size:', self.UNK.count)

        print('Vocab size: %d' % len(self))

        self.finished = True

    def __getitem__(self, i):
        return self.index2token[i]

    def __len__(self):
        return len(self.index2token)

    def __iter__(self):
        return iter(self.index2token)

    def __contains__(self, key):
        return key in self.token2index

    def tokens2indices(self, tokens, add_bos=False, add_eos=False):
        string_seq = []
        if add_bos:
            string_seq.append(self.BOS.hash)
        for token in tokens:
            if self.no_unk:
                string_seq.append(self.token2index[token])
            else:
                string_seq.append(self.token2index.get(token, self.UNK.hash))
        if add_eos:
            string_seq.append(self.EOS.hash)
        return string_seq

    def indices2tokens(self, indices, ignore_ids=()):
        tokens = []
        for idx in indices:
            if idx in ignore_ids:
                continue
            tokens.append(self.index2token[idx].string)

        return tokens

src_vocab = Vocab(min_count=min_count, add_padding=True)

tgt_vocab = Vocab(no_unk=True, add_padding=False)

for sentence in train_examples:
    src_vocab.add_tokens(sentence.tokens[:max_len])
    tgt_vocab.add_tokens([sentence.label])

src_vocab.finish()
tgt_vocab.finish()


Vocabs = namedtuple('Vocabs', ['src', 'tgt'])
vocabs = Vocabs(src_vocab, tgt_vocab)

Unknown vocab size: 424424
Vocab size: 22521
Vocab size: 2



### Q1-1. To get your data prepared, build up Pytorch dataloaders for model training and print out one batch of training data. (15 marks)

- To check whether your dataloader can work successfully, you can choose to use `next(iter(train_dataloader))`. You can refer to https://pytorch.org/tutorials/beginner/basics/data_tutorial.html.

In [5]:


###To get your data prepared, build up Pytorch dataloaders for model training and print out one batch of training data.
#- To check whether your dataloader can work successfully, you can choose to use `next(iter(train_dataloader))`. You can refer to https://pytorch.org/tutorials/beginner/basics/data_tutorial.html.


# The Batch objects
# To easily access all the data in a batch, let's create a special Batch object that will give us access to
# all the information we may require during training.
# Let's begin creating a more friendly object that contains a numeric representation of our inputs and outputs.
# By default we will use numpy objects, but we will also add a function to translate the contents of the object to PyTorch.
# We will create this object to be generic enough so we can use it with tasks other than classification, too.
# This object will work like a dictionary,
# but it will also allow us to access each component using an attribute with the same name.
# The main principle is that this dictionary-like batch will hold `numpy` objects as values,
# and that after calling the `to_torch_()` function, they will be turned into `pytorch` objects and moved to
# the corresponding provided device.
# In this way, we know that all our elements inside the batch object are in the right place.
# We will combine our `Batch` object with a `BatchTuple` object that will hold data relevant to a specific input of the model.

class Batch(dict):
    def __init__(self, *args, **kwargs):
        super(Batch, self).__init__(*args, **kwargs)
        self.__dict__ = self
        self._is_torch = False

    def to_torch_(self, device):
        self._is_torch = False
        for key in self.keys():
            value = self[key]
            # we move `numpy` objects to `pytorch`
            if isinstance(value, BatchTuple):
                value.to_torch_(device)
            # we also move our BatchTuple objects to `pytorch`
            if isinstance(value, np.ndarray):
                self[key] = torch.from_numpy(value).to(device)


class BatchTuple(object):
    def __init__(self, sequences, lengths, sublengths, masks):
        self.sequences = sequences
        self.lengths = lengths
        self.sublengths = sublengths
        self.masks = masks
        self._is_torch = False

    def to_torch_(self, device):
        if not self._is_torch:
            self.sequences = torch.tensor(
                self.sequences, device=device, dtype=torch.long
            )

            if self.lengths is not None:
                self.lengths = torch.tensor(
                    self.lengths, device=device, dtype=torch.long
                )

            if self.sublengths is not None:
                self.sublengths = torch.tensor(
                    self.sublengths, device=device, dtype=torch.long
                )
            if self.masks is not None:
                self.masks = torch.tensor(
                    self.masks, device=device, dtype=torch.float
                )


# The padding function

def pad_list(
    sequences,
    dim0_pad=None,
    dim1_pad=None,
    align_right=False,
    pad_value=0
):

    sequences = [np.asarray(sublist) for sublist in sequences]

    if not dim0_pad:
        dim0_pad = len(sequences)

    if not dim1_pad:
        dim1_pad = max(len(seq) for seq in sequences)

    out = np.full(shape=(dim0_pad, dim1_pad), fill_value=pad_value)

    lengths = []
    for i in range(len(sequences)):
        data_length = len(sequences[i])
        lengths.append(data_length)
        offset = dim1_pad - data_length if align_right else 0
        np.put(out[i], range(offset, offset + data_length), sequences[i])

    lengths = np.array(lengths)

    return out, lengths


In [6]:
#-------------------
# Write your code

class SequenceClassificationBatchBuilder(object):
    def __init__(self, vocabs, max_len=None):
        self.vocabs = vocabs
        self.max_len = max_len

    def __call__(self, examples):
      ids_batch = [int(sentence.index) for sentence in examples]

      src_examples = [
          self.vocabs.src.tokens2indices(sentence.tokens[: self.max_len])
          for sentence in examples
      ]

      tgt_examples = [
          self.vocabs.tgt.token2index[sentence.label] for sentence in examples
      ]

      src_padded, src_lengths = pad_list(
          src_examples, pad_value=self.vocabs.src.PAD.hash
      )

      src_batch_tuple = BatchTuple(src_padded, src_lengths, None, None)

      tgt_batch_tuple = BatchTuple(tgt_examples, None, None, None)

      return Batch(
          indices=ids_batch, src=src_batch_tuple, tgt=tgt_batch_tuple
      )

# Let's instance our `batch_builder`, feed it into the `DataLoader` object alongside the  training and test examples,
# and let's inspect a single batch of examples.
batch_builder = SequenceClassificationBatchBuilder(
    vocabs, max_len=max_len
)

train_batches = DataLoader(
    train_examples,
    batch_size=batch_size,
    shuffle=True,
    num_workers=0,
    collate_fn=batch_builder,
)

test_batches = DataLoader(
    test_examples,
    batch_size=batch_size,
    shuffle=False,
    num_workers=0,
    collate_fn=batch_builder,
)
#----------------------

In [7]:
train_batches_iter = iter(train_batches)
train_batch = next(train_batches_iter)
train_batch.src.sequences

array([[    8,   177,   202, ...,    92,    11,    48],
       [  240,    21,     1, ..., 22520, 22520, 22520],
       [  994,  6211,   207, ..., 22520, 22520, 22520],
       ...,
       [    0,     0,     0, ...,  2227,     9,  1738],
       [   17,   457,     7, ..., 22520, 22520, 22520],
       [   17,  4196,   815, ..., 22520, 22520, 22520]])

### Q1-2. We choose bidirectional LSTM (BiLSTM) as the model. Train the model for 5 epoches with embedding matrix you obtained earlier, and for each epoch, print out the training loss, training accuracy, testing loss and testing accuracy. You could choose any appropriate loss function and values for hyperparameters. (25 marks)

- If you found difficulty understanding the structure of BiLSTM, you may refer to the supplementary note named *notes_on_lstm* inside tutorial 9 for detailed information.

- You definitely want to use GPU for this colab notebook. Go to Edit > Notebook settings as the following: Click on “Notebook settings” and select “GPU”.

In [None]:
def mean_pooling(batch_hidden_states, batch_lengths):
    batch_lengths = batch_lengths.float()
    batch_lengths = batch_lengths.unsqueeze(1)
    if batch_hidden_states.is_cuda:
        batch_lengths = batch_lengths.cuda()

    pooled_batch = torch.sum(batch_hidden_states, 1)
    pooled_batch = pooled_batch / batch_lengths.expand_as(pooled_batch)

    return pooled_batch


def max_pooling(batch_hidden_states):
    pooled_batch, _ = torch.max(batch_hidden_states, 1)
    return pooled_batch

def pack_rnn_input(embedded_sequence_batch, sequence_lengths):
    sequence_lengths = sequence_lengths.cpu().numpy()

    sorted_sequence_lengths = np.sort(sequence_lengths)[::-1]
    sorted_sequence_lengths = torch.from_numpy(
        sorted_sequence_lengths.copy()
    )

    idx_sort = np.argsort(-sequence_lengths)
    idx_unsort = np.argsort(idx_sort)

    idx_sort = torch.from_numpy(idx_sort)
    idx_unsort = torch.from_numpy(idx_unsort)

    if embedded_sequence_batch.is_cuda:
        idx_sort = idx_sort.cuda()
        idx_unsort = idx_unsort.cuda()

    embedded_sequence_batch = embedded_sequence_batch.index_select(
        0, idx_sort
    )

    # Handling padding in Recurrent Networks
    packed_rnn_input = nn.utils.rnn.pack_padded_sequence(
        embedded_sequence_batch,
        sorted_sequence_lengths,
        batch_first=True
    )

    return packed_rnn_input, idx_unsort

def unpack_rnn_output(packed_rnn_output, indices):
    encoded_sequence_batch, _ = nn.utils.rnn.pad_packed_sequence(
        packed_rnn_output, batch_first=True
    )

    encoded_sequence_batch = encoded_sequence_batch.index_select(0, indices)

    return encoded_sequence_batch

class BiLSTM(nn.Module):
    def __init__(self, embeddings, hidden_size, num_labels, input_dropout=0, output_dropout=0, bidirectional=True, num_layers=2, pooling='mean'):
        super(BiLSTM, self).__init__()
        self.embeddings = embeddings
        self.pooling = pooling
        self.input_dropout = nn.Dropout(input_dropout)
        self.output_dropout = nn.Dropout(output_dropout)
        self.bidirectional = bidirectional
        self.num_layers = num_layers
        self.num_labels = num_labels
        self.hidden_size = hidden_size
        self.input_size = self.embeddings.embedding_dim
        self.lstm = nn.LSTM(self.input_size, hidden_size, bidirectional=bidirectional, num_layers=num_layers, batch_first=True)
        self.total_hidden_size = self.hidden_size * (2 if self.bidirectional else 1)
        self.output_layer = nn.Linear(self.total_hidden_size, self.num_labels)
        self.loss_function = nn.CrossEntropyLoss()

    def forward(self, src_batch, tgt_batch=None):
        src_sequences = src_batch.sequences
        src_lengths = src_batch.lengths

        embedded_sequence_batch = self.embeddings(src_sequences)
        embedded_sequence_batch = self.input_dropout(embedded_sequence_batch)

        packed_rnn_input, indices = pack_rnn_input(embedded_sequence_batch, src_lengths)
        rnn_packed_output, _ = self.lstm(packed_rnn_input)
        encoded_sequence_batch = unpack_rnn_output(rnn_packed_output, indices)

        if self.pooling == "mean":
            pooled_batch = mean_pooling(encoded_sequence_batch, src_lengths)
        elif self.pooling == "max":
            pooled_batch = max_pooling(encoded_sequence_batch)
        else:
            raise NotImplementedError

        logits = self.output_layer(pooled_batch)
        _, predictions = logits.max(1)

        if tgt_batch is not None:
            targets = tgt_batch.sequences
            loss = self.loss_function(logits, targets)
        else:
            loss = None

        return loss, predictions, logits

In [None]:
epochs = 10
hidden_size = 300
log_interval = 10
num_labels = 2
input_dropout = 0.5
output_dropout = 0.5
bidirectional = True
num_layers = 2
pooling = 'mean'
lr = 0.001
gradient_clipping = 0.25

model = BiLSTM(
    embeddings=embeddings,
    hidden_size=hidden_size,
    num_labels=num_labels,
    input_dropout=input_dropout,
    output_dropout=output_dropout,
    bidirectional=bidirectional,
    num_layers=num_layers,
    pooling=pooling
)

model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for epoch in range(epochs):

    epoch_correct = 0
    epoch_total = 0
    epoch_loss = 0
    i = 0

    model.train()

    for batch in train_batches:
        batch.to_torch_(device)
        src_batch = batch.src
        tgt_batch = batch.tgt

        loss, predictions, logits = model(src_batch, tgt_batch=tgt_batch)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(
            model.parameters(),
            gradient_clipping)

        optimizer.step()
        correct = (predictions == tgt_batch.sequences).long().sum()
        total = tgt_batch.sequences.size(0)
        epoch_correct += correct.item()
        epoch_total += total
        epoch_loss += loss.item()
        i += 1

    accuracy  = 100 * epoch_correct / epoch_total

    print('Epoch {}'.format(epoch))
    print('Train Loss: {}'.format(epoch_loss / len(train_batches)))
    print('Train Accuracy: {}'.format(accuracy))

    test_epoch_correct = 0
    test_epoch_total = 0
    test_epoch_loss = 0

    model.eval()

    for batch in test_batches:

        ids_batch = batch.indices
        src_batch = batch.src
        tgt_batch = batch.tgt

        batch.to_torch_(device)

        loss, predictions, logits = model.forward(
            src_batch,
            tgt_batch=tgt_batch)

        correct = (predictions == tgt_batch.sequences).long().sum()
        total = tgt_batch.sequences.size(0)
        test_epoch_correct += correct.item()
        test_epoch_total += total
        test_epoch_loss += loss.item()

    test_accuracy = 100 * test_epoch_correct / test_epoch_total

    print('\n---------------------')
    print('Test Loss: {}'.format(test_epoch_loss / len(test_batches)))
    print('Test Accuracy: {}'.format(test_accuracy))
    print('---------------------\n')

Epoch 0
Train Loss: 0.6211941704750061
Train Accuracy: 63.968

---------------------
Test Loss: 0.45040971946716307
Test Accuracy: 79.236
---------------------

Epoch 1
Train Loss: 0.45934455215930936
Train Accuracy: 77.808

---------------------
Test Loss: 0.41602176943421365
Test Accuracy: 80.764
---------------------

Epoch 2
Train Loss: 0.37459966629743574
Train Accuracy: 83.244

---------------------
Test Loss: 0.3302415445446968
Test Accuracy: 86.256
---------------------

Epoch 3
Train Loss: 0.31761866080760953
Train Accuracy: 86.276

---------------------
Test Loss: 0.3459793331623077
Test Accuracy: 85.676
---------------------

Epoch 4
Train Loss: 0.273804653942585
Train Accuracy: 88.744

---------------------
Test Loss: 0.301221303999424
Test Accuracy: 87.864
---------------------

Epoch 5
Train Loss: 0.22896914321184159
Train Accuracy: 90.864

---------------------
Test Loss: 0.3400253424048424
Test Accuracy: 87.396
---------------------

Epoch 6
Train Loss: 0.20002151882648

In [11]:
def mean_pooling(batch_hidden_states, batch_lengths):
    batch_lengths = batch_lengths.unsqueeze(1)  # Shape: [batch_size, 1]
    return torch.sum(batch_hidden_states, dim=1) / batch_lengths

def max_pooling(batch_hidden_states):
    return torch.max(batch_hidden_states, 1).values

def pack_rnn_input(embedded_batch, lengths):
    lengths_sorted, indices_sorted = lengths.sort(descending=True)
    embedded_sorted = embedded_batch.index_select(0, indices_sorted)
    packed_input = nn.utils.rnn.pack_padded_sequence(embedded_sorted, lengths_sorted.cpu(), batch_first=True)
    return packed_input, indices_sorted.argsort()

def unpack_rnn_output(packed_output, unsort_indices):
    output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True)
    return output.index_select(0, unsort_indices)

class BiLSTM(nn.Module):
    def __init__(self, embeddings, hidden_size, num_labels, input_dropout=0, output_dropout=0, bidirectional=True, num_layers=2, pooling='mean'):
        super().__init__()
        self.embeddings = embeddings
        self.pooling = pooling
        self.lstm = nn.LSTM(embeddings.embedding_dim, hidden_size, num_layers=num_layers, bidirectional=bidirectional, batch_first=True)
        self.input_dropout = nn.Dropout(input_dropout)
        self.output_dropout = nn.Dropout(output_dropout)
        self.output_layer = nn.Linear(hidden_size * (2 if bidirectional else 1), num_labels)
        self.loss_function = nn.CrossEntropyLoss()

    def forward(self, src_batch, tgt_batch=None):
        embedded = self.input_dropout(self.embeddings(src_batch.sequences))
        packed_input, unsort_indices = pack_rnn_input(embedded, src_batch.lengths)
        packed_output, _ = self.lstm(packed_input)
        output = unpack_rnn_output(packed_output, unsort_indices)

        pooled_output = mean_pooling(output, src_batch.lengths) if self.pooling == 'mean' else max_pooling(output)
        pooled_output = self.output_dropout(pooled_output)
        logits = self.output_layer(pooled_output)
        loss = self.loss_function(logits, tgt_batch.sequences) if tgt_batch else None
        predictions = logits.argmax(1)

        return loss, predictions, logits

In [13]:
epochs = 5
hidden_size = 300
log_interval = 10
num_labels = 2
input_dropout = 0.5
output_dropout = 0.5
bidirectional = True
num_layers = 2
pooling = 'mean'
lr = 0.001
gradient_clipping = 0.25

model = BiLSTM(
    embeddings=embeddings,
    hidden_size=hidden_size,
    num_labels=num_labels,
    input_dropout=input_dropout,
    output_dropout=output_dropout,
    bidirectional=bidirectional,
    num_layers=num_layers,
    pooling=pooling
)

model.to(device)

optimizer = torch.optim.Adam(model.parameters(), lr=lr)

for epoch in range(epochs):

    epoch_correct = 0
    epoch_total = 0
    epoch_loss = 0
    i = 0

    model.train()

    for batch in train_batches:
        batch.to_torch_(device)
        src_batch = batch.src
        tgt_batch = batch.tgt

        loss, predictions, logits = model(src_batch, tgt_batch=tgt_batch)
        loss.backward()

        torch.nn.utils.clip_grad_norm_(
            model.parameters(),
            gradient_clipping)

        optimizer.step()
        correct = (predictions == tgt_batch.sequences).long().sum()
        total = tgt_batch.sequences.size(0)
        epoch_correct += correct.item()
        epoch_total += total
        epoch_loss += loss.item()
        i += 1

    accuracy  = 100 * epoch_correct / epoch_total

    print('Epoch {}'.format(epoch))
    print('Train Loss: {}'.format(epoch_loss / len(train_batches)))
    print('Train Accuracy: {}'.format(accuracy))

    test_epoch_correct = 0
    test_epoch_total = 0
    test_epoch_loss = 0

    model.eval()

    for batch in test_batches:

        ids_batch = batch.indices
        src_batch = batch.src
        tgt_batch = batch.tgt

        batch.to_torch_(device)

        loss, predictions, logits = model.forward(
            src_batch,
            tgt_batch=tgt_batch)

        correct = (predictions == tgt_batch.sequences).long().sum()
        total = tgt_batch.sequences.size(0)
        test_epoch_correct += correct.item()
        test_epoch_total += total
        test_epoch_loss += loss.item()

    test_accuracy = 100 * test_epoch_correct / test_epoch_total

    print('Test Loss: {}'.format(test_epoch_loss / len(test_batches)))
    print('Test Accuracy: {}'.format(test_accuracy))

Epoch 0
Train Loss: 0.3280042085647583
Train Accuracy: 85.592
Test Loss: 0.3543203083872795
Test Accuracy: 84.696
Epoch 1
Train Loss: 0.24977229115366936
Train Accuracy: 89.792
Test Loss: 0.33911906656622887
Test Accuracy: 86.684
Epoch 2
Train Loss: 0.22109418520331384
Train Accuracy: 90.964
Test Loss: 0.38633763824403283
Test Accuracy: 85.132
Epoch 3
Train Loss: 0.20376477387547492
Train Accuracy: 91.924
Test Loss: 0.4107707781791687
Test Accuracy: 84.32
Epoch 4
Train Loss: 0.18492900213599206
Train Accuracy: 92.66
Test Loss: 0.40320356205105784
Test Accuracy: 85.832


## Q2. (50 marks)
### Implement the idea in paper ***A Neural Probabilistic Language Model*** (https://www.jmlr.org/papers/volume3/bengio03a/bengio03a.pdf) to train a trigram model. We will use the brown corpus in nltk package as the dataset. Train the model for 5 epoches and print out the training loss, training accuracy, testing loss, and testing accuracy. You can use these codes to download the corpus:

```
import nltk
nltk.download("brown")
from nltk.corpus import brown
```

In [None]:
# 1. create brown corpus again with all words

# 2. create term frequency of the words and vocabulary

# 3. creating training and dev set

# 4. define Trigram Neural Network Model

# 5. using negative log-likelihood loss

# ------------------------- TRAIN & SAVE MODEL ------------------------

In [None]:
import nltk
import torch
import numpy as np
from nltk.corpus import brown
from collections import Counter
from torch import nn, optim
from torch.utils.data import DataLoader, TensorDataset
import time
import multiprocessing

nltk.download("brown")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


True

In [None]:
EMBEDDING_DIM = 200
CONTEXT_SIZE = 2
HIDDEN_DIM = 100
BATCH_SIZE = 256
EPOCHS = 5
UNK_SYMBOL = "<UNK>"
MIN_FREQ = 5
DEVICE = 'cuda' if torch.cuda.is_available() else 'cpu'
NUM_WORKERS = multiprocessing.cpu_count()

In [None]:
# 1. Load and preprocess the Brown corpus
corpus = [word.lower() for para in brown.paras() for sent in para for word in sent]
vocab_freq = Counter(corpus)
vocab = {word for word, freq in vocab_freq.items() if freq >= MIN_FREQ}
vocab.add(UNK_SYMBOL)
word_to_id = {word: idx for idx, word in enumerate(vocab)}
UNK_ID = word_to_id[UNK_SYMBOL]

Loading and preprocessing Brown corpus...


In [None]:
# 2. Create term frequency and vocabulary
def word_to_id_fn(word):
    return word_to_id.get(word, UNK_ID)

In [None]:
# 3. Creating trigrams
data = np.array([[word_to_id_fn(corpus[i]), word_to_id_fn(corpus[i + 1]), word_to_id_fn(corpus[i + 2])]
                 for i in range(len(corpus) - 2)])
train_data, dev_data = data[:int(0.8 * len(data))], data[int(0.8 * len(data)):]
train_loader = DataLoader(TensorDataset(torch.tensor(train_data[:, :2]), torch.tensor(train_data[:, 2])),
                          batch_size=BATCH_SIZE, shuffle=True, num_workers=NUM_WORKERS)
dev_loader = DataLoader(TensorDataset(torch.tensor(dev_data[:, :2]), torch.tensor(dev_data[:, 2])),
                        batch_size=BATCH_SIZE, num_workers=NUM_WORKERS)

Creating trigrams...


In [None]:
# 4. Define Trigram Neural Network Model
class TrigramNN(nn.Module):
    def __init__(self, vocab_size, embedding_dim, hidden_dim):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.fc1 = nn.Linear(embedding_dim * CONTEXT_SIZE, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, vocab_size)

    def forward(self, x):
        x = self.embedding(x.to(DEVICE)).view(x.size(0), -1)
        x = torch.relu(self.fc1(x))
        return torch.log_softmax(self.fc2(x), dim=1)

In [None]:
# 5. Using negative log-likelihood loss
model = TrigramNN(len(vocab), EMBEDDING_DIM, HIDDEN_DIM).to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=2e-3)
criterion = nn.NLLLoss()

def compute_accuracy(log_probs, labels):
    return (log_probs.argmax(dim=1) == labels).float().mean().item()

def evaluate_model(loader):
    model.eval()
    total_loss, total_acc = 0, 0
    with torch.no_grad():
        for context, target in loader:
            context, target = context.to(DEVICE), target.to(DEVICE)  # Move to DEVICE
            log_probs = model(context)
            total_loss += criterion(log_probs, target).item()
            total_acc += compute_accuracy(log_probs, target)
    return total_acc / len(loader), total_loss / len(loader)

In [None]:
# ------------------------- TRAIN & SAVE MODEL ------------------------
best_acc = 0
for epoch in range(EPOCHS):
    model.train()
    for context, target in train_loader:
        context, target = context.to(DEVICE), target.to(DEVICE)  # Move to DEVICE
        optimizer.zero_grad()
        log_probs = model(context)
        loss = criterion(log_probs, target)
        loss.backward()
        optimizer.step()

    dev_acc, dev_loss = evaluate_model(dev_loader)
    print(f"Epoch {epoch + 1}/{EPOCHS}: Dev Accuracy: {dev_acc:.4f}, Dev Loss: {dev_loss:.4f}")

    if dev_acc > best_acc:
        best_acc = dev_acc
        torch.save(model.state_dict(), f"best_model_epoch_{epoch + 1}.pth")
        print(f"New best model saved with accuracy: {best_acc:.4f}")

print("Training done")

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


Loading and preprocessing Brown corpus...
Creating trigrams...
Initializing model...
--- Training starts ---
Epoch 1/5: Dev Accuracy: 0.1367, Dev Loss: 5.8231
New best model saved with accuracy: 0.1367
Epoch 2/5: Dev Accuracy: 0.1430, Dev Loss: 5.8974
New best model saved with accuracy: 0.1430
Epoch 3/5: Dev Accuracy: 0.1458, Dev Loss: 6.0951
New best model saved with accuracy: 0.1458
Epoch 4/5: Dev Accuracy: 0.1469, Dev Loss: 6.4017
New best model saved with accuracy: 0.1469
Epoch 5/5: Dev Accuracy: 0.1452, Dev Loss: 6.7481
Training complete!


## Q3. (10 marks)

### Call the chatglm-4 API, and write a proper prompt using prompt engineering knowledge to let chatglm perform the task correctly:

``Take the last letters of the words and concatenate them.``


In [None]:
!pip install zhipuai

Collecting zhipuai
  Downloading zhipuai-2.1.5.20230904-py3-none-any.whl.metadata (10 kB)
Collecting pyjwt<2.9.0,>=2.8.0 (from zhipuai)
  Downloading PyJWT-2.8.0-py3-none-any.whl.metadata (4.2 kB)
Downloading zhipuai-2.1.5.20230904-py3-none-any.whl (104 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m105.0/105.0 kB[0m [31m2.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading PyJWT-2.8.0-py3-none-any.whl (22 kB)
Installing collected packages: pyjwt, zhipuai
  Attempting uninstall: pyjwt
    Found existing installation: PyJWT 2.10.0
    Uninstalling PyJWT-2.10.0:
      Successfully uninstalled PyJWT-2.10.0
Successfully installed pyjwt-2.8.0 zhipuai-2.1.5.20230904


In [None]:
words_list= ['Linius Victor', 'strawberry cake', 'Nice headshot', 'Cristiano Ronaldo', 'Brawl Star', 'Natural Language Processing']

In [None]:
import zhipuai
from zhipuai import ZhipuAI

client = ZhipuAI(api_key="6026d5961c40106882cd6848016ed219.06LGFZj3PNth6GmI")

# Define the task
messages = [
    {
        "role": "user",
        "content": (
            "Your task is to extract and concatenate the **last letter** of **each word** from the following phrases.\n"
            "Make sure to:\n"
            "1. **Extract the last letter** of every word in the phrase.\n"
            "2. **Concatenate** all extracted letters into one continuous string.\n"
            "3. Return a **list** of results for each phrase.\n\n"
            "**Example:**\n"
            "- Phrase: 'Big Red Car'\n"
            "- Extraction: 'g' from 'Big', 'd' from 'Red', 'r' from 'Car'\n"
            "- Result: 'gdr'\n\n"
            "**Phrases to process:**\n"
            "1. Linius Victor\n"
            "2. strawberry cake\n"
            "3. Nice headshot\n"
            "4. Cristiano Ronaldo\n"
            "5. Brawl Star\n"
            "6. Natural Language Processing\n\n"
            "Please return a **list** of concatenated last letters for each phrase."
        )
    }
]

# Call the API
response = client.chat.completions.create(
    model="glm-4-plus",
    messages=messages,
)

# Output the result
output_message = response.choices[0].message.content.strip()
print(output_message)

To accomplish this task, I will follow the steps outlined:

1. Extract the last letter of each word in the phrase.
2. Concatenate all extracted letters into one continuous string.
3. Return a list of results for each phrase.

Here is the list of concatenated last letters for each provided phrase:

1. **Linius Victor**
   - 'Linius' -> 's'
   - 'Victor' -> 'r'
   - Result: 'sr'

2. **strawberry cake**
   - 'strawberry' -> 'y'
   - 'cake' -> 'e'
   - Result: 'ye'

3. **Nice headshot**
   - 'Nice' -> 'e'
   - 'headshot' -> 't'
   - Result: 'et'

4. **Cristiano Ronaldo**
   - 'Cristiano' -> 'o'
   - 'Ronaldo' -> 'o'
   - Result: 'oo'

5. **Brawl Star**
   - 'Brawl' -> 'l'
   - 'Star' -> 'r'
   - Result: 'lr'

6. **Natural Language Processing**
   - 'Natural' -> 'l'
   - 'Language' -> 'e'
   - 'Processing' -> 'g'
   - Result: 'leg'

**Final List:**
```python
['sr', 'ye', 'et', 'oo', 'lr', 'leg']
```

This list contains the concatenated last letters for each of the given phrases.
