# Advanced Text Analytics Lab 2

This notebook is the second of two lab notebooks that you will submit as part of your assessment for the Advanced Data Analytics unit. The notebook contains three required sections plus an optional section:

4. **Introducing Transformers:** This section introduces the Transformers library from HuggingFace, showing you how to use it to obtain contextualised embeddings from pretrained transformer models.

5. **Question Answering with Pretrained Transformers:** Learn about how to use a pretrained model to perform automatic question answering. 

6. **Transformers for Text Classification:** Here we show you how to construct a classifier using Transformers.

7. **OPTIONAL: More on Transformers:** Some pointers to other materials if you want to learn more about transformers, e.g., if using them in your summer project. 

Example code for all the tasks has been tested on a three-year old MacBook Pro, and the longest training process took under 10 minutes. If you find that the code takes too long to run on your own machine, you can try [Google Colab](https://colab.research.google.com/), Amazon Sagemaker Studio, or use lab machines on campus provided by the school. 

## Learning Outcomes

These sections will contain tutorial-like instructions, as you have seen in previous text analytics labs. On completing these sections, the intended learning outcomes are that you will be able to...
1. Use pretrained transformers to obtain contextualised word and sentence embeddings.
1. Apply a pretrained QA model to a new dataset. 
1. Construct classifiers with pretrained transformers. 
1. Find documentation on pretrained models in the Transformers library.

## Your Tasks

Inside each of these sections there are several **'To-do's**, which you must complete for your summative assessment. Your marks will be based on your answers to these to-dos. Please make sure to:
1. Include the output of your code in the saved notebook. Plots and printed output should be visible without re-running the code. 
1. Include all code needed to generate your answers.
1. Provide sufficient comments to understand how your method works.
1. Write text in a cell in markdown format where a written answer is required. You can convert a cell to markdown format by pressing Escape-M. 

There are also some unmarked 'to-do's that are part of the tutorial to help you learn how to implement and use the methods studied here.

## Good Academic Practice

Please follow [the guidance on academic integrity provided by the university](http://www.bristol.ac.uk/students/support/academic-advice/academic-integrity/).
You are required to write your own answers -- do not share your notebooks or copy someone else's writing. Do not copy text or long blocks of code directly into the notebook from online sources -- always rewrite in your own way. Breaking the rules can lead to strong penalties. 

## Marking Criteria

1. The coursework (both notebooks) is worth 30% of the unit in total. 
1. There is a total of 100 marks available for both lab notebooks. 
1. This notebook is worth 50 of those marks.
1. The number of marks for each to-do out of 100 is shown alongside each to-do.
1. For to-dos that require you to write code, a good solution would meet the following criteria (in order of importance):
   1. Solves the task or answers the question asked in the to-do. This means, if the code cells in the notebook are executed in order, we will get the output shown in your notebook.
   1. The code is easy to follow and does not contain unnecessary steps.
   1. The comments show that you understand how your solution works.
   1. A very good answer will also provide code that is computationally efficient but easy to read.
1. You can use any suitable publicly available libraries. Unless the task explicitly asks you to implement something from scratch, there is no penalty for using libraries to implement some steps.

## Support

The main source of support will be during the remaining lab sessions (Fridays 3-6pm) for this unit. 

The TAs and lecturer will help you with questions about the lectures, the code provided for you in this notebook, and general questions about the topics we cover. For the marked 'to-dos', they can only answer clarifying questions about what you have to do. 

Office hours: You can book office hours with Edwin on Mondays 3pm-5pm by sending him an email (edwin.simpson@bristol.ac.uk). If those times are not possible for you, please contact him by email to request an alternative. 

## Deadline

The notebook must be submitted along with the second notebook on Blackboard before **Wednesday 24th May at 13.00**. 

## Submission

You will need to zip up this notebook with the previous notebook into a single .zip file, which you will submit to Blackboard through the 'assessment, submission and feedback' link on the left sidebar. 

Please name your files like this:
   * Name this notebook ADA2_<student_number>.ipynb
   * Name the zip file <student_number>.zip
   * Please don't use your name anywhere as we want to mark anonymously. 

In [245]:
import numpy as np
import torch 
from datasets import load_dataset

cache_dir = "./data_cache"

# 4. Pretrained Transformers (max. 15 marks)

HuggingFace is a company that has developed an open source library for loading pretrained transformer models. They also distribute many models that have been pretrained using language modelling tasks, or fine-tuned to specific downstream NLP tasks.  It is currently the best library to use to create NLP models on top of large, deep neural networks. This is especially useful for tasks where simpler, feature-based methods or smaller LSTM models do not perform well enough, for example, when complex processing of syntax and semantics is required (natural language 'understanding'). 

Let's start by looking at two key types of object in the transformers library: models and tokenizers.

## 4.1. Models

The neural network models available in the Transformers library are accessed through wrapper classes such as `AutoModel`. If we want to load a pretrained model, we can simply pass its name to the `from_pretrained` function, and the pretrained model weights will be downloaded from HuggingFace and a neural network model will be created with those weights. For example:

In [246]:
from transformers import AutoModel # For BERTs

model = AutoModel.from_pretrained("huawei-noah/TinyBERT_General_4L_312D") 

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertModel: ['fit_denses.2.weight', 'fit_denses.1.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.4.weight', 'fit_denses.1.bias', 'fit_denses.0.bias', 'fit_denses.0.weight', 'cls.seq_relationship.bias', 'fit_denses.3.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertModel from the checkpoint of a model that you expect to be exactly identical (initializing

This code loads the TinyBERT model, which is a compressed version of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While TinyBERT will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/huawei-noah/TinyBERT_General_4L_312D).

<!--the RoBERTa variant of BERT. It has 4.4 million parameters, compared to the standard version of BERT, 'BERT-base', which has 110 million parameters. While RoBERTa-tiny will not perform as well as larger models, we will use it for this notebook to save memory and computation costs. See [documentation here](https://huggingface.co/arampacha/roberta-tiny).  -->

The same functions can be used to load other models from HuggingFace's repository simply by changing the model's name. Take a look at [the Models page](https://huggingface.co/models) so see what there is on offer. Do you recognise any of the models' names?

# 4.2. Tokenizers

Before we can apply a model to some text, we need to a create Tokenizer object. In Transfomers, Tokenizer objects convert raw text to a sequence of numbers. First, the tokenizer actually performs tokenization, then it maps each token to its numerical ID. There are lots of different tokenizers that we can use to preprocess text. If we are loading a pretrained model, we will need to choose the tokenizer that corresponds to that model. 

We can load the right tokenizer as follows, in the same way we loaded the model itself:

In [247]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")

Let's see what the TinyBERT tokenizer does to an example sentence:

In [248]:
sentence = "The transformer architecture has transformed the field of NLP."

tokens = tokenizer.tokenize(sentence)
print(tokens)

['the', 'transform', '##er', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'nl', '##p', '.']


Let's compare with the NLTK tokenizer we have seen before:

In [249]:
from nltk.tokenize import word_tokenize

nltk_tokens = word_tokenize(sentence)
print(nltk_tokens)

['The', 'transformer', 'architecture', 'has', 'transformed', 'the', 'field', 'of', 'NLP', '.']


While NLTK keeps whole words as tokens, the BERT tokenizer splits some words into sub-words and inserts some special characters into the tokens. Splitting is applied to words with low frequency in the training set, such as 'transformer'. 

**TO-DO 4a:** What is the benefit of splitting rare words into sub-word tokens? **(2 marks)**

Better generalisation to unseen data as rare words are split into more general forms. Furthermore, a smaller vocabulary has to be stored by the network, thus reducing the number of model parameters, and therefore storage and computation costs.


---

It is important to use the right tokenizer with a pretrained model as each model was trained with text tokenized in a particular way. After tokenization, the Tokenizer object can also map the tokens to their IDs (indexes in the vocabulary):

In [250]:
ids = tokenizer.convert_tokens_to_ids(tokens)

print(ids)

[1996, 10938, 2121, 4294, 2038, 8590, 1996, 2492, 1997, 17953, 2361, 1012]


## 4.3. Contextualised Embeddings

Now that we have a sequence of tokens, we are almost ready to process the sequence using the pretrained model. 

Our model takes as input a PyTorch `tensor` object. In PyTorch, `tensor` is a muli-dimensional matrix. Here, we need a two-dimensional matrix, where each row is a sequence of input tokens corresponding to a single sentence or document. Let's convert our list of IDs to a 2-D tensor with a single row:

In [251]:
ids_tensor = torch.tensor([ids])

print(ids_tensor)

tensor([[ 1996, 10938,  2121,  4294,  2038,  8590,  1996,  2492,  1997, 17953,
          2361,  1012]])


Now we can process the sequence using our model. The model maps the sequence of input IDs to a sequence of output vectors, which are contextualised word embeddings. The hidden state values produced in the last hidden layer of the model are used as the contextualised embeddings:

In [252]:
model_outputs = model(ids_tensor)
print('The complete model outputs: ')
print(model_outputs)

print()
print('The last hidden state for the first token in the sequence (the first word embedding): ')
embeddings = model_outputs['last_hidden_state'][0]
print(embeddings)

The complete model outputs: 
BaseModelOutputWithPoolingAndCrossAttentions(last_hidden_state=tensor([[[ 0.3608,  0.2862, -0.1549,  ..., -0.2064,  0.2663, -0.0109],
         [ 0.0149,  0.7223, -0.0508,  ..., -0.5505,  0.2355, -0.2962],
         [ 0.1531,  0.5903, -0.1244,  ..., -0.4263,  0.0417, -0.1839],
         ...,
         [ 0.1742, -0.1091, -0.1963,  ..., -0.6736,  0.0472, -0.1840],
         [ 0.2434,  0.1021, -0.2241,  ..., -0.5400, -0.1691, -0.1314],
         [ 0.0854,  0.3272, -0.3016,  ..., -0.2154, -0.5632, -0.1921]]],
       grad_fn=<NativeLayerNormBackward0>), pooler_output=tensor([[-1.1380e-02, -6.3006e-03,  1.8521e-02,  7.1139e-03, -3.1795e-02,
          1.3882e-02, -1.5459e-02, -1.0611e-03, -1.8263e-02, -3.6515e-02,
         -2.1257e-02, -1.5479e-02, -2.8088e-04, -4.1092e-02, -2.5315e-02,
         -4.3338e-02, -1.1616e-03, -1.3931e-02,  6.0733e-03,  4.3790e-03,
          2.7091e-04, -2.1810e-02, -4.8026e-02,  2.5493e-02, -1.6502e-02,
         -1.2034e-03,  4.2757e-02,  3.

We can retrieve the embedding vector for "transform" like this ("transform" is the second token in the sequence):

In [34]:
emb = embeddings[1]  # get second embedding in the sequence

# convert it to a numpy array so we can perform various operations on it later on
emb = emb.detach().numpy()

print(emb)
print(f'The TinyBERT embeddings have {emb.shape[0]} dimensions.')

[ 1.49146989e-02  7.22317934e-01 -5.07868007e-02 -2.74206042e-01
 -1.38931692e-01  1.00099623e+00  7.11511075e-03  2.71392703e-01
 -3.92816365e-02  6.04106337e-02  1.25740662e-01  4.60632563e-01
  6.25249092e-03  1.61929548e-01  1.23912975e-01 -4.08096373e-01
  1.24867529e-01 -4.71536100e-01  2.24769399e-01  6.35191500e-02
  8.56178403e-02 -1.88044891e-01  1.77257985e-01  3.40049684e-01
 -1.95546120e-01  1.58554226e-01  9.62866843e-02  1.12648718e-01
  2.21045241e-01 -9.56113458e-01 -3.85948360e-01  1.39220908e-01
  5.90012252e-01 -8.06728959e-01 -1.34287193e-01  2.35692158e-01
 -1.02275051e-01  2.78303325e-01  7.94321120e-01 -2.49363333e-01
  1.72771603e-01 -2.07582667e-01  3.00157368e-01 -8.59338269e-02
 -2.25284770e-01 -9.75404754e-02 -3.52349609e-01  3.81161809e-01
 -3.87680292e-01 -1.77613273e-01 -4.13685918e-01  1.38046771e-01
  1.29876360e-02  6.52684271e-01  1.16502658e-01 -5.10778129e-01
 -8.30419511e-02 -2.67040953e-02  3.12862575e-01 -2.62848616e-01
 -1.43284917e-01  1.10270

TO-DO 4b: Retrieve the embedding for "architecture" (this to-do will not be marked).

In [253]:
emb_architecture = embeddings[3].detach().numpy()
print(emb_architecture)

[ 2.71386623e-01  7.74581969e-01 -3.24256986e-01 -7.14323372e-02
 -4.95713204e-04  9.37310040e-01 -4.40213643e-03 -4.26920839e-02
  1.27402283e-02  1.89264063e-02  1.02528736e-01  4.54656899e-01
  2.70435750e-01  2.30988830e-01  4.03643958e-03 -1.08995616e-01
 -4.59914207e-02 -3.51154298e-01 -1.34710521e-01  8.29390585e-02
  1.86496526e-01  5.00277281e-02  7.21668378e-02  2.28657812e-01
 -2.19697237e-01  9.40194428e-02  1.65540382e-01  1.85794443e-01
  3.17783386e-01 -5.09366930e-01 -5.00949264e-01  1.52488053e-01
  4.57998842e-01 -8.51876020e-01 -1.58632264e-01  1.58965096e-01
  4.16190289e-02  2.30997831e-01  8.78503025e-01 -6.23165891e-02
  1.87219098e-01 -1.23371631e-02  2.10084453e-01  3.48072127e-02
 -2.51239985e-01 -1.37914240e-01 -3.88697296e-01  2.98189580e-01
 -2.92032897e-01 -3.19503605e-01 -1.98434696e-01  1.32033080e-01
 -6.46374822e-02  7.43182719e-01  7.14235753e-02 -3.02117765e-01
  3.49781513e-01 -5.81784919e-02  2.85068393e-01 -4.09581661e-01
 -1.03296652e-01  1.03767

Sentences and documents usually have varying lengths. So, to put multiple sentences into a single tensor, we need to pad the sequences up to a maximum length. Luckily, the tokenizer class takes care of this for us. When we pass in a list of sentences, the tokenizer creates a matrix, where each row is a sequence of the same length:

In [256]:
sentences = [
    "They received a loan from the bank.",
    "It was not good for either his bank balance or his blood pressure.",
    "She walked along the bank of the river towards the city.",
    "They bank their cheques on Thursdays.",
    "She walked along the embankment towards the city."
]

model_inputs = tokenizer(sentences, padding=True, truncation=True, return_tensors="pt")  

print(model_inputs)

{'input_ids': tensor([[  101,  2027,  2363,  1037,  5414,  2013,  1996,  2924,  1012,   102,
             0,     0,     0,     0,     0,     0],
        [  101,  2009,  2001,  2025,  2204,  2005,  2593,  2010,  2924,  5703,
          2030,  2010,  2668,  3778,  1012,   102],
        [  101,  2016,  2939,  2247,  1996,  2924,  1997,  1996,  2314,  2875,
          1996,  2103,  1012,   102,     0,     0],
        [  101,  2027,  2924,  2037, 18178, 10997,  2006,  9432,  2015,  1012,
           102,     0,     0,     0,     0,     0],
        [  101,  2016,  2939,  2247,  1996, 22756,  2875,  1996,  2103,  1012,
           102,     0,     0,     0,     0,     0]]), 'token_type_ids': tensor([[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
        [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]]), 'attention_mask': t

`model_inputs` is a dictionary containing three objects:
 * The `input_ids` are the list of token IDs in the input sequences. 
 * The `attention_mask` records which tokens are special padding tokens and which are real tokens. Tokens with a 0 in the attention mask will be ignored.
 * `token_type_ids` is needed when two sequences are passed together as input to the model for tasks such as next sentence prediction that involve comparing two sentences. Here, each input is a single sentence, so we have only one type of token in the output above. 
 
TO-DO 4c: What value do the special padding tokens have? (this to-do is unmarked)

ANSWER: 0

---

Notice that the input_ids all start with the same token ID, 101, even though they have different first words. They also have token ID 102 before the padding tokens. This is because the tokenizer inserts two special tokens, which are used in some applicaions of BERT. 101 is the '[CLS]' token, which is a dummy token whose embedding can be trained to represent the whole sequence. the [CLS] token's embedding can then be used as input to a text classifier to classify a sentence or document. Token 102 is '[SEP]', which can be used to separate multiple input sequences in a single example. This is needed in tasks where multiple pieces of text are provided as input, e.g., a to build a classifier that can determine whether two sentences contradict each other. 

We can now pass all of the model inputs to the model to produce a set of contextualised embeddings:

In [257]:
# model_inputs is a dictionary, so to provide the arguments to model(), 
# we use the double star to unpack the dictionary so that each key in the dictionary is
# an argument to model() and each value is the value of the argument. 
model_outputs = model(**model_inputs) 

**TO-DO 4d:** The first four example sentences above all contain the word "bank", and the last example contains "embankment". Obtain a list of contextualised word embeddings for 'bank' and 'embankment' in the example sentences using our model. **(4 marks)**

Hint: you may need to convert tensors to numpy arrays.

In [258]:
#WRITE YOUR OWN CODE HERE
idxs = []

for sentence in sentences:
    tokens = tokenizer.tokenize(sentence)  # tokenise each sentence
    for i, token in enumerate(tokens):
        if token == 'bank' or 'embankment':
            idxs.append(i)  # get the token in each sentence for either 'bank' or 'embankment'

embeddings = []

for i, output in enumerate(model_outputs['last_hidden_state']):
    embeddings.append(output[idxs[i+1]].detach().numpy())  # get the embedding for the token (stored in last_hidden_state of output)

**TO-DO 4e:** Compute the similarities between these embeddings in the cell below, and show the results. Which embeddings are most similar to one another and why? **(6 marks)**

**ANSWER:** Sentence 3 and sentence 5 are the most similar. This is because are the most contextually similar as they both represent the idea of walking along the bank of a river even though they are fundamentally different words ('bank' vs 'embankment').


In [264]:
def cos_sim(a, b):  # function for cosine-similarity from notebook 1
    dot_p = np.dot(a, b)
    norm_mult = np.linalg.norm(a) * np.linalg.norm(b)
    return dot_p / norm_mult

"""
Code below loops through each embedding and computes the cosine similarity for each pair of tokens
"""
for i in range(len(embeddings)):
    for j in range(i + 1, len(embeddings)):
        print(f'S{i+1}, S{j+1}: {cos_sim(embeddings[i], embeddings[j])}')

S1, S2: 0.31865963339805603
S1, S3: 0.1883321851491928
S1, S4: 0.2595188319683075
S1, S5: 0.17240209877490997
S2, S3: 0.17107850313186646
S2, S4: 0.1947207897901535
S2, S5: 0.11836692690849304
S3, S4: 0.258332759141922
S3, S5: 0.44357869029045105
S4, S5: 0.3239094614982605


**TO-DO 4f:** Use the [CLS] token's embedding to find the most similar **sentence** to "She walked along the embankment towards the city." from the first four sentences. Print the similarities and the selected sentence. **(3 marks)**

In [265]:
cls_embeddings = []

for i, output in enumerate(model_outputs['last_hidden_state']):
    cls_embeddings.append(output[0].detach().numpy())  # get the CLS token stored in the first index of each sentence

for i in range(0, len(cls_embeddings) - 1):
    print(f'Similarity to sentence {i+1}: {cos_sim(cls_embeddings[i], cls_embeddings[4])}')  # compute each pairwise cosine similarity for each pairwise combination of sentences

Similarity to sentence 1: 0.9093042016029358
Similarity to sentence 2: 0.793634831905365
Similarity to sentence 3: 0.9947959780693054
Similarity to sentence 4: 0.8988917469978333


**ANSWER:** Selected sentence is sentence 3 as it has the highest cosine similarity to sentence 5 (0.995).

# 5. Question Answering with Pretrained Transformers (max. 11 marks)

The previous section showed us how to obtain a sequence of contextualised word embeddings using a pretrained transformer. How are these embeddings used to extract answers from documents to a given question?

First, let's load up the [Tweet QA](https://huggingface.co/datasets/tweet_qa) dataset, which we will use to test a pretrained question answering (QA) model. This dataset contains tweets along with questions about the information in the tweets, and a list of correct answers. As we are not going to train our own QA model (it requires a lot of compute time), we will only need the validation set:

In [266]:
from sklearn.metrics import f1_score

val_dataset = load_dataset(
    "tweet_qa",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)
print(f"Validation dataset with {len(val_dataset)} instances loaded")

Found cached dataset tweet_qa (C:/Users/Morg/OneDrive/Documents/MSc/TB2/advanced_data_analytics/text_analytics/advanced-labs-public/data_cache/tweet_qa/default/1.0.0/7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777)


Validation dataset with 1086 instances loaded


Now we are working with complete dataset using the HuggingFace datasets library. In the next cell, we create a tokenizer to tokenize the examples in the dataset. We need to choose the right tokenizer for the QA model we want to use, so let's decide to use `"distilbert-base-cased-distilled-squad"` as our pretrained model. This is based on a smaller version of BERT, called Distilbert, which was fine-tuned on the SQUAD question answering dataset.

In [267]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert-base-cased-distilled-squad") 

def tokenize_function(dataset):
    # Pass two strings to the tokenizer -- it will concatenate them with a [SEP] special token between them. 
    model_inputs = tokenizer(dataset['Question'], dataset['Tweet'], padding="max_length", max_length=200, truncation='only_second')
    return model_inputs

Again, we can use the `map()` method to apply the tokenizer to each example in the dataset. 

In [268]:
val_dataset = val_dataset.map(tokenize_function, batched=True) 

Loading cached processed dataset at C:\Users\Morg\OneDrive\Documents\MSc\TB2\advanced_data_analytics\text_analytics\advanced-labs-public\data_cache\tweet_qa\default\1.0.0\7d588f7f477946b10f60c035ca55175737315ac446102b015218af38d2638777\cache-da98121affcce4b7.arrow


The type of QA model we are going to work with is _extractive_, meaning that the model will extract the answer from the 'context' (also known as the 'passage' or 'source document'). It does this by identifying the index of the start and end tokens of the answer span within the context, or returning `(0, 0)` (the index 0 for both the start and end token) if the context does not contain an answer to the given question. 

As explained in the lectures, BERT forms the basis of the QA model, and maps each token to a contextualised embedding. The QA model then maps each token's contextualised embedding to the probability that the token is the start of the answer span, and to the probability that the token is the end of the answer span. The layers that map the embeddings to the start and end probabilities are known as the 'head' of the model. [The original BERT paper](https://arxiv.org/pdf/1810.04805.pdf) depicts the QA model like this (Devlin et al., 2018):

<img src="bert_qa.png" alt="BERT QA diagram from the slides in week 10 showing the embedding of each token connected to the start and end output layers" width="400px"/>

We can see a similar structure in most neural network models. Our original text classifier from the first notebook used a fully-connected layer to produce a hidden representation of the whole sentence (rather than using BERT to produce a sequence of embeddings). This hidden representation was then fed to an output layer to produce a probability distribution over class labels (rather than the start and end probabilities):

<img src="neural_text_classifier_smaller.png" alt="Neural text classifier diagram from the slides in lecture 8.1" width="400px"/>


<!--With transformers, 
we can do something very similar, by connecting the transfomer's output to a fully-connected layer. However, with BERT, we do not need to pass the embedding of each individual word to the fully-connected layer because there is a special [CLS] token that represents the whole sentence:

The code below shows how to access a tensor containing the [CLS] embeddings:-->

Now, we have the dataset in the right format, let's see how to load a pretrained QA model based on a pretrained transformer. The QA model was trained by taking a pretrained BERT model (pretrained on masked language modelling with unlabelled text), adding the QA head, then further training the complete model on a QA dataset. 

The transformers library provides some useful wrapper classes for loading pretrained models for various NLP tasks, such as QA or text classification. These 'auto' classes are documented here: https://huggingface.co/docs/transformers/model_doc/auto . Let's use an auto class to load the `"distilbert-base-cased-distilled-squad"` pretrained QA model (this code will try to reload the model from a cache or download the model from HuggingFace):

In [269]:
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-cased-distilled-squad")

As our model was pretrained, we can use it directly on our Tweet_QA dataset (you may see a message to this effect when you run the cell above the first time). 

So, how do we get a prediction from the model? Let's take a single example from Tweet_QA and obtain the start and end probabilities for all tokens in the 'context':

In [270]:
def predict_nn(qa_model, dataset):
    
    # Switch off dropout
    qa_model.eval()

    # Pass the required inputs from the dataset to the model    
    output = qa_model(attention_mask=torch.tensor(dataset["attention_mask"]), input_ids=torch.tensor(dataset["input_ids"]))
        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    probs_start = torch.nn.Softmax(dim=1)(output["start_logits"]).detach().numpy()
    probs_end = torch.nn.Softmax(dim=1)(output["end_logits"]).detach().numpy()
        
    return probs_start, probs_end

# Run the prediction function to get the results for the first 20 examples:
probs_start, probs_end = predict_nn(model, val_dataset[0:20])

Now that we have the probabilities that each token is a start or end token, we combine these probabilities to estimate the probability of each possible answer span. This will allow us to choose the answer span with highest probability. 

In the next cell is our first attempt, which you will need to improve to get valid answers. This code loops through each possible combination of start and end tokens, obtains the start and end probabilities, and extracts the answer text for the corresponding span.

**TO-DO 5a:** Use the start and end probabilities to compute the answer span probability at the place marked inside the predict_answer() function below. **2 marks**

In [271]:
# our example:
example_index = 3

example = val_dataset[example_index]
print(f'CONTEXT = {example["Tweet"]}')
print(f'QUESTION = {example["Question"]}')
print(f'LIST OF POSSIBLE ANSWERS = {example["Answer"]}')

CONTEXT = The #endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow. (I'm losing it) John D. Sutter (@jdsutter) June 21, 2014
QUESTION = what hashtag was used?
LIST OF POSSIBLE ANSWERS = ['#endangeredriver', '#endangereddriver']


In [272]:
def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)  # length of the input sequence, in the form "[CLS] question [SEP] context"

    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(0, input_length):
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            span_probabilities.append(start_prob * end_prob)  # get the joint probability of the start and end tokens to get probability for the span
            
            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 0.5619068145751953, answer = endangeredriver
Span prob = 0.1524379700422287, answer = The # endangeredriver
Span prob = 0.12783248722553253, answer = # endangeredriver
Span prob = 0.04335261508822441, answer = 
Span prob = 0.01834719628095627, answer = endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow
Span prob = 0.014453954994678497, answer = 
Span prob = 0.007982296869158745, answer = endangeredriver would be a sexy bastard in this channel if it had water. Quick turns
Span prob = 0.0077003296464681625, answer = endangeredriver would be a sexy bastard in this channel if it had water
Span prob = 0.007665065582841635, answer = 
Span prob = 0.006627386901527643, answer = ##river
Span prob = 0.0058457818813622, answer = 
Span prob = 0.004977354779839516, answer = The # endangeredriver would be a sexy bastard in this channel if it had water. Quick turns. Narrow
Span prob = 0.004173944238573313, answer = # endangeredriver would be a sex

Are all of the top 20 valid and unique answers? If not, what do you think is going wrong? 

**TO-DO 5b:** Use the cell below to define a new and improved version of `predict_answer()` that only includes valid answers. Summarise in a couple of sentences what kind of invalid answers your code removes. **4 marks**

**ANSWER:** The code removes answers where the end index is before the start index and therefore are empty spans.

In [276]:
### WRITE YOUR OWN CODE HERE

def predict_answer(probs_start, probs_end, input_ids, tokenizer):
    
    input_length = len(input_ids)
    
    SEP_SPECIAL_TOKEN = 102  # the input id for the sep special token the separates the question from the context. The context starts after this token. 
    PAD_SPECIAL_TOKEN = 0  # the input id for padding tokens added to the end of the context
    
    span_probabilities = []  # save the probabilities here
    spans = []  # save the possible answer spans here
    
    for start_idx in range(0, input_length):
        for end_idx in range(start_idx + 1, input_length):  # change to start from one token after the start_idx to get valid spans
            
            start_prob = probs_start[start_idx]
            end_prob = probs_end[end_idx]
            
            span_probabilities.append(start_prob * end_prob)
            
            span = tokenizer.decode(input_ids[start_idx:end_idx+1])
            spans.append(span)

    # sort the spans according to probability:
    sorted_span_index = np.argsort(span_probabilities)
    
    # print the top 20 answers:
    for i in range(20):
        print(f'Span prob = {span_probabilities[sorted_span_index[-i-1]]}, answer = {spans[sorted_span_index[-i-1]]}')
            
predict_answer(probs_start[example_index], probs_end[example_index], example['input_ids'], tokenizer)

Span prob = 0.5619068145751953, answer = past West
Span prob = 0.1524379700422287, answer = Drove past West
Span prob = 0.12783248722553253, answer = ##ove past West
Span prob = 0.01834719628095627, answer = past Westgate earlier today, still can't believe i can't shop there. It was
Span prob = 0.007982296869158745, answer = past Westgate earlier today, still can't believe i can't shop there
Span prob = 0.0077003296464681625, answer = past Westgate earlier today, still can't believe i can '
Span prob = 0.004977354779839516, answer = Drove past Westgate earlier today, still can't believe i can't shop there. It was
Span prob = 0.004173944238573313, answer = ##ove past Westgate earlier today, still can't believe i can't shop there. It was
Span prob = 0.002165492856875062, answer = Drove past Westgate earlier today, still can't believe i can't shop there
Span prob = 0.0020889986772090197, answer = Drove past Westgate earlier today, still can't believe i can '
Span prob = 0.0018159537576138

In [275]:
for i in range(10, 15):  # display five other instances and the model's predictions
    print(f'Example {i}')
    example = val_dataset[i]
    print(f'QUESTION = {example["Question"]}')
    print(f'ANSWER = {example["Answer"]}')
    predict_answer(probs_start[i], probs_end[i], example['input_ids'], tokenizer)

Example 10
QUESTION = who is davis directing their question to?
ANSWER = ['cnn.', 'cnn.']
Span prob = 0.8641284704208374, answer = Adam Davis
Span prob = 0.07723896205425262, answer = Eat the pizza, eat the dog, or both? Adam Davis
Span prob = 0.0038331663236021996, answer = Adam Davis ( @ amdhit )
Span prob = 0.0019280483247712255, answer = Eat the pizza
Span prob = 0.0016680951230227947, answer = Adam Davis ( @ amdhit
Span prob = 0.0012408460024744272, answer = Adam Davis ( @ amdhit ) July 23, 2014
Span prob = 0.0010941752698272467, answer = Eat the pizza, eat the dog, or both? Adam
Span prob = 0.0010419517057016492, answer = eat the dog, or both? Adam Davis
Span prob = 0.0007980415248312056, answer = pizza, eat the dog, or both? Adam Davis
Span prob = 0.0005266257794573903, answer = CNN What do I do about this dog? Eat the pizza, eat the dog, or both? Adam Davis
Span prob = 0.0003426224284339696, answer = Eat the pizza, eat the dog, or both? Adam Davis ( @ amdhit )
Span prob = 0.000

Span prob = 0.9885745048522949, answer = Westgate
Span prob = 0.004299418069422245, answer = Drove past Westgate
Span prob = 0.0003031263768207282, answer = Westgate earlier today
Span prob = 0.00017634339747019112, answer = Westgate earlier
Span prob = 0.00013716019748244435, answer = Westgate earlier today, still can't believe i can't shop there. It was my favorite mall! # Westgate6Months
Span prob = 7.982626266311854e-05, answer = Westgate earlier today,
Span prob = 5.914858775213361e-05, answer = Westgate earlier today, still can't believe i can't shop there. It was my favorite mall! # Westgate
Span prob = 4.606162474374287e-05, answer = Westgate earlier today, still can't believe i can't shop there
Span prob = 2.0618390408344567e-05, answer = Westgate earlier today, still can't believe i can't shop
Span prob = 1.820515899453312e-05, answer = Westgate earlier today, still can't believe i can't shop there. It was my favorite mall
Span prob = 1.8074706531479023e-05, answer = past Wes

You can try out the pretrained QA model on a few examples and try to identify its common mistakes.

**TO-DO 5c:** State one way that we could improve the performance of our extractive QA model on the Tweet QA dataset.  **2 marks**

**ANSWER:** The model sometimes outputs answers that contain the question and/or [CLS] OR [SEP] tags. This could be fixed by ensuring that the outputs printed are only those after the [SEP] tag.

--- 

As well as answering ad-hoc queries, question answering models can help us to extract structured information about entities of interest from a large set of documents. Suppose that we want to automatically collect information on tech companies, such as Apple and Open AI. We want to extract information about each company's activities from social media, including the names and release dates of new products and services, the company's earnings in a specific year, and who its CEO is.  

**TO-DO 5d:** Given a list of tech company names, how could we use question answering to extract the required information for each company from a set of tweets?  **(3 marks)** 

**ANSWER:** First, tweets should be pre-processed using tokenisation, syntax parsing, named-entity recognition, and relation extraction. Then, co-reference resolution should be used to resolve named entities, including names, dates, corporations, etc. to pronouns and phrases that refer to them. Next, text should be labelled with semantic roles, e.g., [releaser] label could help identify company earnings for a specific company. Finally, answer type  etection, candidate answer retrieval, and answer ranking are carried out to obtain a best-guess answer.


# 6. Transformer-based Text Classifiers (max. 24 marks)

The previous section showed us how to use a pretrained QA model based on a pretrained transformer. In this section, you will learn how to construct and train a text classifier on top of a pretrained transformer. 

We will use the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset to train and test a classifier. The task is to classify lines from poems into one of  0: negative, 1: positive, 2: no impact, or 3: mixed sentiment. For more information, see [Sheng and Uthus, 2020](https://arxiv.org/pdf/2011.02686.pdf). 

To begin you will need to instantiate a suitable model.

**TO-DO 6a:** Find an AutoModel class that constructs a text classifier from the pretrained TinyBERT model, "huawei-noah/TinyBERT_General_4L_312D". Create the `model` object in the cell below using this class. Refer to the [Hugging Face documentation for auto models](https://huggingface.co/transformers/v3.0.2/model_doc/auto.html) as needed. **(2 marks)**

In [277]:
from transformers import AutoModelForSequenceClassification

model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)  # 4 labels in the dataset

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['fit_denses.2.weight', 'fit_denses.1.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.4.weight', 'fit_denses.1.bias', 'fit_denses.0.bias', 'fit_denses.0.weight', 'cls.seq_relationship.bias', 'fit_denses.3.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

**TO-DO 6b:** Provide a link to the documentation for your chosen auto model for text classification. Briefly describe how the text classifier `model` it creates differs from the QA model created by `AutoModelForQuestionAnswering`. Note: A useful reference may be the original BERT paper (https://arxiv.org/pdf/1810.04805.pdf), which includes diagrams (Figure 4) showing how BERT can be adapted to different tasks. **(2 marks)** 

https://huggingface.co/transformers/v3.0.2/model_doc/auto.html#automodelforsequenceclassification

**ANSWER:** Where models created using `AutoModelForQuestionAnswering` use the final hidden states for each of the tokens in the answer half of the sentence to attain start/end probabilities. `AutoModelForSequenceClassification`, on the other hand, uses the [CLS] token for the whole sequence to make a classification.

---

For the QA task, the complete model was pretrained and we could apply it to a dataset without further training. However, for our poem sentiment classification task,
we will need to train our model before we can use it (you may see a message in the output of the last cell telling you this). 

**TO-DO 6c:** The emotion classifier is built on top of a pretrained TinyBERT model, so why do we need to train it before we can use it? **(2 marks)**

**ANSWER:** Whereas QA tasks are predominantly standardised across applications, classification tasks can vary based on the characteristics of classes, e.g., sentiment classification is a very different task to emotion classification and therefore the model should be trained for a specific task.

---

Next, let's learn how to train our model. For some tasks it is not necessary to update the weights in the BERT model itself, so we can freeze them to save a lot of computation time. We can do this as follows. Since our pretrained model is based on BERT, we can access the weights inside BERT through the variable `model.bert`.

In [278]:
for param in model.bert.parameters():
    param.requires_grad = False

To train our model, we can make use of the Trainer class, which encapsulates a lot of the complex training steps and avoids the need to define our own training function, as we did in the previous notebook (we don't need to write our own `train_nn`).

First, define some settings for the training process. This is where we can set training hyperparameters:

In [279]:
from transformers import TrainingArguments

training_args = TrainingArguments(
    output_dir="transformer_checkpoints",  # specify the directory where models weights will be saved a certain points during training (checkpoints)
    num_train_epochs=10, # A sensible and sufficient number to use for the to-dos below
    per_device_train_batch_size=8,  # you can decrease this if memory usage is too high while training
    logging_steps=50,  # how often to print progress during training
    evaluation_strategy="epoch",  # report evaluation metrics after each epoch
)

Next, create a trainer object. Note that the next cell will currently fail with an error, because the variables `poem_train_dataset` and `poem_val_dataset` do not exist yet! Don't worry, we'll fix this later. 

In [280]:
from transformers import Trainer
from torch import nn

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_poems,
    eval_dataset=val_poems,
)

To train the model, you will need to call `trainer.train()`.

Once the model is trained, we can obtain predictions using the function below. Notice that it is simpler than obtaining the spans for QA -- we simply get the logits for each tweet in the test set, then apply argmax over the classes to find the most probable class for each tweet:

In [281]:
def predict_nn(trained_model, test_dataset):

    # Switch off dropout
    trained_model.eval()
    
    # Pass the required items from the dataset to the model    
    output = trained_model(attention_mask=torch.tensor(test_dataset["attention_mask"]), input_ids=torch.tensor(test_dataset["input_ids"]))
                        
    # the output dictionary contains logits, which are the unnormalised scores for each class for each example:
    pred_labs = np.argmax(output["logits"].detach().numpy(), axis=1)

    return pred_labs

You should now have all the bits and pieces needed to build and train a text classifier. Let's put them all together...

**TO-DO 6d:** Implement and test a classifier for the [Poem Sentiment](https://huggingface.co/datasets/poem_sentiment) dataset using a pretrained transformer. Evaluate the classifier with both frozen and unfrozen (i.e., fine-tuned) parameters in the pretrained transformer. Choose a suitable evaluation metric and provide a comparison of the results below, including a brief explanation  (1-2 sentences) for any differences you observe between the frozen and unfrozen variants. Make sure to comment your code.  **(10 marks)**

Notes: 
 * Strong classifier performance is not required to achive good marks -- rather, we award marks for implementing and testing a transformer-based classifier correctly.
 * You may implement any suitable kind of classifier you like, as long as you are using a pretrained transformer model.
 * 'tiny' BERT variants such as TinyBERT and roberta-tiny are recommended because they are small enough to fine-tune with a typical laptop CPU. We recommend sticking with these smaller pretrained models unless you have access to a GPU, e.g., via Google Colab. 

WRITE YOUR ANSWER HERE (DESCRIPTION OF RESULTS FOR 6d):

**ANSWER:** The model with fine-tuning performs much better than the model with frozen weights. When reviewing the accuracy, precision, recall, F1, and confusion matrix, it is clear that the frozen model (without fine-tuning) only predicts the majority class, and achieves an F1 score of 0.27. Whereas, the model with fine-tuning achieves an F1 score of 0.78. This is because the model requires fine-tuning to apply BERT effectively to new classification tasks. By using direct transfer learning, rather than inductive transfer learning, the frozen model is only applying knowledge from the basic classification tasks to the problem, rather than learning how to predict sentiment in poems.


In [282]:
from datasets import load_dataset

poem_train_dataset = load_dataset(  # load the train dataset.
    "poem_sentiment",
    split="train",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

poem_val_dataset = load_dataset(  # load the validation dataset.
    "poem_sentiment",
    split="validation",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

poem_test_dataset = load_dataset(  # load the test dataset.
    "poem_sentiment",
    split="test",
    ignore_verifications=True,
    cache_dir=cache_dir,
)

Found cached dataset poem_sentiment (C:/Users/Morg/OneDrive/Documents/MSc/TB2/advanced_data_analytics/text_analytics/advanced-labs-public/data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Found cached dataset poem_sentiment (C:/Users/Morg/OneDrive/Documents/MSc/TB2/advanced_data_analytics/text_analytics/advanced-labs-public/data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)
Found cached dataset poem_sentiment (C:/Users/Morg/OneDrive/Documents/MSc/TB2/advanced_data_analytics/text_analytics/advanced-labs-public/data_cache/poem_sentiment/default/1.0.0/4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099)


The next block searches through the training set to find the maximum text length. This is then used as the max_length for model inputs and any shorter texts are padded with zeros.

In [283]:
max_length = 0
for line in poem_train_dataset:
    if len(line['verse_text']) > max_length:
        max_length = len(line['verse_text'])
    
max_length

109

In [284]:
tokenizer = AutoTokenizer.from_pretrained("huawei-noah/TinyBERT_General_4L_312D")  # import right autotokenizer for our model

def tokenize_function(dataset):  # function to tokenize each text
    model_inputs = tokenizer(dataset['verse_text'], padding="max_length", max_length=max_length)
    return model_inputs

The next code block applies tokenization to all three splits.

In [285]:
train_poems = poem_train_dataset.map(tokenize_function, batched=True)
val_poems = poem_val_dataset.map(tokenize_function, batched=True)
test_poems = poem_test_dataset.map(tokenize_function, batched=True)

Loading cached processed dataset at C:\Users\Morg\OneDrive\Documents\MSc\TB2\advanced_data_analytics\text_analytics\advanced-labs-public\data_cache\poem_sentiment\default\1.0.0\4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099\cache-626f48bd9730a0c0.arrow
Loading cached processed dataset at C:\Users\Morg\OneDrive\Documents\MSc\TB2\advanced_data_analytics\text_analytics\advanced-labs-public\data_cache\poem_sentiment\default\1.0.0\4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099\cache-3175bca6b0f81b30.arrow
Loading cached processed dataset at C:\Users\Morg\OneDrive\Documents\MSc\TB2\advanced_data_analytics\text_analytics\advanced-labs-public\data_cache\poem_sentiment\default\1.0.0\4e44428256d42cdde0be6b3db1baa587195e91847adabf976e4f9454f6a82099\cache-df8707107c288815.arrow


In [159]:
trainer.train()  # train the model



Epoch,Training Loss,Validation Loss
1,1.3705,1.359082
2,1.3506,1.335096
3,1.3375,1.315157
4,1.3194,1.298243
5,1.3041,1.284494
6,1.2842,1.273413
7,1.2884,1.265001
8,1.2721,1.259087
9,1.2714,1.255654
10,1.2684,1.254551


TrainOutput(global_step=1120, training_loss=1.3085977724620275, metrics={'train_runtime': 849.6878, 'train_samples_per_second': 10.498, 'train_steps_per_second': 1.318, 'total_flos': 27233181666240.0, 'train_loss': 1.3085977724620275, 'epoch': 10.0})

In [160]:
no_ft_pred = predict_nn(model, test_poems)  # make predictions on the test set

The next code block looks at the performance of the model without fine-tuning.

In [200]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import pycm

print(f"Model without fine-tuning:\n\nACCURACY: {accuracy_score(test_poems['label'], no_ft_pred)}\nPRECISION: {precision_score(test_poems['label'], no_ft_pred, average='macro')}\nRECALL: {recall_score(test_poems['label'], no_ft_pred, average='macro')}\nF1 SCORE: {f1_score(test_poems['label'], no_ft_pred, average='macro')}\n")

cm = pycm.ConfusionMatrix(test_poems['label'], no_ft_pred)
cm.print_matrix()

Model without fine-tuning:

ACCURACY: 0.6634615384615384
PRECISION: 0.22115384615384615
RECALL: 0.3333333333333333
F1 SCORE: 0.26589595375722547

Predict  0        1        2        
Actual
0        0        0        19       

1        0        0        16       

2        0        0        69       




  _warn_prf(average, modifier, msg_start, len(result))


Next, let's see how the model performs with fine-tuning. First let's reinstatiate the model and trainer object.

In [161]:
model = AutoModelForSequenceClassification.from_pretrained("huawei-noah/TinyBERT_General_4L_312D", num_labels=4)

Some weights of the model checkpoint at huawei-noah/TinyBERT_General_4L_312D were not used when initializing BertForSequenceClassification: ['fit_denses.2.weight', 'fit_denses.1.weight', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.dense.bias', 'fit_denses.4.weight', 'fit_denses.1.bias', 'fit_denses.0.bias', 'fit_denses.0.weight', 'cls.seq_relationship.bias', 'fit_denses.3.weight', 'cls.predictions.bias', 'cls.predictions.transform.LayerNorm.bias', 'fit_denses.4.bias', 'cls.seq_relationship.weight', 'fit_denses.2.bias', 'fit_denses.3.bias', 'cls.predictions.decoder.weight', 'cls.predictions.transform.dense.weight']
- This IS expected if you are initializing BertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForSequenceClassification from the checkpoint of a

In [162]:
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_poems,
    eval_dataset=val_poems,
)

In [163]:
for param in model.bert.parameters():  # unfreezes BERT weights and turns on fine-tuning as a result
    param.requires_grad = True

In [164]:
trainer.train()  # train the model with fine-tuning



Epoch,Training Loss,Validation Loss
1,0.9461,0.737348
2,0.6585,0.655575
3,0.5251,0.599569
4,0.428,0.545868
5,0.3448,0.522004
6,0.2595,0.761997
7,0.2507,0.571598
8,0.2236,0.587284
9,0.1842,0.664679
10,0.1996,0.636919


TrainOutput(global_step=1120, training_loss=0.4207231926066535, metrics={'train_runtime': 2579.9734, 'train_samples_per_second': 3.457, 'train_steps_per_second': 0.434, 'total_flos': 27233181666240.0, 'train_loss': 0.4207231926066535, 'epoch': 10.0})

In [165]:
ft_pred = predict_nn(model, test_poems)  # make predictions using the fine-tuned model

The code below gets performance metrics for the fine-tuned BERT model for classification.

In [199]:
print(f"Model with fine-tuning:\n\nACCURACY: {accuracy_score(test_poems['label'], ft_pred)}\nPRECISION: {precision_score(test_poems['label'], ft_pred, average='macro')}\nRECALL: {recall_score(test_poems['label'], ft_pred, average='macro')}\nF1 SCORE: {f1_score(test_poems['label'], ft_pred, average='macro')}\n")

cm = pycm.ConfusionMatrix(test_poems['label'], ft_pred)
cm.print_matrix()

Model with fine-tuning:

ACCURACY: 0.8269230769230769
PRECISION: 0.7941468253968255
RECALL: 0.8043160437325195
F1 SCORE: 0.7841412040740149

Predict  0        1        2        
Actual
0        18       0        1        

1        1        10       5        

2        9        2        58       




**TO-DO 6e:** Did your sentiment classifier make use of any kind of model transfer or transfer learning? If so, what kinds of transfer were used and what benefit do they provide? **(4 marks)**

**ANSWER:** The model with frozen embeddings made use of direct transfer, using knowledge learned from the masked language modelling tasks it was trained on. With a classification head it transfers this knowledge to classfication. However, this knowledge isn't particularly useful on its own for predicting poem sentiment. Therefore, fine-tuning the model uses inductive transfer learning, adapting the model for learning on the poem sentiment classification task. This is beneficial for the task as the inductive bias from previous text understanding is transferred to the task, reducing the necessary training set size needed for effective sentiment classification.

---

**TO-DO 6f:** Use your model to compute the probability of sentiment for a sentence of your choosing. Comment your code and print the sentence with its probability distribution. Label the values so that we know which class they refer to. **(4 marks)**

Hint: you could use a poem generator, such as [this one](https://www.poemofquotes.com/tools/poetry-generator/ai-poem-generator), to generate a test sentence. 

In [244]:
from scipy.special import softmax

line = ['I burned down your meaningless claim']  # this is the line to be predicted
model_input = tokenizer(line, padding="max_length", max_length=max_length)  # tokenize the line for model input format
output = model(attention_mask=torch.tensor(model_input["attention_mask"]), input_ids=torch.tensor(model_input["input_ids"]))  # make model prediction
probabilities = softmax(output["logits"].detach().numpy(), axis=1)[0]  # run a softmax function over the output logits to get class probabilities

print(f'SENTENCE:\n{line[0]}\n\nPROBABILITIES:\nClass 0 (negative): {probabilities[0]}\nClass 1 (positive): {probabilities[1]}\nClass 2 (no impact): {probabilities[2]}\nClass 3 (mixed sentiment): {probabilities[3]}\n\nMODEL DECISION: Class 2 (negative)')

SENTENCE:
I burned down your meaningless claim

PROBABILITIES:
Class 0 (negative): 0.9050926566123962
Class 1 (positive): 0.009393616579473019
Class 2 (no impact): 0.006814749911427498
Class 3 (mixed sentiment): 0.07869898527860641

MODEL DECISION: Class 2 (negative)


# 7. OPTIONAL: More on Transformers

There are many great resources out there to show you how to use this kind of model in practice:
* An extensive online course is provided by HuggingFace: https://huggingface.co/course/chapter1/1. The pages linked from the HuggingFace course website have an 'open in Colab' button on the top right. You can open the notebook and run it on a Google server there to access GPUs.
* Chapters that may be particularly useful: 
   * Transformers, what can they do? https://huggingface.co/course/chapter1/3?fw=pt
   * Using Transformers: https://huggingface.co/course/chapter2/2?fw=pt
* They provide information on fine-tuning the transformer models here: https://huggingface.co/docs/transformers/training. Fine-tuning updates the weights inside the pretrained network and requires extensive GPU or TPU computing. 
* Text Generation: https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb. This topic goes way beyond data analytics on this unit and shows you another powerful feature of pretrained transformers.


