[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/gsarti/lcl23-xnlm-lab/blob/main/notebooks/1.1_Transformer_Syntactic_Abilities.ipynb)

# **Assessing Transformer Model Syntactic Abilities**

In this notebook, we will see how to assess the syntactic abilities of a Transformer model trained with Masked Language Modeling (MLM) objective function. In particular, we will test the abilities of the model on the *subject-verb agreement* phenomena.

The notebook is adapted from the experiments made by Yoav Goldberg in "*Assessing BERT's Syntactic Abilities*" (https://arxiv.org/pdf/1901.05287.pdf).
For further details, please also see the original github repo by Yoav Goldberg: https://github.com/yoavg/bert-syntax. 

## **Masked Language Modeling**

BERT is trained to approximate the Masked Language Modeling function, i.e. predict the identity of masked words in an input sequence (e.g. sentence).

<div>
  <img src="https://media.geeksforgeeks.org/wp-content/uploads/20200422002516/maskedLanguage.jpg" width="600">
</div>

*(Source: https://www.geeksforgeeks.org/understanding-bert-nlp/)*

Relying on the MLM training function, wee can test the performance of the model in learning subject-verb agreement patterns.
From Goldberg's paper: 

> *“feeding into BERT the complete sentence, while masking out the single focus verb. I then ask BERT for its word predictions for the masked position, and compare the score assigned to the original correct verb to the score assigned to the incorrect one.”*

## **1. Installation and imports**

In [23]:
!pip install transformers

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
Defaulting to user installation because normal site-packages is not writeable


In [2]:
import pandas as pd

from collections import Counter

from transformers import pipeline

## **2. Dataset**

Below is reported how to load and format the dataset so that it can then be passed directly to the Transformer model. 

In this notebook, we will be utilizing the dataset defined in "Targeted Syntactic Evaluation of Language Models" (Marvin and Linzen, 2018, link to the paper: https://aclanthology.org/D18-1151.pdf). 
The dataset consists of various stimuli designed to assess the language model's proficiency in the following syntactic phenomena:


*   Subject-Verb Agreement;
*   Reflexive Anaphora;
*   Negative Polarity Items.

The dataset can be download from https://github.com/yoavg/bert-syntax.

In [None]:
# Connect to Google Drive folder
drive.mount('/content/drive')

### **2.1 Data Investigation**

Before proceeding with the data processing, let's examine the structure of the dataset more closely.

In [3]:
test_dataset = "../data/marvin_linzen_dataset.tsv"

df = pd.read_csv(test_dataset, delimiter='\t', header=None)
df.head()

Unnamed: 0,0,1,2,3
0,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes laughs,the author that the guard likes laugh
1,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes swims,the author that the guard likes swim
2,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes smiles,the author that the guard likes smile
3,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes is tall,the author that the guard likes are tall
4,obj_rel_across_anim,sing_MS_MV_sing_ES_EV,the author that the guard likes is old,the author that the guard likes are old


In [4]:
# Visualize the different typologies of phenomena present in the dataset
df[0].unique()

array(['obj_rel_across_anim', 'obj_rel_within_anim',
       'obj_rel_across_inanim', 'obj_rel_within_inanim', 'subj_rel',
       'prep_anim', 'prep_inanim', 'obj_rel_no_comp_across_anim',
       'obj_rel_no_comp_within_anim', 'obj_rel_no_comp_across_inanim',
       'obj_rel_no_comp_within_inanim', 'simple_agrmt', 'sent_comp',
       'vp_coord', 'long_vp_coord', 'reflexives_across',
       'simple_reflexives', 'reflexive_sent_comp', 'npi_across_anim',
       'npi_across_inanim', 'simple_npi_anim', 'simple_npi_inanim'],
      dtype=object)

In [5]:
# Samples from the dataset with simple subject-verb agreement
df[df[0] == 'simple_agrmt'].head()

Unnamed: 0,0,1,2,3
118160,simple_agrmt,sing_MS_MV,the author laughs,the author laugh
118161,simple_agrmt,sing_MS_MV,the author swims,the author swim
118162,simple_agrmt,sing_MS_MV,the author smiles,the author smile
118163,simple_agrmt,sing_MS_MV,the author is tall,the author are tall
118164,simple_agrmt,sing_MS_MV,the author is old,the author are old


In [6]:
# Samples from the dataset with subject-verb agreement with Long VP (verb phrase) coordination
df[df[0] == 'long_vp_coord'].head()

Unnamed: 0,0,1,2,3
120820,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120821,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120822,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120823,long_vp_coord,sing_MS_LMV_LMV,the author knows many different foreign langua...,the author knows many different foreign langua...
120824,long_vp_coord,sing_MS_LMV_LMV,the author likes to watch television shows and...,the author likes to watch television shows and...


### **2.2 Data Preparation**

The following lines of code are adapted from the script *eval_bert.py* available here: https://github.com/yoavg/bert-syntax.

In [7]:
processed_dataset = []

cc = Counter()
for line in open(test_dataset, 'r'):
  sample = line.strip().split("\t")

  # Select only the configuration with simple subject-verb agreement
  if sample[0] == "simple_agrmt":
    cc[sample[1]]+=1

    # Select the correct ('g') and the erroneous sentence ('ug)
    g, ug = sample[-2], sample[-1]
    g = g.split()
    ug = ug.split()
    assert(len(g)==len(ug)),(g,ug)

    # Identify the difference between the two sentences (i.e. the different token)
    diffs = [i for i,pair in enumerate(zip(g,ug)) if pair[0]!=pair[1]]
    if (len(diffs)!=1):
      continue    
    assert(len(diffs)==1),diffs

    # Save in 'gv' and 'ugv' the correct and incorrect token
    gv=g[diffs[0]]   # correct
    ugv=ug[diffs[0]] # incorrect

    # Recreate the input sequence by replace the target token with [MASK]
    g[diffs[0]]="[MASK]"
    g.append(".")

    # Filter the sentences that contains 'swims' as possible target token, since 'swims' does not exist in the model vocabulary
    if gv != 'swims' and ugv != 'swims':
      processed_dataset.append((sample[0],sample[1]," ".join(g),gv,ugv))

At the end of the data preparation process, each instance will have the following structure:

*   Phenomena;
*   Construction Template;
*   Sentence with the masked target token;
*   Target token with correct agreement;
*    Target token with incorrect agreement.



In [8]:
print("Number of samples:", len(processed_dataset))
print()

print("Input sample:", processed_dataset[2])

Number of samples: 120

Input sample: ('simple_agrmt', 'sing_MS_MV', 'the author [MASK] tall .', 'is', 'are')


## **3. Loading the Pipeline**

To carry out the experiments, we rely on the Huggingface's Transformers Library. 

Transformers 🤗 (https://huggingface.co/docs/transformers/index) “*provides APIs and tools to easily download and train state-of-the-art pretrained models.*”

For our specific scenario, we utilize the *pipeline* object (https://huggingface.co/docs/transformers/main_classes/pipelines), which offers a simple API dedicated to several tasks (e.g. Named Entity Recognition, Masked Language Modeling, Feature Extraction, and more).

The *pipeline()* object can be instantiated as follows:

```python
nlp = pipeline(<task_name>, model=<model_name>)
```

In this notebook, we use *bert-base-cased* as Transformer model (12 layers, 12 attention heads, 768 hidden units). However, it is possible to use any other model that has been trained with the MLM function. The full list of available models can be found at the following link: [Huggingface Models](https://huggingface.co/models).

In [9]:
nlp = pipeline("fill-mask", model="bert-base-cased")

Some weights of the model checkpoint at bert-base-cased were not used when initializing BertForMaskedLM: ['cls.seq_relationship.weight', 'cls.seq_relationship.bias']
- This IS expected if you are initializing BertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


## **4. Running the pipeline on the dataset**

After loading the dataset and the model, we can simple call the *pipeline* on one item (i.e. sentence) as follows:

```python
predictions = nlp(<sentence>, targets=<target_tokens>)
```


The *targets* parameters allows us to provide the model with a set of target tokens in order to compute their probability for the MLM task.

Here's an example:

In [10]:
# We select a sample sentence (and the corresponding target tokens) from our dataset
sentence = processed_dataset[10][2]
targets = processed_dataset[10][3:]

# Sample sentence and target tokens
print("Sentence:", sentence)
print("Targets:", targets)
print()

predictions = nlp(sentence, targets=targets)

# Model MLM predictions
predictions

Sentence: the pilot [MASK] young .
Targets: ('is', 'are')



[{'score': 0.15975096821784973,
  'token': 1110,
  'token_str': 'is',
  'sequence': 'the pilot is young.'},
 {'score': 8.954993245424703e-05,
  'token': 1132,
  'token_str': 'are',
  'sequence': 'the pilot are young.'}]

In the following, we apply our pipeline to all the sentences in the dataset and then we store the results in a Pandas dataframe.

In [11]:
columns = ["phenomena", "template", "sentence", 
           "correct_token", "prob_correct_token", 
           "incorrect_token", "prob_incorrect_token"]

df = pd.DataFrame(columns=columns)

for input_sample in processed_dataset:
  sentence = input_sample[2]
  targets = input_sample[3:]

  dict_results = {"phenomena": [input_sample[0]],
                  "template": [input_sample[1]],
                  "sentence": [sentence],
                  "correct_token": [targets[0]],
                  "incorrect_token": [targets[1]]}
  
  # Get BERT predictions and store in the  dictionary
  predictions = nlp(sentence, targets=targets)
  for pred in predictions:
    token = pred["token_str"]
    prob = pred["score"]
    if token == targets[0]:
      dict_results["prob_correct_token"] = [prob]
    else:
      dict_results["prob_incorrect_token"] = [prob]

  df = pd.concat([df, pd.DataFrame(dict_results)])

First 10 rows of the resulting Dataframe:

In [12]:
df.head()

Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
0,simple_agrmt,sing_MS_MV,the author [MASK] .,laughs,0.005252,laugh,4.6e-05
0,simple_agrmt,sing_MS_MV,the author [MASK] .,smiles,0.002515,smile,5.7e-05
0,simple_agrmt,sing_MS_MV,the author [MASK] tall .,is,0.130696,are,4.5e-05
0,simple_agrmt,sing_MS_MV,the author [MASK] old .,is,0.423882,are,0.000553
0,simple_agrmt,sing_MS_MV,the author [MASK] young .,is,0.098579,are,1.6e-05


## **5. Evaluation**

As a final step, we compute the performance of the model. Specifically, we calculate its accuracy, i.e. we verify how many times BERT was able to assign a higher probability to the target word with the correct subject-verb agreement.

In [13]:
# Evaluation
corr = df.query('prob_correct_token > prob_incorrect_token')
miss = df.query('prob_correct_token < prob_incorrect_token')

tot_corr = len(corr)
tot_miss = len(miss)

print("Total number of correct predictions:", tot_corr)
print("Total number of wrong predictions:", tot_miss)
print()

accuracy = tot_corr/(tot_corr+tot_miss)
print("Accuracy:", accuracy)

Total number of correct predictions: 120
Total number of wrong predictions: 0

Accuracy: 1.0


In [14]:
# Correct predictions
corr 

Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token
0,simple_agrmt,sing_MS_MV,the author [MASK] .,laughs,0.005252,laugh,0.000046
0,simple_agrmt,sing_MS_MV,the author [MASK] .,smiles,0.002515,smile,0.000057
0,simple_agrmt,sing_MS_MV,the author [MASK] tall .,is,0.130696,are,0.000045
0,simple_agrmt,sing_MS_MV,the author [MASK] old .,is,0.423882,are,0.000553
0,simple_agrmt,sing_MS_MV,the author [MASK] young .,is,0.098579,are,0.000016
...,...,...,...,...,...,...,...
0,simple_agrmt,plur_MS_MV,the consultants [MASK] .,smile,0.001336,smiles,0.000065
0,simple_agrmt,plur_MS_MV,the consultants [MASK] tall .,are,0.029917,is,0.000175
0,simple_agrmt,plur_MS_MV,the consultants [MASK] old .,are,0.048789,is,0.000271
0,simple_agrmt,plur_MS_MV,the consultants [MASK] young .,are,0.149911,is,0.000322


In [15]:
# Incorrect predictions
miss

Unnamed: 0,phenomena,template,sentence,correct_token,prob_correct_token,incorrect_token,prob_incorrect_token


## **Causal Language Modeling**

While BERT is trained with the MLM function, Causal Lanugage Models are generally trained with the language modeling (LM) function, i.e. predicting the next token in an input sequence according to its previous context.

<div>
  <img src="https://lena-voita.github.io/resources/lectures/lang_models/neural/nn_lm_idea_linear-min.png" width="800">
</div>

(Source: https://lena-voita.github.io/nlp_course/language_modeling.html)

It is possible to reproduce the experiments just shown also with a Causal Language Model (e.g. GPT), whit a few adjustments.

Given an input sequence, we can provide the model with the sequence up to the target token (i.e. prompt), and then ask to predict the next token and compare the probabilities associated with the two target words.

In [24]:
import torch

import numpy as np

from transformers import AutoTokenizer, AutoModelForCausalLM

## **Loading the model**

To extract the probabilities from the target tokens, we can simply run the model on our input sequence.
Before doing so, we need to load both the model and its tokenizer.

In [25]:
model_name = 'gpt2'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

## **Running the model**

By calling the *'model.generate()'* object on the input sequence: 

```python
model.generate(input_ids)
```

it will be possible to return a generated sequence starting from the prompt defined with *'input_ids'*. It is also possible to specific several parameters to condition the generation of the model.

At the end of the generation process, with:

```python
model.compute_transition_scores()
```
 
we can extract extract the probability scores at each generation step.

In [27]:
sentence = processed_dataset[0][2]
targets = processed_dataset[0][3:]

print("Sentence:", sentence)
print()

# Split the sentence to get the sequence up to the [MASK] token
pre = sentence.split(" [MASK] ")[0]
inputs = tokenizer(pre, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=5, num_beams=4, return_dict_in_generate=True, output_scores=True)

transition_scores = model.compute_transition_beam_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices)

# input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
# encoder-decoder models, like BART or T5.
input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]

for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.10%}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentence: the author [MASK] .

|   286 |  of      | -1.434 | 23.8339081407%
|   262 |  the     | -1.334 | 26.3485491276%
|  1492 |  book    | -2.173 | 11.3818027079%
|   366 |  "       | -2.134 | 11.8385106325%
|   464 | The      | -1.662 | 18.9790815115%


If we want to condition the generation only on certain target tokens (i.e. the target verbs for the subject-verb agreement task), we can modify the previous code snipped by adjusting some parameters.

In [28]:
sentence = processed_dataset[0][2]
targets = processed_dataset[0][3:]

print("Sentence:", sentence)
print("Targets:", targets)
print()

# Function to restrict the generated tokens of the model to the target ones
def restrict_decode_vocab(batch_idx, prefix_beam):
    restricted_vocab = tokenizer([targets], return_tensors="pt")['input_ids'].tolist()
  
    return restricted_vocab

# Split the sentence to get the sequence up to the [MASK] token
pre = sentence.split(" [MASK] ")[0]
inputs = tokenizer(pre, return_tensors="pt")

outputs = model.generate(**inputs, max_new_tokens=1, num_return_sequences=2, return_dict_in_generate=True, 
                         output_scores=True, num_beams=4, prefix_allowed_tokens_fn=restrict_decode_vocab)

transition_scores = model.compute_transition_beam_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices
)

# input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
# encoder-decoder models, like BART or T5.
input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]

# Probabilities for the first generation (i.e. with target_tokens = targets[0])
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.10%}")

print()

# Probabilities for the second generation (i.e. with target_tokens = targets[1])
for tok, score in zip(generated_tokens[1], transition_scores[1]):
    # | token | token string | logits | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.10%}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Sentence: the author [MASK] .
Targets: ('laughs', 'laugh')

| 28124 | laughs   | -18.429 | 0.0000009915%

| 44944 | laugh    | -19.947 | 0.0000002173%
