[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1Z6G_DJNyYXAA8KMnOzle7OmC_nS-N_p5?usp=sharing)

# **Assessing MLM Transformer Model Syntactic Abilities**

In this notebook, we will see how to assess the syntactic abilities of a Transformer model trained with Masked Language Modeling (MLM) objective function. In particular, we will test the abilities of the model on the *subject-verb agreement* phenomena.

The notebook is adapted from the experiments made by Yoav Goldberg in "*Assessing BERT's Syntactic Abilities*" (https://arxiv.org/pdf/1901.05287.pdf).
For further details, please also see the original github repo by Yoav Goldberg: https://github.com/yoavg/bert-syntax. 

## **Masked Language Modeling**

BERT is trained to approximate the Masked Language Modeling function, i.e. predict the identity of masked words in an input sequence (e.g. sentence).

<div>
  <img src="https://media.geeksforgeeks.org/wp-content/uploads/20200422002516/maskedLanguage.jpg" width="600">
</div>

(Source: https://www.geeksforgeeks.org/understanding-bert-nlp/)

Relying on the MLM training function, wee can test the performance of the model in learning subject-verb agreement patterns.
From Goldberg's paper: *“feeding into BERT the complete sentence, while masking out the single focus verb. I then ask BERT for its word predictions for the masked position, and compare the score assigned to the original correct verb to the score assigned to the incorrect one.”*

## **1. Installation and imports**

In [None]:
!pip install transformers

In [None]:
from google.colab import drive

import pandas as pd

from collections import Counter

from transformers import pipeline

## **2. Dataset**

Below is reported how to load and format the dataset so that it can then be passed directly to the Transformer model. 

In this notebook, we will be utilizing the dataset defined in "Targeted Syntactic Evaluation of Language Models" (Marvin and Linzen, 2018, link to the paper: https://aclanthology.org/D18-1151.pdf). 
The dataset consists of various stimuli designed to assess the language model's proficiency in the following syntactic phenomena:


*   Subject-Verb Agreement;
*   Reflexive Anaphora;
*   Negative Polarity Items.

The dataset can be download from https://github.com/yoavg/bert-syntax.

In [None]:
# Connect to Google Drive folder
drive.mount('/content/drive')

### **2.1 Data Investigation**

Before proceeding with the data processing, let's examine the structure of the dataset more closely.

In [None]:
test_dataset = "/content/drive/My Drive/Lectures_2023/marvin_linzen_dataset.tsv"

df = pd.read_csv(test_dataset, delimiter='\t', header=None)
df.head()

In [None]:
# Visualize the different typologies of phenomena present in the dataset
df[0].unique()

In [None]:
# Samples from the dataset with simple subject-verb agreement
df[df[0] == 'simple_agrmt'].head()

In [None]:
# Samples from the dataset with subject-verb agreement with Long VP (verb phrase) coordination
df[df[0] == 'long_vp_coord'].head()

### **2.2 Data Preparation**

The following lines of code are adapted from the script *eval_bert.py* available here: https://github.com/yoavg/bert-syntax.

In [None]:
processed_dataset = []

cc = Counter()
for line in open(test_dataset, 'r'):
  sample = line.strip().split("\t")

  # Select only the configuration with simple subject-verb agreement
  if sample[0] == "simple_agrmt":
    cc[sample[1]]+=1

    # Select the correct ('g') and the erroneous sentence ('ug)
    g, ug = sample[-2], sample[-1]
    g = g.split()
    ug = ug.split()
    assert(len(g)==len(ug)),(g,ug)

    # Identify the difference between the two sentences (i.e. the different token)
    diffs = [i for i,pair in enumerate(zip(g,ug)) if pair[0]!=pair[1]]
    if (len(diffs)!=1):
      continue    
    assert(len(diffs)==1),diffs

    # Save in 'gv' and 'ugv' the correct and incorrect token
    gv=g[diffs[0]]   # correct
    ugv=ug[diffs[0]] # incorrect

    # Recreate the input sequence by replace the target token with [MASK]
    g[diffs[0]]="[MASK]"
    g.append(".")

    # Filter the sentences that contains 'swims' as possible target token, since 'swims' does not exist in the model vocabulary
    if gv != 'swims' and ugv != 'swims':
      processed_dataset.append((sample[0],sample[1]," ".join(g),gv,ugv))

At the end of the data preparation process, each instance will have the following structure:

*   Phenomena;
*   Construction Template;
*   Sentence with the masked target token;
*   Target token with correct agreement;
*    Target token with incorrect agreement.



In [None]:
print("Number of samples:", len(processed_dataset))
print()

print("Input sample:", processed_dataset[2])

## **3. Loading the Pipeline**

To carry out the experiments, we rely on the Huggingface's Transformers Library. 

Transformers 🤗 (https://huggingface.co/docs/transformers/index) “*provides APIs and tools to easily download and train state-of-the-art pretrained models.*”

For our specific scenario, we utilize the *pipeline* object (https://huggingface.co/docs/transformers/main_classes/pipelines), which offers a simple API dedicated to several tasks (e.g. Named Entity Recognition, Masked Language Modeling, Feature Extraction, and more).

The *pipeline()* object can be instantiated as follows:

```python
nlp = pipeline(<task_name>, model=<model_name>)
```

In this notebook, we use *bert-base-cased* as Transformer model (12 layers, 12 attention heads, 768 hidden units). However, it is possible to use any other model that has been trained with the MLM function. The full list of available models can be found at the following link: [Huggingface Models](https://huggingface.co/models).

In [None]:
nlp = pipeline("fill-mask", model="bert-base-cased")

## **4. Running the pipeline on the dataset**

After loading the dataset and the model, we can simple call the *pipeline* on one item (i.e. sentence) as follows:

```python
predictions = nlp(<sentence>, targets=<target_tokens>)
```


The *targets* parameters allows us to provide the model with a set of target tokens in order to compute their probability for the MLM task.

Here's an example:

In [None]:
# We select a sample sentence (and the corresponding target tokens) from our dataset
sentence = processed_dataset[10][2]
targets = processed_dataset[10][3:]

# Sample sentence and target tokens
print("Sentence:", sentence)
print("Targets:", targets)
print()

predictions = nlp(sentence, targets=targets)

# Model MLM predictions
predictions

In the following, we apply our pipeline to all the sentences in the dataset and then we store the results in a Pandas dataframe.

In [None]:
columns = ["phenomena", "template", "sentence", 
           "correct_token", "prob_correct_token", 
           "incorrect_token", "prob_incorrect_token"]

df = pd.DataFrame(columns=columns)

for input_sample in processed_dataset:
  sentence = input_sample[2]
  targets = input_sample[3:]

  dict_results = {"phenomena": [input_sample[0]],
                  "template": [input_sample[1]],
                  "sentence": [sentence],
                  "correct_token": [targets[0]],
                  "incorrect_token": [targets[1]]}
  
  # Get BERT predictions and store in the  dictionary
  predictions = nlp(sentence, targets=targets)
  for pred in predictions:
    token = pred["token_str"]
    prob = pred["score"]
    if token == targets[0]:
      dict_results["prob_correct_token"] = [prob]
    else:
      dict_results["prob_incorrect_token"] = [prob]

  df = pd.concat([df, pd.DataFrame(dict_results)])

First 10 rows of the resulting Dataframe:

In [None]:
df.head()

## **5. Evaluation**

As a final step, we compute the performance of the model. Specifically, we calculate its accuracy, i.e. we verify how many times BERT was able to assign a higher probability to the target word with the correct subject-verb agreement.

In [None]:
# Evaluation
corr = df.query('prob_correct_token > prob_incorrect_token')
miss = df.query('prob_correct_token < prob_incorrect_token')

tot_corr = len(corr)
tot_miss = len(miss)

print("Total number of correct predictions:", tot_corr)
print("Total number of wrong predictions:", tot_miss)
print()

accuracy = tot_corr/(tot_corr+tot_miss)
print("Accuracy:", accuracy)

In [None]:
# Correct predictions
corr 

In [None]:
# Incorrect predictions
miss