# Introduction: Subtask B **ENGLISH — BERTRAM**

Based on Baseline Model of [Subtask B of SemEval 2022 Task 2](https://sites.google.com/view/semeval2022task2-idiomaticity#h.qq7eefmehqf9). 

Original paper: “[AStitchInLanguageModels: Dataset and Methods for the Exploration of Idiomaticity in Pre-Trained Language Models](https://arxiv.org/abs/2109.04413)”.

Phelps: https://arxiv.org/abs/2204.02821

## Pre-train only setting: Methodology 

### Requirements

- NOT allowed to train models using the idiom based training data provided. 
- Train models using STS data. 
- Note that models such as BERT typically do not output embeddings that can be compared using STS and so this is actually required.

### Phelps' Paper

- Use pre-trained BERTRAM models to create representations for idiomatic expressions. Pre-trained BERTRAM here: https://github.com/timoschick/bertram.
- 3 epochs of context-only training, 10 epochs of form-only training and 3 epochs of combined training.

### Architecture

0. BERTRAM embeddings for idioms -> Embedding matrix of pre-trained [BERT-Base](https://doi.org/10.18653/v1/N19-1423) model.

1. WRAPPED IN [Sentence BERT](https://doi.org/10.18653/v1/D19-1410).

2. Find cosine similarity between a given pair of sentences.

### Data

0. Pre-train: 
The methodology used for the pre-train only setting involves the introduction of new tokens associated with each MWE. Note, that we do not actually continue pre-training our models so these embeddings remain random. Despite its simplicity, this method has been shown (by the above paper) to be a good way of ensuring that compositionality is “broken”. Note that it is possible to pre-train these models on sentences containing these MWEs as long as such data is not annotated for idiomaticity. Please also see the paper for more details.

Note that we must replace MWEs occurring in the evaluation data with the tokens we’ve chosen. 

For this setting, we use multilingual BERT with new tokens added as described above.

## Fine-tune setting: Methodology

For the fine-tune setting, we fine-tune the model (with the MWE tokens added) using the training data provided. 

Note that we must replace MWEs occurring in both the training and evaluation data with the tokens we’ve chosen. 

For this setting, we use the same model as in the pre-train setting (multilingual BERT with new tokens added), but additionally fine-tune on the training data provided.


# Setup

Download the “AStitchInLanguageModels” code which we make use of. 


In [None]:
!ls

drive  sample_data


## Download Pre-Trained BERTRAM

In [None]:
import sys
!git clone https://github.com/timoschick/bertram.git
!git clone https://github.com/timoschick/form-context-model.git
!pip install jsonpickle
sys.path.append('bertram/')
sys.path.append('form-context-model/')

fatal: destination path 'bertram' already exists and is not an empty directory.
fatal: destination path 'form-context-model' already exists and is not an empty directory.


In [None]:
!mkdir bertram_models

### Step 2: Get Pretrained BERTRAM Models

Chose **ONE** from the following two options.

#### Option 1: Copy Pretrained BERTRAM from Drive (Recommended)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

# ROBERTA
!cp "drive/Shareddrives/Multilingual Idiods/bertram_models/bertram-add-for-roberta-large.zip" "bertram_models"

# BERT BASE UNCASED
# !cp "drive/Shareddrives/Multilingual Idiods/bertram_models/bertram-add-for-bert-base-uncased.zip" "bertram_models"

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#### Option 2: Download Directly Via Colab
If this approach does not work, please download the pretrained BERTRAM models from [here](https://github.com/timoschick/bertram#-usage) and copy the zip files to your Colab instance.

In [None]:
!wget  -O '/bertram_models' 'https://www.cis.uni-muenchen.de/~schickt/bertram-add-for-bert-base-uncased.zip'

/bertram_models/bertram-add-for-bert-base-uncased.zip: Not a directory


In [None]:
!wget  -O '/bertram_models' 'https://www.cis.uni-muenchen.de/~schickt/bertram-add-for-roberta-large.zip'

--2022-05-10 20:38:17--  https://www.cis.uni-muenchen.de/~schickt/bertram-add-for-roberta-large.zip
Resolving www.cis.uni-muenchen.de (www.cis.uni-muenchen.de)... 129.187.148.72, 2001:4ca0:4f01::5
Connecting to www.cis.uni-muenchen.de (www.cis.uni-muenchen.de)|129.187.148.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1175306974 (1.1G) [application/zip]
Saving to: ‘/bertram_models’


2022-05-10 20:39:02 (25.2 MB/s) - ‘/bertram_models’ saved [1175306974/1175306974]



### Step 3: Unzip Pretrained Models

In [None]:
# Bert base uncased
# !unzip "bertram_models/bertram-add-for-bert-base-uncased.zip"

# Roberta
!unzip "bertram_models/bertram-add-for-roberta-large.zip"

Archive:  bertram_models/bertram-add-for-roberta-large.zip
   creating: bertram-add-for-roberta-large/
  inflating: bertram-add-for-roberta-large/bertram_config.json  
  inflating: bertram-add-for-roberta-large/config.json  
  inflating: bertram-add-for-roberta-large/input_processor.json  
  inflating: bertram-add-for-roberta-large/input_processor.json.vocab  
  inflating: bertram-add-for-roberta-large/pytorch_model.bin  


## Download Helper Code

In [None]:
!git clone https://github.com/H-TayyarMadabushi/AStitchInLanguageModels.git

Cloning into 'AStitchInLanguageModels'...
remote: Enumerating objects: 1030, done.[K
remote: Counting objects: 100% (17/17), done.[K
remote: Compressing objects: 100% (13/13), done.[K
remote: Total 1030 (delta 11), reused 4 (delta 4), pack-reused 1013[K
Receiving objects: 100% (1030/1030), 79.59 MiB | 12.03 MiB/s, done.
Resolving deltas: 100% (394/394), done.


Install the modified version of Sentence Transformers which can handle the additional MWE tokens. 

Sentence Transformers provide a way of generating sentence embeddings the semantic similarity of which can be compared using cosine similarity. 


In [None]:
%cd AStitchInLanguageModels/dependencies/sentence-transformers
!pip install -e . 
%cd /content/

/content/AStitchInLanguageModels/dependencies/sentence-transformers
Obtaining file:///content/AStitchInLanguageModels/dependencies/sentence-transformers
Collecting transformers<5.0.0,>=3.1.0
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0 MB 3.9 MB/s 
Collecting sentencepiece
  Downloading sentencepiece-0.1.96-cp37-cp37m-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.2 MB)
[K     |████████████████████████████████| 1.2 MB 41.5 MB/s 
Collecting huggingface-hub<1.0,>=0.1.0
  Downloading huggingface_hub-0.5.1-py3-none-any.whl (77 kB)
[K     |████████████████████████████████| 77 kB 6.6 MB/s 
Collecting tokenizers!=0.11.3,<0.13,>=0.11.1
  Downloading tokenizers-0.12.1-cp37-cp37m-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (6.6 MB)
[K     |████████████████████████████████| 6.6 MB 33.1 MB/s 
Collecting sacremoses
  Downloading sacremoses-0.0.53.tar.gz (880 kB)
[K     |████████████████████████████████| 880 kB 42.5 MB/s 
Collec

Download and install an editable version of huggingfaces transformers. 

In [None]:
!git clone https://github.com/huggingface/transformers.git
%cd transformers/
!pip install --editable .
%cd /content/

Cloning into 'transformers'...
remote: Enumerating objects: 94094, done.[K
remote: Counting objects: 100% (382/382), done.[K
remote: Compressing objects: 100% (182/182), done.[K
remote: Total 94094 (delta 242), reused 303 (delta 188), pack-reused 93712[K
Receiving objects: 100% (94094/94094), 86.12 MiB | 11.45 MiB/s, done.
Resolving deltas: 100% (69010/69010), done.
/content/transformers
Obtaining file:///content/transformers
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.18.0
    Uninstalling transformers-4.18.0:
      Successfully uninstalled transformers-4.18.0
  Running setup.py develop for transformers
Successfully installed transformers
/content


In [None]:
!pip install datasets
!pip install transformers
!pip install sentence_transformers

Collecting datasets
  Downloading datasets-2.2.0-py3-none-any.whl (342 kB)
[?25l[K     |█                               | 10 kB 29.1 MB/s eta 0:00:01[K     |██                              | 20 kB 14.9 MB/s eta 0:00:01[K     |██▉                             | 30 kB 7.9 MB/s eta 0:00:01[K     |███▉                            | 40 kB 3.6 MB/s eta 0:00:01[K     |████▉                           | 51 kB 3.6 MB/s eta 0:00:01[K     |█████▊                          | 61 kB 4.3 MB/s eta 0:00:01[K     |██████▊                         | 71 kB 4.4 MB/s eta 0:00:01[K     |███████▋                        | 81 kB 5.0 MB/s eta 0:00:01[K     |████████▋                       | 92 kB 5.1 MB/s eta 0:00:01[K     |█████████▋                      | 102 kB 4.3 MB/s eta 0:00:01[K     |██████████▌                     | 112 kB 4.3 MB/s eta 0:00:01[K     |███████████▌                    | 122 kB 4.3 MB/s eta 0:00:01[K     |████████████▌                   | 133 kB 4.3 MB/s eta 0:00:01[

In [None]:
# # Reproducibility/
# !wget 'https://github.com/huggingface/transformers/archive/refs/tags/v4.7.0.zip'
# !unzip v4.7.0.zip
# %cd transformers-4.7.0/
# !pip install --editable .
# %cd /content/ 

Required for dataserts we use. 

Download the Task data and evaluation scripts

In [None]:
!git clone https://github.com/H-TayyarMadabushi/SemEval_2022_Task2-idiomaticity.git

Cloning into 'SemEval_2022_Task2-idiomaticity'...
remote: Enumerating objects: 123, done.[K
remote: Counting objects: 100% (123/123), done.[K
remote: Compressing objects: 100% (106/106), done.[K
remote: Total 123 (delta 48), reused 61 (delta 15), pack-reused 0[K
Receiving objects: 100% (123/123), 2.50 MiB | 7.16 MiB/s, done.
Resolving deltas: 100% (48/48), done.


Editable install requires runtime restart unless we do this. 

In [None]:
import site
site.main()

# Imports and Helper functions

In [None]:
import re
import os
import sys
import csv
import gzip
import math
import torch
import random
import numpy as np

from datetime                         import datetime
from torch.utils.data                 import DataLoader
from sklearn.metrics.pairwise         import paired_cosine_distances

from datasets                         import load_dataset
from transformers                     import AutoModelForMaskedLM
from transformers                     import AutoTokenizer

from sentence_transformers            import SentenceTransformer,  LoggingHandler, losses, models, util
from sentence_transformers.readers    import InputExample
from sentence_transformers.evaluation import EmbeddingSimilarityEvaluator

sys.path.append( '/content/SemEval_2022_Task2-idiomaticity/SubTaskB/' )
from SubTask2Evaluator                import evaluate_submission


In [None]:
def load_csv( path ) : 
  header = None
  data   = list()
  with open( path, encoding='utf-8') as csvfile:
    reader = csv.reader( csvfile ) 
    for row in reader : 
      if header is None : 
        header = row
        continue
      data.append( row ) 
  return header, data


We choose to create a single token for MWEs using this function. This must be used when adding tokens to models, tokenising evaluation and training data. 

In [None]:
def tokenise_idiom( phrase ) :
  return 'ID' + re.sub( r'[\s|-]', '', phrase ).lower() + 'ID'

Set seed to ensure reproducibility. 

In [None]:
def is_torch_available() :
    try:
        import torch
        return True
    except ImportError:
        return False

def is_tf_available() :
    try:
        import tensorflow as tf
        return True
    except ImportError:
        return False

def set_seed(seed: int):
    """
    Modified from : https://github.com/huggingface/transformers/blob/master/src/transformers/trainer_utils.py
    Helper function for reproducible behavior to set the seed in ``random``, ``numpy``, ``torch`` and/or ``tf`` (if
    installed).
    Args:
        seed (:obj:`int`): The seed to set.
    """
    random.seed(seed)
    np.random.seed(seed)
    if is_torch_available():
        torch.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)
        # ^^ safe to call this function even if cuda is not available

        ## From https://pytorch.org/docs/stable/notes/randomness.html
        torch.backends.cudnn.benchmark = False

        ## Might want to use the following, but set CUBLAS_WORKSPACE_CONFIG=:16:8
        # try : 
        #   torch.use_deterministic_algorithms(True)
        # except AttributeError: 
        #   torch.set_deterministic( True )
        
    if is_tf_available():
        import tensorflow as tf
        tf.random.set_seed(seed)



In [None]:
def write_csv( data, location ) : 
  with open( location, 'w', encoding='utf-8') as csvfile:
    writer = csv.writer( csvfile ) 
    writer.writerows( data ) 
  print( "Wrote {}".format( location ) ) 
  return


# Setup models and parameters

In [None]:
seed = 4 ## Found using 5 different seeds - specific to this experiment. 
set_seed( seed ) 

In [None]:
data_location = '/content/SemEval_2022_Task2-idiomaticity/SubTaskB/EvaluationData/'

In [None]:
outpath = '/content/models/'

In [None]:
dev_location                = os.path.join( data_location, 'dev.csv'                     ) 
eval_location               = os.path.join( data_location, 'eval.csv'                    ) 
dev_formated_file_location  = os.path.join( data_location, 'dev.submission_format.csv'   ) 
eval_formated_file_location = os.path.join( data_location, 'eval.submission_format.csv'   ) 

In [None]:
## WARNING: We filter everything based on this (SemEval Task 2 requires that ALL languages are included) 
languages = ['EN'] 

In [None]:
## Save tmp model here.  
outdir = os.path.join( outpath, 'mBERT' + '-' + str( seed ) ) 
## Save initial Sent Trans model here. 
sent_trans_path  = os.path.join( outpath, 'tokenizedSentTrans_' + str( seed ) )  
## Save final trained model here. 
model_save_path  = os.path.join( outpath, 'tokenizedSentTransNoPreTrain_' + str( seed ) ) 

# Setting: pre-train (No Fine-Tuning)

## Adding Idiom Tokens

### Extract idioms from Dev and Eval splits

We need this so we know which MWEs to tokenize 

In [None]:
idioms = list()
for data_split in [ 'dev', 'eval' ] : 
    file_path = os.path.join( data_location, data_split + '.csv' )
    header, data = load_csv( file_path )
    for elem in data : 
        if not elem[ header.index( 'Language' ) ] in languages :
            continue
        idioms.append( elem[ header.index( 'MWE1' ) ] )
        idioms.append( elem[ header.index( 'MWE2' ) ] )
  
idioms = list( set( idioms ) ) 
idioms.remove( 'None' ) 

print( "Found a total of {} idioms".format( len( idioms ) ) )

idioms = [ tokenise_idiom( i ) for i in idioms ]

Found a total of 60 idioms


### Download and tokenize model

We use BERT-Base-Uncased as this is a requirement to use the provided pre-trained BERTRAM model. (This applies to English only.)

In [None]:
# If using bert-base-uncased
# model_checkpoint = 'bert-base-uncased'

# If using roberta-large
model_checkpoint = 'roberta-large'
  
model = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
model.save_pretrained( outdir )

tokenizer = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False, truncation=True)
tokenizer.save_pretrained( outdir )

print( "Wrote to: ", outdir, flush=True )

Downloading:   0%|          | 0.00/482 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.33G [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/878k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Wrote to:  /content/models/mBERT-4


In [None]:
model          = AutoModelForMaskedLM.from_pretrained(model_checkpoint)
tokenizer      = AutoTokenizer.from_pretrained(model_checkpoint, use_fast=False, truncation=True)
old_len        = len( tokenizer )
num_added_toks = tokenizer.add_tokens( idioms ) 
print( "Old tokenizer length was {}. Added {} new tokens. New length is {}.".format( old_len, num_added_toks, len( tokenizer ) )  ) 
model.resize_token_embeddings(len(tokenizer))

model.save_pretrained    ( outdir )
tokenizer.save_pretrained( outdir )


Old tokenizer length was 50265. Added 60 new tokens. New length is 50325.


('/content/models/mBERT-4/tokenizer_config.json',
 '/content/models/mBERT-4/special_tokens_map.json',
 '/content/models/mBERT-4/vocab.json',
 '/content/models/mBERT-4/merges.txt',
 '/content/models/mBERT-4/added_tokens.json')

In [None]:
## Make sure this worked. 
print( tokenizer.tokenize('This is a IDancienthistoryID'), flush=True )
print( tokenizer.tokenize( 'This is a IDcolégiomilitarID' ) )

['This', 'Ġis', 'Ġa', 'IDancienthistoryID']
['This', 'Ġis', 'Ġa', 'ĠID', 'col', 'Ã©', 'gi', 'om', 'ilit', 'ar', 'ID']


## BERTRAM Embeddings

In [None]:
data_location = '/content/SemEval_2022_Task2-idiomaticity/SubTaskB/EvaluationData/'
outpath = '/content/models/'
dev_location                = os.path.join( data_location, 'dev.csv'                     ) 
eval_location               = os.path.join( data_location, 'eval.csv'                    ) 
dev_formated_file_location  = os.path.join( data_location, 'dev.submission_format.csv'   ) 
eval_formated_file_location = os.path.join( data_location, 'eval.submission_format.csv'   ) 
## Save tmp model here.  
outdir = os.path.join( outpath, 'mBERT' + '-' + str( 4 ) )

In [None]:
idioms = list()
for data_split in [ 'dev', 'eval' ] : 
  file_path = os.path.join( data_location, data_split + '.csv' )
  header, data = load_csv( file_path )
  for elem in data : 
    if elem[ header.index( 'Language' ) ] =='EN' :
      idioms.append( elem[ header.index( 'MWE1' ) ] )
      idioms.append( elem[ header.index( 'MWE2' ) ] )

idioms = list( set( idioms ) ) 
idioms.remove( 'None' )

In [None]:
from bertram import BertramWrapper

# If using bert-based-uncased
# bertram = BertramWrapper('bertram-add-for-bert-base-uncased', device='cpu')

# If using roberta-large
bertram = BertramWrapper('bertram-add-for-roberta-large', device='cuda')

# -----

idioms = [ tokenise_idiom(i) for i in idioms ]
words_with_contexts = {}
for i in idioms:
  words_with_contexts[i] = []

embeddings={word:bertram.infer_vector(word, contexts) for word, contexts in words_with_contexts.items()}

2022-05-11 01:18:02,629 - INFO - ngram_models - Found 94601 ngrams with min count 4 and (nmin,nmax)=(3,5), first 10: ['UNK', 'PAD', 'ing', 'ed<S>', 'ng<S>', 'es<S>', 'er<S>', 'ing<S>', 'on<S>', '<S>co'], last 10: [':40', ':40<S>', 'beur', '<S>ette', 'ion-r', '<S>yot', 'eora', 'eora<S>', 'kowo', 'kowo<S>']
Some weights of the model checkpoint at bertram-add-for-roberta-large were not used when initializing BertramForRoberta: ['shallow_combination.linear.weight', 'sep_linear.weight', 'mask_linear.bias', 'roberta.embeddings.word_embeddings.embedding.weight', 'sep_linear.bias', 'shallow_combination.linear.bias', 'mask_linear.weight']
- This IS expected if you are initializing BertramForRoberta from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertramForRoberta from the checkpoint of a model that you expect to be exactly 

In [None]:
special_tokens=[f"<BERTRAM:{word}>" for word in embeddings.keys()]
tokenizer.add_special_tokens({"additional_special_tokens": special_tokens})

60

## Creating Sentence Transformers model

This will ensure that the output sentence embeddings can be compared using cosine similarity.

### Start by preparing data. 

Here we combine English and Portuguese training, dev and evaluation data. We use the dev and test splits to test the resultant sentence transformers models.

**WARNING**: You must NOT train using the development or test sections of either of these datasets.


In [None]:
## For details and source see https://github.com/H-TayyarMadabushi/AStitchInLanguageModels/blob/main/Dataset/Task2/sentenceTransformers/training_stsbenchmark.py

sts_dataset_path = '/content/datasets/stsbenchmark.tsv.gz'
if not os.path.exists(sts_dataset_path):
    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)

  0%|          | 0.00/392k [00:00<?, ?B/s]

In [None]:
  ## For details and source see https://github.com/H-TayyarMadabushi/AStitchInLanguageModels/blob/main/Dataset/Task2/sentenceTransformers/training_stsbenchmark.py

  sts_dataset_path = os.path.join( outpath, 'datasets', 'stsbenchmark.tsv.gz' )
  if not os.path.exists(sts_dataset_path):
    util.http_get('https://sbert.net/datasets/stsbenchmark.tsv.gz', sts_dataset_path)

  train_samples = []
  dev_samples   = []
  test_samples  = []
  if 'EN' in languages : 
    with gzip.open(sts_dataset_path, 'rt', encoding='utf8') as fIn:
      reader = csv.DictReader(fIn, delimiter='\t', quoting=csv.QUOTE_NONE)
      for row in reader:
          score = float(row['score']) / 5.0  # Normalize score to range 0 ... 1
          inp_example = InputExample(texts=[row['sentence1'], row['sentence2']], label=score)

          if row['split'] == 'dev':
              dev_samples.append(inp_example)
          elif row['split'] == 'test':
              test_samples.append(inp_example)
          else:
              train_samples.append(inp_example)
              
  if 'PT' in languages : 
    for split in [ 'train', 'validation', 'test' ] :
      dataset = load_dataset( 'assin2', split=split )
      for elem in dataset :
        ## {'entailment_judgment': 1, 'hypothesis': 'Uma criança está segurando uma pistola de água', 'premise': 'Uma criança risonha está segurando uma pistola de água e sendo espirrada com água', 'relatedness_score': 4.5, 'sentence_pair_id': 1}
          score = float( elem['relatedness_score'] ) / 5.0 # Normalize score to range 0 ... 1
          inp_example = InputExample(texts=[elem['hypothesis'], elem['premise']], label=score)
          if split == 'validation':
            dev_samples.append(inp_example)
          elif split == 'test':
            test_samples.append(inp_example)
          elif split == 'train' :
            train_samples.append(inp_example)
          else :
              raise Exception( "Unknown split. Should be one of ['train', 'test', 'validation']." )

 

  0%|          | 0.00/392k [00:00<?, ?B/s]

### Train Tokenized Sentence Transformer Model

In [None]:
model_path = outdir

word_embedding_model = models.Transformer(model_path)

# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
                                 pooling_mode_mean_tokens=True,
                                 pooling_mode_cls_token=False,
                                 pooling_mode_max_tokens=False)
  
tokenizer      = AutoTokenizer.from_pretrained(
    model_path             , 
    use_fast       = False ,
    max_length     = 510   ,
    force_download = True
  )
  
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
model._first_module().tokenizer = tokenizer

model.save( sent_trans_path )


Some weights of the model checkpoint at /content/models/mBERT-4 were not used when initializing RobertaModel: ['lm_head.layer_norm.weight', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.dense.weight', 'lm_head.bias']
- This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of RobertaModel were not initialized from the model checkpoint at /content/models/mBERT-4 and are newly initialized: ['roberta.pooler.dense.weight', 'roberta.pooler.dense.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.


In [None]:
train_batch_size = 4
num_epochs       = 4
    
train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
train_loss       = losses.CosineSimilarityLoss(model=model)
    
evaluator        = EmbeddingSimilarityEvaluator.from_input_examples(dev_samples, name='sts-dev')

# Configure the training. 
warmup_steps     = math.ceil(len(train_dataloader) * num_epochs  * 0.1) #10% of train data for warm-up
print("Warmup-steps: {}".format(warmup_steps), flush=True)

# Train the model
  
model.fit(train_objectives=[(train_dataloader, train_loss)],
            evaluator=evaluator,
            epochs=num_epochs,
            evaluation_steps=1000,
            warmup_steps=warmup_steps,
            output_path=model_save_path
  )

test_evaluator = EmbeddingSimilarityEvaluator.from_input_examples(test_samples, name='sts-test')
test_evaluator(model, output_path=model_save_path)

model_path = model_save_path

Warmup-steps: 576




Epoch:   0%|          | 0/4 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1438 [00:00<?, ?it/s]

## Load/Save from/to Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


In [None]:
## Save
!mkdir -p /content/gdrive/MyDrive/ColabData/SemEval2022Task2/SubtaskB/tokenizedSentTransNoPreTrain/
!cp -r /content/models/tokenizedSentTransNoPreTrain_4/* /content/gdrive/MyDrive/ColabData/SemEval2022Task2/SubtaskB/tokenizedSentTransNoPreTrain/

In [None]:
sent_trans_path 

'/content/models/tokenizedSentTrans_4'

In [None]:
## Load
#!mkdir -p /content/models/tokenizedSentTransNoPreTrain_4
#!cp -r /content/gdrive/MyDrive/ColabData/SemEval2022Task2/SubtaskB/tokenizedSentTransNoPreTrain/* /content/models/tokenizedSentTransNoPreTrain_4 

In [None]:
model_path = model_save_path

## Generate submission File and Evaluate 

These functions provide a way of finding the Semantic Text Similarity using (Sentence Transformers) between sentences containing MWEs. 

To do this we first replace all instances of MWEs in the input sentences with single tokens and then use Sentence Transformers. 


In [None]:
def prepare_eval_data( location, languages, test_print=False ) :
    header, data = load_csv( location )
    sentence1s = list()
    sentence2s = list()
    for elem in data : 
        if not languages is None and not elem[ header.index( 'Language' ) ] in languages : 
            continue
        sentence1 = elem[ header.index( 'sentence1' ) ] 
        sentence2 = elem[ header.index( 'sentence2' ) ] 
        mwe1      = elem[ header.index( 'MWE1'      ) ] 
        mwe2      = elem[ header.index( 'MWE2'      ) ] 

        if test_print : 
            print( sentence1 ) 
            print( sentence2 ) 
            print( mwe1 ) 
            print( mwe2 ) 

        if mwe1 != 'None' : 
            replaced = re.sub( mwe1, tokenise_idiom( mwe1 ), sentence1, flags=re.I)
            assert replaced != sentence1
            sentence1 = replaced
        if mwe2 != 'None' : 
            replaced = re.sub( mwe1, tokenise_idiom( mwe2 ), sentence2, flags=re.I)
            assert replaced != sentence2
            sentence2 = replaced

        if test_print : 
            print( sentence1 ) 
            print( sentence2 ) 
            break

        sentence1s.append( sentence1 ) 
        sentence2s.append( sentence2 ) 

    return sentence1s, sentence2s


def get_similarities( location, model, languages=None ) : 
    sentences1, sentences2 = prepare_eval_data( location, languages ) 

    #Compute embedding for both lists
    embeddings1 = model.encode(sentences1, show_progress_bar=True, convert_to_numpy=True)
    embeddings2 = model.encode(sentences2, show_progress_bar=True, convert_to_numpy=True)

    # Compute cosine-similarits
    cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))

    return cosine_scores


In [None]:
model_path

'/content/models/tokenizedSentTransNoPreTrain_4'

In [None]:
model      = SentenceTransformer( model_path )

This following generates the similarities that we require for the development and evaluation splits. 

In [None]:
dev_sims  = get_similarities( dev_location , model, languages ) 
eval_sims = get_similarities( eval_location, model, languages )

Batches:   0%|          | 0/35 [00:00<?, ?it/s]

Batches:   0%|          | 0/35 [00:00<?, ?it/s]

Batches:   0%|          | 0/37 [00:00<?, ?it/s]

Batches:   0%|          | 0/37 [00:00<?, ?it/s]

The following function creates a submission file with the predictions generated. 

Note that we set it up so we can load up results for only one setting. 

It requires as input the submission format file, which is available with the data. 


In [None]:
def insert_to_submission( languages, settings, sims, location ) : 
    header, data = load_csv( location ) 
    sims = list( reversed( sims ) )
    ## Validate with length
    updatable = [ i for i in data if i[ header.index( 'Language' ) ] in languages and i[ header.index( 'Setting' ) ] in settings ]
    assert len( updatable ) == len( sims ) 

  ## Will update in sequence - if data is not in sequence must update one language / setting at a time. 
    started_update = False
    for elem in data : 
        if elem[ header.index( 'Language' ) ] in languages and elem[ header.index( 'Setting' ) ] in settings : 
            sim_to_insert = sims.pop()
            elem[-1] = sim_to_insert
            started_update = True
        else :  
            assert not started_update ## Once we start, we must complete. 
        if len( sims ) == 0 : 
            break 
    assert len( sims ) == 0 ## Should be done here. 

    return [ header ] + data ## Submission file must retain header. 

For the dev set, we can use the evaluation script and gold labels to see what the results are. 


In [None]:
## Create submission file on the development set. 
submission_data = insert_to_submission( languages, [ 'pre_train' ], dev_sims, dev_formated_file_location )  
results_file    = os.path.join( outpath, 'dev.pre_train_results-' + str( seed ) + '.csv' )
write_csv( submission_data, results_file )

## Evaluate development set. 
results = evaluate_submission( results_file, os.path.join( data_location, 'dev.gold.csv' ) )

## Make results printable. 
for result in results : 
    for result_index in range( 2, 5 ) : 
        result[result_index] = 'Did Not Attempt' if result[result_index] is None else result[ result_index ]
%reload_ext google.colab.data_table
import pandas as pd
df = pd.DataFrame(data=results[1:], columns=results[0])

df

Wrote /content/models/dev.pre_train_results-4.csv


Unnamed: 0,Settings,Languages,Spearman Rank ALL,Spearman Rank Idiom Data,Spearman Rank STS Data
0,pre_train,EN,0.794031,0.327618,0.861627
1,pre_train,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
2,pre_train,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt
3,fine_tune,EN,Did Not Attempt,Did Not Attempt,Did Not Attempt
4,fine_tune,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
5,fine_tune,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt


In [None]:
results_file = os.path.join( outpath, 'RESULTS_TABLE-dev.pre_train_' + str( seed ) + '.csv' )    
write_csv( results, results_file )

Wrote /content/models/RESULTS_TABLE-dev.pre_train_4.csv


### Generate output for the evaluation set
Note that we do not have access to the gold labels for the eval set. These results must be submitted to CodaLab.


In [None]:
# Save
!cp /content/models/RESULTS_TABLE-dev.pre_train_4.csv /content/gdrive/MyDrive/ColabData/SemEval2022Task2/tokenizedSentTransNoPreTrain/

In [None]:
submission_data = insert_to_submission( languages, [ 'pre_train' ], eval_sims, eval_formated_file_location )  
results_file    = os.path.join( outpath, 'eval.pre_train_results-' + str( seed ) + '.csv' )
write_csv( submission_data, results_file )

Wrote /content/models/eval.pre_train_results-4.csv


# Setting: Fine-Tune

We must start with a model that can already output Semantic Text Similarity here. 

We choose to use the model we created in the previous sections and as such continue to use the MWE tokenization. 



## Generate Training Data

### Helper Functions
We need to perform some preprocessing to generate the required training data. 

Notice that the training data either has an associated similarity or requires us to generate a similarity measure based on the alternative sentences provided. 



#### _parse_train_data

The *_parse_train_data* function splits the training data provided into three lists: 
  * The first (​​train_data_with_labels) is the list of sentence pairs that have an associated similarity measure and so do not require further processing. 

  * The second (train_data_requiring_labels) is the list of sentence pairs that do not have associated similarities.

  * The third (need_predictions_for_train_data_labels) is the list of “associated sentence pairs” which must be used to generate similarities for the above list of sentences pairs. Since the training data is based on self-consistency, we need to generate similarities between sentences that do not contain MWEs to compare against. 

For more details on the training data, please see the [associated section](https://sites.google.com/view/semeval2022task2-idiomaticity#h.qq7eefmehqf9) on the [task description website](https://sites.google.com/view/semeval2022task2-idiomaticity).


In [None]:
def _parse_train_data( train_data_location, languages, tokenize=True ) :

    header, train_data = load_csv( train_data_location )
    
    train_data_with_labels                 = list()
    train_data_requiring_labels            = list()
    need_predictions_for_train_data_labels = list()

    # ['ID', 'MWE1', 'MWE2', 'Language', 'sentence_1', 'sentence_2', 'sim', 'alternative_1', 'alternative_2']

    skipped = 0 

    for elem in train_data :

        if not elem[ header.index( 'Language' ) ] in languages :
            skipped += 1
            continue

        mwe1          = elem[ header.index( 'MWE1'          ) ] 
        mwe2          = elem[ header.index( 'MWE2'          ) ] 
        
        this_sim      = elem[ header.index( 'sim'           ) ]
        sentence_1    = elem[ header.index( 'sentence_1'    ) ]
        sentence_2    = elem[ header.index( 'sentence_2'    ) ]
        alternative_1 = elem[ header.index( 'alternative_1' ) ]
        alternative_2 = elem[ header.index( 'alternative_2' ) ]

        ## Remove below if you do not want to tokenize with idiom tokens!
        if tokenize : 
            if mwe1 != 'None' : 
                replaced = re.sub( mwe1, tokenise_idiom( mwe1 ), sentence_1, flags=re.I)
                assert replaced != sentence_1
                sentence_1 = replaced
            if mwe2 != 'None' : 
                replaced = re.sub( mwe1, tokenise_idiom( mwe2 ), sentence_2, flags=re.I)
                assert replaced != sentence_2
                sentence_2 = replaced
  
   
        if this_sim != 'None' :
            tmp = float( this_sim ) 
            train_data_with_labels.append( [ sentence_1, sentence_2, this_sim ] ) 
            continue
            
        train_data_requiring_labels.append( [ sentence_1, sentence_2 ] ) 
        need_predictions_for_train_data_labels.append( [ alternative_1, alternative_2 ] )

    assert len( need_predictions_for_train_data_labels ) == len( train_data_requiring_labels )
    assert len( train_data ) == len( need_predictions_for_train_data_labels ) + len( train_data_with_labels ) + skipped

    return train_data_with_labels, train_data_requiring_labels, need_predictions_for_train_data_labels 


#### _get_predictions_for_train_data_labels

This function is used to generate similarities between the “associated sentences” (i.e. need_predictions_for_train_data_labels) generated above. 

In [None]:
def _get_predictions_for_train_data_labels( model_path, data ) :

    model      = SentenceTransformer( model_path )

    sentences1 = [ i[0] for i in data ]
    sentences2 = [ i[1] for i in data ]

    embeddings1 = model.encode(sentences1, show_progress_bar=True, convert_to_numpy=True)
    embeddings2 = model.encode(sentences2, show_progress_bar=True, convert_to_numpy=True)

    cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))

    return cosine_scores
  

#### generate_train_data

Finally, this function uses the predictions and the three lists generated above to put together the final training data that we can use to train our model.


In [None]:
def generate_train_data( train_data_location, model_path, languages ) :
  
    train_data_with_labels, train_data_requiring_labels, need_predictions_for_train_data_labels = _parse_train_data( train_data_location, languages )
    sims = _get_predictions_for_train_data_labels( model_path, need_predictions_for_train_data_labels )

    train_data_requiring_labels_with_labels = list()
    for index in range( len( train_data_requiring_labels ) ) : 
        train_data_requiring_labels_with_labels.append( [ train_data_requiring_labels[index][0], train_data_requiring_labels[index][1], sims[index] ] )

    train_data = [ [ 'sentence_1', 'sentence_2', 'sim' ] ] + train_data_with_labels + train_data_requiring_labels_with_labels
    assert all( [ (len(i) == 3) for i in train_data ] )
    
    return train_data


### Generic Helper functions (Same as in Pre-Train Setting)

These helper functions are the same as above - they are included in the fine-tune setting for completeness. 


In [None]:
def get_similarities( location, model, languages=None ) : 
    sentences1, sentences2 = prepare_eval_data( location, languages ) 
    #Compute embedding for both lists
    
    embeddings1 = model.encode(sentences1, show_progress_bar=True, convert_to_numpy=True)
    embeddings2 = model.encode(sentences2, show_progress_bar=True, convert_to_numpy=True)

    # Compute cosine-similarits
    cosine_scores = 1 - (paired_cosine_distances(embeddings1, embeddings2))

    return cosine_scores

def insert_to_submission( languages, settings, sims, location ) : 
    header, data = load_csv( location ) 
    sims = list( reversed( sims ) )
    ## Validate with length
    updatable = [ i for i in data if i[ header.index( 'Language' ) ] in languages and i[ header.index( 'Setting' ) ] in settings ]
    assert len( updatable ) == len( sims ) 

    ## Will update in sequence - if data is not in sequence must update one language / setting at a time. 
    started_update = False
    for elem in data : 
        if elem[ header.index( 'Language' ) ] in languages and elem[ header.index( 'Setting' ) ] in settings : 
            sim_to_insert = sims.pop()
            elem[-1] = sim_to_insert
            started_update = True
        else :  
            assert not started_update ## Once we start, we must complete. 
        if len( sims ) == 0 : 
            break 
    assert len( sims ) == 0 ## Should be done here. 

    return [ header ] + data ## Submission file must retain header. 



### Parameters and Data Generation

In [None]:
!ls /content/models

datasets		      RESULTS_TABLE-dev.pre_train_4.csv
dev.pre_train_results-4.csv   tokenizedSentTrans_4
eval.pre_train_results-4.csv  tokenizedSentTransNoPreTrain_4
mBERT-4


In [None]:
sent_trans_path 

'/content/models/tokenizedSentTrans_4'

In [None]:
seed   = 1 ## Found using multiple runs
epochs = 1 ## Found using multiple runs

In [None]:
best_pre_train_seed = 4 ## Found this by running above (as in pre-train setting) multiple times. 

train_data_location = 'SemEval_2022_Task2-idiomaticity/SubTaskB/TrainData/train_data.csv'
out_location        = 'models/FineTune/'

model_path = sent_trans_path
train_data = generate_train_data( train_data_location, model_path, languages )

Batches:   0%|          | 0/64 [00:00<?, ?it/s]

Batches:   0%|          | 0/64 [00:00<?, ?it/s]

## Train Model

### Train Function

This function will fine tune for a particular seed. 

If no epoch is passed, it can also train for multiple epochs and print (and write out) all results. Note that it will return the final epoch results (not the best one). 

If the best epoch is known, it can be passed to the function and the model will be trained for that many epochs. 

Notice that we cannot use the default evaluator - we write out the results, create a submission file of the required format and then use the evaluation script for the subtask. 

We also write out the evaluation results - which can be submitted to Codalab. 



In [None]:
def create_and_eval_subtask_b_fine_tune( 
    model_path, 
    seed, 
    data_location, 
    dev_formated_file_location,
    eval_formated_file_location,
    train_data, 
    out_location, 
    languages, 
    epoch=None 
    ):

    set_seed( seed )
    
    dev_location                = os.path.join( data_location, 'dev.csv'                     ) 
    eval_location               = os.path.join( data_location, 'eval.csv'                    ) 


    ## Training Dataloader
    train_samples = list()

    header     = train_data[0] ## ['sentence_1', 'sentence_2', 'sim']
    train_data = train_data[1:]
    for elem in train_data :
        score = float( elem[2] ) 
        inp_example = InputExample(texts=[elem[0], elem[1]], label=score)
        train_samples.append(inp_example)


  ## Params
    train_batch_size = 4
        
    model            = SentenceTransformer( model_path )
    train_dataloader = DataLoader(train_samples, shuffle=True, batch_size=train_batch_size)
    train_loss       = losses.CosineSimilarityLoss(model=model)
  
 # Train the model
    dev_sims = eval_sims = results = None
    if epoch is None :
        ## Going to test all epochs - notice we can't use the default evaluator. 
        for epoch in range( 1, 10 ) :
            warmup_steps     = math.ceil(len(train_dataloader) * epoch  * 0.1) #10% of train data for warm-up
            print("Warmup-steps: {}".format(warmup_steps), flush=True)
            
            model_save_path = os.path.join( out_location, str( seed ), str( epoch ) ) 
            model.fit(train_objectives=[(train_dataloader, train_loss)],
                        evaluator=None,
                        epochs=1,
                        evaluation_steps=0,
                        warmup_steps=warmup_steps,
                        output_path=model_save_path
            )

            dev_sims  = get_similarities( dev_location , model, languages ) 
            eval_sims = get_similarities( eval_location, model, languages )

            ## Create submission file on the development set. 
            submission_data = insert_to_submission( languages, [ 'fine_tune' ], dev_sims, dev_formated_file_location )  
            results_file    = os.path.join( outpath, 'dev.combined_results-' + str( seed ) + '.csv' )
            write_csv( submission_data, results_file )

            ## Evaluate development set. 
            results = evaluate_submission( results_file, os.path.join( data_location, 'dev.gold.csv' ) )

        ## Make results printable. 
            for result in results : 
                for result_index in range( 2, 5 ) : 
                    result[result_index] = 'Did Not Attempt' if result[result_index] is None else result[ result_index ]

            for row in results : 
                print( '\t'.join( [str(i) for i in row ] ) )
            
            results_file = os.path.join( model_save_path, 'RESULTS_TABLE-dev.pre_train_' + str(epoch) + str( seed ) + '.csv' )    
            write_csv( results, results_file )      

            ## Generate combined output for this epoch.
            submission_data = insert_to_submission( languages, [ 'fine_tune' ], eval_sims, eval_formated_file_location )  
            results_file    = os.path.join( outpath, 'eval.combined_results-' + str( seed ) + '_' + str( epoch ) + '.csv' )
            write_csv( submission_data, results_file )

 
    else :
    ## We already know the best epoch and so will use it.
        warmup_steps     = math.ceil(len(train_dataloader) * epoch  * 0.1) #10% of train data for warm-up
        print("Warmup-steps: {}".format(warmup_steps), flush=True)

        model_save_path = os.path.join( out_location, str( seed ), str( epoch ) ) 
        model.fit(train_objectives=[(train_dataloader, train_loss)],
                evaluator=None,
                epochs=epoch,
                evaluation_steps=0,
                warmup_steps=warmup_steps,
                output_path=model_save_path
        )
    
    dev_sims  = get_similarities( dev_location , model, languages ) 
    eval_sims = get_similarities( eval_location, model, languages )

    ## Create submission file on the development set. 
    submission_data = insert_to_submission( languages, [ 'fine_tune' ], dev_sims, dev_formated_file_location )  
    results_file    = os.path.join( outpath, 'dev.combined_results-' + str( seed ) + '.csv' )
    write_csv( submission_data, results_file )
    
    ## Evaluate development set. 
    results = evaluate_submission( results_file, os.path.join( data_location, 'dev.gold.csv' ) )
    
    ## Make results printable. 
    for result in results : 
      for result_index in range( 2, 5 ) : 
        result[result_index] = 'Did Not Attempt' if result[result_index] is None else result[ result_index ]
  
    results_file = os.path.join( model_save_path, 'RESULTS_TABLE-dev.pre_train_' + str(epoch) + str( seed ) + '.csv' )    
    write_csv( results, results_file )
    
    submission_data = insert_to_submission( languages, [ 'fine_tune' ], eval_sims, os.path.join( data_location, 'eval.submission_format.csv'   )  )  
    results_file    = os.path.join( outpath, 'eval.fine_tune_results-' + str( seed ) + '.csv' )
    write_csv( submission_data, results_file )

    submission_data = insert_to_submission( languages, [ 'fine_tune' ], eval_sims, eval_formated_file_location )  
    results_file    = os.path.join( outpath, 'eval.combined_results-' + str( seed ) + '.csv' )
    write_csv( submission_data, results_file )


    ## Outside if
    return results


### Train the model and Evaluate

In [None]:
epochs = 5

In [None]:
params = {
    'model_path'                  : model_path, 
    'seed'                        : seed, 
    'data_location'               : data_location, 
    'dev_formated_file_location'  : '/content/models/dev.pre_train_results-4.csv',  ## We can append to this.
    'eval_formated_file_location' : '/content/models/eval.pre_train_results-4.csv',
    'train_data'                  : train_data , 
    'out_location'                : out_location ,  
    'languages'                   : languages ,
    'epoch'                       : epochs ,
} 


In [None]:
  results = create_and_eval_subtask_b_fine_tune( ** params ) 
  %reload_ext google.colab.data_table
  import pandas as pd
  df = pd.DataFrame(data=results[1:], columns=results[0])

  df

Warmup-steps: 591




Epoch:   0%|          | 0/5 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1182 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1182 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1182 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1182 [00:00<?, ?it/s]

Iteration:   0%|          | 0/1182 [00:00<?, ?it/s]

Batches:   0%|          | 0/35 [00:00<?, ?it/s]

Batches:   0%|          | 0/35 [00:00<?, ?it/s]

Batches:   0%|          | 0/37 [00:00<?, ?it/s]

Batches:   0%|          | 0/37 [00:00<?, ?it/s]

Wrote /content/models/dev.combined_results-1.csv
Wrote models/FineTune/1/5/RESULTS_TABLE-dev.pre_train_51.csv
Wrote /content/models/eval.fine_tune_results-1.csv
Wrote /content/models/eval.combined_results-1.csv


Unnamed: 0,Settings,Languages,Spearman Rank ALL,Spearman Rank Idiom Data,Spearman Rank STS Data
0,pre_train,EN,0.794031,0.327618,0.861627
1,pre_train,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
2,pre_train,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt
3,fine_tune,EN,0.681911,0.312658,0.677142
4,fine_tune,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
5,fine_tune,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt


# Results

In [None]:
df

Unnamed: 0,Settings,Languages,Spearman Rank ALL,Spearman Rank Idiom Data,Spearman Rank STS Data
0,pre_train,EN,0.794031,0.327618,0.861627
1,pre_train,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
2,pre_train,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt
3,fine_tune,EN,0.761372,0.150482,0.619134
4,fine_tune,PT,Did Not Attempt,Did Not Attempt,Did Not Attempt
5,fine_tune,"EN,PT",Did Not Attempt,Did Not Attempt,Did Not Attempt
