**CSI5138F - Intro to DL and RL - Project**

This notebook is heavily influenced by the following Tutorial:
  https://mccormickml.com/2019/07/22/BERT-fine-tuning/ 

# Set up the notebook

In [0]:
'''
BASIC GUIDELINES FOR THE USAGE OF THIS NOTEBOOK

(1) To run any section of this code, you must first run the first section ''Set up the notebook''
(2) For the second section ''SemCor and SensEval Parsing and Preprocessing'', it should not be necessary to run it 
anymore since all the preprocessed data for input into the models is saved into two files already.
(3) However, the WiC preprocessed data was not saved into a file as for (2). Thus, to before performing training with WiC,
we must run the section ''WiC Preprocessing'', where the WiC preprocessing method is defined 
(though I would recommend that we save the preprocessed data to a file just like for (2) as well)
(4) The fourth section "Training and Validation" is where we do all the training and validation and load the data for that purpose, 
it is divided into subsections according to which model we wish to focus on. The hyperparameters and constants have been moved to that section for now, since we will probably have them for each model individually after
so, make sure to run all the subsections of "Training and Validation" if you want to make sure you have all the parameters necessary for your training
'''

'''
THINGS TO DO:

(1) Run the training loop of the Sense/POS Prediction Task and check that the loss is diminishing.
(2) Implement methods to save/load the parameters of the different models along with a name that identifies the hyperparameters (and possibly achieved loss, etc.)
(3)
'''

'''
OTHER POINTS TO CONSIDER:
(1) It may be possible to feed words in a batchfor the finetuneingHead on senses if we assign a dummy sense/pos/lemma to the id -1. 
We can always experiment with that if we find that feeding words one-by-one in a loop is inefficient.
'''


'\nOTHER POINTS TO CONSIDER:\n(1) It may be possible to feed words in a batchfor the finetuneingHead on senses if we assign a dummy sense/pos/lemma to the id -1. \nWe can always experiment with that if we find that feeding words one-by-one in a loop is inefficient.\n'

In [2]:
# install pytorch transformers
!pip install pytorch-transformers

Collecting pytorch-transformers
[?25l  Downloading https://files.pythonhosted.org/packages/a3/b7/d3d18008a67e0b968d1ab93ad444fc05699403fa662f634b2f2c318a508b/pytorch_transformers-1.2.0-py3-none-any.whl (176kB)
[K     |█▉                              | 10kB 16.3MB/s eta 0:00:01[K     |███▊                            | 20kB 4.4MB/s eta 0:00:01[K     |█████▋                          | 30kB 6.3MB/s eta 0:00:01[K     |███████▍                        | 40kB 8.0MB/s eta 0:00:01[K     |█████████▎                      | 51kB 5.2MB/s eta 0:00:01[K     |███████████▏                    | 61kB 6.0MB/s eta 0:00:01[K     |█████████████                   | 71kB 6.8MB/s eta 0:00:01[K     |██████████████▉                 | 81kB 7.6MB/s eta 0:00:01[K     |████████████████▊               | 92kB 8.4MB/s eta 0:00:01[K     |██████████████████▋             | 102kB 6.5MB/s eta 0:00:01[K     |████████████████████▍           | 112kB 6.5MB/s eta 0:00:01[K     |██████████████████████▎     

## Imports and mount drive

In [3]:
import torch
import os
import string
from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SubsetRandomSampler, SequentialSampler
from pytorch_transformers import *
import numpy as np
import json
from google.colab import drive
from sklearn.utils import shuffle
from sklearn.metrics import accuracy_score

# Mount google drive containing the datasets
drive.mount('/content/drive', force_remount=True)

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


## Get RoBERTa tokenizer

In [4]:
tokenizer = RobertaTokenizer.from_pretrained('roberta-base')

100%|██████████| 898823/898823 [00:00<00:00, 1820850.83B/s]
100%|██████████| 456318/456318 [00:00<00:00, 1359075.01B/s]


## Utility functions

In [0]:
# Create a function to import json objects from jsonl files
def load_json_objects_from_file(filename):
  # Array for json objects
  json_objects = []
  # Read file line by line
  with open(filename, mode = "r") as jsonl_file:
      for line in jsonl_file:
          json_objects.append(json.loads(line))
  return json_objects

# Takes a word, tokenizes it and returns the pair of positions for the FIRST occurrence of the tokens in the token_ids list
# NOTE: it can also apply to a group of words separated by spaces
#DO NOT CALL THIS METHODS DIRECTLY IF YOU WORK WITH A SENTENCE, USE THE NEXT METHOD INSTEAD: find_words_in_tokenized_sentences(wordList,token_ids) 
def find_word_in_tokenized_sentence(word,token_ids):
  decomposedWord = tokenizer.encode(word)
  # Iterate through to find a matching sublist of the token_ids
  for i in range(len(token_ids)):
    if token_ids[i] == decomposedWord[0] and token_ids[i:i+len(decomposedWord)] == decomposedWord:
      return (i,i+len(decomposedWord)-1)
  # This is the ouput when no matching pattern is found
  return (-1,-1)

# Takes a list of words (strings) and a sentence (as a RoBERTa tokenized ID list) and returns a list
# of pairs indicating the tokens' start and end positions in the sentence for each word
# NOTE: It is important that the list of words given describes a sentence, because the order is relevant to do the matching properly
def find_words_in_tokenized_sentences(wordList,token_ids):
  intList = []
  for word in wordList:
    if len(intList) == 0:
      intList.append(find_word_in_tokenized_sentence(word,token_ids))
    else:
      afterLastInterval = intList[-1][1]+1
      interv = find_word_in_tokenized_sentence(word,token_ids[afterLastInterval:])
      actualPositions = (interv[0] + afterLastInterval,interv[1]+afterLastInterval)
      intList.append(actualPositions)
  return intList

# Returns the list of senses for the words of a given semcor_object
def sensesOf(semcor_obj):
  sensesList = []
  for j in range(len(semcor_obj['lemmatized'])):
    sensesList.append(semcor_obj['lemmatized'][j] + '%' + semcor_obj['lex_sense'][j])
  return sensesList


#Call on roberta_model and the embedding intervals of your sample to get the average of the corresponding embeddings for each word in your sample
def embExtract(embeddings_model,input_ids,attention_mask,emb_intervals):
  # Start by retrieving the embeddings from the embeddings model for input_ids and attention_mask
  #embeddings, _ = embeddings_model(input_ids=input_ids, attention_mask=attention_mask) # Without MaskedLM Model, don't use unless needed
  embeddings, _ = embeddings_model.roberta(input_ids=input_ids, attention_mask=attention_mask) # RobertaForMaskedLM model
  embeddings = embeddings[0]
  #This will hold the embeddings averaged over the appropriate interval for each word
  extracted_embeddings = []
  embedding_intervals = [[(x, y) for x, y in emb_intervals.data.tolist()[0] if x != -1]][0]
  
  for j in range(len(embedding_intervals)):
    temp_embeddings = torch.Tensor([0.0] * 768).to(device)
    start = embedding_intervals[j][0] 
    end = embedding_intervals[j][1] 
    for k in range(start,end+1):
      temp_embeddings += embeddings[k]
    temp_embeddings /= (end-start+1)    
    extracted_embeddings.append(temp_embeddings)
  
  return extracted_embeddings


def correct(pred,label): 
  pred_actual = torch.argmax(torch.tensor(torch.nn.functional.softmax(pred)))
  if (pred_actual.view(-1).item() == label.view(-1).item()):
    return 1
  else:
    return 0

def countCorrect(preds,labels):
  num_of_correct = 0
  total_num = 0
  for pred, label in zip(preds, labels):
    num_of_correct += correct(pred,label)
    total_num += 1
  return num_of_correct / total_num


## Method for the SemCor+SensEval Vocabulary

In [0]:
# CREATE THE SEMCOR AND SENSEVAL VOCABULARY OBJECT
#IT NEEDS TO BE LOADED BEFORE EVERYTHING BECAUSE BOTH THE PREPROCESSING METHODS AND THE TRAINING METHODS MAKE USE OF IT

#This incorporates both the SemCor and SensEval vocabulary
class SemCor_SensEval_Vocab:
  def __init__(self):

    vocab = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SemCor/semcor_senseval_vocab.jsonl")[0]

    # These define the three separate vocabularies for senses, lemmas and pos respectively.
    self.SENSES = vocab['senses']
    self.LEMMAS = vocab['lemmas']
    self.POS = vocab['POS']
    #for each word in the vocabulary self.WORDS[j], there is a corresponding list of all possible senses self.SENSESPACES[j] for that word, a list of possible lemmas self.LEMMASSPACES[j] and list of pos self.POSSPACES[j]
    self.WORDS = vocab['words']
    self.SENSESPACES = vocab['senseSpaces']
    self.LEMMASSPACES = vocab['lemmaSpaces']
    self.POSSPACES = vocab['posSpaces']

  #Returns the index of a word in the vocabulary self.WORDS
  def getIndexOfWord(self, word):
    try:
      index_value = self.WORDS.index(word)
    except ValueError:
      index_value = -1
    return index_value
  
  def getIndicesOfWords(self, words):
    listOfWordIndices = []
    for j in range(len(words)):
      listOfWordIndices.append(self.getIndexOfWord(words[j]))
    return listOfWordIndices
  
  def getSenseSpaceOfWord(self,word):
    j = self.getIndexOfWord(word)
    return self.SENSESPACES[j]

  #Returns the index of a sense in the corresponding list of senses in self.SENSESPACES for that word
  def giveSenseIndexOfWord(self, word,sense):
    wordIndex = self.getIndexOfWord(word)
    try:
      index_value = self.SENSESPACES[wordIndex].index(sense)
    except ValueError:
      index_value = -1
    return index_value
  
  #Returns a list of indices in the sense space for each word in the list words, in correspondance with a specific sense in senses
  def getSenseSpaceIndices(self, words, senses):
    listOfSenseIndices = []
    for j in range(len(words)):
      listOfSenseIndices.append(self.giveSenseIndexOfWord(words[j],senses[j]))
    
    return listOfSenseIndices

    wordIndex = self.getIndexOfWord(word)
    try:
      index_value = self.SENSESPACES[wordIndex].index(sense)
    except ValueError:
      index_value = -1
    return index_value

  def convertSenseID_to_Sense(self, sense_id): 
    if sense_id == -1:
      return 'NO SENSE'
    else:
      return self.SENSES[sense_id]

  def convertListSenseID_to_Sense(self, sense_id_list):
    senseList =[]

    for sense_id in sense_id_list:
      senseList.append(self.convertSenseID_to_Sense(sense_id))
      
    return senseList

  def convertLemmaID_to_Lemma(self, lemma_id): 
    if lemma_id == -1:
      return 'NO LEMMA'
    else:
      return self.LEMMAS[lemma_id]

  def convertListLemmaID_to_Lemma(self, lemma_id_list):
    lemmaList =[]

    for lemma_id in lemma_id_list:
      lemmaList.append(self.convertLemmaID_to_Lemma(lemma_id))
      
    return lemmaList

  def convertPOSID_to_POS(self, pos_id): 
    if pos_id == -1:
      return 'NO POS'
    else:
      return self.POS[pos_id]

  def convertListPOSID_to_POS(self, pos_id_list):
    posList =[]

    for pos_id in pos_id_list:
      posList.append(self.convertPOSID_to_POS(pos_id))
      
    return posList

  # This returns the position of a given sense in the sense vocabulary; it acts as the sense ID.
  def getSenseID(self, sense): 
    return self.SENSES.index(sense)

  # This returns a list of the positions of a given list of senses in the sense vocabulary.
  def getSenseIDList(self, senseList):
      IDList = []
      for exp in senseList:
        IDList.append(self.getSenseID(exp))
      return IDList
  
  # This returns the position of a given lemma in the lemma vocabulary; it acts as the lemma ID.
  def getLemmaID(self, lemma):
    return self.LEMMAS.index(lemma)

  # This returns a list of the positions of a given list of lemmas in the lemma vocabulary.
  def getLemmaIDList(self, lemmaList):
      IDList = []
      for exp in lemmaList:
        IDList.append(self.getLemmaID(exp))
      return IDList    

  # This returns the position of a given part-of-speech in the part-of-speech vocabulary; it acts as the part-of-speech ID.
  def getPOSID(self, pos):
    return self.POS.index(pos)

  # This returns a list of the positions of a given list of lemmas in the lemma vocabulary.
  def getPOSIDList(self, posList):
      IDList = []
      for exp in posList:
        IDList.append(self.getPOSID(exp))
      return IDList

  #def convertPOSIDtoPOS() :  

# TEST THE VOCABULARY
#print(SEMCOR_VOCAB.LEMMAS[:10])
#print(SEMCOR_VOCAB.SENSES[:10])
#print(SEMCOR_VOCAB.POS[:10])

#print('# of dimensions for lemmas:', len(SEMCOR_VOCAB.LEMMAS))
#print('# of dimensions for senses:', len(SEMCOR_VOCAB.SENSES))
#print('# of dimensions for POS:', len(SEMCOR_VOCAB.POS))

#print(SEMCOR_VOCAB.getSenseID('siberia%1:15:00::'))
#print(SEMCOR_VOCAB.getSenseIDList(['siberia%1:15:00::', 'creativity%1:09:00::', 'foundation%1:06:00::']))

#print(SEMCOR_VOCAB.getLemmaID('flattened'))
#print(SEMCOR_VOCAB.getLemmaIDList(['consolidate', 'steakhouse', 'halftime', 'margin']))

#print(SEMCOR_VOCAB.getPOSID('WP'))
#print(SEMCOR_VOCAB.getPOSIDList(['WP$', 'WP','RB']))

In [0]:
#SEMCOR_VOCAB = SemCor_SensEval_Vocab()

In [0]:
print(SEMCOR_VOCAB.WORDS[100])
print(SEMCOR_VOCAB.SENSESPACES[100])
print(SEMCOR_VOCAB.POSSPACES[100])

wordIndex = SEMCOR_VOCAB.getIndexOfWord('mother')
print(wordIndex)

senseIndex = SEMCOR_VOCAB.giveSenseIndexOfWord(SEMCOR_VOCAB.WORDS[wordIndex],'mother%1:18:00::')
print(senseIndex)

words = [SEMCOR_VOCAB.WORDS[100],SEMCOR_VOCAB.WORDS[200],SEMCOR_VOCAB.WORDS[350],SEMCOR_VOCAB.WORDS[112]]

#print(SEMCOR_VOCAB.getSenseSpaceOfWord(words[0]))
#print(SEMCOR_VOCAB.getSenseSpaceOfWord(words[1]))
#print(SEMCOR_VOCAB.getSenseSpaceOfWord(words[2]))
#print(SEMCOR_VOCAB.getSenseSpaceOfWord(words[3]))

senses = [SEMCOR_VOCAB.getSenseSpaceOfWord(words[0])[0], SEMCOR_VOCAB.getSenseSpaceOfWord(words[1])[1], SEMCOR_VOCAB.getSenseSpaceOfWord(words[2])[1],SEMCOR_VOCAB.getSenseSpaceOfWord(words[3])[0] ]

print(words)
print(senses)


SEMCOR_VOCAB.getSenseSpaceIndices(words,senses)

mother
['mother%1:18:00::']
['NN']
100
0
['mother', 'ten', 'recent', 'trisodium orthophosphate']
['mother%1:18:00::', 'ten%None', 'recent%5:00:00:past:00', 'trisodium_orthophosphate%1:27:00::']


[0, 1, 1, 0]

# SemCor and SensEval Parsing and Preprocessing

## SemCor and SensEval Vocabulary Preprocessing

In [0]:
# CREATES THE SEMCOR + SENSEVAL VOCABULARY FILE FOR LEMMAS, SENSES AND POS AND STORES IT IN
#/content/drive/My Drive/CSI5138_Project/SemCor/semcor_senseval_vocab.jsonl
def createVocabFile():
  semcor_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SemCor/semcor_parsed2nd.jsonl")
  senseval_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SensEval/senseval.jsonl")

  vocabulary = {
      "words" : [],
      "lemmas" : [],
      "senses" : [],
      "POS" : [],
      "senseSpaces" : [],
      "lemmaSpaces" : [],
      "posSpaces" : []
      }

  words = []
  lemmas = []
  senses = []
  POS = []

  for obj in semcor_json_objs:
    words.extend(obj['wordList'])
    lemmas.extend(obj['lemmatized'])
    senses.extend(sensesOf(obj))
    POS.extend(obj['pos'])


  

  for obj in senseval_json_objs:
    words.extend(obj['wordList'])
    lemmas.extend(obj['lemmatized'])
    senses.extend(sensesOf(obj))
    POS.extend(obj['pos'])

  print(words[5620])
  print(senses[5620])


  vocabulary['words'] = list(set(words))
  vocabulary['lemmas'] = list(set(lemmas))
  vocabulary['senses'] = list(set(senses))
  vocabulary['POS'] = list(set(POS))

  


  for word in vocabulary['words']:
    listOfSensesOfWord =[]
    listOfLemmasOfWord =[]    
    listOfPOSOfWord =[]
    for j in range(len(words)):
      if words[j] == word:
        listOfSensesOfWord.append(senses[j])
        listOfLemmasOfWord.append(lemmas[j])
        listOfPOSOfWord.append(POS[j])
    listOfSensesOfWord = list(set(listOfSensesOfWord))
    listOfLemmasOfWord = list(set(listOfLemmasOfWord))
    listOfPOSOfWord = list(set(listOfPOSOfWord))
    vocabulary['senseSpaces'].append(listOfSensesOfWord)
    vocabulary['lemmaSpaces'].append(listOfLemmasOfWord)
    vocabulary['posSpaces'].append(listOfPOSOfWord)


  
  print(len(vocabulary['words']))  
  print(len(vocabulary['senseSpaces']))
  print(len(vocabulary['lemmaSpaces']))
  print(len(vocabulary['posSpaces']))


  #incorporating SensEval to the SemCor vocabulary shifted the number of lemmas from 17176 to 17764
  #incorporating SensEval to the SemCor vocabulary shifted the number of senses from 25573 to 26898
  #incorporating SensEval to the SemCor vocabulary shifted the number of POS from 39 to 45 (this could affect performance since there are 6 unknown POS, though they probably arise very rarely)
  #In sum, there are no major shifts in the size of the vocabulary

  with open('/content/drive/My Drive/CSI5138_Project/SemCor/semcor_senseval_vocab.jsonl', mode = "w") as vocab_file:
      vocab_file.write(json.dumps(vocabulary))

#LAUNCH THE METHOD
createVocabFile()


news
news%1:10:00::
25457
25457
25457
25457


## Parse the SemCor data
(to the file: '/content/drive/My Drive/CSI5138_Project/SemCor/semcor_parsed2nd.jsonl')

In [0]:
'''
    CSI5138F - Word in Context Project
    Project Group 4
    Members:
        - William Larocque
        - Simon Fortier-Garceau
        - Julian Templeton
    ---------------------------
    This file is for parsing the semcor corpus which we will use as the training set.
    Based on http://www.nltk.org/howto/corpus.html#other-corpora and https://docs.python.org/3.7/library/xml.etree.elementtree.html
'''
from xml.etree import ElementTree as ET
from os import listdir
from os.path import join
import json

def parseSemCor():
  file_location = "/content/drive/My Drive/CSI5138_Project/SemCor/tagfiles"
  semcor_files = [file for file in listdir(file_location) if '.xml' in file]
  # Buffer array to get all sentences of semcor
  parsed_semcor = []
  # Parse every file
  for xml_file in semcor_files:
      # Get all the sentences in the file
      sentences = ET.parse(join(file_location, xml_file)).findall('context/p/s')
      for sentence in sentences:
        
        parsed_sentence = {
            "sentence" : "",
            "wordList" : [],
            "lemmatized" : [],
            "pos" : [],
            "wordnet" : [],
            "lex_sense" : [],
            "source_file" : xml_file
            }
        # Deal with every word individually
        for token in sentence: # We only care about the words not the punctuation of the sentence
              # Get the token attribute            
              token_attrib = token.attrib

              #replace the underscore in word groups and replace with whitespace to facilitate proper tokenization
              token.text = token.text.replace('_', ' ')

              # Append word POS
              if 'pos' in token_attrib:
                parsed_sentence["pos"].append(token_attrib['pos'])
              else:
                parsed_sentence["pos"].append('PUNC')

              # Look for a lemmatize word, else just append word as-is
              if 'lemma' in token_attrib:
                parsed_sentence["lemmatized"].append(token_attrib['lemma'])
              else:
                parsed_sentence["lemmatized"].append(token.text)

              # Look for word sense, else append None.
              if 'wnsn' in token_attrib:
                  parsed_sentence["wordnet"].append(token_attrib["wnsn"])
              else:
                  parsed_sentence["wordnet"].append(None)

              # Look for lemma sense, else append None
              if 'lexsn' in token_attrib:
                  parsed_sentence["lex_sense"].append(token_attrib["lexsn"])
              else:
                  parsed_sentence["lex_sense"].append('None')

              # Add word to sentence
              parsed_sentence["wordList"].append(token.text)

        parsed_sentence["sentence"] = " ".join(parsed_sentence["wordList"])

        # Append to parsed_semcor
        parsed_semcor.append(parsed_sentence)

  with open('/content/drive/My Drive/CSI5138_Project/SemCor/semcor_parsed2nd.jsonl', mode = "w") as parsed_file:
      for entry in parsed_semcor:
          parsed_file.write(json.dumps(entry) + "\n")

#LAUNCH THE METHOD
parseSemCor()

In [0]:
#TEST THE PARSED SEMCOR FILE
#Punctuation has been added
semcor_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SemCor/semcor_parsed2nd.jsonl")
#print(f"Number of training examples = {len(semcor_json_objs)}")
#print(semcor_json_objs[0])

#sentence for RoBERTa tokenization example
s =  semcor_json_objs[0]['sentence']
print(s)

totalwords = 0
for j in range(len(semcor_json_objs)):
  for t in semcor_json_objs[j]['lex_sense']:
    totalwords += 1

token_ids = tokenizer.encode(s)
tokens = tokenizer.convert_ids_to_tokens(token_ids)
sentence = tokenizer.convert_tokens_to_string(tokens)

print("As we can see, the RoBERTa tokenizer can (more or less) recover a sentence in its original format with convert_tokens_to_string:\n")
print('Original sentence:', s)
print('Derived from RoBERTa tokens:', sentence)
print('\n')
#there is an added whitespace at the beginning of the sentence

print("We can also compare the list of words of a SemCor sentence with the token assignment of RoBERTa on the sentence obtained by joining with whitespace.\n")
print('List of words of SemCor sentence:', semcor_json_objs[0]['wordList'])
print('Derived RoBERTa tokenization:', tokens)
print('\n')
print("We remark that the group of words -a hundred- gets two separate tokens, and -myrrh- gets broken down into three segments.\n")

#print("But we can find the specific interval of tokens that characterizes the words of the SemCor list.\n")
#for i in range(len(semcor_json_objs[0]['wordList'])):
#  print("Token interval for -", semcor_json_objs[0]['wordList'][i], "- is "  , intervalOfMatchTokens(semcor_json_objs[0]['wordList'][i],token_ids))

print(totalwords)


He brought with him a mixture of myrrh and aloes , of about a hundred pounds ' weight .
As we can see, the RoBERTa tokenizer can (more or less) recover a sentence in its original format with convert_tokens_to_string:

Original sentence: He brought with him a mixture of myrrh and aloes , of about a hundred pounds ' weight .
Derived from RoBERTa tokens:  He brought with him a mixture of myrrh and aloes , of about a hundred pounds ' weight .


We can also compare the list of words of a SemCor sentence with the token assignment of RoBERTa on the sentence obtained by joining with whitespace.

List of words of SemCor sentence: ['He', 'brought', 'with', 'him', 'a', 'mixture', 'of', 'myrrh', 'and', 'aloes', ',', 'of', 'about', 'a hundred', 'pounds', "'", 'weight', '.']
Derived RoBERTa tokenization: ['ĠHe', 'Ġbrought', 'Ġwith', 'Ġhim', 'Ġa', 'Ġmixture', 'Ġof', 'Ġmy', 'r', 'rh', 'Ġand', 'Ġal', 'oes', 'Ġ,', 'Ġof', 'Ġabout', 'Ġa', 'Ġhundred', 'Ġpounds', "Ġ'", 'Ġweight', 'Ġ.']


We remark that the 

## Parse the SensEval data
(to the file: '/content/drive/My Drive/CSI5138_Project/SensEval/senseval.jsonl')

In [0]:
'''
    CSI5138F - Word in Context Project
    Project Group 4
    Members:
        - William Larocque
        - Simon Fortier-Garceau
        - Julian Templeton
    ---------------------------
    This file is for parsing the semcor corpus which we will use as the training set.
    Based on http://www.nltk.org/howto/corpus.html#other-corpora and https://docs.python.org/3.7/library/xml.etree.elementtree.html
'''
from xml.etree import ElementTree as ET
from os import listdir
from os.path import join
import json

def parseSensEval():
  file_location = "/content/drive/My Drive/CSI5138_Project/SensEval/WNet3Set"
  semcor_files = [file for file in listdir(file_location) if '.xml' in file]
  # Buffer array to get all sentences of semcor
  parsed_semcor = []
  # Parse every file
  for xml_file in semcor_files:
      # Get all the sentences in the file
      sentences = ET.parse(join(file_location, xml_file)).findall('context/p/s')
      for sentence in sentences:
        
        parsed_sentence = {
            "sentence" : "",
            "wordList" : [],
            "lemmatized" : [],
            "pos" : [],
            "wordnet" : [],
            "lex_sense" : [],
            "source_file" : xml_file
            }
        # Deal with every word individually
        for token in sentence: # We only care about the words not the punctuation of the sentence
              # Get the token attribute            
              token_attrib = token.attrib

              #replace the underscore in word groups and replace with whitespace to facilitate proper tokenization
              token.text = token.text.replace('_', ' ')

              # Append word POS
              if 'pos' in token_attrib:
                parsed_sentence["pos"].append(token_attrib['pos'])
              else:
                parsed_sentence["pos"].append('PUNC')

              # Look for a lemmatize word, else just append word as-is
              if 'lemma' in token_attrib:
                parsed_sentence["lemmatized"].append(token_attrib['lemma'])
              else:
                parsed_sentence["lemmatized"].append(token.text)

              # Look for word sense, else append None.
              if 'wnsn' in token_attrib:
                  parsed_sentence["wordnet"].append(token_attrib["wnsn"])
              else:
                  parsed_sentence["wordnet"].append(None)

              # Look for lemma sense, else append None
              if 'lexsn' in token_attrib:
                  parsed_sentence["lex_sense"].append(token_attrib["lexsn"])
              else:
                  parsed_sentence["lex_sense"].append('None')

              # Add word to sentence
              parsed_sentence["wordList"].append(token.text)

        parsed_sentence["sentence"] = " ".join(parsed_sentence["wordList"])

        # Append to parsed_semcor
        parsed_semcor.append(parsed_sentence)

  with open('/content/drive/My Drive/CSI5138_Project/SensEval/senseval.jsonl', mode = "w") as parsed_file:
      for entry in parsed_semcor:
          parsed_file.write(json.dumps(entry) + "\n")

#LAUNCH THE METHOD
parseSensEval()

In [0]:
#TESTING THE PARSED SENSEVAL
senseval_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SensEval/senseval.jsonl")

countNOANS = 0
countU = 0
countNOWN = 0
totalwords = 0
for j in range(len(senseval_json_objs)):
  for t in senseval_json_objs[j]['lex_sense']:
    totalwords += 1
    if t == 'NOANSWER':
      countNOANS += 1
    if t == 'U':
      countU += 1
    if t == 'NOWN':
      countNOWN += 1


print(totalwords)
print(countNOANS) #These have no adequate sense identification
print(countU) #These words have an unknown sense
print(countNOWN) #The answer provided for some collocations consists of two senses for two different words.





11100
40
112
28


## Preprocess the SemCor and SensEval

(This gives the input for the models and are saved into their respective files:
'CSI5138_Project/SensEval/preprocessed_semcor.jsonl' and 
'CSI5138_Project/SensEval/preprocessed_senseval.jsonl'


### Method for preprocessing Semcor (or SensEval)

In [0]:
#A method to preprocess the SemCor or SenEval data
def sense_preprocessing(semcor_objects, testing = True, shuffle_data = False):

  SEMCOR_VOCAB = SemCor_SensEval_Vocab()



  semcor_sentences = []
  semcor_senses = []
  semcor_lemmas = []  
  semcor_pos = []
  semcor_wordLists = []
  semcor_senseSpaceIDs = []



  semcor_encoded = []
  semcor_token_intervals = []

  for example in semcor_objects:
    sentence = f"<s>{example['sentence']}</s>"

    semcor_sentences.append(sentence)
    semcor_senses.append(sensesOf(example))
    #print(sensesOf(example))
    semcor_lemmas.append(example['lemmatized'])
    semcor_pos.append(example['pos'])
    semcor_wordLists.append(example['wordList'])
    semcor_senseSpaceIDs.append(SEMCOR_VOCAB.getSenseSpaceIndices(example['wordList'],sensesOf(example)))


    # Then encode the sentences
    semcor_encoded.append(tokenizer.encode(sentence, add_special_tokens=False))  #maybe this is wrong and requires sentence directly with special tokens (check later)

    
  # Pad the sequences and find the encoded word location in the combined input
  #max_len = np.array([len(ex) for ex in semcor_encoded]).max()
  #max number of tokens in a sentence
  max_len = 151

  #max number of words in a sentence
  #max_len_sentence = np.array([len(ex) for ex in semcor_senses]).max()
  max_len_sentence = 134

  semcor_padded = {"input_ids" : [], "attention_mask" : [], "sense_ids": [], "pos_ids" : [], 'lemma_ids' : [], "emb_intervals" : [], "wordList" : [], "sense_spaceids":[]}
  

  for j in range(len(semcor_senses)):    
    semcor_padded['sense_spaceids'].append(semcor_senseSpaceIDs[j])
    semcor_padded['sense_spaceids'][j].extend([-1]*(max_len_sentence-len(semcor_padded['sense_spaceids'][j])))
    semcor_padded['sense_ids'].append(SEMCOR_VOCAB.getSenseIDList(semcor_senses[j]))
    semcor_padded['sense_ids'][j].extend([-1]*(max_len_sentence-len(semcor_padded['sense_ids'][j])))
    semcor_padded['pos_ids'].append(SEMCOR_VOCAB.getPOSIDList(semcor_pos[j]))
    semcor_padded['pos_ids'][j].extend([-1]*(max_len_sentence-len(semcor_padded['pos_ids'][j])))
    semcor_padded['lemma_ids'].append(SEMCOR_VOCAB.getLemmaIDList(semcor_lemmas[j]))
    semcor_padded['lemma_ids'][j].extend([-1]*(max_len_sentence-len(semcor_padded['lemma_ids'][j])))
    semcor_padded['wordList'].append(semcor_wordLists[j])
    #print('Changing to ids and padding:', j, ' with ', semcor_padded['sense_ids'][j])


  for i in range(0, len(semcor_encoded)):
    enc_sentence = semcor_encoded[i]
    #word_locs = wic_word_locs[i]


    # Pad the sequences
    ex_len = len(enc_sentence)
    padded_sentence = enc_sentence.copy()
    padded_sentence.extend([0]*(max_len - ex_len))
    semcor_padded["input_ids"].append(padded_sentence)

    semcor_padded["emb_intervals"].append(find_words_in_tokenized_sentences(semcor_wordLists[i],padded_sentence))

    #lastIntervalOfCurrent = semcor_padded["emb_intervals"][i][-1]
    lastIntervalOfCurrent = (-1, -1)

    semcor_padded["emb_intervals"][i].extend([lastIntervalOfCurrent]*(max_len_sentence-len(semcor_padded["emb_intervals"][i])))

    #print('Processing intervals:', i, ' with ', semcor_padded["emb_intervals"][i])

    padded_mask = [1] * ex_len
    padded_mask.extend([0]*(max_len - ex_len))
    semcor_padded["attention_mask"].append(padded_mask)
    
  if testing:
    if shuffle_data:
      # Shuffle the data
      raw_set = {"input_ids" : [], "attention_mask" : [], "senses": [], "pos" : [], 'lemmas' : [], "emb_intervals" : [], "wordList" : [], "sense_spaceids": []}
      raw_set["input_ids"], raw_set["attention_mask"], raw_set["sense_ids"], raw_set["pos_ids"], raw_set["lemma_ids"], raw_set["emb_intervals"], raw_set["wordList"], raw_set["sense_spaceids"] = shuffle(semcor_padded["input_ids"],
                                                                                                                                                                                                          semcor_padded["attention_mask"],
                                                                                                                                                                                                          semcor_padded["sense_ids"],
                                                                                                                                                                                                          semcor_padded["pos_ids"],
                                                                                                                                                                                                          semcor_padded["lemma_ids"],
                                                                                                                                                                                                          semcor_padded["emb_intervals"],
                                                                                                                                                                                                          semcor_padded["wordList"],
                                                                                                                                                                                                          semcor_padded["sense_spaceids"])
    else:
      raw_set = {"input_ids": semcor_padded["input_ids"], "attention_mask": semcor_padded["attention_mask"], "sense_ids": semcor_padded["sense_ids"],
                 "pos_ids": semcor_padded["pos_ids"], "lemma_ids" : semcor_padded["lemma_ids"], "emb_intervals" : semcor_padded["emb_intervals"], "wordList" : semcor_padded["wordList"], "sense_spaceids": semcor_padded["sense_spaceids"] }
  else: # No labels present (Testing set)
    # Do not shuffle the testing set
    raw_set = {"input_ids": semcor_padded["input_ids"], "attention_mask": semcor_padded["attention_mask"], "sense_ids": semcor_padded["sense_ids"],
                 "pos_ids": semcor_padded["pos_ids"], "lemma_ids" : semcor_padded["lemma_ids"], "emb_intervals" : semcor_padded["emb_intervals"], "wordList" : semcor_padded["wordList"], "sense_spaceids": semcor_padded["sense_spaceids"] }
  # Return the raw data (Need to put them in a PyTorch tensor and dataset)
  return raw_set


### Preprocess SemCor and save to file

In [0]:
#This function saves the preprocessed SemCor Training data to 'preprocessed_semcor.jsonl' for quick loading
def savePreprocSemcor():
  semcor_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SemCor/semcor_parsed2nd.jsonl")
  raw_train_set = sense_preprocessing(semcor_json_objs, testing = False)
  data = []
  data.append(raw_train_set['input_ids'])
  data.append(raw_train_set['attention_mask'])
  data.append(raw_train_set['sense_ids'])
  data.append(raw_train_set['pos_ids'])
  data.append(raw_train_set['lemma_ids'])
  data.append(raw_train_set['emb_intervals'])
  data.append(raw_train_set['wordList'])
  data.append(raw_train_set['sense_spaceids'])
  
  with open('/content/drive/My Drive/CSI5138_Project/SensEval/preprocessed_semcor.jsonl', mode = "w") as preproc_file:
    for entry in data:
      preproc_file.write(json.dumps(entry) + "\n")


#Execute the procedure in question
savePreprocSemcor()


### Preprocess SensEval and save to file

In [0]:
#This function saves the preprocessed SensEval Evaluation data to 'preprocessed_senseval.jsonl' for quick loading
def savePreprocSenseval():
  senseval_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SensEval/senseval.jsonl")
  raw_eval_set = sense_preprocessing(senseval_json_objs, testing = True)
  data = []
  data.append(raw_eval_set['input_ids'])
  data.append(raw_eval_set['attention_mask'])
  data.append(raw_eval_set['sense_ids'])
  data.append(raw_eval_set['pos_ids'])
  data.append(raw_eval_set['lemma_ids'])
  data.append(raw_eval_set['emb_intervals'])
  data.append(raw_eval_set['wordList'])
  data.append(raw_eval_set['sense_spaceids'])
  
  with open('/content/drive/My Drive/CSI5138_Project/SensEval/preprocessed_senseval.jsonl', mode = "w") as preproc_file:
    for entry in data:
      preproc_file.write(json.dumps(entry) + "\n")


#Execute the procedure in question
savePreprocSenseval()

# WiC Preprocessing

## Method for Preprocessing the WordInContext datasets

In [0]:
# Create a function to preprocess the WiC data
def wic_preprocessing(json_objects, testing = True, shuffle_data = False, verbose = False):
  wic_sentences = []
  wic_encoded = []
  wic_labels = []
  wic_word_locs = []
  for example in json_objects:
    sentence = f"<s>{example['sentence1']}</s><s>{example['sentence2']}</s>"
    wic_sentences.append(sentence)
    # Then encode the sentences
    wic_encoded.append(tokenizer.encode(sentence, add_special_tokens=False))
    # Find the word in each sentences
    word = example['word']
    word_locs = (-1, -1)
    # Split the 2 sentences on space. (Also, lemmatize and uncapitilize each word)
    sent1_split = example['sentence1'].split(' ')
    sent2_split = example['sentence2'].split(' ')
    # Get the index of word in both sentences
    sent1_word_char_loc = (example['start1'], example['end1'])
    sent2_word_char_loc = (example['start2'], example['end2'])
    # Create a variable to keep track of the number of characters parsed in each sentence as we loop
    sent_chars = 0
    # Loop over the words in the first sentence
    i, j = 0, 0
    word1_not_found, word2_not_found = True, True
    while word1_not_found and i < len(sent1_split):
      if sent_chars >= sent1_word_char_loc[0] and sent_chars <= sent1_word_char_loc[1]:
        word_locs = (i, -1) # Found the word in the sentence
        word1_not_found = False
      elif sent_chars > sent1_word_char_loc[1]:
        # If we somehow got past the word. Assume it was the previous word
        word_locs = (i - 1, -1) # Found the word in the sentence
        word1_not_found = False
      else:
        # Look at the next word
        sent_chars += len(sent1_split[i]) + 1 # Plus one for the space
        i += 1
    # Loop over the words in the second
    sent_chars = 0 # Reset
    while word2_not_found and j < len(sent2_split):
      if sent_chars >= sent2_word_char_loc[0] and sent_chars <= sent2_word_char_loc[1]:
        word_locs = (i, j) # Found the word in the sentence
        word2_not_found = False
      elif sent_chars > sent2_word_char_loc[1]:
        # If we somehow got past the word. Assume it was the previous word
        word_locs = (i, j - 1) # Found the word in the sentence
        word2_not_found = False
      else:
        # Look at the next word
        sent_chars += len(sent2_split[j]) + 1 # Plus one for the space
        j += 1
    # For testing
    if verbose:
      print(word)
      print(sent1_split)
      print(sent2_split)
      print(word_locs)
    # Now to find the word in the tokenized sentences
    word1 = sent1_split[word_locs[0]].translate(str.maketrans('', '', string.punctuation)) #Remove punctuation (See https://stackoverflow.com/questions/265960/best-way-to-strip-punctuation-from-a-string)
    word2 = sent2_split[word_locs[1]].translate(str.maketrans('', '', string.punctuation)) #Remove punctuation
    token_word_locs = find_words_in_tokenized_sentences([word1, word2], wic_encoded[-1])
    wic_word_locs.append(token_word_locs)
    # Get the label if we expect it to be there
    if testing:
      if example['label']:
        wic_labels.append(1)
      else:
        wic_labels.append(0)
  # Pad the sequences and find the encoded word location in the combined input
  max_len = np.array([len(ex) for ex in wic_encoded]).max()
  wic_padded = {"input_ids" : [], "attention_mask" : [], "token_type_ids" : [], "word1_locs": [], "word2_locs" : []}
  for i in range(0, len(wic_encoded)):
    enc_sentence = wic_encoded[i]
    word_locs = wic_word_locs[i]
    # Pad the sequences
    ex_len = len(enc_sentence)
    padded_sentence = enc_sentence.copy()
    padded_sentence.extend([0]*(max_len - ex_len))
    wic_padded["input_ids"].append(padded_sentence)
    padded_mask = [1] * ex_len
    padded_mask.extend([0]*(max_len - ex_len))
    wic_padded["attention_mask"].append(padded_mask)
    # Create the vector to get back the words after RoBERTa
    token_word_locs = wic_word_locs[i]
    first_word_loc = []
    second_word_loc = []
    len_first_word = token_word_locs[0][1] - token_word_locs[0][0] + 1
    len_second_word = token_word_locs[1][1] - token_word_locs[1][0] + 1
    for j in range(0, max_len):
      if j >= token_word_locs[0][0] and j <= token_word_locs[0][1]:
        # Part of the first word
        first_word_loc.append(1.0 / len_first_word)
      else:
        first_word_loc.append(0.0)
      if j >= token_word_locs[1][0] and j <= token_word_locs[1][1]:
        # Part of the second word
        second_word_loc.append(1.0 / len_second_word)
      else:
        second_word_loc.append(0.0)
    wic_padded["word1_locs"].append(first_word_loc)
    wic_padded["word2_locs"].append(second_word_loc)
    # token_type_ids is a mask that tells where the first and second sentences are
    token_type_id = []
    first_sentence = True
    sentence_start = True
    for token in padded_sentence:
      if first_sentence and sentence_start and token == 0:
        # Allows 0 at the start of the first sentence
        token_type_id.append(0)
      elif first_sentence and token > 0:
        if sentence_start:
          sentence_start = False
        token_type_id.append(0)
      elif first_sentence and not sentence_start and token == 0:
        first_sentence = False
        # Start of second sentence
        token_type_id.append(1)
      else:
        # Second sentence
        token_type_id.append(1)
    wic_padded["token_type_ids"].append(token_type_id)
  if testing:
    if shuffle_data:
      # Shuffle the data
      raw_set = {"input_ids": [], "token_type_ids": [], "attention_mask": [], "labels": [], "word1_locs": [], "word2_locs" : []}
      raw_set["input_ids"], raw_set["token_type_ids"], raw_set["attention_mask"], raw_set["labels"] = shuffle(wic_padded["input_ids"], wic_padded["token_type_ids"],
                                                                                                              wic_padded["attention_mask"], wic_labels,
                                                                                                              wic_padded["word1_locs"], wic_padded["word2_locs"])
    else:
      raw_set = {"input_ids": wic_padded["input_ids"], "token_type_ids": wic_padded["token_type_ids"],
                 "attention_mask": wic_padded["attention_mask"], "labels": wic_labels,
                 "word1_locs": wic_padded["word1_locs"], "word2_locs" : wic_padded["word2_locs"]}
  else: # No labels present (Testing set)
    # Do not shuffle the testing set
    raw_set = {"input_ids": wic_padded["input_ids"], "token_type_ids": wic_padded["token_type_ids"], 
               "attention_mask": wic_padded["attention_mask"], 
               "word1_locs": wic_padded["word1_locs"], "word2_locs" : wic_padded["word2_locs"]}
  # Return the raw data (Need to put them in a PyTorch tensor and dataset)
  return raw_set

# Training and Validation

## Constants and Hyperparameters 
(should probably make a separate one for each training model in the sections below)

In [7]:
BATCH_SIZE = 64
PATIENCE = 5

# INSTANTIATE THE VOCABULARY
SEMCOR_VOCAB = SemCor_SensEval_Vocab()

# Prepare Torch to use GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
n_gpu = torch.cuda.device_count()
print(torch.cuda.get_device_name(0))

#UTILITY F?UNCTIONS
# Function to calculate the accuracy of our predictions vs labels
def flat_accuracy(preds, labels):
    pred_flat = np.argmax(preds, axis=1).flatten()
    labels_flat = labels.flatten()
    return np.sum(pred_flat == labels_flat) / len(labels_flat)



Tesla K80


## Load the RoBERTa model

In [0]:
#roberta_model = RobertaModel.from_pretrained('roberta-base') # Working without Masked LM (shouldn't use if LM is good)
# https://huggingface.co/transformers/_modules/transformers/modeling_roberta.html#RobertaForMaskedLM
roberta_model = RobertaForMaskedLM.from_pretrained('roberta-base') # Working with Masked LM
roberta_model.to(device)

100%|██████████| 473/473 [00:00<00:00, 88598.87B/s]
100%|██████████| 501200538/501200538 [00:15<00:00, 31530650.68B/s]


RobertaForMaskedLM(
  (roberta): RobertaModel(
    (embeddings): RobertaEmbeddings(
      (word_embeddings): Embedding(50265, 768, padding_idx=0)
      (position_embeddings): Embedding(514, 768)
      (token_type_embeddings): Embedding(1, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (encoder): BertEncoder(
      (layer): ModuleList(
        (0): BertLayer(
          (attention): BertAttention(
            (self): BertSelfAttention(
              (query): Linear(in_features=768, out_features=768, bias=True)
              (key): Linear(in_features=768, out_features=768, bias=True)
              (value): Linear(in_features=768, out_features=768, bias=True)
              (dropout): Dropout(p=0.1, inplace=False)
            )
            (output): BertSelfOutput(
              (dense): Linear(in_features=768, out_features=768, bias=True)
              (LayerNorm): LayerNorm((768,), eps=1e-05, elementwi

## SemCor/SensEval Training and Validation

### Load the SemCor/SensEval data

In [1]:
raw_train_set = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SensEval/preprocessed_semcor.jsonl")
raw_eval_set = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/SensEval/preprocessed_senseval.jsonl")

print(len(raw_eval_set[0][0]))
print(len(raw_train_set[0][0]))
print(raw_eval_set[5][0])

NameError: ignored

In [0]:
#TESTS
#print(raw_eval_set[0][0]) #token_ids
#print(tokenizer.convert_ids_to_tokens(raw_train_set[0][0])) #tokens
#print(raw_eval_set[1][0]) #attention_mask
#print(raw_eval_set[2][0]) #sense_ids
#print(raw_eval_set[3][0]) #pos_ids
#print(raw_eval_set[4][0]) #lemma_ids
#print(raw_eval_set[5][0]) #emb_intervals
print(raw_eval_set[6][0]) #words in the sentence

print(SEMCOR_VOCAB.convertListSenseID_to_Sense(raw_eval_set[2][0]))

wordIndices = SEMCOR_VOCAB.getIndicesOfWords(raw_eval_set[6][0])
#for j in range(0,10):
#  print(raw_eval_set[7][j]) 
#print(wordIndices)

print(SEMCOR_VOCAB.SENSESPACES[wordIndices[6]])
print(raw_eval_set[7][0]) #senseSpaceIDs in the sentence


#print(SEMCOR_VOCAB.convertListPOSID_to_POS(raw_eval_set[3][0]))
#print(SEMCOR_VOCAB.convertListLemmaID_to_Lemma(raw_eval_set[4][0]))
#check for the largest space in SENSESPACES:
maximum = 0

for space in SEMCOR_VOCAB.SENSESPACES:
  if len(space) > maximum:
    maximum = len(space)

maximum


[11497, 14403, 10347, 2380, 22201, 5898, 4082, 17627, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1]


NameError: ignored

In [0]:
# Combine the two loaded datasets into one
raw_train_set[0] = raw_train_set[0] + raw_eval_set[0]
raw_train_set[1] = raw_train_set[1] + raw_eval_set[1]
raw_train_set[2] = raw_train_set[2] + raw_eval_set[2]
raw_train_set[3] = raw_train_set[3] + raw_eval_set[3]
raw_train_set[4] = raw_train_set[4] + raw_eval_set[4]
raw_train_set[5] = raw_train_set[5] + raw_eval_set[5]
raw_train_set[7] = raw_train_set[7] + raw_eval_set[7] #these give you the position of the senses in a space of size 31


In [0]:
# Create a PyTorch dataset for training and evaluation sets for senses and POS
train_data = TensorDataset(
    torch.tensor(raw_train_set[0]), #token_ids
    torch.tensor(raw_train_set[1]), #attention_mask
    torch.tensor(raw_train_set[2]), #sense_ids
    torch.tensor(raw_train_set[3]), #pos_ids
    torch.tensor(raw_train_set[4]), #lemma_ids
    torch.tensor(raw_train_set[5]), #emb_intervals
    torch.tensor(raw_train_set[7]), #sense_spaceids
)

# Shuffle and split the data as seen in the link below to split into train/validation sets
# https://stackoverflow.com/questions/50544730/how-do-i-split-a-custom-dataset-into-training-and-test-datasets
validation_split = 0.15 # 15% validation, 85% training
indices = list(range(len(train_data)))
split_point = int(np.floor(validation_split * len(train_data)))
np.random.seed(1) # For reproducability
np.random.shuffle(indices) # Shuffle the indices

train_indices, validation_indices = indices[split_point:], indices[:split_point]

# Create a sampler and loader for the train and validation sets
train_sampler = SubsetRandomSampler(train_indices)
validation_sampler = SubsetRandomSampler(validation_indices)

trainloader = DataLoader(train_data, sampler=train_sampler, batch_size=1)
validationloader = DataLoader(train_data, sampler=validation_sampler, batch_size=1)

In [0]:
# ISSUE: tokenizer not containing the function get_special_tokens_mask which means that we may be
# using an older version (see second commented link)
# 
# Taken directly from an official example on the library's Github:
# https://github.com/huggingface/transformers/blob/master/examples/run_lm_finetuning.py
# https://github.com/huggingface/transformers/releases
# https://huggingface.co/transformers/model_doc/roberta.html
# https://github.com/pytorch/fairseq/tree/master/examples/roberta
def mask_tokens(inputs, tokenizer, probability=0.15):
    """ Prepare masked tokens inputs/labels for masked language modeling: 80% MASK, 10% random, 10% original. """
    labels = inputs.clone()
    # We sample a few tokens in each sequence for masked-LM training (with probability defaulting to 0.15 in Bert/RoBERTa)
    probability_matrix = torch.full(labels.shape, probability)
    special_tokens_mask = [list(map(lambda x: 1 if x in [tokenizer.sep_token_id, tokenizer.cls_token_id] else 0, val)) for val in labels.tolist()]
    probability_matrix.masked_fill_(torch.tensor(special_tokens_mask, dtype=torch.bool), value=0.0)
    masked_indices = torch.bernoulli(probability_matrix).bool().cuda()
    labels[~masked_indices] = -1  # We only compute loss on masked tokens
    # 80% of the time, we replace masked input tokens with tokenizer.mask_token ([MASK])
    indices_replaced = torch.bernoulli(torch.full(labels.shape, 0.8)).bool().cuda() & masked_indices
    inputs[indices_replaced] = tokenizer.convert_tokens_to_ids(tokenizer.mask_token)

    # 10% of the time, we replace masked input tokens with random word
    indices_random = torch.bernoulli(torch.full(labels.shape, 0.5)).bool().cuda() & masked_indices & ~indices_replaced
    random_words = torch.randint(len(tokenizer), labels.shape, dtype=torch.long).cuda()
    inputs[indices_random] = random_words[indices_random]

    # The rest of the time (10% of the time) we keep the masked input tokens unchanged
    return inputs, labels

In [0]:
# Assumes a batch_size of one
# Given a RoBERTA model, embedding size, batch size, and output size, setup the fine-tuning for the senses of a word
class FineTuningHeadPOS(torch.nn.Module):
    # Setup the class by defining and storing any needed variables
    def __init__(self, output_size, batch_size=1, embedding_size=768):
        super(FineTuningHeadPOS, self).__init__()
        # The size of each contextual embedding
        self.embedding_size = embedding_size
        # The size of the output
        self.output_size = output_size
        # The batch size
        self.batch_size = batch_size
        # Define size of linear layers depending on if embedding_size > output_size or embedding_size < output_size
        # Should not be equal, but the layers would all be the same size if equal
        diff = abs(embedding_size - output_size)
        # Default is that the embedding_size is greater than the output_size
        sizes = [embedding_size - diff // 4, embedding_size - diff // 2]
        # If the embedding_size is less than the output size, we grow the valus appropriately
        if (embedding_size < output_size):
            sizes = [embedding_size + diff // 4, embedding_size + diff // 2]
        # Setup the layers for the MLP used to help learn the senses
        self.fcl1 = torch.nn.Linear(embedding_size, sizes[0])
        self.relu = torch.nn.ReLU()
        self.fcl2 = torch.nn.Linear(sizes[0], sizes[1])
        self.fcl3 = torch.nn.Linear(sizes[1], output_size)
        self.softmax = torch.nn.Softmax()

    # Forward propagate with a sentence of words, one-by-one, learning to predict the
    # sense/POS/etc of a word.
    #
    # input_ids: The input ids for the embedding model to get embeddings per input
    # attention_mask: Used by the embedding model to help get the embeddings
    # embedding_intervals: The intervals for the embeddings specifying which embeddings to combine
    def forward(self, input_ids, attention_mask, embedding_intervals, embedding_model):
      
      extracted_emb = embExtract(embedding_model,input_ids,attention_mask, embedding_intervals)
      # For each embedding in other_embeddings, predict the related id from other_ids
      predictions = []
      for embedding in extracted_emb:
        output = self.relu(self.fcl1(embedding))
        output = self.relu(self.fcl2(output))
        output = self.fcl3(output)
        predictions.append(output)
            #predictions.append(output.view(-1, self.output_size))
            #predictions.append(torch.argmax(output))
      predictions = torch.stack(predictions)
      # Return the predictions, computing the loss in the training algorithm
      #return torch.stack([predictions])
      return predictions

In [0]:
# Assumes a batch_size of one
# Given a RoBERTA model, embedding size, batch size, and output size, setup the fine-tuning for the senses of a word
class FineTuningHeadSenses(torch.nn.Module):
    # Setup the class by defining and storing any needed variables
    def __init__(self, output_size=31, batch_size=1, embedding_size=768):
        super(FineTuningHeadSenses, self).__init__()
        # The size of each contextual embedding
        self.embedding_size = embedding_size
        # The size of the output
        self.output_size = output_size
        # The batch size
        self.batch_size = batch_size
        # Define size of linear layers depending on if embedding_size > output_size or embedding_size < output_size
        # Should not be equal, but the layers would all be the same size if equal
        diff = abs(embedding_size - output_size)
        # Default is that the embedding_size is greater than the output_size
        sizes = [embedding_size - diff // 4, embedding_size - diff // 2]
        # If the embedding_size is less than the output size, we grow the valus appropriately
        if (embedding_size < output_size):
            sizes = [embedding_size + diff // 4, embedding_size + diff // 2]
        # Setup the layers for the MLP used to help learn the senses
        self.fcl1 = torch.nn.Linear(embedding_size, sizes[0])
        self.relu = torch.nn.ReLU()
        self.fcl2 = torch.nn.Linear(sizes[0], sizes[1])
        self.fcl3 = torch.nn.Linear(sizes[1], output_size)
        self.softmax = torch.nn.Softmax()

    # Forward propagate with a sentence of words, one-by-one, learning to predict the
    # sense/POS/etc of a word.
    #
    # input_ids: The input ids for the embedding model to get embeddings per input
    # attention_mask: Used by the embedding model to help get the embeddings
    # embedding_intervals: The intervals for the embeddings specifying which embeddings to combine
    def forward(self, input_ids, attention_mask, embedding_intervals, embedding_model):
        # Start by retrieving the embeddings from the embeddings model for input_ids and attention_mask
        extracted_emb = embExtract(embedding_model,input_ids, attention_mask, embedding_intervals)
          
        # For each embedding in other_embeddings, predict the related id from other_ids
        predictions = []
        for embedding in extracted_emb:
            output = self.relu(self.fcl1(embedding))
            output = self.relu(self.fcl2(output))
            output = self.fcl3(output)
            predictions.append(output)
            #predictions.append(output.view(-1, self.output_size))
            #predictions.append(torch.argmax(output))
        predictions = torch.stack(predictions)
        # Return the predictions, computing the loss in the training algorithm
        #return torch.stack([predictions])
        return predictions

In [0]:
# PARTIALLY COMPLETE (most likely need to modify how we use the loss, .item vs how it's set now)
# Did not specify the same optimizer setup as in the head, change if needed


min_loss = 99999999

# Given a sense model and POS model, train them
def train_model(roberta_model, sense_model, pos_model, trainloader, validationloader, tokenizer, path, alpha_lm=1, alpha_sense=0.6, alpha_pos=0.4, lr=1e-5, epochs=5, fine_tune_RoBERTa = True):
    # Store the loss values found from the validation set
    val_losses = []
    mask_losses = []
    sense_losses = []
    pos_losses = []
    # Best loss and counter from when it was last set
    global min_loss
    since_last_min = 0
    early_stop = 10
    # Define the loss function
    loss_function = torch.nn.CrossEntropyLoss()
    # Optimizers for each model
    optimizers = [None, None, None]
    # Set to the GPU and setup the optimizers
    if sense_model is not None:
        optimizers[0] = torch.optim.Adam(sense_model.parameters(), lr)
    if pos_model is not None:
        optimizers[1] = torch.optim.Adam(pos_model.parameters(), lr)

    optimizers[2] = torch.optim.Adam(roberta_model.parameters(), 1e-6)

    # Train for the specified number of epochs
    
    #prevWeights1 = sense_model.fcl1.weight
    #prevWeights2 = sense_model.fcl2.weight    

    epoch = 1
    iterations = 0
    validation_condition = len(trainloader) // 6
    while (epoch < epochs + 1):
        # Iterate through the instances in the train DataLoader
        for sample_set in trainloader:
            # Add batch to GPU
            sample_set = tuple(t.cuda() for t in sample_set)
            # Unpack the inputs from our dataloader
            input_ids, attention_mask, sense_ids, pos_ids, lemma_ids, embedding_intervals, sense_spaceids = sample_set


            #Keep track of the total number of samples for training
            totalNumSamples = len(trainloader)            


            # Masked LM training
            # TO DO: Review and edit this as needed
            loss_masked = torch.tensor(0.0)
            loss_masked = loss_masked.cuda()
            #'''
            inputs_masked, labels_masked = mask_tokens(input_ids, tokenizer)
            inputs_masked = inputs_masked.cuda()
            labels_masked = labels_masked.cuda()
            outputs_masked = roberta_model(inputs_masked, masked_lm_labels=labels_masked)
            loss_masked = outputs_masked[0]
            #'''
            # Train the sense model if it is provided
            loss_sense = torch.tensor(0.0)
            if sense_model is not None:
              roberta_model.eval()              
              sense_model.train()
              optimizers[0].zero_grad()
              # Retrieve the sense predictions
              sense_predictions = sense_model(input_ids, attention_mask, embedding_intervals, roberta_model)
              sense_spaceids = torch.tensor([sense.view(-1) for sense in sense_spaceids.view(-1) if sense.item() != -1], dtype=torch.long)
              sense_spaceids = sense_spaceids.to(device)
              # Compute the loss for the sense predictions
              loss_sense = [loss_function(pred.view(1, -1), label.view(-1)) for pred, label in zip(sense_predictions, sense_spaceids)]
              loss_sense = torch.stack(loss_sense).sum() / len(loss_sense)
              loss_sense.backward(retain_graph=True)
              optimizers[0].step()
              sense_model.eval()

                
            # Train the POS model if it is provided
            loss_pos = torch.tensor(0.0)
            if pos_model is not None:
              roberta_model.eval()
              pos_model.train()
              optimizers[1].zero_grad()
              # Retrieve the POS predictions
              pos_predictions = pos_model(input_ids, attention_mask, embedding_intervals, roberta_model)
              pos_ids = torch.tensor([pos.view(-1) for pos in pos_ids.view(-1) if pos.item() != -1], dtype=torch.long)
              pos_ids = pos_ids.cuda()
              # Computer the loss for the POS predictions
              loss_pos = [loss_function(pred.view(1, -1), label.view(-1)) for pred, label in zip(pos_predictions, pos_ids)]
              loss_pos = torch.stack(loss_pos).sum() / len(loss_pos)
              loss_pos.backward(retain_graph=True)
              optimizers[1].step()
              pos_model.eval()

            #Want to train RoBERTa
            if fine_tune_RoBERTa:
              roberta_model.train()
            else:
              roberta_model.eval()

            optimizers[2].zero_grad()
            #total_loss = torch.autograd.Variable(alpha_lm * loss_masked + alpha_sense * loss_sense + alpha_pos * loss_pos, requires_grad=True)
            total_loss = alpha_lm * loss_masked + alpha_sense * loss_sense + alpha_pos * loss_pos
            total_loss.backward()
            optimizers[2].step()

            #Get statistics
            if iterations%5 == 0:
              #print(roberta_model.fcl1.weight)
              print("Training Losses: ", "- Iter:", iterations, "/",totalNumSamples, "- Masked LM=", loss_masked.item(), "- Sense=", loss_sense.item(), "- POS=", loss_pos.item(), "- TOTAL=", total_loss.item())
              print('Iteration:',iterations)
              #print(roberta_model.encoder.layer[1].intermediate.dense.weight)
              
              #currentWeights1 = sense_model.fcl1.weight
              #currentWeights2 = sense_model.fcl2.weight
              #print('Weights1Compare', currentWeights1 == sense_model.fcl1.weight)
              #print('Weights2Compare', currentWeights2 == sense_model.fcl2.weight)
              #prevWeights1 = currentWeights1
              #prevWeights2 = currentWeights2

              #print(pos_model.fcl1.weight)
              #print(pos_model.fcl2.weight)
              #print(pos_model.fcl3.weight)
            
            # Test on the validation set if the validation condition is met
            if (iterations % validation_condition == 0):
              total_loss, loss_mask, loss_sense, loss_pos = test_model(roberta_model, sense_model, pos_model, validationloader, tokenizer, alpha_lm, alpha_sense, alpha_pos)
              val_losses.append(total_loss)
              mask_losses.append(loss_mask)
              sense_losses.append(loss_sense)
              pos_losses.append(loss_pos)
              print("Iteration", iterations, "- Validation Loss=", total_loss)
              # Early stopping check
              if (min_loss >= total_loss):                
                min_loss = total_loss

                #Create pathname for the RoBERTa model folder
                #RoBERTaPath = path + "/RoBERTa"                 

                #Save each model to the specified path (and create one if it does not exist)
                #REMARK: os is imported at the top of the notebook
                #if not os.path.exists(RoBERTaPath):
                 # os.makedirs(RoBERTaPath)
                #roberta_model.save_pretrained(RoBERTaPath)
                #if sense_model != None:
                #  torch.save(sense_model.state_dict(), path + '/SenseModel.pt')
                #if pos_model != None:
                #  torch.save(pos_model.state_dict(), path + '/PosModel.pt') 

                since_last_min = 0

              else:
                since_last_min += 1
              if (since_last_min >= early_stop):
                epoch = max_epoch
                break
            iterations += 1
        epoch += 1

    return val_losses, mask_losses, sense_losses, pos_losses

In [0]:
# Test the model with a given dataloader
def test_model(roberta_model, sense_model, pos_model, dataloader, tokenizer, alpha_lm=1, alpha_sense=0.6, alpha_pos=0.4):
    # Store the values to return
    loss_values = [[], [], [], []] # 0 (total), 1 (mask), 2 (sense), 3 (pos)
    # Define the loss function
    loss_function = torch.nn.CrossEntropyLoss()    
    roberta_model.eval()

    if sense_model is not None:
        sense_model.eval()
    if pos_model is not None:
        pos_model.eval()

    correct_sense = 0
    total_sense = 0
    correct_pos = 0
    total_pos = 0

    with torch.no_grad():
        for i, sample_set in enumerate(dataloader):

            #Keep tack of total number of samples for testing
            totalNumSamples = len(dataloader)

            if i%100 == 0:
              print("Testing sample:", i , "/", totalNumSamples)
            # Add batch to GPU
            sample_set = tuple(t.cuda() for t in sample_set)
            # Unpack the inputs from our dataloader
            input_ids, attention_mask, sense_ids, pos_ids, lemma_ids, embedding_intervals, sense_spaceids = sample_set
            # Masked LM training
            # TO DO: Review and edit this as needed
            loss_masked = torch.tensor(0.0)
            loss_masked = loss_masked.cuda()
            #'''
            inputs_masked, labels_masked = mask_tokens(input_ids, tokenizer)
            inputs_masked = inputs_masked.cuda()
            labels_masked = labels_masked.cuda()
            outputs_masked = roberta_model(inputs_masked, masked_lm_labels=labels_masked)
            loss_masked = outputs_masked[0].item()
            #'''
            loss_values[1].append(loss_masked)
            # Train the sense model if it is provided
            loss_sense = 0.0
            if sense_model is not None:
                # Retrieve the sense predictions
                sense_predictions = sense_model(input_ids, attention_mask, embedding_intervals, roberta_model)
                sense_spaceids = torch.tensor([sense.view(-1) for sense in sense_spaceids.view(-1) if sense.item() != -1], dtype=torch.long)
                sense_spaceids = sense_spaceids.cuda()

                for pred, label in zip(sense_predictions, sense_spaceids):
                  pred_actual = torch.argmax(torch.tensor(torch.nn.functional.softmax(pred)))
                  if (pred_actual.view(-1).item() == label.view(-1).item()):
                    correct_sense += 1
                  total_sense += 1

                # Compute the loss for the sense predictions
                loss_sense = [loss_function(pred.view(1, -1), label.view(-1)).item() for pred, label in zip(sense_predictions, sense_spaceids)]
                loss_sense = sum(loss_sense) / len(loss_sense)
                loss_values[2].append(loss_sense)
            # Train the POS model if it is provided
            loss_pos = 0.0
            if pos_model is not None:
                # Retrieve the POS predictions
                pos_predictions = pos_model(input_ids, attention_mask, embedding_intervals, roberta_model)
                pos_ids = torch.tensor([pos.view(-1) for pos in pos_ids.view(-1) if pos.item() != -1], dtype=torch.long)
                pos_ids = pos_ids.cuda()

                for pred, label in zip(pos_predictions, pos_ids):
                  pred_actual = torch.argmax(torch.tensor(torch.nn.functional.softmax(pred)))
                  if (pred_actual.view(-1).item() == label.view(-1).item()):
                    correct_pos += 1
                  total_pos += 1


                # Computer the loss for the POS predictions
                loss_pos = [loss_function(pred.view(1, -1), label.view(-1)).item() for pred, label in zip(pos_predictions, pos_ids)]
                loss_pos = sum(loss_pos) / len(loss_pos)
                loss_values[3].append(loss_pos)
            loss_values[0].append(alpha_lm * loss_masked + alpha_sense * loss_sense + alpha_pos * loss_pos)
            #print(alpha_lm * loss_masked + alpha_sense * loss_sense + alpha_pos * loss_pos)
            #print(loss_masked)
            #print(loss_sense)
            #print(loss_pos)
    # Average the loss and return them
    loss_averages = []
    for loss_list in loss_values:
        avg = 0
        if len(loss_list) > 0:
            avg = sum(loss_list) / len(loss_list)
        loss_averages.append(avg)
    if sense_model is not None:
        print("Sense Accuracy =", correct_sense / total_sense)
    if pos_model is not None:
        print("POS Accuracy =", correct_pos / total_pos)
    return loss_averages[0], loss_averages[1], loss_averages[2], loss_averages[3]



In [0]:
#PUT YOUR NAME FOR THE FOLDER HERE!!!
NAME = 'Common'
basepath = f"/content/drive/My Drive/CSI5138_Project/Models/{NAME}" 

# Define the model used for senses
# Note that for POS, we would simply need to change the output size.
output_size = 31
#sense_model = None
sense_model = FineTuningHeadSenses(output_size, batch_size=1, embedding_size=768)

if sense_model != None:
  if os.path.exists(basepath + '/SenseModel.pt'):
    sense_model.load_state_dict(torch.load(basepath + '/SenseModel.pt'))
    print('Sense loaded')
  sense_model.to(device)

output_size = len(SEMCOR_VOCAB.POS)
#pos_model = None
pos_model = FineTuningHeadPOS(output_size, batch_size=1, embedding_size=768)

if pos_model != None:
  if os.path.exists(basepath + '/PosModel.pt'):
    pos_model.load_state_dict(torch.load(basepath + '/PosModel.pt'))
    print('POS loaded')
  pos_model.to(device)

Sense loaded
POS loaded


In [0]:

ALPHA_LM = 0.75
ALPHA_SENSE = 0.15
ALPHA_POS = 0.10
LR = 1e-4
EPOCHS = 4
FINE_TUNE_ROBERTA = True


#IMPORTANT REMARK: It takes about 2 minutes for the folders and saved models to appear in google drive, give it some time before doing anything else
val_losses, mask_losses, sense_losses, pos_losses = train_model(roberta_model, sense_model, 
                                                                pos_model, trainloader, validationloader, 
                                                                tokenizer, basepath, alpha_lm=ALPHA_LM, alpha_sense=ALPHA_SENSE, 
                                                                alpha_pos=ALPHA_POS, lr=LR, epochs=EPOCHS, fine_tune_RoBERTa = FINE_TUNE_ROBERTA)

roberta_model.save_pretrained('/content/drive/My Drive/CSI5138_Project/Models/Simon/RoBERTa')


Training Losses:  - Iter: 0 / 9962 - Masked LM= 0.0 - Sense= 8.710225665709004e-05 - POS= 0.0018609365215525031 - TOTAL= 0.00019915899611078203
Iteration: 0
Testing sample: 0 / 1758




Testing sample: 100 / 1758
Testing sample: 200 / 1758
Testing sample: 300 / 1758
Testing sample: 400 / 1758
Testing sample: 500 / 1758
Testing sample: 600 / 1758
Testing sample: 700 / 1758
Testing sample: 800 / 1758
Testing sample: 900 / 1758
Testing sample: 1000 / 1758
Testing sample: 1100 / 1758
Testing sample: 1200 / 1758
Testing sample: 1300 / 1758
Testing sample: 1400 / 1758
Testing sample: 1500 / 1758
Testing sample: 1600 / 1758
Testing sample: 1700 / 1758
Sense Accuracy = 0.831564838601586
POS Accuracy = 0.9132413716072825
Iteration 0 - Validation Loss= 1.7442793256137408
Training Losses:  - Iter: 5 / 9962 - Masked LM= 0.015453338623046875 - Sense= 0.9299651384353638 - POS= 0.2595136761665344 - TOTAL= 0.17703615128993988
Iteration: 5
Training Losses:  - Iter: 10 / 9962 - Masked LM= 4.047239780426025 - Sense= 0.6125109791755676 - POS= 0.3687419593334198 - TOTAL= 3.1641809940338135
Iteration: 10
Training Losses:  - Iter: 15 / 9962 - Masked LM= 0.0 - Sense= 0.46051129698753357 - PO

In [0]:
roberta_model.save_pretrained('/content/drive/My Drive/CSI5138_Project/Models/Simon/RoBERTa')

In [0]:
#Save your POS model in case the internal autosave missed his shot at the end of the run
torch.save(pos_model.state_dict(), '/content/drive/My Drive/CSI5138_Project/Models/Common/PosModel.pt') 

In [0]:
#Save your Sense model in case the internal autosave missed his shot at the end of the run
torch.save(sense_model.state_dict(), '/content/drive/My Drive/CSI5138_Project/Models/Common/SenseModel.pt') 

In [0]:
#Look into POS predictions explicitly
sample_set = next(iter(trainloader))
sample_set = tuple(t.cuda() for t in sample_set)
# Unpack the inputs from our dataloader
input_ids, attention_mask, sense_ids, pos_ids, lemma_ids, embedding_intervals, sense_spaceids = sample_set
pos_model.eval()

pos_predictions = pos_model(input_ids, attention_mask, embedding_intervals, roberta_model)
pos_ids = torch.tensor([pos.view(-1) for pos in pos_ids.view(-1) if pos.item() != -1], dtype=torch.long)
pos_ids = pos_ids.cuda()
# Computer the loss for the POS predictions
for pred, label in zip(pos_predictions, pos_ids):
  pred_id = torch.argmax(torch.tensor(torch.nn.functional.softmax(pred))).item()
  label_id = label.item()
  print('Prediction:', SEMCOR_VOCAB.convertPOSID_to_POS(pred_id),'  Actual:', SEMCOR_VOCAB.convertPOSID_to_POS(label_id))


Prediction: PRP   Actual: PRP
Prediction: VBD   Actual: VBD
Prediction: VB   Actual: VB
Prediction: IN   Actual: IN
Prediction: PRP   Actual: PRP
Prediction: VB   Actual: VB
Prediction: NN   Actual: NN
Prediction: IN   Actual: VB
Prediction: WDT   Actual: WDT
Prediction: TO   Actual: TO
Prediction: VB   Actual: VB
Prediction: PRP   Actual: PRP
Prediction: PUNC   Actual: PUNC


  if sys.path[0] == '':
  if sys.path[0] == '':


In [0]:
#Look into Sense predictions explicitly
sample_set = next(iter(trainloader))
sample_set = tuple(t.cuda() for t in sample_set)
# Unpack the inputs from our dataloader
input_ids, attention_mask, sense_ids, pos_ids, lemma_ids, embedding_intervals, sense_spaceids = sample_set
sense_model.eval()

sense_predictions = sense_model(input_ids, attention_mask, embedding_intervals, roberta_model)
sense_spaceids = torch.tensor([sense.view(-1) for sense in sense_spaceids.view(-1) if sense.item() != -1], dtype=torch.long)
sense_spaceids = sense_spaceids.cuda()

print(len(sense_predictions))
print(len(sense_spaceids))

# Computer the loss for the POS predictions
for pred, label in zip(sense_predictions, sense_spaceids):
  pred_id = torch.argmax(torch.tensor(torch.nn.functional.softmax(pred))).item()
  label_id = label.item()
  print('Prediction:', pred_id,'  Actual:', label_id)


25
25
Prediction: 0   Actual: 0
Prediction: 1   Actual: 1
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 11   Actual: 11
Prediction: 3   Actual: 3
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 0   Actual: 1
Prediction: 0   Actual: 0
Prediction: 1   Actual: 1
Prediction: 0   Actual: 0
Prediction: 0   Actual: 1
Prediction: 8   Actual: 1
Prediction: 0   Actual: 0
Prediction: 1   Actual: 1
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0
Prediction: 1   Actual: 1
Prediction: 8   Actual: 8
Prediction: 0   Actual: 0
Prediction: 0   Actual: 0


  app.launch_new_instance()
  app.launch_new_instance()


In [0]:

import pandas as pd

# initialize statistics
stats = {'Validation loss':val_losses,
         'Mask loss':mask_losses,
         'Sense loss':sense_losses,
         'POS loss':pos_losses
        } 

df = pd.DataFrame(stats)

# Print the output.
print(df)

df.to_csv('/content/drive/My Drive/CSI5138_Project/Models/Common/Sense Model Statistics.csv') 

NameError: ignored

## WiC Training and Validation

### Load the WiC data

In [0]:
#LOAD THE WIC TRAINING DATA ON A DATALOADER
train_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/WiC/train.jsonl")
#print(f"Number of training exampled = {len(train_json_objs)}")
#print(train_json_objs[5])



# Process the data
raw_train_set = wic_preprocessing(train_json_objs, shuffle_data=False, verbose = False) # We do not want to shuffle for now.

# Create a PyTorch dataset for it
train_data = TensorDataset(
    torch.tensor(raw_train_set["input_ids"]),
    torch.tensor(raw_train_set["token_type_ids"]),
    torch.tensor(raw_train_set["attention_mask"]),
    torch.tensor(raw_train_set["labels"]),
    torch.tensor(raw_train_set["word1_locs"]),
    torch.tensor(raw_train_set["word2_locs"])
)

# Create a sampler and loader
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data, sampler=train_sampler, batch_size=BATCH_SIZE)

Number of training exampled = 5428
{'word': 'head', 'sentence1': 'His horse won by a head.', 'sentence2': 'He is two heads taller than his little sister.', 'idx': 5, 'label': True, 'start1': 19, 'start2': 10, 'end1': 23, 'end2': 15, 'version': 1.1}


In [0]:
#LOAD THE WIC TESTING AND VALIDATION DATA ON DATALOADERS
# Load the json objects from each file
test_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/WiC/test.jsonl")
valid_json_objs = load_json_objects_from_file("/content/drive/My Drive/CSI5138_Project/WiC/val.jsonl")
# Process the objects
raw_test_set = wic_preprocessing(test_json_objs, testing = False) # The labels for the testing set are unknown
raw_valid_set = wic_preprocessing(valid_json_objs)
# Create PyTorch datasets
test_data = TensorDataset(
    torch.tensor(raw_test_set["input_ids"]),
    torch.tensor(raw_test_set["token_type_ids"]),
    torch.tensor(raw_test_set["attention_mask"]),
    torch.tensor(raw_test_set["word1_locs"]),
    torch.tensor(raw_test_set["word2_locs"])
)
validation_data = TensorDataset(
    torch.tensor(raw_valid_set["input_ids"]),
    torch.tensor(raw_valid_set["token_type_ids"]),
    torch.tensor(raw_valid_set["attention_mask"]),
    torch.tensor(raw_valid_set["labels"]),
    torch.tensor(raw_valid_set["word1_locs"]),
    torch.tensor(raw_valid_set["word2_locs"])
)

# Create a sampler and loader for each
test_sampler = RandomSampler(test_data)
test_dataloader = DataLoader(test_data, sampler=test_sampler, batch_size=BATCH_SIZE)
validation_sampler = SequentialSampler(validation_data)
validation_dataloader = DataLoader(validation_data, sampler=validation_sampler, batch_size=BATCH_SIZE)

### Custom head class for WiC instead of a classification head
Based on https://pytorch.org/tutorials/beginner/examples_nn/two_layer_net_module.html and https://huggingface.co/transformers/_modules/transformers/modeling_roberta.html#RobertaModel



In [0]:
class WiCSemcorHead(torch.nn.Module):
    def __init__(self, roberta_based_model, embedding_size = 768):
        """
        Keeps a reference to the provided RoBERTa model. 
        It then adds a linear layer that takes the distance between two 
        """
        super(WiCSemcorHead, self).__init__()
        self.embedding_size = embedding_size
        self.embedder = roberta_based_model
        self.linear_diff = torch.nn.Linear(embedding_size, 250, bias = True)
        self.linear_seperator = torch.nn.Linear(250, 2, bias = True)
        self.loss = torch.nn.CrossEntropyLoss()
        self.activation = torch.nn.ReLU()
        self.softmax = torch.nn.Softmax()

    def forward(self, input_ids=None, attention_mask=None, labels=None,
                word1_locs = None, word2_locs = None):
        """
        Takes in the same argument as RoBERTa forward plus two tensors for the location of the 2 words to compare
        """
        if word1_locs is None or word2_locs is None:
          raise ValueError("The tensors (word1_locs, word1_locs) containing the location of the words to compare in the input vector must be provided.")
        elif input_ids is None:
          raise ValueError("The input_ids tensor must be provided.")
        elif word1_locs.shape[0] != input_ids.shape[0] or word2_locs.shape[0] != input_ids.shape[0]:
          raise ValueError("All provided vectors should have the same batch size.")
        batch_size = word1_locs.shape[0]
        # Get the embeddings
        embs, _ = self.embedder(input_ids=input_ids, attention_mask=attention_mask)
        # Get the words
        word1s = torch.matmul(word1_locs, embs).view(batch_size, self.embedding_size)
        word2s = torch.matmul(word2_locs, embs).view(batch_size, self.embedding_size)
        diff = word1s - word2s
        # Calculate outputs using activation
        layer1_results = self.activation(self.linear_diff(diff))
        logits = self.softmax(self.linear_seperator(layer1_results))
        outputs = logits
        # Calculate the loss
        if labels is not None:
            #  We want seperation like a SVM so use Hinge loss
            loss = self.loss(logits.view(-1, 2), labels.view(-1))
            outputs = (loss, logits)
        return outputs

### Instantiate the WiCSemcorHead

In [0]:
class_model = WiC_Head(roberta_model, embedding_size = 768)

### The training loop (fine-tuning Roberta right?)

In [0]:
# Want to maximize accuracy
max_val_acc = (0, 0)
# Put the model in GPU
class_model.cuda()
# Create the optimizer
param_optimizer = list(class_model.named_parameters())
no_decay = ['bias', 'gamma', 'beta']
optimizer_grouped_parameters = [
    {'params': [p for n, p in param_optimizer if not any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.01},
    {'params': [p for n, p in param_optimizer if any(nd in n for nd in no_decay)],
     'weight_decay_rate': 0.0}
]
# I use the one that comes with the models, but any other optimizer could be used
optimizer = AdamW(optimizer_grouped_parameters, lr=1e-6)
# Store our loss and accuracy for plotting
fit_history = {"loss": [],  "accuracy": [], "val_loss": [], "val_accuracy": []}
epoch_number = 0
epoch_since_max = 0
continue_learning = True
while epoch_number < EPOCHS and continue_learning:
  epoch_number += 1
  print(f"Training epoch #{epoch_number}")
  # Tracking variables
  tr_loss, tr_accuracy = 0, 0
  nb_tr_examples, nb_tr_steps = 0, 0
  eval_loss, eval_accuracy = 0, 0
  nb_eval_steps, nb_eval_examples = 0, 0
  # Training
  # Set our model to training mode (as opposed to evaluation mode)
  class_model.train()
  # Freeze RoBERTa weights
  class_model.embedder.eval()
  # Train the data for one epoch
  for step, batch in enumerate(train_dataloader):
    # Add batch to GPU
    batch = tuple(t.cuda() for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_token_ids, b_input_mask, b_labels, b_word1, b_word2 = batch
    # Clear out the gradients (by default they accumulate)
    optimizer.zero_grad()
    # Forward pass
    #loss, logits = class_model(b_input_ids, token_type_ids=b_token_ids, attention_mask=b_input_mask, labels=b_labels)   
    loss, logits = class_model(b_input_ids, attention_mask=b_input_mask, 
                               labels=b_labels, word1_locs = b_word1, word2_locs = b_word2) 
    # Backward pass
    loss.backward()
    # Update parameters and take a step using the computed gradient
    optimizer.step()
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.cpu().numpy()
    # Calculate the accuracy
    b_accuracy = flat_accuracy(logits, label_ids) # For RobertaForClassification
    '''preds = []
    for logit in logits:
      if logit >= 0:
        preds.append([1])
      else:
        preds.append([-1])
    b_accuracy = accuracy_score(label_ids, preds)'''
    # Append to fit history
    fit_history["loss"].append(loss.item()) 
    fit_history["accuracy"].append(b_accuracy) 
    # Update tracking variables
    tr_loss += loss.item()
    tr_accuracy += b_accuracy
    nb_tr_examples += b_input_ids.size(0)
    nb_tr_steps += 1
    if nb_tr_steps%10 == 0:
      print("\t\tTraining Batch {}: Loss: {}; Accuracy: {}".format(nb_tr_steps, loss.item(), b_accuracy))
  print("Training:\n\tLoss: {}; Accuracy: {}".format(tr_loss/nb_tr_steps, tr_accuracy/nb_tr_steps))
  # Validation
  # Put model in evaluation mode to evaluate loss on the validation set
  class_model.eval()
  # Evaluate data for one epoch
  for batch in validation_dataloader:
    # Add batch to GPU
    batch = tuple(t.cuda() for t in batch)
    # Unpack the inputs from our dataloader
    b_input_ids, b_token_ids, b_input_mask, b_labels, b_word1, b_word2 = batch
    # Telling the model not to compute or store gradients, saving memory and speeding up validation
    with torch.no_grad():
      # Forward pass, calculate logit predictions
      #loss, logits = class_model(b_input_ids, token_type_ids=b_token_ids, attention_mask=b_input_mask, labels=b_labels)
      loss, logits = class_model(b_input_ids, attention_mask=b_input_mask, 
                                 labels=b_labels, word1_locs = b_word1, word2_locs = b_word2)
    # Move logits and labels to CPU
    logits = logits.detach().cpu().numpy()
    label_ids = b_labels.cpu().numpy()
    # Calculate the accuracy
    b_accuracy = flat_accuracy(logits, label_ids) # For RobertaForClassification
    '''preds = []
    for logit in logits:
      if logit >= 0:
        preds.append([1])
      else:
        preds.append([-1])
    b_accuracy = accuracy_score(label_ids, preds)'''
    # Append to fit history
    fit_history["val_loss"].append(loss.item()) 
    fit_history["val_accuracy"].append(b_accuracy) 
    # Update tracking variables
    eval_loss += loss.item()
    eval_accuracy += b_accuracy
    nb_eval_examples += b_input_ids.size(0)
    nb_eval_steps += 1
    if nb_eval_steps%10 == 0:
      print("\t\tValidation Batch {}: Loss: {}; Accuracy: {}".format(nb_eval_steps, loss.item(), b_accuracy))
  eval_acc = eval_accuracy/nb_eval_steps
  if eval_acc >= max_val_acc[0]:
    max_val_acc = (eval_acc, epoch_number)
    continue_learning = True
    epoch_since_max = 0 # New max
  else:
    epoch_since_max += 1
    if epoch_since_max> PATIENCE:
      continue_learning = False # Stop learning, starting to overfit
  print("Validation:\n\tLoss={}; Accuracy: {}".format(eval_loss/nb_eval_steps, eval_accuracy/nb_eval_steps))
print(f"Best accuracy ({max_val_acc[0]}) obtained at epoch #{max_val_acc[1]}.")

### Get the testing results
DOES NOT WORK FOR NOW. WE DO NOT HAVE THE TESTING LABELS FOR WiC.

In [0]:
# Testing
# Put model in evaluation mode to evaluate loss on the validation set
class_model.eval()
# Evaluate data for one epoch
for batch in test_dataloader:
  # Add batch to GPU
  batch = tuple(t.cuda() for t in batch)
  # Unpack the inputs from our dataloader
  b_input_ids, b_token_ids, b_input_mask, b_labels = batch
  # Telling the model not to compute or store gradients, saving memory and speeding up validation
  with torch.no_grad():
    # Forward pass, calculate logit predictions
    #loss, logits = class_model(b_input_ids, token_type_ids=b_token_ids, attention_mask=b_input_mask, labels=b_labels)
    loss, logits = class_model(b_input_ids, attention_mask=b_input_mask)
  # Move logits and labels to CPU
  logits = logits.detach().cpu().numpy()
  label_ids = b_labels.cpu().numpy()
  # Calculate the accuracy
  b_accuracy = flat_accuracy(logits, label_ids)
  # Append to fit history
  fit_history["val_loss"].append(loss.item()) 
  fit_history["val_accuracy"].append(b_accuracy) 
  # Update tracking variables
  eval_loss += loss.item()
  eval_accuracy += b_accuracy
  nb_eval_examples += b_input_ids.size(0)
  nb_eval_steps += 1
  if nb_eval_steps%10 == 0:
    print("\t\tTest Batch {}: Loss: {}; Accuracy: {}".format(nb_eval_steps, loss.item(), b_accuracy))
# Print final results
print("Testing:\n\tLoss={}; Accuracy: {}".format(eval_loss/nb_eval_steps, eval_accuracy/nb_eval_steps))