In this homework, you'll explore using BERT and SentenceTransformers into the Lesk algorithm for word sense disambiguation.  (You'll likely want to run this on Colab.)

In [1]:
# Note: Christian and I worked on this code together

!pip install transformers
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence-transformers-2.2.2.tar.gz (85 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m86.0/86.0 kB[0m [31m3.1 MB/s[0m eta [36m0:00:00[0m
[?25h  Preparing metadata (setup.py) ... [?25ldone
Collecting torch>=1.6.0 (from sentence-transformers)
  Obtaining dependency information for torch>=1.6.0 from https://files.pythonhosted.org/packages/16/dd/1bf10180ba812afa1aa7427466083d731bc37b9a1157ec929d0cfeef87eb/torch-2.1.0-cp311-none-macosx_10_9_x86_64.whl.metadata
  Downloading torch-2.1.0-cp311-none-macosx_10_9_x86_64.whl.metadata (24 kB)
Collecting torchvision (from sentence-transformers)
  Obtaining dependency information for torchvision from https://files.pythonhosted.org/packages/23/84/46481327771d4f63feb59dd0d9e1cd6a42e985dbd371965f486a5bf9f323/torchvision-0.16.0-cp311-cp311-macosx_10_13_x86_64.whl.metadata
  Downloading torchvision-0.16.0-cp311-cp311-macosx_10_13_x86_64.whl.metadata (6.6 kB)
Collecting sentencepie

Collecting mpmath>=0.19 (from sympy->torch>=1.6.0->sentence-transformers)
  Downloading mpmath-1.3.0-py3-none-any.whl (536 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m536.2/536.2 kB[0m [31m43.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading torch-2.1.0-cp311-none-macosx_10_9_x86_64.whl (146.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m146.7/146.7 MB[0m [31m10.1 MB/s[0m eta [36m0:00:00[0m00:01[0m00:01[0m
[?25hDownloading torchvision-0.16.0-cp311-cp311-macosx_10_13_x86_64.whl (1.7 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.7/1.7 MB[0m [31m31.5 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading networkx-3.2.1-py3-none-any.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m38.6 MB/s[0m eta [36m0:00:00[0m00:01[0m
[?25hBuilding wheels for collected packages: sentence-transformers
  Building wheel for sentence-transformers (setup.py) ... [?25ldone

In [2]:
from transformers import BertModel, BertTokenizer
import numpy as np
from nltk.corpus import wordnet as wn
import nltk
from scipy.spatial.distance import cosine
import operator
import torch
from math import sqrt


In [3]:
nltk.download('wordnet')

[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/piadeshpande/nltk_data...


True

In [4]:
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertModel.from_pretrained('bert-base-uncased')

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

**Q1:** The Lesk algorithm we discussed in class uses information about the *context* around a term in calculating the similarity between a word in a sentence and a word in a dictionary gloss.  For instance, [Basile et al. 2014](https://www.aclweb.org/anthology/C/C14/C14-1151.pdf) use static word vectors to provide this context-level information, where we measure the similarity between a gloss g = $\{ g_1, \ldots, g_G \}$ and context c = $\{ c_1, \ldots, c_C \}$ as the cosine similarity between the sum of distributed representations:

$$
\cos \left(\sum_{i=1}^G g_i, \sum_{i=1}^C c_i  \right)
$$

However, over the past few weeks we've considered how contextual language models like BERT already provide a sentence-level contextualization for words.  So given a target sentence ("I withdrew money from the *bank*") with target term (bank), and a list of dictionary glosses/examples corresponding to different senses ("A bank is a financial institution" = bank1; "A bank is the side of a river" = bank2), let's adapt the Lesk algorithm to simply calculate the similarity between the average BERT embedding for all words in the target sentence (including the [CLS] and [SEP] tokens) and the average BERT embedding for all the words in the sense gloss (again including [CLS] and [SEP]):

$$
\cos \left({1 \over G}\sum_{i=1}^G BERT(g_{i}), {1 \over C} \sum_{j=1}^C BERT(c_{j}) \right)
$$


* The gloss for a synset can be found in `synset.definition()`.
* You can find the cosine similarity between two vectors below.
* `wn.synsets(word, pos=part_of_speech)` gets you a list of the synsets for a word with a specific part of speech (e.g., "n" for noun)
* Feel free to draw on the code you've already seen for getting the BERT embeddings for words (e.g., `3.embeddings/BERT.ipynb`).

In [5]:
def cosine_similarity(vec1, vec2):
  return np.dot(vec1, vec2)/(sqrt(np.dot(vec1, vec1)) * sqrt(np.dot(vec2, vec2)))

In [6]:
test = wn.synsets("test", pos="n")
test[1].definition()
#inputs = tokenizer(test[1].definition(), return_tensors="pt")
#tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
#term_idx=tokens.index(term)
#outputs = model(**inputs)
#res = outputs.last_hidden_state[0][term_idx].detach().numpy()
#res.shape()

'any standardized procedure for measuring sensitivity or memory or intelligence or aptitude or personality etc'

In [38]:
from nltk.corpus.reader.wordnet import Synset

def bert_lesk(word, sentence, part_of_speech):

# this function gets the BERT token representations for a given term within a larger string
    def get_bert_for_token(string, term):
        # tokenize
        inputs = tokenizer(string, return_tensors="pt")
        tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        # find the first location of the query term among those tokens (so we know which BERT rep to use)
        term_idx=tokens.index(term)
        outputs = model(**inputs)
        return outputs.last_hidden_state[0][term_idx].detach().numpy()
    context_vector=get_bert_for_token(sentence, word)
    synsets= wn.synsets(word,pos = part_of_speech)
    
    
    
# lastly, for every one of the synsets in the list, grab the BERT representation between the definition as the string and the input word
    # vals={}
   #  for synset in synsets:
      # code seems to be breaking here; where i get error related to term_idx; this should be vec2?
    #    vector=get_bert_for_token(synset.definition(), word)
    #    vals[synset]=cosine_similarity(context_vector, vector)

   # sorted_x = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
   # for k,v in sorted_x:
    #    print("%.3f\t%s\t%s"% (v,k,k.definition()))
    
    return(synsets[1].definition())


Execute the following two cells to check whether your implementation distinguishes between these two senses of "bank".

In [39]:
bert_lesk("bank", "I deposited my money into my savings account at the bank", "n")

# This prints! It breaks once I try to index through synsets which is confusing to me. Christian and I were working
# on this and couldn't figure out why that didn't work. 



'a financial institution that accepts deposits and channels the money into lending activities'

In [42]:
from nltk.corpus.reader.wordnet import Synset

def bert_lesk(word, sentence, part_of_speech):

# this function gets the BERT token representations for a given term within a larger string
    def get_bert_for_token(string, term):
        # tokenize
        inputs = tokenizer(string, return_tensors="pt")
        tokens=tokenizer.convert_ids_to_tokens(inputs["input_ids"][0])
        # find the first location of the query term among those tokens (so we know which BERT rep to use)
        term_idx=tokens.index(term)
        outputs = model(**inputs)
        return outputs.last_hidden_state[0][term_idx].detach().numpy()
    context_vector=get_bert_for_token(sentence, word)
    synsets= wn.synsets(word,pos = part_of_speech)
    
    
    
# lastly, for every one of the synsets in the list, grab the BERT representation between the definition as the string and the input word
    # vals={}
   #  for synset in synsets:
      # code seems to be breaking here; where i get error related to term_idx; this should be vec2?
    #    vector=get_bert_for_token(synset.definition(), word)
    #    vals[synset]=cosine_similarity(context_vector, vector)

   # sorted_x = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
   # for k,v in sorted_x:
    #    print("%.3f\t%s\t%s"% (v,k,k.definition()))
    
    return(get_bert_for_token(synsets[1].definition(), word))


In [43]:
bert_lesk("bank", "I ran along the river bank", "n")
# getting error that "bank" is not in list but I'm not sure why

ValueError: 'bank' is not in list

Q2.  Now do the same thing with SentenceBERT.  For a gloss $g$ and a target sentence $c$ containing the word to disambiguate, calculate the similarity between them as the cosine similarity of the SentenceBERT vectors of each one:

$$
\cos \left(\textrm{SBERT}(g), \textrm{SBERT}(c) \right)
$$


In [None]:
from sentence_transformers import SentenceTransformer
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

Downloading (…)87e68/.gitattributes:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading (…)_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading (…)5afc487e68/README.md:   0%|          | 0.00/10.3k [00:00<?, ?B/s]

Downloading (…)fc487e68/config.json:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading (…)ce_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading (…)e68/data_config.json:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading (…)afc487e68/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading pytorch_model.bin:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading (…)nce_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading (…)cial_tokens_map.json:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading (…)87e68/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading (…)7e68/train_script.py:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading (…)afc487e68/vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

Downloading (…)c487e68/modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

In [None]:
def sentencebert_lesk(word, sentence, part_of_speech):

    def get_sentence_bert(string, term):
      embedding=sentence_model.encode(string)
      return embedding

    context_vector = get_sentence_bert(sentence, word)
    synsets= wn.synsets(word,pos = part_of_speech)

    vals={}

    for synset in synsets:
      # code seems to be breaking here; where i get error related to term_idx; this should be vec2?
        vector=get_sentence_bert(synset.definition(), word)
        vals[synset]=cosine_similarity(context_vector, vector)

    sorted_x = sorted(vals.items(), key=operator.itemgetter(1), reverse=True)
    for k,v in sorted_x:
        print("%.3f\t%s\t%s"% (v,k,k.definition()))


Execute the following two cells to check whether your implementation of SentenceBERT-Lesk distinguishes between these two senses of "bank".

In [None]:
sentencebert_lesk("bank", "I deposited my money into my savings account at the bank", "n")

In [None]:
sentencebert_lesk("bank", "I ran along the river bank", "n")

To turn in:

- Go to `File > Download > Download .ipynb` and save your notebook.
- In your browser, print this page to save as PDF.
- Upload both your .ipynb and .pdf files to bCourses as usual.