<a href="https://colab.research.google.com/github/dbamman/nlp22/blob/main/HW6/HW6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 6

In this homework, you will be working with WordNet synsets and exploring methods to align new words (not in WordNet) with an existing synset.

In [1]:
import nltk
import math
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
import numpy as np
from typing import List, Tuple, Dict

nltk.download('wordnet')
nltk.download('punkt')
!wget https://people.ischool.berkeley.edu/~dbamman/glove.6B.100d.100K.txt
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
--2022-04-08 05:54:26--  https://people.ischool.berkeley.edu/~dbamman/glove.6B.100d.100K.txt
Resolving people.ischool.berkeley.edu (people.ischool.berkeley.edu)... 128.32.78.16
Connecting to people.ischool.berkeley.edu (people.ischool.berkeley.edu)|128.32.78.16|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 85951834 (82M) [text/plain]
Saving to: ‘glove.6B.100d.100K.txt’


2022-04-08 05:54:28 (36.1 MB/s) - ‘glove.6B.100d.100K.txt’ saved [85951834/85951834]

Collecting sentence_transformers
  Downloading sentence-transformers-2.2.0.tar.gz (79 kB)
[K     |████████████████████████████████| 79 kB 3.5 MB/s 
[?25hCollecting transformers<5.0.0,>=4.6.0
  Downloading transformers-4.18.0-py3-none-any.whl (4.0 MB)
[K     |████████████████████████████████| 4.0

# Preliminaries: WordNet

NLTK provides a great interface to the WordNet ontology.  Remember that core unit within WordNet is the **synset** (a category of near-synonyms).  A word (like "blue") can appear in many different synsets, each corresponding to a distinct *sense* of that word.

In [2]:
# get all of the synsets that a specific word belongs to; print their definitions

synsets=wn.synsets('blue')
for synset in synsets:
    print (synset, synset.definition())

Synset('blue.n.01') blue color or pigment; resembling the color of the clear sky in the daytime
Synset('blue.n.02') blue clothing
Synset('blue.n.03') any organization or party whose uniforms or badges are blue
Synset('blue_sky.n.01') the sky as viewed during daylight
Synset('bluing.n.01') used to whiten laundry or hair or give it a bluish tinge
Synset('amobarbital_sodium.n.01') the sodium salt of amobarbital that is used as a barbiturate; used as a sedative and a hypnotic
Synset('blue.n.07') any of numerous small butterflies of the family Lycaenidae
Synset('blue.v.01') turn blue
Synset('blue.s.01') of the color intermediate between green and violet; having a color similar to that of a clear unclouded sky
Synset('blue.s.02') used to signify the Union forces in the American Civil War (who wore blue uniforms)
Synset('gloomy.s.02') filled with melancholy and despondency
Synset('blasphemous.s.02') characterized by profanity or cursing
Synset('blue.s.05') suggestive of sexual impropriety
Syn

Any given synset will likewise contain multiple different words (all near-synonyms of each other).

In [3]:
# get all of the words/phrase in a given synset

for lemma in wn.synset("gloomy.s.02").lemmas():
    print (lemma.name())

gloomy
grim
blue
depressed
dispirited
down
downcast
downhearted
down_in_the_mouth
low
low-spirited


Remember also that one of the powerful things about WordNet is that it places synsets within a hierarchical structure; a given synset has both **hypernyms** (other synsets that it is a subclass of) and **hyponyms** (other synsets that are subclasses of it).

In [4]:
# Functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms

hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

Find all of the synsets that are hyponyms of the target synset (descendents in the WordNet hierarchy)

In [5]:
list(wn.synset("blue.n.01").closure(hypo))

[Synset('azure.n.01'),
 Synset('dark_blue.n.01'),
 Synset('greenish_blue.n.01'),
 Synset('powder_blue.n.01'),
 Synset('prussian_blue.n.02'),
 Synset('purplish_blue.n.01'),
 Synset('steel_blue.n.01'),
 Synset('ultramarine.n.02')]

Find all of the synsets that are hyperyms (ancestors up the tree) of the target synset

In [6]:
list(wn.synset("blue.n.01").closure(hyper))

[Synset('chromatic_color.n.01'),
 Synset('color.n.01'),
 Synset('visual_property.n.01'),
 Synset('property.n.02'),
 Synset('attribute.n.02'),
 Synset('abstraction.n.06'),
 Synset('entity.n.01')]

Here's how you can access all of the synsets in WordNet through NLTK (though note executing this may take a while, so it's commented out).

In [7]:
#for idx, synset in enumerate(wn.all_synsets()):
#   print(idx, synset)
#   if (idx > 10): break

# Homework

WordNet is a great resource, but one of its downsides is *coverage* -- many of the words in our vocabulay aren't in WordNet, but could conceivably be placed within existing synsets within it.  Your task for this homework is to develop two methods to finding the closest synset for a given new word from Urban Dictionary.

For the scope of this homework, we're only going to pretend that WordNet only has 12 different synsets within it (though feel free to use the `wn.all_synsets` function above if you wanted to explore running it on all of WordNet).

In [8]:
target_synsets=['spread.n.01', 'formidable.s.01', 'coziness.n.01', 'mutation.n.02', 'kernel.n.03', 'faineant.s.01', 'fund-raise.v.01', 'orientation.n.06', 'inappropriate.a.01', 'stranger.n.02', 'plausibility.n.01', 'sever.v.01']

In [9]:
for synset in target_synsets:
    wn_synset=wn.synset(synset)
    print(wn_synset)
    print("\tDefinition:", wn_synset.definition())

Synset('spread.n.01')
	Definition: process or result of distributing or extending over a wide expanse of space
Synset('formidable.s.01')
	Definition: extremely impressive in strength or excellence
Synset('coziness.n.01')
	Definition: a state of warm snug comfort
Synset('mutation.n.02')
	Definition: (genetics) any event that changes genetic structure; any alteration in the inherited nucleic acid sequence of the genotype of an organism
Synset('kernel.n.03')
	Definition: the choicest or most essential or most vital part of some idea or experience
Synset('faineant.s.01')
	Definition: disinclined to work or exertion
Synset('fund-raise.v.01')
	Definition: raise money for a cause or project
Synset('orientation_course.n.01')
	Definition: a course introducing a new situation or environment
Synset('inappropriate.a.01')
	Definition: not suitable for a particular occasion etc
Synset('stranger.n.02')
	Definition: an individual that one is not acquainted with
Synset('plausibility.n.01')
	Definition:

Here are the words that do not exist in WordNet now but that we want to add.  Each element of the tuple is (word, definition).

In [10]:
urban_dictionary_terms: List[Tuple[str, str]] = [
    ("Crowdfunding", "the practice of obtaining needed funding (as for a new business) by soliciting contributions from a large number of people especially from the online community"), 
    ("Hygge", "a cozy quality that makes a person feel content and comfortable"), 
    ("biohacking", "biological experimentation (as by gene editing or the use of drugs or implants) done to improve the qualities or capabilities of living organisms especially by individuals and groups working outside a traditional medical or scientific research environment"), 
    ("TL;DR", "a briefly expressed main point or key message that summarizes a longer discussion or explanation"), 
    ("Hellacious", "Exceptionally powerful or violent; remarkably good; extremely difficult; extraordinarily large"), 
    ("Unfriend", "To remove from one's list of friends (e.g. on a social networking website)"), 
    ("Infodemic", "A wide and rapid spread of misinformation through various media, namely social media"),
    ("Onboarding", "The act or process of orienting and training a new employee"), 
    ("Truthiness", "something that seems true but isn’t backed up by evidence"), 
    ("Amotivational", "Relating to, or characterised by, a lack of motivation"), 
    ("NSFW", "Not Safe For Work. Used to describe Internet content generally inappropriate for the typical workplace, i.e., would not be acceptable in the presence of your boss and colleagues"),
    ("Rando", "a person who is not known or recognizable or whose appearance (as in a conversation or narrative) seems unprompted or unwelcome")
]

In [11]:
urban_dictionary_terms

[('Crowdfunding',
  'the practice of obtaining needed funding (as for a new business) by soliciting contributions from a large number of people especially from the online community'),
 ('Hygge', 'a cozy quality that makes a person feel content and comfortable'),
 ('biohacking',
  'biological experimentation (as by gene editing or the use of drugs or implants) done to improve the qualities or capabilities of living organisms especially by individuals and groups working outside a traditional medical or scientific research environment'),
 ('TL;DR',
  'a briefly expressed main point or key message that summarizes a longer discussion or explanation'),
 ('Hellacious',
  'Exceptionally powerful or violent; remarkably good; extremely difficult; extraordinarily large'),
 ('Unfriend',
  "To remove from one's list of friends (e.g. on a social networking website)"),
 ('Infodemic',
  'A wide and rapid spread of misinformation through various media, namely social media'),
 ('Onboarding', 'The act or

Your task here is to develop two different methods for finding the best matching synset.
1. Find the WordNet synset with the highest cosine similarity between the average GloVe embeddings of its synset definition and the average GloVe embeddings of the new word definition.
2. Find the WordNet synset with the highest cosine similarity between the sentence embedding its synset definition and the sentence embedding of the new word definition.

Here is some code for reading in Glove embeddings:


In [12]:
def read_vectors(filename: str):
    vocab_map={}
    embeddings=[]
    with(open(filename, encoding="utf-8")) as file:
        for idx, line in enumerate(file):
            cols=line.rstrip().split(" ")
            word=cols[0]
            embedding=cols[1:]

            embeddings.append(embedding)
            vocab_map[word]=idx
    
    return vocab_map, np.array(embeddings, dtype="float")

In [13]:
glove_vocab_map, glove_embeddings=read_vectors("glove.6B.100d.100K.txt")

In [14]:
len(glove_embeddings)

100000

In [15]:
l = []

l.append(glove_embeddings[0])
l.append(glove_embeddings[1])

In [16]:
np.mean(glove_embeddings[0])

-0.02795817

In [17]:
np.mean(l,axis=0)

array([-0.072932  , -0.06717   ,  0.66312   , -0.47161   ,  0.378566  ,
        0.0752915 , -0.1762715 ,  0.344605  , -0.25597   , -0.003365  ,
        0.222865  , -0.44198   ,  0.22539   , -0.022225  ,  0.22637   ,
       -0.364575  ,  0.28089   , -0.25245   , -0.499855  , -0.06233   ,
        0.28604   , -0.091415  ,  0.6327    ,  0.3137955 ,  0.455655  ,
       -0.02304   , -0.0345595 , -0.48452   , -0.160425  , -0.098956  ,
       -0.167836  ,  0.27157   , -0.061065  ,  0.082118  , -0.293415  ,
        0.1221915 ,  0.45111   ,  0.478525  ,  0.22558   , -0.268374  ,
       -0.52794   , -0.498575  ,  0.213495  , -0.352715  ,  0.072525  ,
        0.17381   ,  0.57151   , -0.460005  , -0.25462   , -0.595375  ,
       -0.02871115, -0.04332   ,  0.231835  ,  1.088675  , -0.640775  ,
       -2.8898    , -0.05978   , -0.2085745 ,  1.6411    ,  0.912705  ,
       -0.23529   ,  0.554025  ,  0.083142  ,  0.30931   ,  0.9105    ,
       -0.272865  ,  0.4194525 ,  0.42568   ,  0.328929  ,  0.00

Here is some code for loading the sentence transformer package:

In [18]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

sentence_vector=sentence_model.encode("this is a sentence")
print(sentence_vector.shape)


Downloading:   0%|          | 0.00/737 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/190 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/9.86k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/653 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/116 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/15.7k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/349 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/329M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/239 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/333 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/13.1k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/798k [00:00<?, ?B/s]

(768,)


In [19]:
len(sentence_vector)

768

Here's an implementation of cosine similarity that you will find useful.

In [20]:
def cosine_similarity(one, two):
    return np.dot(one, two) / (np.linalg.norm(one) * np.linalg.norm(two))

#### Q1. Implement the first method as `method_one` below.

As mentioned above, you should compute the average GloVe embedding of the UD word definition and use cosine similarity to compare it with the average GloVe embedding of the synset definitions. For each UD word, choose the definition that maximizes the cosine similarity with its definition. Here are some things you need to do when calculating the average GloVe embedding of a sentence:
- Use `nltk.word_tokenize()` to tokenize the sentence.
- Treat everything as lowercase.
- Skip any tokens which don't appear in the GloVe vocabulary.
- Calculate the average value of the embedding vectors, which will be another vector of the same shape.

Your function should return a dictionary mapping each urban dictionary term to a WordNet synset ID, e.g.:

`{
 "adore" : "love.v.01",
 "dripping" : "stylish.a.01"    
 }`

 Please make sure that any helper functions that you use are defined *within* `method_one`! That will help us extract your code more easily with the autograder.

In [21]:
nltk.word_tokenize("the practice of obtaining needed funding")

['the', 'practice', 'of', 'obtaining', 'needed', 'funding']

In [22]:
for (i,j) in urban_dictionary_terms:
  print(i,":",j)

Crowdfunding : the practice of obtaining needed funding (as for a new business) by soliciting contributions from a large number of people especially from the online community
Hygge : a cozy quality that makes a person feel content and comfortable
biohacking : biological experimentation (as by gene editing or the use of drugs or implants) done to improve the qualities or capabilities of living organisms especially by individuals and groups working outside a traditional medical or scientific research environment
TL;DR : a briefly expressed main point or key message that summarizes a longer discussion or explanation
Hellacious : Exceptionally powerful or violent; remarkably good; extremely difficult; extraordinarily large
Unfriend : To remove from one's list of friends (e.g. on a social networking website)
Infodemic : A wide and rapid spread of misinformation through various media, namely social media
Onboarding : The act or process of orienting and training a new employee
Truthiness : so

In [23]:
d = ['a','b','c']
d.index('c')

2

In [24]:
syn_dict = []
for synset in target_synsets:
  wn_synset=wn.synset(synset)
  syn_dict.append((synset,wn_synset.definition()))


In [25]:
syn_dict

[('spread.n.01',
  'process or result of distributing or extending over a wide expanse of space'),
 ('formidable.s.01', 'extremely impressive in strength or excellence'),
 ('coziness.n.01', 'a state of warm snug comfort'),
 ('mutation.n.02',
  '(genetics) any event that changes genetic structure; any alteration in the inherited nucleic acid sequence of the genotype of an organism'),
 ('kernel.n.03',
  'the choicest or most essential or most vital part of some idea or experience'),
 ('faineant.s.01', 'disinclined to work or exertion'),
 ('fund-raise.v.01', 'raise money for a cause or project'),
 ('orientation.n.06', 'a course introducing a new situation or environment'),
 ('inappropriate.a.01', 'not suitable for a particular occasion etc'),
 ('stranger.n.02', 'an individual that one is not acquainted with'),
 ('plausibility.n.01', 'apparent validity'),
 ('sever.v.01', 'set or keep apart')]

In [26]:
def method_one(urban_dictionary_terms: List[Tuple[str, str]], target_synsets: List[str]):
    """
    Method 1: an algorithm based on GloVe embeddings that maps each urban dictionary term to a synset ID.

    Parameters
    ----------
    urban_dictionary_terms : List[Tuple[str, str]]
        a list of string 2-tuples where the first elements are words, second elements are definitions.
    target_synsets : List[str]
        a list of synset IDs that the words should be classified into.
        You can call `wn.synset("<synset ID>")` to get the synset object.
    
    Returns
    --------
    A dictionary mapping each urban dictionary term to a WordNet synet ID, e.g.
    `{"adore" : "love.v.01", "dripping" : "stylish.a.01"}`
    
    """
    
    # Your code

    # create dict of synset def
    syn_list = []
    for synset in target_synsets:
      wn_synset=wn.synset(synset)
      syn_list.append((synset,wn_synset.definition()))

    # function to get avg embedding vector of a list of definitions (either UB/Synset)
    # we call this function later
    def avg_vec(list_terms):
      sent_vec = []
      for (w,d) in list_terms:

         # tokenize definition
         tokens = nltk.word_tokenize(d)
         
         l =[]
         # find glove vector for each token and take avg
         for t in tokens:
           if t.lower() in glove_vocab_map:
             emb_vec = glove_embeddings[glove_vocab_map[t.lower()]]
             l.append(emb_vec)
        
         # average across all tokens to get sentence embedding
         sent_vec.append(np.mean(l,axis=0))

      return sent_vec

    # call function for UD 
    ub_vec = avg_vec(urban_dictionary_terms)
    # call function for synset
    syn_vec = avg_vec(syn_list)


    # cosine sim of every UB term with each synset
    final_dict ={}

    for i in range(len(urban_dictionary_terms)):
      max_sim = -999

      for j in range(len(syn_list)):
        
        sim = cosine_similarity(ub_vec[i],syn_vec[j])
        if sim>max_sim:
          max_sim = sim
          flag=j

      best_syn = syn_list[flag][0]

      final_dict[urban_dictionary_terms[i][0]] = best_syn

    return final_dict

    pass


In [27]:
method_one_results=method_one(urban_dictionary_terms, target_synsets)
method_one_results

{'Amotivational': 'kernel.n.03',
 'Crowdfunding': 'spread.n.01',
 'Hellacious': 'formidable.s.01',
 'Hygge': 'stranger.n.02',
 'Infodemic': 'spread.n.01',
 'NSFW': 'stranger.n.02',
 'Onboarding': 'orientation.n.06',
 'Rando': 'stranger.n.02',
 'TL;DR': 'stranger.n.02',
 'Truthiness': 'stranger.n.02',
 'Unfriend': 'stranger.n.02',
 'biohacking': 'kernel.n.03'}

#### Q2. Implement your second method as `method_two` below.

In this function, you should compute the cosine similarity between the sentence embedding of the UD word definition and those of the synsets, then for each UD word, choose the synset with the highest cosine similarity. For consistency, use the sentence transformer model called `sentence-transformers/all-distilroberta-v1`.

Your function must also return a dictionary mapping each urban dictionary term to a WordNet synset ID, e.g.:

`{
 "adore" : "love.v.01",
 "dripping" : "stylish.a.01"    
 }`

As before, please make sure that any helper functions that you use are defined *within* `method_two`! That will help us extract your code more easily with the autograder.

In [49]:
def method_two(urban_dictionary_terms: List[Tuple[str, str]], target_synsets: List[str]):
    """
    Method 2: an algorithm based on sentence embeddings that maps each urban dictionary term to a synset ID.

    Parameters
    ----------
    urban_dictionary_terms : List[Tuple[str, str]]
        a list of string 2-tuples where the first elements are words, second elements are definitions.
    target_synsets : List[str]
        a list of synset IDs that the words should be classified into.
        You can call `wn.synset("<synset ID>")` to get the synset object.
    
    Returns
    --------
    A dictionary mapping each urban dictionary term to a WordNet synet ID, e.g.
    `{"adore" : "love.v.01", "dripping" : "stylish.a.01"}`
    
    """

    # Your code

    # create dict of synset def
    syn_list = []
    for synset in target_synsets:
      wn_synset=wn.synset(synset)
      syn_list.append((synset,wn_synset.definition()))

    # print(syn_list)
    
    # cosine sim of every UB term with each synset
    final_dict ={}

    for i in range(len(urban_dictionary_terms)):
      max_sim = -999

      for j in range(len(syn_list)):
        
        # get sent vector
        sent_vec_ub=sentence_model.encode(urban_dictionary_terms[i][1])
        sent_vec_syn=sentence_model.encode(syn_list[j][1])

        # get cosine similarity
        sim = cosine_similarity(sent_vec_ub,sent_vec_syn)
        if sim>max_sim:
          max_sim = sim
          flag=j

      best_syn = syn_list[flag][0]

      final_dict[urban_dictionary_terms[i][0]] = best_syn

    
    return final_dict

    pass


In [50]:
method_two_results=method_two(urban_dictionary_terms, target_synsets)
method_two_results

{'Amotivational': 'faineant.s.01',
 'Crowdfunding': 'fund-raise.v.01',
 'Hellacious': 'formidable.s.01',
 'Hygge': 'coziness.n.01',
 'Infodemic': 'spread.n.01',
 'NSFW': 'faineant.s.01',
 'Onboarding': 'orientation.n.06',
 'Rando': 'stranger.n.02',
 'TL;DR': 'kernel.n.03',
 'Truthiness': 'plausibility.n.01',
 'Unfriend': 'stranger.n.02',
 'biohacking': 'mutation.n.02'}

#### Q3: Define an evaluation metric (accuracy).  

Throughout this semester we've stressed how critical evaluation is for any NLP method.  Implement a function `accuracy` that assesses quality of the dictionaries you return from `method_one` and `method_two`.  This accuracy function should return a single real number (the accuracy), and its input parameters are a prediction dict (the output of your model) and a truth dict (which you will need to create based on your own judgement). Make sure that the accuracies you calculate for the two methods match what you expect. 

In [30]:
def accuracy(prediction: Dict[str, str], truth: Dict[str, str]) -> float:

  







In [31]:
truth = {
    
    



}

In [32]:
print(accuracy(method_one_results, truth))

None


In [33]:
print(accuracy(method_two_results, truth))

None


That concludes homework 6! To submit, just upload this .ipynb file to Gradescope.

#### Q4 (optional)

Use the two methods you've defined to find the best-matching synset within the **entire** WordNet. Do the results make sense? Is one method consistently better than the other? Why?

Here's a reminder of how to iterate through all synsets:

In [34]:
for idx, synset in enumerate(wn.all_synsets()):
   print(idx, synset)
   if (idx > 10): break

0 Synset('able.a.01')
1 Synset('unable.a.01')
2 Synset('abaxial.a.01')
3 Synset('adaxial.a.01')
4 Synset('acroscopic.a.01')
5 Synset('basiscopic.a.01')
6 Synset('abducent.a.01')
7 Synset('adducent.a.01')
8 Synset('nascent.a.01')
9 Synset('emergent.s.02')
10 Synset('dissilient.s.01')
11 Synset('parturient.s.02')
