<a href="https://colab.research.google.com/github/dbamman/nlp22/blob/main/HW6/HW6.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Homework 6

In this homework, you will be working with WordNet synsets and exploring methods to align new words (not in WordNet) with an existing synset.

In [None]:
import nltk
import math
from nltk import word_tokenize
from nltk.corpus import wordnet as wn
import numpy as np
from typing import List, Tuple, Dict

nltk.download('wordnet')
nltk.download('punkt')
!wget https://people.ischool.berkeley.edu/~dbamman/glove.6B.100d.100K.txt
!pip install sentence_transformers
from sentence_transformers import SentenceTransformer

# Preliminaries: WordNet

NLTK provides a great interface to the WordNet ontology.  Remember that core unit within WordNet is the **synset** (a category of near-synonyms).  A word (like "blue") can appear in many different synsets, each corresponding to a distinct *sense* of that word.

In [None]:
# get all of the synsets that a specific word belongs to; print their definitions

synsets=wn.synsets('blue')
for synset in synsets:
    print (synset, synset.definition())

Any given synset will likewise contain multiple different words (all near-synonyms of each other).

In [None]:
# get all of the words/phrase in a given synset

for lemma in wn.synset("gloomy.s.02").lemmas():
    print (lemma.name())

Remember also that one of the powerful things about WordNet is that it places synsets within a hierarchical structure; a given synset has both **hypernyms** (other synsets that it is a subclass of) and **hyponyms** (other synsets that are subclasses of it).

In [None]:
# Functions from http://www.nltk.org/howto/wordnet.html to get *all* of a synset's hyponym/hypernyms

hypo = lambda s: s.hyponyms()
hyper = lambda s: s.hypernyms()

Find all of the synsets that are hyponyms of the target synset (descendents in the WordNet hierarchy)

In [None]:
list(wn.synset("blue.n.01").closure(hypo))

Find all of the synsets that are hyperyms (ancestors up the tree) of the target synset

In [None]:
list(wn.synset("blue.n.01").closure(hyper))

Here's how you can access all of the synsets in WordNet through NLTK (though note executing this may take a while, so it's commented out).

In [None]:
#for idx, synset in enumerate(wn.all_synsets()):
#   print(idx, synset)
#   if (idx > 10): break

# Homework

WordNet is a great resource, but one of its downsides is *coverage* -- many of the words in our vocabulay aren't in WordNet, but could conceivably be placed within existing synsets within it.  Your task for this homework is to develop two methods to finding the closest synset for a given new word from Urban Dictionary.

For the scope of this homework, we're only going to pretend that WordNet only has 12 different synsets within it (though feel free to use the `wn.all_synsets` function above if you wanted to explore running it on all of WordNet).

In [1]:
target_synsets=['spread.n.01', 'formidable.s.01', 'coziness.n.01', 'mutation.n.02', 'kernel.n.03', 'faineant.s.01', 'fund-raise.v.01', 'orientation.n.06', 'inappropriate.a.01', 'stranger.n.02', 'plausibility.n.01', 'sever.v.01']

In [None]:
for synset in target_synsets:
    wn_synset=wn.synset(synset)
    print(wn_synset)
    print("\tDefinition:", wn_synset.definition())

Here are the words that do not exist in WordNet now but that we want to add.  Each element of the tuple is (word, definition).

In [None]:
urban_dictionary_terms: List[Tuple[str, str]] = [
    ("Crowdfunding", "the practice of obtaining needed funding (as for a new business) by soliciting contributions from a large number of people especially from the online community"), 
    ("Hygge", "a cozy quality that makes a person feel content and comfortable"), 
    ("biohacking", "biological experimentation (as by gene editing or the use of drugs or implants) done to improve the qualities or capabilities of living organisms especially by individuals and groups working outside a traditional medical or scientific research environment"), 
    ("TL;DR", "a briefly expressed main point or key message that summarizes a longer discussion or explanation"), 
    ("Hellacious", "Exceptionally powerful or violent; remarkably good; extremely difficult; extraordinarily large"), 
    ("Unfriend", "To remove from one's list of friends (e.g. on a social networking website)"), 
    ("Infodemic", "A wide and rapid spread of misinformation through various media, namely social media"),
    ("Onboarding", "The act or process of orienting and training a new employee"), 
    ("Truthiness", "something that seems true but isn’t backed up by evidence"), 
    ("Amotivational", "Relating to, or characterised by, a lack of motivation"), 
    ("NSFW", "Not Safe For Work. Used to describe Internet content generally inappropriate for the typical workplace, i.e., would not be acceptable in the presence of your boss and colleagues"),
    ("Rando", "a person who is not known or recognizable or whose appearance (as in a conversation or narrative) seems unprompted or unwelcome")
]

Your task here is to develop two different methods for finding the best matching synset.
1. Find the WordNet synset with the highest cosine similarity between the average GloVe embeddings of its synset definition and the average GloVe embeddings of the new word definition.
2. Find the WordNet synset with the highest cosine similarity between the sentence embedding its synset definition and the sentence embedding of the new word definition.

Here is some code for reading in Glove embeddings:


In [None]:
def read_vectors(filename: str):
    vocab_map={}
    embeddings=[]
    with(open(filename, encoding="utf-8")) as file:
        for idx, line in enumerate(file):
            cols=line.rstrip().split(" ")
            word=cols[0]
            embedding=cols[1:]

            embeddings.append(embedding)
            vocab_map[word]=idx
    
    return vocab_map, np.array(embeddings, dtype="float")

In [None]:
glove_vocab_map, glove_embeddings=read_vectors("glove.6B.100d.100K.txt")

Here is some code for loading the sentence transformer package:

In [None]:
sentence_model = SentenceTransformer('sentence-transformers/all-distilroberta-v1')

sentence_vector=sentence_model.encode("this is a sentence")
print(sentence_vector.shape)


Here's an implementation of cosine similarity that you will find useful.

In [None]:
def cosine_similarity(one, two):
    return np.dot(one, two) / (np.linalg.norm(one) * np.linalg.norm(two))

#### Q1. Implement the first method as `method_one` below.

As mentioned above, you should compute the average GloVe embedding of the UD word definition and use cosine similarity to compare it with the average GloVe embedding of the synset definitions. For each UD word, choose the definition that maximizes the cosine similarity with its definition. Here are some things you need to do when calculating the average GloVe embedding of a sentence:
- Use `nltk.word_tokenize()` to tokenize the sentence.
- Treat everything as lowercase.
- Skip any tokens which don't appear in the GloVe vocabulary.
- Calculate the average value of the embedding vectors, which will be another vector of the same shape.

Your function should return a dictionary mapping each urban dictionary term to a WordNet synset ID, e.g.:

`{
 "adore" : "love.v.01",
 "dripping" : "stylish.a.01"    
 }`

 Please make sure that any helper functions that you use are defined *within* `method_one`! That will help us extract your code more easily with the autograder.

In [None]:
def method_one(urban_dictionary_terms: List[Tuple[str, str]], target_synsets: List[str]):
    """
    Method 1: an algorithm based on GloVe embeddings that maps each urban dictionary term to a synset ID.

    Parameters
    ----------
    urban_dictionary_terms : List[Tuple[str, str]]
        a list of string 2-tuples where the first elements are words, second elements are definitions.
    target_synsets : List[str]
        a list of synset IDs that the words should be classified into.
        You can call `wn.synset("<synset ID>")` to get the synset object.
    
    Returns
    --------
    A dictionary mapping each urban dictionary term to a WordNet synet ID, e.g.
    `{"adore" : "love.v.01", "dripping" : "stylish.a.01"}`
    
    """
    
    # Your code

    pass


In [None]:
method_one_results=method_one(urban_dictionary_terms, target_synsets)
method_one_results

#### Q2. Implement your second method as `method_two` below.

In this function, you should compute the cosine similarity between the sentence embedding of the UD word definition and those of the synsets, then for each UD word, choose the synset with the highest cosine similarity. For consistency, use the sentence transformer model called `sentence-transformers/all-distilroberta-v1`.

Your function must also return a dictionary mapping each urban dictionary term to a WordNet synset ID, e.g.:

`{
 "adore" : "love.v.01",
 "dripping" : "stylish.a.01"    
 }`

As before, please make sure that any helper functions that you use are defined *within* `method_two`! That will help us extract your code more easily with the autograder.

In [None]:
def method_two(urban_dictionary_terms: List[Tuple[str, str]], target_synsets: List[str]):
    """
    Method 2: an algorithm based on sentence embeddings that maps each urban dictionary term to a synset ID.

    Parameters
    ----------
    urban_dictionary_terms : List[Tuple[str, str]]
        a list of string 2-tuples where the first elements are words, second elements are definitions.
    target_synsets : List[str]
        a list of synset IDs that the words should be classified into.
        You can call `wn.synset("<synset ID>")` to get the synset object.
    
    Returns
    --------
    A dictionary mapping each urban dictionary term to a WordNet synet ID, e.g.
    `{"adore" : "love.v.01", "dripping" : "stylish.a.01"}`
    
    """

    # Your code

    pass


In [None]:
method_two_results=method_two(urban_dictionary_terms, target_synsets)
method_two_results

#### Q3: Define an evaluation metric (accuracy).  

Throughout this semester we've stressed how critical evaluation is for any NLP method.  Implement a function `accuracy` that assesses quality of the dictionaries you return from `method_one` and `method_two`.  This accuracy function should return a single real number (the accuracy), and its input parameters are a prediction dict (the output of your model) and a truth dict (which you will need to create based on your own judgement). Make sure that the accuracies you calculate for the two methods match what you expect. 

In [None]:
def accuracy(prediction: Dict[str, str], truth: Dict[str, str]) -> float:
    pass

In [None]:
truth = ...

In [None]:
print(accuracy(method_one_results, truth))

In [None]:
print(accuracy(method_two_results, truth))

That concludes homework 6! To submit, just upload this .ipynb file to Gradescope.

#### Q4 (optional)

Use the two methods you've defined to find the best-matching synset within the **entire** WordNet. Do the results make sense? Is one method consistently better than the other? Why?

Here's a reminder of how to iterate through all synsets:

In [None]:
for idx, synset in enumerate(wn.all_synsets()):
   print(idx, synset)
   if (idx > 10): break