** *International Workshop on Semantic Evaluation*

|   |   |   |
|---|---|---|
|Senseval-1	1998 | Sussex || 
|Senseval-2	2001| Toulouse@ACL ||
|Senseval-3	2004|  Barcelona ||
|SemEval-2007| Prague@ACL ||
|SemEval-2010|  Uppsala@ACL ||
|SemEval-2012|  Montreal@NAACL ||
|SemEval-2013| Atlanta@NAACL ||
|... | ||
|SemEval-2020| Barcelona@COLING |https://alt.qcri.org/semeval2020/index.php?id=tasks |
|SemEval-2021| Bangkok@ACL-IJCNLP | https://semeval.github.io/SemEval2021/tasks |
|SemEval-2022| Seattle@NAACL| https://semeval.github.io/SemEval2022/tasks |
|SemEval-2023|| https://semeval.github.io/SemEval2023/|

* https://semeval.github.io/SemEval2023/tasks.html

# COLX 561 Lab Assignment 1: Word Sense Disambiguation Project (Cheat sheet)

## Assignment objectives

In this assignment you will:
- Do automatic word sense disambiguation (WSD) using the context around an ambigious word
- Apply the semantic knowledge in WordNet

## Getting started

This assignment requires that you have downloaded following NLTK corpora/lexicons:

In [7]:
import pandas as pd
import ssl

try:
    _create_unverified_https_context = ssl._create_unverified_context
except AttributeError:
    pass
else:
    ssl._create_default_https_context = _create_unverified_https_context

import nltk
nltk.download('senseval')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

from nltk.corpus import senseval, stopwords
from collections import defaultdict
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag, word_tokenize
from nltk.wsd import lesk
from nltk.corpus import wordnet as wn

from sklearn.feature_extraction import DictVectorizer
from sklearn.tree import DecisionTreeClassifier

STOPWORDS = stopwords.words("english")
OPEN_CLASS_POS = {'n', 'v', 'j', 'r'}

[nltk_data] Downloading package senseval to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package senseval is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to /Users/jungyeul/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /Users/jungyeul/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


The NLTK version of the Senseval 2 formatted files from *Senseval-3* uses well-formed XML. Each instance of the ambiguous words *hard*, *interest*, *line*, and *serve* is tagged with a sense identifier, and supplied with context.

## Tidy Submission

rubric={mechanics:2}

- You have been assigned a team for this project, which you will find in the teams.txt file on the COLX 561 course repo
- One person in each group must create a private UBC github repo, and give access to all group members as well as the members of the teaching team
- In the readme in the individual lab repo (the one created when the lab is opened) for all members of the group, you should have a link to this private, shared repo. Pushing that link is your only "submission". Don't put anything else in your repo for this lab.
- In the readme of the private shared repo, include instructions that will allow someone to reproduce your results and identify the parts of the code relevant to the major sections discussed below. 
- You do NOT have to use the ipynb you are reading; You may find it useful to have multiple ipynbs (or even standalone .py files which your .ipynb(s) import) for different parts of the assignment, just make sure you indicate how your code should be run (You should assume it will be run). You can also just add cells to this ipynb, whatever works for you.
- Note that any commits on the private shared repo after the deadline will result in a late penalty being applied to the project, so be careful about that.

### Part 1: Setting up the WSD task

rubric={accuracy:2,quality:1}

For this lab, you will be using the Senseval corpus included in NLTK, which has sense-tagged data for a small set of word types. In this lab, we will only look at the ambiguity of the word *line*. Note that this corpus is arranged in a way that is **NOT** typical for NLTK corpora. It is stored in a list of *instances*, where each instance has the sense and the context around it. You can iterate over instances of the word *line* using this code: `for instance in senseval.instances('line.pos')`. It is up to you to figure out how to extract the exact information you need from each instance (you don't need to store information other than the sense and context for each instance).  It might be worthwhile to do some debugging to see what these instances look like.

Your first goal is to reorganize the information into two Python dictionaries: *train* and *test*.
Each dictionary will contain senses as the keys, while the values are lists of POS-tagged sentences (if an *instance* in the semeval corpus has the given sense, it is included in this list).

The first 200 instances will be stored in the *test* dictionary, while all the rest of the instances will be in the *train* dictionary.

When you have solved the problem, please add test cases which confirm the below: 

1. both dictionaries contain information relating to 6 senses of line (ie, the dictionary has 6 keys), including a "product" sense
2. for the "product" sense of *line*
   2. the `test` dictionary has 200 context sentences.
   1. the `train` dictionary contains 2017 context sentences.
   2. the first context sentence of `train` has 49 tokens

** The NLTK version of the Senseval 2 (*formatted*) files (from *Senseval-3*) uses well-formed XML.
Each instance of the ambiguous words ***hard***, ***interest***, ***line***, and ***serve*** is tagged with a sense identifier, and supplied with context.


source: 

* [nltk] https://www.nltk.org/_modules/nltk/corpus/reader/senseval.html
* [senseval] https://www.d.umn.edu/~tpederse/data.html


Gale, W. A., Church, K. W., & Yarowsky, D. (1992). One Sense Per Discourse. *Proceedings of the Workshop on Speech and Natural Language*, 233–237. https://doi.org/10.3115/1075527.1075579.


```
for instance in senseval.instances('line.pos'):
    ...
```

```
SensevalInstance(word='line-n', position=16, context=[('perhaps', 'RB'), ('not', 'RB'), ('surprisingly', 'RB'), (',', ','), ('the', 'DT'), ('locals', 'NNS'), ('often', 'RB'), ('call', 'VBP'), ('the', 'DT'), ('warden', 'NN'), ('.', '.'), ('while', 'IN'), ('noodlers', 'NNS'), ('generally', 'RB'), ('drag', 'VBP'), ('a', 'DT'), ('line', 'NN'), ('with', 'IN'), ('a', 'DT'), ('big', 'JJ'), ('hook', 'NN'), ('on', 'IN'), ('it', 'PRP'), ('through', 'IN'), ('the', 'DT'), ('water', 'NN'), ('trying', 'VBG'), ('to', 'TO'), ('snag', 'VB'), ('a', 'DT'), ('fish', 'NN'), (',', ','), ('mr', 'NNP'), ('.', '.'), ('willaby', 'NNP'), ('wades', 'NNS'), ('along', 'IN'), ('river', 'NN'), ('banks', 'NNS'), ('and', 'CC'), ('lake', 'NN'), ('shores', 'NNS'), ('until', 'IN'), ('he', 'PRP'), ('finds', 'VBZ'), ('holes', 'NNS'), ('where', 'WRB'), ('fat', 'JJ'), ('catfish', 'NN'), ('are', 'VBP'), ('laying', 'VBG'), ('eggs', 'NNS'), ('.', '.'), ('he', 'PRP'), ('dives', 'VBZ'), ('down', 'RB'), ('and', 'CC'), ('pokes', 'VBZ'), ('his', 'PRP$'), ('rod', 'NN'), ('and', 'CC'), ('a', 'DT'), ('few', 'JJ'), ('inches', 'NNS'), ('of', 'IN'), ('line', 'NN'), ('with', 'IN'), ('a', 'DT'), ('baited', 'VBN'), ('hook', 'NN'), ('into', 'IN'), ('the', 'DT'), ('nest', 'NN'), ('until', 'IN'), ('the', 'DT'), ('fish', 'NN'), ('bites', 'NNS'), ('.', '.')], senses=('cord',))
```

```
instance.senses[0] = {cord | division | formation | phone | product | text}

                      test : train
 373 cord           -> 200 : ...
 374 division       -> 200 : ...
 349 formation      -> 200 : ...
 429 phone          -> 200 : ...
2217 product        -> 200 : 2017
 404 text           -> 200 : ...
```

In [8]:
train_dict = defaultdict(list)
test_dict = defaultdict(list)

#Your code here

#Your code here

In [10]:
assert len(test_dict.keys()) == 6
assert len(train_dict.keys()) == 6
assert "product" in test_dict.keys()
assert "product" in train_dict.keys()
assert len(test_dict["product"]) ==  200
assert len(train_dict["product"]) == 2017
assert len(train_dict["product"][0]) == 49
print("Success")

Success


### Part 2: Creating and testing features for WSD

Part 2 contains most of the work in this lab.  You will be extracting features from the semeval data you stored in Part 1.  Remember that features can be any useful information that might help in the classification.  You will be extracting several different types of features to eventually present to a classifier to do word-sense disambiguation.  You can develop them in any order, except as noted.  Be sure to coordinate with your team-mates, so you aren't all doing the same work multiple times.

Note that all of the subparts here involve two steps:

1. You will write a function that takes a sentence context, and returns some quantification (numbers).  i.e., you will read in the sentence context, and return the average counts of the words (this is not actually a feature here - the features you will extract are better for WSD).
2. You will test this output of this function using your training data by averaging the numbers across all contexts for pairs of senses, seeing if an expected relationship holds. 

#### 2.1: Concreteness feature
rubric={accuracy:3,quality:1,efficiency:1}

One typical distinction between senses of a word is that some senses are more concrete (involving the physical world) whereas others are more abstract.  For example, "house" is very concrete - it is a thing that exists in the world, while "happiness" is abstract - there are many different definitions, and you can't point to something and say "that's happiness". A list of words with human-assigned concreteness ratings can be found on the webpage [here](https://raw.githubusercontent.com/ArtsEngine/concreteness/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt); the relevant column is *Conc.M*. Note that they are floating-point numbers (0 means no value was assigned). Extract this information into a Python dict (key is word, value is concreteness) and then write a function which calculates an average concreteness score for all words in a context (that is, given a list of context words, your function calculates the average concreteness of all of them). **You should lemmatize (remember the WordNet lemmatizer from COLX 521?) and lowercase the words in the context before you look them up in the dictionary.**  If a word occurs more than once, it should be counted more than once. If a word has no concreteness score (ie, Conc.M == 0) it should be left out of the calculation (both numerator and denominator).


For example, in the sentence "This is a test", we get:

this = 2.14 <br/>
is = 1.59 <br/>
a = 1.46 <br/>
test = 3.93 <br/>

So the concreteness score should be (2.14 + 1.59 + 1.46 + 3.93) / 4 = 2.28


Then use this function to show that the "cord" sense of *line* appears in more concrete contexts, on average, than the "division" sense. You should use the function you've built, averaging the result across all the contexts for each of those two senses (using the training data from part 1). 

Remember that you've already collected all the contexts for each sense in part 1.  In this part, you'll be calculating the average concreteness score for each sense - calculate the concreteness score for each context of the sense, and then calculate the average of those scores.

Brysbaert, M., Warriner, A.B. & Kuperman, V.  (2014).
Concreteness ratings for 40 thousand generally known English word lemmas. *Behavior Research Methods*. 46:904–911. https://doi.org/10.3758/s13428-013-0403-5 

**Conc.M**:
```
Word		Bigram	Conc.M	Conc.SD	Unknown	Total	Percent_known	SUBTLEX	Dom_Pos
roadsweeper	0	4.85	0.37	1	27	0.96	0	0
traindriver	0	4.54	0.71	3	29	0.90	0	0
tush		0	4.45	1.01	3	25	0.88	66	0
...
```

`# of features = 1`

In [None]:
lemmatizer= WordNetLemmatizer()

In [None]:
df = pd.read_csv("https://raw.githubusercontent.com/ArtsEngine/concreteness/master/Concreteness_ratings_Brysbaert_et_al_BRM.txt", delimiter="\t", index_col=0)
conc_dict = df["Conc.M"].to_dict()

def get_conc_score(context):
    '''calculate the average concreteness score for all words in a given context'''
    valid_context = []
    
    #Your code here
    
    
    #Your code here

In [None]:
test_context = pos_tag(word_tokenize("I have a cat"))
assert get_conc_score(test_context) == ((2.18 + 1.46 + 4.86) / 3)
print("Success")

Success


In [None]:
'''
Calculate the average concreteness of all contexts of the sense "cord" and "division".  Show that
"cord" is higher.
'''

cord_context_conc = 0
#Your code here

#Your code here
print("The concreteness score for 'cord' is: " + str(round(cord_context_conc, 3)))

div_context_conc = 0
#Your code here

#Your code here
print("The concreteness score for 'division' is: " + str(round(div_context_conc, 3)))

The concreteness score for 'cord' is: 2.722
The concreteness score for 'division' is: 2.451


#### 2.2 Gloss overlap features (Lesk)
rubric={accuracy:4,efficiency:1,quality:1}

In this part you're going to apply the Lesk approach to WSD, looking for word overlap between the gloss of the sense and the context. However, you're not going to be able to use the version included in WordNet, for two reasons:

1. We will be using a restricted set of senses, not all possible senses for *line* included in WordNet
2. Rather than a single feature indicating which sense was chosen, we are going to calculate an overlap score for each possible sense

To apply Lesk, you will first need to associate each sense in the Senseval dataset with a synset in WordNet. I've attached the most-likely synset to each sense in the Senseval dataset.  You are free to look at the definitions of the senses, and see how I arrived at those definitions.


Write a function which takes a sentence context, and calculates the number of tokens that overlap between the context and the gloss of each sense in WordNet (HINT: use set intersection - we are only interested in *type* overlap). Your overlap calculation should exclude English stopwords (see COLX 521 Lecture 2). Your function should return a dictionary where the keys are senses and the values are overlap counts.

Then, show that the average overlap of the "product" gloss from WordNet is higher with "product" contexts than "division" contexts.  Again, use the training data from Part 1.  At this point, you'll have a synset_dictionary (with the glosses from WordNet), and a list of contexts for each sense.  You can use these to calculate the average overlap of each sense in your context dictionary.

For example, if your context dictionary has "This line is busy" and "Hold the line" for the sense 'phone', then you can calculate the overlap of "This line is busy" with the synset gloss of "phone", the same thing for the line "Hold the line", and then average them.

** Lesk, Michael. (1986).
Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream cone. 
*Proceedings of SIGDOC '86*, 24–26. https://doi.org/10.1145/318723.318728 


|`wn.synset('line.n.18').definition()`    |  |  `SensevalInstance(word='line-n', position=16, context=[('perhaps', 'RB'), ('not', 'RB'), ('surprisingly', 'RB'), (',', ','), ('the', 'DT'), ('locals', 'NNS'), ... ], senses=('cord',))` |
|---|---|---|
| *something (as a cord or rope) that is long and thin and flexible*   |  $\bigcap$  | *perhaps not surprisingly , the locals often call the warden . while noodlers generally drag a line with ...*  |



In [None]:
print(train_dict.keys())
line_synsets = wn.synsets("line")
synset_lookup = {}
synset_lookup['cord'] = wn.synset('line.n.18')
synset_lookup['division'] = wn.synset('line.n.29')
synset_lookup['formation'] = wn.synset('line.n.01')
synset_lookup['phone'] = wn.synset('telephone_line.n.02')
synset_lookup['product'] = wn.synset('line.n.22')
synset_lookup['text'] = wn.synset('line.n.05')

dict_keys(['cord', 'division', 'formation', 'phone', 'product', 'text'])


In [None]:
for synset in line_synsets:
    print(synset.name())
    print(synset.definition())
print("SIZE: ", len(line_synsets))

line.n.01
a formation of people or things one beside another
line.n.02
a mark that is long relative to its width
line.n.03
a formation of people or things one behind another
line.n.04
a length (straight or curved) without breadth or thickness; the trace of a moving point
line.n.05
text consisting of a row of words written across a page or computer screen
line.n.06
a single frequency (or very narrow band) of radiation in a spectrum
line.n.07
a fortified position (especially one marking the most forward position of troops)
argumentation.n.02
a course of reasoning aimed at demonstrating a truth or falsehood; the methodical process of logical reasoning
cable.n.02
a conductor for transmitting electrical or optical signals or electric power
course.n.02
a connected series of events or actions or developments
line.n.11
a spatial location defined by a real or imaginary unidimensional extent
wrinkle.n.01
a slight depression in the smoothness of a surface
pipeline.n.02
a pipe used to transport li

In [None]:
def count_overlap(context):
    '''Calculate the number of tokens that overlap between the context and the gloss of each sense in WordNet'''
    
    #Your code here
    
    #Your code here

In [None]:
test_sent = "I was holding a flexible line"
test_context = pos_tag(word_tokenize(test_sent))
assert count_overlap(test_context)['cord'] == 1

In [None]:
# 'Product' gloss overlap with product contexts
avg_overlap_dict = {}
avg_overlap_dict['product'] = 0
#Your code here
    
#Your code here

avg_overlap_dict['division'] = 0
#Your code here
    
#Your code here
    
print(f"The average overlap of the 'product' gloss with the product contexts is: {avg_overlap_dict['product']:0.3f}")
print(f"The average overlap of the 'product' gloss with the division contexts is: {avg_overlap_dict['division']:0.3f}")
print(avg_overlap_dict)

The average overlap of the 'product' gloss with the product contexts is: 0.059
The average overlap of the 'product' gloss with the division contexts is: 0.029
{'product': 0.059494298463063956, 'division': 0.028735632183908046}


#### 2.3 : WordNet distance features
rubric={accuracy:5,efficiency:1,quality:1}

This feature involves calculating the WordNet (Wu-Palmer) distances from the synsets of relevant senses of *line* to the synsets of mostly non-ambiguous context words. For this, you will need the Senseval -> WordNet sense mapping from 2.2.  If you can't remember how to get the Wu-Palmer (ie, wup) value, check the lecture slides.

The biggest challenge in this problem is identifying "mostly" non-ambiguous words. We could exclude any word type that has any polysemy (i.e. associated with more than one synset), but that seems too extreme (almost all words have some rare instances of strange sense uses). Instead, we are going to consider a word mostly non-ambiguous if it appears as one particular sense 75% of the time, based on the corpus counts provided in WordNet. You should write a general function, `get_dominant_sense`, which takes a word and a POS (a single letter, same as the input to the WordNet lemmatizer), and returns the dominant (75% of instances) synset if it exists, or `None` if it doesn't. The POS will be useful because, in order to do this properly, you will have to correctly lemmatize the word, so as to match it with the lemmas of each of its synset, so you can get the right count.

So this function should take a word and pos as input, and then: <br/>
1. Lemmatize the word <br/>
2. Get all the senses from the WordNet synsets for the word <br/>
3. Keep track of the counts for each sense that match the lemma <br/>
4. If the highest count is greater than 0.75 * total count, then return that synset.  Otherwise, return None. <br/>

Once you have this function, you should create another function which will, for a particular instance,

1. Use `get_dominant_sense` to get a list of synsets appearing in the context (one for each mostly non-ambiguous word). You will need to do this again in 2.4, so a separate function might be a good idea!  The function will take a context as input, and return a list of synsets.
2. For each sense of *line* in Senseval (ie, the senses in synset_lookup), calculate the average distance between that sense and all the synsets in the context. You should use the built-in function for calculating Wu-Palmer distance between a synset pair, don't implement your own.  
3. Return a dictionary mapping the (Senseval) sense to the average distance to the context synset.  That is, return a dictionary where the keys are the six senses in synset_lookup, and the values are the average distance from the context to that sense.

Then use the output of this function to show that the synsets associated with contexts around "phone" sense of line are on average closer to the "phone" synset than synsets from "division" contexts are.

That is, for each context in your training dictionary with the sense "phone", calculate the average distance between the context and the "phone" sense in synset_lookup.  Then, do the same for each context in your training dictionary with the sense "division".  Show that the "phone" sense is closer for "phone" contexts than "division" contexts ("closer" means the number will be smaller - this is a distance)

<span style="color:gray"> 
Wu, Z., & Palmer, M. (1994). 
Verb Semantics and Lexical Selection. <i>Proceedings of the 32nd Annual Meeting of the Association for Computational Linguistics</i>, 133–138. https://doi.org/10.3115/981732.981751
</span>

$\text{Wu-Palmer} = \displaystyle\frac{2 \cdot \text{depth}(LCS)}{\text{depth}(sn_1) + \text{depth}(sn_2)}$
where $LCS$ = least common subsumer and $sn$ = synset. 

```
>>> car = wordnet.synset('car.n.01')
>>> boat = wordnet.synset('boat.n.01')
>>> car.wup_similarity(boat)
0.6956521739130435

    Object
      |
    Vehicle ***
      |
  ---------
  |       |
 Boat  Automobile
          |
         Car
```

```
>>> hyp = lambda s:s.hypernyms()
>>> from pprint import pprint
>>> pprint(car.tree(hyp))
[Synset('car.n.01'),
 [Synset('motor_vehicle.n.01'),
  [Synset('self-propelled_vehicle.n.01'),
   [Synset('wheeled_vehicle.n.01'),
    [Synset('container.n.01'),
        ...
    [Synset('vehicle.n.01'), ***
     [Synset('conveyance.n.03'),
      [Synset('instrumentality.n.03'),
       [Synset('artifact.n.01'),
        [Synset('whole.n.02'),
         [Synset('object.n.01'),
          [Synset('physical_entity.n.01'),
           [Synset('entity.n.01')]]]]]]]]]]]]
>>> pprint(boat.tree(hyp))
[Synset('boat.n.01'),
 [Synset('vessel.n.02'),
  [Synset('craft.n.02'),
   [Synset('vehicle.n.01'), ***
    [Synset('conveyance.n.03'),
     [Synset('instrumentality.n.03'),
      [Synset('artifact.n.01'),
       [Synset('whole.n.02'),
        [Synset('object.n.01'),
         [Synset('physical_entity.n.01'), 
          [Synset('entity.n.01')]]]]]]]]]]]
>>> 2*8 / (12+11)
0.6956521739130435
```

To calculate Wu-Palmer, you need to get the dominant sense of each word by using `get_dominant_sense("word","n")` which can generate as follows:


```
[
 [0, Synset('password.n.01')],   <-- Lemma('password.n.01.word').count() = 0
 [0, Synset('word.n.07')], 
 [1, Synset('parole.n.01')], 
 [2, Synset('give_voice.v.01')], 
 [3, Synset('discussion.n.02')], 
 [3, Synset('word.n.04')], 
 [5, Synset('news.n.01')], 
 [18, Synset('word.n.02')], 
 [117, Synset('word.n.01')]      <-- dominant sense
]
117/149 = 0.785234899328859 => 0.75
```

In [None]:
dominant_sense_ratio = 0.75

lemmatizer = WordNetLemmatizer()

def get_dominant_sense(word,pos="n"):
    '''return the dominant (75% of instances) synset of the word if it exists, or None if it doesn't
    word -- an English word
    pos -- a single letter that represents part of speech of the input word, noun by default'''

    #Your code here

    #Your code here

In [None]:
assert get_dominant_sense('word', 'n').name() == 'word.n.01'

In [None]:
def get_dominant_sense_context(context):
    '''return a list of dominant synsets appearing in the context,
    one for each mostly non-ambiguous word'''
    valid_context = []
    dominant_synsets = []
    
    #Your code here

    #Your code here

In [None]:
def get_average_distance(context):
    '''calculate average distance between senses of the word "line" and a context'''
 
    #Your code here

    #Your code here

In [None]:
avg_phone_distance = 0
count = 0

#Your code here

#Your code here
print("The distance between the synsets associated with contexts around 'phone' sense of line and 'phone' synset: ", round(avg_phone_distance / count, 3))

The distance between the synsets associated with contexts around 'phone' sense of line and 'phone' synset:  0.758


In [None]:
avg_division_distance = 0
count = 0

#Your code here

#Your code here
print("The distance between the synsets associated with contexts around 'division' sense of line and 'phone' synset: ", round(avg_division_distance / count, 3))

The distance between the synsets associated with contexts around 'division' sense of line and 'phone' synset:  0.798


#### 2.4 WordNet Hypernyms
rubric={accuracy:5, efficiency:1, quality:1}

Now, we will consider the count of WordNet synsets in the context directly as features. However, limiting ourselves to the synsets corresponding directly to words might result in sparsity, and provide little more information than raw words would. Instead, we are going to also include all the hypernyms of words appearing in the context as potential features for doing WSD.

First, write a recursive function `get_all_hypernyms` which collects the names (e.g. `synset.name()`) of a provided WordNet synset and all of its hypernyms.  The base case can just be when an item no longer has any hypernyms.

Then, applying this function to the synsets found in the context (step 1 of the distance function in 2.3), write a function that counts all the hypernyms of all the (again mostly non-ambiguous) synsets in the context, normalizing by the total count to get a proportion for each synset.

The function will take a context as input.  It will calculate the dominant synsets from this context.  Then, for each of these synsets, it will get all of their hypernyms, and keep track of their counts.
The returned dictionary will have the hypernym names as keys, and the percentage of all the hypernyms found using this method.  For example, if you count all the hypernyms, and have 20, and 5 of them are "animal", then "animal" will have a value of "0.25"

Then show that the average proportion of the 'object.n.01' synset is higher in contexts involving the "cord" sense of *line* than the "division" sense. (This should be true for the same reason as in 2.1)

```
>>> from nltk.corpus import wordnet as wn
>>> wn.synsets('cat')
[Synset('cat.n.01'), Synset('guy.n.01'), Synset('cat.n.03'), 
Synset('kat.n.01'), Synset('cat-o'-nine-tails.n.01'), Synset('caterpillar.n.02'), 
Synset('big_cat.n.01'), Synset('computerized_tomography.n.01'), Synset('cat.v.01'), 
Synset('vomit.v.01')]
>>> wn.synset('cat')    <-- not OK
...
ValueError: not enough values to unpack (expected 3, got 1)
>>> wn.synset('cat.n.01')
Synset('cat.n.01')
>>> wn.synsets('cat')[0]
Synset('cat.n.01')

```

In [None]:
def get_all_hypernyms(synset, names=[]):
    '''return a list of the names of a synset and all its hypernyms'''

    #Your code here

    #Your code here

`get_all_hypernyms(wn.synsets("cat")[0])`
where `wn.synsets("cat")[0]` (or `wn.synset("cat.n.01")`) is `get_dominant_sense("cat","n")`:
```
['cat.n.01', 'feline.n.01', 'carnivore.n.01', 'placental.n.01', 'mammal.n.01', 'vertebrate.n.01', 'chordate.n.01', 'animal.n.01', 'organism.n.01', 'living_thing.n.01', 'whole.n.02', 'object.n.01', 'physical_entity.n.01', 'entity.n.01']
```

In [None]:
cat_synset = wn.synset("cat.n.01")
assert get_all_hypernyms(cat_synset, []) == [
    "cat.n.01",
    "feline.n.01",
    "carnivore.n.01",
    "placental.n.01",
    "mammal.n.01",
    "vertebrate.n.01",
    "chordate.n.01",
    "animal.n.01",
    "organism.n.01",
    "living_thing.n.01",
    "whole.n.02",
    "object.n.01",
    "physical_entity.n.01",
    "entity.n.01",
]

In [None]:
def get_all_of_all_hypernyms(context):
    '''get all the hypernyms of all the dominent synsets in a given context,
       return normalized counts'''
    
    #Your code here

    #Your code here

In [None]:
get_all_of_all_hypernyms(train_dict['cord'][16])

{'heard.s.01': 0.04,
 'struggling.s.01': 0.04,
 'rebuff.n.01': 0.04,
 'discourtesy.n.03': 0.04,
 'behavior.n.01': 0.04,
 'activity.n.01': 0.04,
 'act.n.02': 0.04,
 'event.n.01': 0.04,
 'psychological_feature.n.01': 0.04,
 'abstraction.n.06': 0.04,
 'entity.n.01': 0.12,
 'beach.n.01': 0.04,
 'geological_formation.n.01': 0.04,
 'object.n.01': 0.08,
 'physical_entity.n.01': 0.08,
 'bucket.n.01': 0.04,
 'vessel.n.03': 0.04,
 'container.n.01': 0.04,
 'instrumentality.n.03': 0.04,
 'artifact.n.01': 0.04,
 'whole.n.02': 0.04}

In [None]:
sum(get_all_of_all_hypernyms(train_dict['cord'][16]).values())

1.0000000000000002

- On average, are the synsets associated with contexts around "phone" sense of `line` closer to the "phone" synset than synsets from "division" contexts?

In [None]:
# average proportion of the "object.n.01" synset in "cord" contexts

total_prop = 0
count = 0
for context in train_dict['cord']:
    hypernyms = get_all_of_all_hypernyms(context)
    if "object.n.01" in get_all_of_all_hypernyms(context).keys():
        total_prop += hypernyms["object.n.01"]
        count += 1

print("avg proportion of the 'object.n.01' synset in 'cord' contexts: ", round(total_prop / count, 3))

avg proportion of the 'object.n.01' synset in 'cord' contexts:  0.062


In [None]:
# average proportion of the "object.n.01" synset in "division" contexts

total_prop = 0
count = 0
for context in train_dict['division']:
    hypernyms = get_all_of_all_hypernyms(context)
    if "object.n.01" in hypernyms.keys():
        total_prop += hypernyms["object.n.01"]
        count += 1

print("avg proportion of the 'object.n.01' synset in 'division' contexts: ", round(total_prop / count, 3))

avg proportion of the 'object.n.01' synset in 'division' contexts:  0.042


### Part 3: Building a classifier
rubric={accuracy:2,quality:1,reasoning:1}

Now that we have a collection of features which show some promise for the task, you should build a classifier for WSD of *line* which uses all the features above. As before, you should use DictVectorizers with descriptive feature names, combining all the individual outputs of the functions into a single feature dictionary for each context sentence. Note that there is 1 feature from 2.1, 6 features each from 2.2 and 2.3 (one for each sense of "line") and 2.4 involves thousands of features. The classifications here are the Senseval senses. You can again just use a Decision Tree (with more features than in Lab 4 of COLX 521, you should probably try a higher max_depth, e.g. 5), though you're also welcome to use something else as well. In addition to outputting the performance on the classifier on the test set using all features, please do one interesting experiment that checks the test performance using a feature set other than all the features, and briefly discuss the results. (The full ablation/feature selection is NOT necessary here).

In [None]:
def get_feature_dict(context):
    '''Extract a feature dictionary for an input text'''
    feature_dict = {}

    #Your code here

    #Your code here

    return feature_dict

In [None]:
train_feat_dicts = []
train_classification = []
test_feat_dicts = []
test_classification = []

for key in test_dict.keys():
    for context in range(len(test_dict[key])):
        test_feat_dicts.append(get_feature_dict(test_dict[key][context]))
        test_classification.append(key)

try:
    for key in train_dict.keys():
        for context in range(len(train_dict[key])):
            train_feat_dicts.append(get_feature_dict(train_dict[key][context]))
            train_classification.append(key)
except IndexError:
    print(IndexError)
    print(train_dict[key][context])

In [None]:
def vectorize(train_dict, test_dict):
    '''vectorize given lists of feature dictionaries, return X_train and X_test'''
    vectorizer = DictVectorizer(sparse=False, dtype=float)
    X_train = vectorizer.fit_transform(train_dict)
    X_test = vectorizer.transform(test_dict)
    
    return X_train, X_test

In [None]:
X_train, X_test = vectorize(train_feat_dicts, test_feat_dicts)
y_train, y_test = train_classification, test_classification

In [1]:
# test_feat_dicts

In [None]:
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [None]:
print("Final score: ", tree.score(X_test, y_test))

Final score:  0.2225


#### Experiment: Taking out `average_distance` features (WordNet)

In [None]:
def remove_average_distance_feature(feat_dicts):
    '''remove average distance feature from data set'''
    new_feat_dicts = []

    #Your code here

    #Your code here

    return new_feat_dicts

In [None]:
new_train_feat_dicts, new_test_feat_dicts = remove_average_distance_feature(train_feat_dicts), remove_average_distance_feature(test_feat_dicts)

In [2]:
# new_test_feat_dicts

In [None]:
X_new_train, X_new_test = vectorize(new_train_feat_dicts, new_test_feat_dicts)

In [None]:
tree = DecisionTreeClassifier(max_depth=5)
tree.fit(X_new_train, y_train)

DecisionTreeClassifier(max_depth=5)

In [None]:
print("Score with new training set: ", tree.score(X_new_test, y_test))

Score with new training set:  0.2275


We tried to take out the average distance feature from the data sets because it was the feature that showed the least significant distinction between different senses of contexts. As a result, we got a slightly higher score compared to the final score with all features. Given that feature extraction takes a significant amount of time, it would be better to take out the feature from the data sets and try to implement other features.

### Part 4: Teamwork report
rubric={raw:1, reasoning:1}

Briefly discuss how each person contributed to the project. Though it is not necessary that every group member has a equal contribution in terms of code, every group member should have a significant contribution, a major part of the lab for which they were a primary contributor. If any team member fails to contribute significantly, the rest of the group, assuming they have made efforts to encourage the team member to contribute, should discuss what happened in this report, otherwise all team members might lose raw points for this rubric. Team members who failed to contribute may lose some or all of their grade for the assignment.

### Part 5: Extra Feature(s) (Optional)
rubric={raw:2}

Develop another feature or feature set to try to improve your WSD performance. This could be an variation on or combination of the features from above, or perhaps something totally new. However, in order to get any points, **it must rely in some key way on information from WordNet**. You will be graded based on novelty, programming difficulty, and how much improvement you see on on the test set (you can still get some points for an interesting idea that fails). You can try multiple distinct features or feature types, but you'll only get points for the best one. 