# IHLT Lab 5

Lab developed by:
- Niklas Long Schiefelbein
- Oriol Miró López-Feliu


**Exercice description:**
Given the following (lemma, category) pairs:
```
(’the’,’DT’), (’man’,’NN’), (’swim’,’VB’), (’with’, ’PR’), (’a’, ’DT’),
(’girl’,’NN’), (’and’, ’CC’), (’a’, ’DT’), (’boy’, ’NN’), (’whilst’, ’PR’),
(’the’, ’DT’), (’woman’, ’NN’), (’walk’, ’VB’)
```
1. For each pair, when possible, print their most frequent WordNet synset

2. For each pair of words, when possible, print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:
  - Path Similarity
  - Leacock-Chodorow Similarity
  - Wu-Palmer Similarity
  - Lin Similarity

  Normalize similarity values when necessary. What similarity seems better?

## Imports

In [9]:
# basic
import nltk

# normalising
import math

# wordnet
nltk.download('wordnet')
from nltk.corpus import wordnet as wn

# for lin similarity (we need corpus)
nltk.download('wordnet_ic')
from nltk.corpus import wordnet_ic
brown_ic = wordnet_ic.ic('ic-brown.dat')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package wordnet_ic to /root/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


## Data Loading

In [10]:
# in this case, fairly simple...

# (lemma, category) pairs
pairs = [('the', 'DT'), ('man', 'NN'), ('swim', 'VB'), ('with', 'PR'), ('a', 'DT'),
         ('girl', 'NN'), ('and', 'CC'), ('a', 'DT'), ('boy', 'NN'), ('whilst', 'PR'),
         ('the', 'DT'), ('woman', 'NN'), ('walk', 'VB')]

## Exercices

### 1. For each pair, when possible, print their most frequent WordNet synset

Note: I usually put functions in a separate "useful functions section", but in this case given they fully solve the exercise I deemed them more fit here.

In [11]:
# we only consider nouns (NN) and verbs (VB), since WordNet does not support
#   determiners (DT), prepositions (PR) and conjunctions (CC) (see https://wordnet.princeton.edu/frequently-asked-questions)

def get_most_frequent_synset(lemma, category):
    if category == 'NN':
        synsets = wn.synsets(lemma, pos=wn.NOUN)
    elif category == 'VB':
        synsets = wn.synsets(lemma, pos=wn.VERB)
    else:
        return None  # DR, PR, CC

    return synsets[0] if synsets else None # most common == first!


In [12]:
# loop calc
synsets = []
for lemma, category in pairs:
    synset = get_most_frequent_synset(lemma, category)
    if synset:
        print(f"Most frequent synset for {lemma} ({category}): {synset.name()}")
        synsets.append(synset) # we save them for latter calculations (ex 2)
    else:
        print(f"No synset found for {lemma} ({category})") # :(

No synset found for the (DT)
Most frequent synset for man (NN): man.n.01
Most frequent synset for swim (VB): swim.v.01
No synset found for with (PR)
No synset found for a (DT)
Most frequent synset for girl (NN): girl.n.01
No synset found for and (CC)
No synset found for a (DT)
Most frequent synset for boy (NN): male_child.n.01
No synset found for whilst (PR)
No synset found for the (DT)
Most frequent synset for woman (NN): woman.n.01
Most frequent synset for walk (VB): walk.v.01


### Exercice 2

For each pair of words, when possible, print their corresponding least common subsumer (LCS) and their similarity value, using the following functions:
  - Path Similarity
  - Leacock-Chodorow Similarity
  - Wu-Palmer Similarity
  -Lin Similarity

  Normalize similarity values when necessary. What similarity seems better?

In [13]:
# for normalization in Leacock-Chodorow Similarity

max_depth_nouns = max(len(path) for synset in wn.all_synsets('n') for path in synset.hypernym_paths())
max_depth_verbs = max(len(path) for synset in wn.all_synsets('v') for path in synset.hypernym_paths())

max_lch_noun  = -math.log(1 / (2 * max_depth_nouns))
max_lch_verb  = -math.log(1 / (2 * max_depth_verbs))

# debug... (left as it is interesting to see!)
print(f"Max depth nouns: {max_depth_nouns}")
print(f"Max depth verbs: {max_depth_verbs}")
print(f"Max LCH noun: {max_lch_noun}")
print(f"Max LCH verb: {max_lch_verb}")

Max depth nouns: 20
Max depth verbs: 13
Max LCH noun: 3.6888794541139363
Max LCH verb: 3.258096538021482


In [14]:
# we decided to compute similarity between every pair of words, despite different POS tags, for completeness, becasuse Path Similarity does yield a metric (despite bad)

def compute_similarities(synset1, synset2):
    print(f"\n------------------------\n\nComparing {synset1.name()} and {synset2.name()}:")

    # basic path similarity
    path_sim = synset1.path_similarity(synset2)
    print(f"Path Similarity: {path_sim:.3f}")

    # the rest of similarities require same POS

    if synset1.pos() != synset2.pos():
        print(f"Leacock-Chodorow Similarity not computable")
        print(f"Wu-Palmer Similarity not computable")
        print(f"Lin Similarity not computable")
        print("No LCS found")
        return

    # => implicit else

    if synset1.pos() == 'v':
        max_lch = max_lch_verb
    else:
        max_lch = max_lch_noun

    # lch similarity
    lch_sim = synset1.lch_similarity(synset2)
    norm_lch_sim = lch_sim / max_lch

    # wp similarity
    wup_sim = synset1.wup_similarity(synset2)

    # lin similarity
    lin_sim = synset1.lin_similarity(synset2, brown_ic)

    # least common subsumer for both synsets
    lcs = synset1.lowest_common_hypernyms(synset2)

    # we print
    # print(f"Leacock-Chodorow Similarity: {lch_sim}") # uncomment for debug
    print(f"Normalized Leacock-Chodorow Similarity: {norm_lch_sim:.3f}")
    print(f"Wu-Palmer Similarity: {wup_sim:.3f}")
    print(f"Lin Similarity: {lin_sim:.3f}")
    print(f"Least Common Subsumer (LCS): {lcs[0].name()}")



In [16]:
# all combination of pairs (non repeated)
for i in range(len(synsets)):
    for j in range(i+1, len(synsets)):
        # if synsets[i].pos() == synsets[j].pos(): if one wants to only compare the same POS
        compute_similarities(synsets[i], synsets[j])


------------------------

Comparing man.n.01 and swim.v.01:
Path Similarity: 0.100
Leacock-Chodorow Similarity not computable
Wu-Palmer Similarity not computable
Lin Similarity not computable
No LCS found

------------------------

Comparing man.n.01 and girl.n.01:
Path Similarity: 0.250
Normalized Leacock-Chodorow Similarity: 0.610
Wu-Palmer Similarity: 0.632
Lin Similarity: 0.714
Least Common Subsumer (LCS): adult.n.01

------------------------

Comparing man.n.01 and male_child.n.01:
Path Similarity: 0.333
Normalized Leacock-Chodorow Similarity: 0.688
Wu-Palmer Similarity: 0.667
Lin Similarity: 0.729
Least Common Subsumer (LCS): male.n.02

------------------------

Comparing man.n.01 and woman.n.01:
Path Similarity: 0.333
Normalized Leacock-Chodorow Similarity: 0.688
Wu-Palmer Similarity: 0.667
Lin Similarity: 0.787
Least Common Subsumer (LCS): adult.n.01

------------------------

Comparing man.n.01 and walk.v.01:
Path Similarity: 0.100
Leacock-Chodorow Similarity not computable
W

**What similarity seems better?**

Comparing metrics in NLP is always tough, as different metrics can be better in different contexts. One way could be comparing each metric against the similarity computed through word embeddings, as they have been proven robust (and are currently the state-of-the-art). Nevertheless, in this task, we will evaluate metrics against our "human evaluations" (themselves full of biases!), comparing what metric captures our intuitions best across cases.

- **Path Similarity:** It offers an advantage as it is computable across POS, giving us an "aproximate idea" of similarity, but for any other case it fails to capture similarity as well as other metrics; for example, it scores with a mere 0.33 the similarity between "man" and "male_child".

- **Leacock-Chodorow Similarity:** Is closer to our intuition; it (in our opinion) correctly captures the similarity between "man" and "woman" being the same as between "man" and "male_child", as we believe both only differ in one dimension (gender and age, respectively). However, such "dimension difference" argument does not always apply, as the similarity between "girl" and "male_child" is lower than we would expect (0.5); in cases such as this one, it fails to capture the meaning of the words.

- **Wu-Palmer Similarity:** Regarding the previous example, it correctly computes "girl" and "male_child" to be similar, more so than any other metric; nevertheless, we believe it sometimes underestimates similarities, for example between "swim" and "walk" (0.333) or between "girl" and "woman" (0.632).

- **Lin Similarity:** This metric is hard to evaluate as it also depends on the training corpus, so our analysis might be influenced by our choice (https://www.nltk.org/howto/wordnet.html). We found Lin Similarity to be the most fine-grained, as all values computed are different. In many cases the similarities align well with our intuitions (e.g. "man" and "male_child" being more similar than "man" and "girl"), altough in others it seems to underestimate similarity (e.g. "girl" and "male_child" being much less similar than "man" and "woman", despite the difference in both mainly being gender). This unpredictability poses a major drawback.

In conclusion, no single metric is perfect for capturing our intuitive understanding of word similarity. However, we believe **Wu-Palmer Similarity** is the most balanced among those evaluated (despite not by much). It generally aligns well with our intuitions, capturing similarities across a range of cases without being as unpredictable as some of the other metrics. We must however take into account the low number of samples evaluated, meaning this conclusion does not carry much weight.




