## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

# 4. Representing Text (Homework)

In this homework, we will try out some methods to compute semantic relatedness between words.

❓Read the first two pages of [this article](https://aclanthology.org/J06-1003.pdf) by Budanitsky and Hirst (2006). Answer the questions about the article in Vips.

## WordSim Dataset

[WordSim353](https://gabrilovich.com/resources/data/wordsim353/wordsim353.html) is a test collection for measuring word similarity or relatedness.
Each instance consists of a pair of words that were judged by humans with regard to how similar or related they are. For example, "midday" and "noon" are rated to be more similar than "noon" and "string."
The task for the models is to produce similarity scores that _correlate_ well with the human ratings. We will measure that in terms of [__Pearson's r__](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
The full math can be found [here](https://mathworld.wolfram.com/CorrelationCoefficient.html).

In this homework, we use the version by [Agirre et al., 2009](https://aclanthology.org/N09-1003/), who split the dataset into a part about relatedness and one about similarity. Your first task is to read in the dataset from a tab-separated CSV file.

The dataset is in the folder `wordsim353_sim_rel`.

Rename `wordsim_relatedness_goldstandard.txt`to `wordsim_relatedness_goldstandard.csv` (or `.tsv`) and upload it to Colab or place it in the same directory like the Jupyter notebook. The content of the file looks like this:

```
computer	keyboard	7.62
Jerusalem	Israel	8.46
planet	galaxy	8.11
canyon	landscape	7.53
OPEC	country	5.63
day	summer	3.94
...
```

In fact, the file format is a [tab-separated format](https://en.wikipedia.org/wiki/Tab-separated_values). As this is just a variant of the comma-separated format, we can easily read the file in using [Python's csv package](https://docs.python.org/3/library/csv.html) by setting the `delimiter` to `"\t"`.
If you prefer, you can also use `pandas` to handle the file. If you have never read a csv file with Python using the `csv` reader package, I suggest you implement it this way, just so you know how to that in case you ever need it.



In [2]:
# Some imports
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
import csv # see: https://docs.python.org/3/library/csv.html
import numpy as np
import scipy
import math

[nltk_data] Downloading package wordnet to /root/nltk_data...


Read in the content of the file `"wordsim_relatedness_goldstandard.csv` into a data structure of your choice. How many instances does the dataset contain? What is the average similarity of the rating by the humans, what is the minimum, what is the maximum? (You may use `numpy` to compute these statistics.)

In [3]:
# A list of instances to read the /t-separated value strings into
instances = []
# A list of only the word pairs of the instances (useful later)
wordPairs = []
# A list of only the (float) scores of the instances (useful for min, max, mean)
scores = []

with open('/content/wordsim_relatedness_goldstandard.csv', newline='') as csvfile:
    filereader = csv.reader(csvfile, delimiter='\t', quotechar='|')


    for line in filereader:
        instances.append(line)
        wordPairs.append((line[0], line[1]))
        scores.append(float(line[2]))

    print(f"Number of instances in set: {len(instances)}.")
    print(f"Max score: {np.max(scores)}.")
    print(f"Min score: {np.min(scores)}.")
    print(f"Mean of all scores: {np.mean(scores)}.")


Number of instances in set: 252.
Max score: 8.81.
Min score: 0.23.
Mean of all scores: 5.294642857142857.




## WordNet Similarity

First, we will use WordNet to compute semantic relatedness between the words.
Re-read the Wiki page on WordNet. Recall that WordNet is a huge graph.

❓ Read sections 2.3 and section 2.5.3 of the Budanitsky and Hirst (2006) paper. Make sure you understand how the Leacock-Chodorow algorithm works. It might help you to sketch an example on a piece of paper. You can also use additional sources that you find about the algorithm.

The [`nltk`](https://www.nltk.org/index.html) toolkit provides a method to compute the Leacock Chodorow (LCH) similarity. Go to [this website](https://www.nltk.org/howto/wordnet.html) and figure out how it works. You can assume that all the words in the wordsim353 dataset the we are using are nouns.

__Computing LCH for wordsim353__: Wait a minute, LCH works for pairs of synsets, and in wordsim353 we are just given words, completely out of context! While this may admittedly also cause some problems for human annotators, for our purpose, we will simply retrieve _all_ the synsets associated with a noun and then compute the LCH similarities between all pairs of synsets for the two words. We define the LCH similarity between the two words as the maximum similarity score we found.

❓Compute the maximum LCH similarity for each pair of words in the wordsim353 dataset.

Hint: If you use a list for the computed similarities, it should start like this: `[2.2512917986064953, 1.6916760106710724, 1.55814461804655,...`

In [4]:
# Compute maximum LCH similarity between the two sets of synsets that belong to the two lemmas
# Do this for every pair of words in the wordsim353 dataset and collect the results
# Your code here

def lchmax(word1, word2):
  # Get all synsets for a word, assume all words are nouns.
  sets1 = wn.synsets(word1, pos=wn.NOUN)
  sets2 = wn.synsets(word2, pos=wn.NOUN)

  lchmax = 0

  # Nested loop to iterate over all pairs, calculating lch for each pair,
  # storing max in acc var 'lchmax'.
  for ss1 in sets1:
    for ss2 in sets2:
      lch = ss1.lch_similarity(ss2)
      if( lch > lchmax):
        lchmax = lch

  return lchmax


# Compute similarity for each pair into a list of similarities
compsims = []
for wordPair in wordPairs:
  sim = lchmax(wordPair[0], wordPair[1])
  print(f"Computed Similarity for {wordPair[0]} and {wordPair[1]} = {sim}.")
  compsims.append(sim)




Computed Similarity for computer and keyboard = 2.2512917986064953.
Computed Similarity for Jerusalem and Israel = 1.6916760106710724.
Computed Similarity for planet and galaxy = 1.55814461804655.
Computed Similarity for canyon and landscape = 1.1526795099383855.
Computed Similarity for OPEC and country = 1.6916760106710724.
Computed Similarity for day and summer = 2.2512917986064953.
Computed Similarity for day and dawn = 2.538973871058276.
Computed Similarity for country and citizen = 1.4403615823901665.
Computed Similarity for planet and people = 1.4403615823901665.
Computed Similarity for environment and ecology = 2.9444389791664407.
Computed Similarity for Maradona and football = 0.
Computed Similarity for OPEC and oil = 0.9985288301111273.
Computed Similarity for money and bank = 1.6916760106710724.
Computed Similarity for computer and software = 1.072636802264849.
Computed Similarity for law and lawyer = 1.2396908869280152.
Computed Similarity for weather and forecast = 0.998528

__Evaluation:__ Next, we evaluate the system ratings by computing the correlation with the human ratings.

More on computing various correlation coefficients in Python: https://realpython.com/numpy-scipy-pandas-correlation-python/#example-numpy-correlation-calculation

A function for computing Pearson's r is given below. Use it to compute Pearson's correlation for the human ratings of the wordsim353 dataset and the system ratings provided by the LCH metric that you have implemented above.

In [5]:
# function is given
def compute_correlation(human_ratings, system_ratings):
  """ Input: two lists (of equal length) with numeric values.
  Computes Pearson's correlation coefficient.
  """
  assert len(human_ratings), len(system_ratings)
  return scipy.stats.pearsonr(human_ratings, system_ratings)

# Use the function above to compute the correlation of the human and the LCH ratings
corr_coeff, p_val = compute_correlation(scores, compsims)
print(f"Computed Pearson correlation coefficient is {corr_coeff}.")
print(f"Computed p value is {p_val}.")

Computed Pearson correlation coefficient is 0.007949148935555112.
Computed p value is 0.9000775038545066.


## Distributional Similarity

Next, we will use a distributional method to compute relatedness values.

In order to compute how often a word co-occurs with another word, we need a plaintext corpus. In this homework, we will use the Brown corpus as provided by `nltk`, check out the examples [here](https://www.nltk.org/howto/corpus.html#plaintext-corpora).

❓Import the `brown` corpus using nltk. For each pair of words in the wordsim353 dataset, compute the _Pointwise Mutual Information_ (see wiki!) as

$ \displaystyle PMI(w1, w2) = log_2 \big( \frac{p(w1, w2)}{p(w1)*p(w2)}   \big)$

* $p(w1, w2)$ denotes the probability that $w1$ and $w2$ occur together in a sentence.
* $p(w1)$ is the probability that $w1$ occurs in a sentence; $w2$ accordingly.

Use the PMI scores as the similarity ratings between two words. Compute the Pearson correlation vs. the human ratings using this method. Compare the results to that of the LCH method above.
Hint: Write a function `compute_pmi` to structure your code in a good way. Again, use the `compute_corelation` function to compute the correlation between the human and the PMI-based system ratings.

By the way, a naive iterative implementation runs about 10 minutes (hint: print some progress statements to see what is going on). My optimized solution using dictionaries (`defaultdict`) and `set` operations runs in just a few seconds.

In [6]:
from nltk.corpus import brown
nltk.download('brown')

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


True

In [33]:
# Your code here
from collections import defaultdict

# compute_pmi takes a dictionary containing words as keys,
# word1 and word2 to compute the pmi of,
# and a smoothing parameter
# if smoothing = 0 and two words do not co-occur, pmi = 0

def compute_pmi(occurrences, word1, word2, smoothing):
  w1_occurrences = occurrences[word1.lower()]
  w2_occurrences = occurrences[word2.lower()]

  w1_count = len(w1_occurrences)
  w2_count = len(w2_occurrences)

  # Flatten the list of lists of strings(words)
  # into a single list of strings (sentences)
  w1_sentences = []
  w2_sentences = []
  for i in range(w1_count):
    w1_sentences.append(' '.join(w1_occurrences[i]))
  for i in range(w2_count):
    w2_sentences.append(' '.join(w2_occurrences[i]))

  # Eliminate repeated sentences
  w1_set = set(w1_sentences)
  w2_set = set(w2_sentences)

  # Get only the sentences that are in both lists = co-occurrences
  intersection = list(w1_set & w2_set)

  # Add-1 Smoothing
  w1_count += smoothing
  w2_count += smoothing
  w1w2_count = len(intersection) + smoothing

  total_sents = len(nltk.corpus.brown.sents())

  pmi = 0
  if(w1w2_count != 0):
    pmi = math.log2((w1w2_count/total_sents) / ((w1_count/total_sents)*(w2_count/total_sents)))

  print(f"{word1}: {w1_count}.")
  print(f"{word2}: {w2_count}.")
  print(f"Co-occur: {w1w2_count}.")
  print(f"PMI: {pmi}.\n")

  return pmi


# Dictionary to store occurrences for each word
# Key: word
# Value: list of sentences where the word occurrs
occurrences = defaultdict(list)

for s in nltk.corpus.brown.sents():
  for w in s:
    occurrences[w.lower()].append(s)

compute_pmi(occurrences, "baseball", "season", 0)
#compute_pmi(occurrences, "baseball", "season", 1)

comp_pmis = []
for wordPair in wordPairs:
  comp_pmis.append(compute_pmi(occurrences, wordPair[0], wordPair[1], 0))

baseball: 57.
season: 105.
Co-occur: 4.
PMI: 5.260118752297021.

computer: 13.
keyboard: 4.
Co-occur: 0.
PMI: 0.

Jerusalem: 7.
Israel: 15.
Co-occur: 0.
PMI: 0.

planet: 21.
galaxy: 3.
Co-occur: 0.
PMI: 0.

canyon: 12.
landscape: 20.
Co-occur: 0.
PMI: 0.

OPEC: 0.
country: 324.
Co-occur: 0.
PMI: 0.

day: 687.
summer: 134.
Co-occur: 7.
PMI: 2.1243537269096175.

day: 687.
dawn: 28.
Co-occur: 2.
PMI: 2.5757330732521813.

country: 324.
citizen: 30.
Co-occur: 0.
PMI: 0.

planet: 21.
people: 847.
Co-occur: 0.
PMI: 0.

environment: 43.
ecology: 0.
Co-occur: 0.
PMI: 0.

Maradona: 0.
football: 36.
Co-occur: 0.
PMI: 0.

OPEC: 0.
oil: 95.
Co-occur: 0.
PMI: 0.

money: 265.
bank: 83.
Co-occur: 0.
PMI: 0.

computer: 13.
software: 0.
Co-occur: 0.
PMI: 0.

law: 299.
lawyer: 43.
Co-occur: 4.
PMI: 4.156987855227683.

weather: 69.
forecast: 10.
Co-occur: 0.
PMI: 0.

network: 30.
hardware: 11.
Co-occur: 0.
PMI: 0.

nature: 191.
environment: 43.
Co-occur: 1.
PMI: 2.8035607013900394.

FBI: 8.
investigation:

In [34]:
pmi_corr_coeff, pmi_p_val = compute_correlation(scores, comp_pmis)
print(f"Computed Pearson correlation coefficient is {pmi_corr_coeff}.")
print(f"Computed p value is {pmi_p_val}.")

for i in range(len(wordPairs)):
  print(f"{wordPairs[i][0]:13s}-{wordPairs[i][1]:13s} Score={scores[i]:6.5f} \tPMI={comp_pmis[i]:6.5f} \tLCH={compsims[i]:6.5f}.")

print(f"Number of sentences in the Brown Corpus: {len(nltk.corpus.brown.sents())}.")

Computed Pearson correlation coefficient is 0.226519011137628.
Computed p value is 0.00028869853627923954.
computer     -keyboard      Score=7.62000 	PMI=0.00000 	LCH=2.25129.
Jerusalem    -Israel        Score=8.46000 	PMI=0.00000 	LCH=1.69168.
planet       -galaxy        Score=8.11000 	PMI=0.00000 	LCH=1.55814.
canyon       -landscape     Score=7.53000 	PMI=0.00000 	LCH=1.15268.
OPEC         -country       Score=5.63000 	PMI=0.00000 	LCH=1.69168.
day          -summer        Score=3.94000 	PMI=2.12435 	LCH=2.25129.
day          -dawn          Score=7.53000 	PMI=2.57573 	LCH=2.53897.
country      -citizen       Score=7.31000 	PMI=0.00000 	LCH=1.44036.
planet       -people        Score=5.75000 	PMI=0.00000 	LCH=1.44036.
environment  -ecology       Score=8.81000 	PMI=0.00000 	LCH=2.94444.
Maradona     -football      Score=8.62000 	PMI=0.00000 	LCH=0.00000.
OPEC         -oil           Score=8.59000 	PMI=0.00000 	LCH=0.99853.
money        -bank          Score=8.50000 	PMI=0.00000 	LCH=1.691

Computing PMI with a Smoothing factor of 1:
- Computed Pearson correlation coefficient is 0.1470437495408519.
-Computed p value is 0.019525119848715456.

Computing PMI with a Smoothing factor of 0
(thus if co-occurrence = 0 --> pmi = 0, which would mean "the two words are independent")
- Computed Pearson correlation coefficient is 0.226519011137628.
- Computed p value is 0.00028869853627923954.

The p value seems to be well under 0.05 in both cases, which would point towards correlation.

❓Inspect the scores output by the PMI method. Do they always make sense? If not, what are possible reasons? For which pairs of words do they work exceptionally well? Enter your answer into Vips.

❓ Optional open-ended exercise: Come up with an automatic method to identify those (hint: using rankings).

## References

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.