## Introduction to Natural Language Processing
[**CC-BY-NC-SA**](https://creativecommons.org/licenses/by-nc-sa/4.0/deed.en)<br/>
Prof. Dr. Annemarie Friedrich<br/>
Faculty of Applied Computer Science, University of Augsburg<br/>
Date: **SS 2025**

# 4. Representing Text (Homework)

In this homework, we will try out some methods to compute semantic relatedness between words.

❓Read the first two pages of [this article](https://aclanthology.org/J06-1003.pdf) by Budanitsky and Hirst (2006). Answer the questions about the article in Vips.

## WordSim Dataset

[WordSim353](https://gabrilovich.com/resources/data/wordsim353/wordsim353.html) is a test collection for measuring word similarity or relatedness.
Each instance consists of a pair of words that were judged by humans with regard to how similar or related they are. For example, "midday" and "noon" are rated to be more similar than "noon" and "string."
The task for the models is to produce similarity scores that _correlate_ well with the human ratings. We will measure that in terms of [__Pearson's r__](https://en.wikipedia.org/wiki/Pearson_correlation_coefficient).
The full math can be found [here](https://mathworld.wolfram.com/CorrelationCoefficient.html).

In this homework, we use the version by [Agirre et al., 2009](https://aclanthology.org/N09-1003/), who split the dataset into a part about relatedness and one about similarity. Your first task is to read in the dataset from a tab-separated CSV file.

The dataset is in the folder `wordsim353_sim_rel`.

Rename `wordsim_relatedness_goldstandard.txt`to `wordsim_relatedness_goldstandard.csv` (or `.tsv`) and upload it to Colab or place it in the same directory like the Jupyter notebook. The content of the file looks like this:

```
computer	keyboard	7.62
Jerusalem	Israel	8.46
planet	galaxy	8.11
canyon	landscape	7.53
OPEC	country	5.63
day	summer	3.94
...
```

In fact, the file format is a [tab-separated format](https://en.wikipedia.org/wiki/Tab-separated_values). As this is just a variant of the comma-separated format, we can easily read the file in using [Python's csv package](https://docs.python.org/3/library/csv.html) by setting the `delimiter` to `"\t"`.
If you prefer, you can also use `pandas` to handle the file. If you have never read a csv file with Python using the `csv` reader package, I suggest you implement it this way, just so you know how to that in case you ever need it.



In [None]:
# Some imports
import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
import csv # see: https://docs.python.org/3/library/csv.html
import numpy as np
import scipy
import math

Read in the content of the file `"wordsim_relatedness_goldstandard.csv` into a data structure of your choice. How many instances does the dataset contain? What is the average similarity of the rating by the humans, what is the minimum, what is the maximum? (You may use `numpy` to compute these statistics.)

In [None]:
# Your code here



## WordNet Similarity

First, we will use WordNet to compute semantic relatedness between the words.
Re-read the Wiki page on WordNet. Recall that WordNet is a huge graph.

❓ Read sections 2.3 and section 2.5.3 of the Budanitsky and Hirst (2006) paper. Make sure you understand how the Leacock-Chodorow algorithm works. It might help you to sketch an example on a piece of paper. You can also use additional sources that you find about the algorithm.

The [`nltk`](https://www.nltk.org/index.html) toolkit provides a method to compute the Leacock Chodorow (LCH) similarity. Go to [this website](https://www.nltk.org/howto/wordnet.html) and figure out how it works. You can assume that all the words in the wordsim353 dataset the we are using are nouns.

__Computing LCH for wordsim353__: Wait a minute, LCH works for pairs of synsets, and in wordsim353 we are just given words, completely out of context! While this may admittedly also cause some problems for human annotators, for our purpose, we will simply retrieve _all_ the synsets associated with a noun and then compute the LCH similarities between all pairs of synsets for the two words. We define the LCH similarity between the two words as the maximum similarity score we found.

❓Compute the maximum LCH similarity for each pair of words in the wordsim353 dataset.

Hint: If you use a list for the computed similarities, it should start like this: `[2.2512917986064953, 1.6916760106710724, 1.55814461804655,...`

In [None]:
# Compute maximum LCH similarity between the two sets of synsets that belong to the two lemmas
# Do this for every pair of words in the wordsim353 dataset and collect the results

# Your code here

__Evaluation:__ Next, we evaluate the system ratings by computing the correlation with the human ratings.

More on computing various correlation coefficients in Python: https://realpython.com/numpy-scipy-pandas-correlation-python/#example-numpy-correlation-calculation

A function for computing Pearson's r is given below. Use it to compute Pearson's correlation for the human ratings of the wordsim353 dataset and the system ratings provided by the LCH metric that you have implemented above.

In [None]:
# function is given
def compute_correlation(human_ratings, system_ratings):
  """ Input: two lists (of equal length) with numeric values.
  Computes Pearson's correlation coefficient.
  """
  assert len(human_ratings), len(system_ratings)
  return scipy.stats.pearsonr(human_ratings, system_ratings)

# Use the function above to compute the correlation of the human and the LCH ratings
# Your code here

## Distributional Similarity

Next, we will use a distributional method to compute relatedness values.

In order to compute how often a word co-occurs with another word, we need a plaintext corpus. In this homework, we will use the Brown corpus as provided by `nltk`, check out the examples [here](https://www.nltk.org/howto/corpus.html#plaintext-corpora).

❓Import the `brown` corpus using nltk. For each pair of words in the wordsim353 dataset, compute the _Pointwise Mutual Information_ (see wiki!) as

$ \displaystyle PMI(w1, w2) = log_2 \big( \frac{p(w1, w2)}{p(w1)*p(w2)}   \big)$

* $p(w1, w2)$ denotes the probability that $w1$ and $w2$ occur together in a sentence.
* $p(w1)$ is the probability that $w1$ occurs in a sentence; $w2$ accordingly.

Use the PMI scores as the similarity ratings between two words. Compute the Pearson correlation vs. the human ratings using this method. Compare the results to that of the LCH method above.
Hint: Write a function `compute_pmi` to structure your code in a good way. Again, use the `compute_corelation` function to compute the correlation between the human and the PMI-based system ratings.

By the way, a naive iterative implementation runs about 10 minutes (hint: print some progress statements to see what is going on). My optimized solution using dictionaries (`defaultdict`) and `set` operations runs in just a few seconds.

In [None]:
from nltk.corpus import brown
nltk.download('brown')

In [None]:
# Your code here

❓Inspect the scores output by the PMI method. Do they always make sense? If not, what are possible reasons? For which pairs of words do they work exceptionally well? Enter your answer into Vips.

❓ Optional open-ended exercise: Come up with an automatic method to identify those (hint: using rankings).

## References

Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca, Aitor Soroa, A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches, In Proceedings of NAACL-HLT 2009.