In [2]:
# This notebook requires Python 3.12.3 or higher

import sys
required_version = (3, 12, 3)
if sys.version_info < required_version:
    raise Exception(f"This notebook requires Python {required_version} or higher!")
else:
    print(f"Python version {sys.version} is compatible.")

Python version 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0] is compatible.


In [3]:
%pip install nltk matplotlib

Note: you may need to restart the kernel to use updated packages.


# LAB 3

The aim of this lab is to test the similarity between two sentences using online lexical database WordNet. The students can refer to the original paper of Mihalcea et al. (Corpus-based and Knowledge-based Measures of Text Semantic Similarity), appeared in AAAI 2006. See, (https://www.aaai.org/Papers/AAAI/2006/AAAI06-123.pdf)

## Task 1: For early practice, study Section 5 of Chapter 2 of NLTK online book, and try to reproduce the coding examples and try to use your own examples of wording to identify the synsets, hyponyms, hypernyms, and various semantic similarity between two words of your choice. Suggest a script that retrieves the first hypernym and the list of all hyponyms of words ‘car’ and ‘bus’. 

In [16]:
import nltk
from nltk.corpus import wordnet as wn
from nltk.corpus.reader.wordnet import Synset
from collections.abc import Callable


nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package wordnet to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

In [5]:
def get_hypernym_hyponyms(word: str) -> tuple[Synset, list[Synset]]:
    synsets = wn.synsets(word)

    if not synsets:
        return None, []

    # Get the first hypernym from the first synset (most common usage)
    hypernyms = synsets[0].hypernyms()
    first_hypernym = hypernyms[0] if hypernyms else None

    # Get all hyponyms
    hyponyms = []
    for synset in synsets:
        hyponyms.extend(synset.hyponyms())

    return first_hypernym, hyponyms


def get_hypernym_hyponyms_str(word: str) -> tuple[str, list[str]]:
    first_hypernym, hyponyms = get_hypernym_hyponyms(word)
    first_hypernym_name = first_hypernym.name().split('.')[0] if first_hypernym else None

    hyponym_names = [hyponym.name().split('.')[0] for hyponym in hyponyms]

    return first_hypernym_name, hyponym_names

# Words to test
words = ['car', 'bus']

for word in words:
    hypernym, hyponyms = get_hypernym_hyponyms_str(word)
    print(f"Word: {word}")
    print(f"First Hypernym: {hypernym}")
    print(f"List of Hyponyms: {hyponyms}")
    print('-' * 40)

Word: car
First Hypernym: motor_vehicle
List of Hyponyms: ['minicar', 'compact', 'hot_rod', 'cruiser', 'hatchback', 'sedan', 'stock_car', 'sports_car', 'cab', 'racer', 'hardtop', 'model_t', 'minivan', 'limousine', 'used-car', 'bus', 'sport_utility', 'horseless_carriage', 'ambulance', 'roadster', 'convertible', 'gas_guzzler', 'subcompact', 'touring_car', 'beach_wagon', 'coupe', 'pace_car', 'stanley_steamer', 'jeep', 'electric', 'loaner', 'handcar', 'freight_car', 'tender', 'mail_car', 'cabin_car', "guard's_van", 'club_car', 'passenger_car', 'van', 'baggage_car', 'slip_coach']
----------------------------------------
Word: bus
First Hypernym: public_transport
List of Hyponyms: ['trolleybus', 'school_bus', 'minibus']
----------------------------------------


## Task 2: Suggest another script that extracts the synsets of the word “car” and rank them in the order of their frequency of occurrence (most common synset first, less common synset at the end). For this purpose, you may use the coding:
```python 
car = wn.synsets('car', 'n')[0]  # Get the most common synset
print car.lemmas()[0].count()  # Get the first lemma
```

In [6]:
def rank_synsets_by_frequency(word: str) -> list[tuple[Synset, int]]:
    synsets = wn.synsets(word, pos=wn.NOUN)

    # Create a list of tuples (synset, frequency) sorted by frequency in descending order
    synset_frequencies = []
    for synset in synsets:
        lemmas = synset.lemmas()
        if lemmas:
            frequency = sum([lemma.count() for lemma in lemmas])
            synset_frequencies.append((synset, frequency))

    # We should not need to do because synsets supposed to sort this already
    # But there is too few documentation on this, only 3rd party claims, so sorting it anyway
    synset_frequencies.sort(key=lambda x: x[1], reverse=True)

    return synset_frequencies


# Display the synsets for 'car' ranked by frequency
synset_frequencies = rank_synsets_by_frequency('car')
print(f"Synsets of 'car' ranked by frequency:")
for synset, freq in synset_frequencies:
    print(f"Synset: {synset.name()}, Frequency: {freq}, Definition: {synset.definition()}")

Synsets of 'car' ranked by frequency:
Synset: car.n.01, Frequency: 89, Definition: a motor vehicle with four wheels; usually propelled by an internal combustion engine
Synset: car.n.02, Frequency: 2, Definition: a wheeled vehicle adapted to the rails of railroad
Synset: car.n.03, Frequency: 0, Definition: the compartment that is suspended from an airship and that carries personnel and the cargo and the power plant
Synset: car.n.04, Frequency: 0, Definition: where passengers ride up and down
Synset: cable_car.n.01, Frequency: 0, Definition: a conveyance for passengers or freight on a cable railway


## Task 3: Now we want to use the WordNet semantic similarity to evaluate the similarity between the words. Suggest a script that calculates the Wu and Palmer semantic similarity between words ‘car’ and ‘bus’ in terms of maximum S1, minimum  S2 and average S3 over all synsets of these words (in other words, combination of synsets that yields the maximum, minimum Wu and Palmer similarity as well as the average similarity over all combination of synsets in ‘car’ and ‘bus’). Repeat this process by calculating the Wu and Palmer similarity between the first hypernym of ‘car’ and first hypernym of ‘bus’, and the new values for S1, S2 and S3. Next, repeat this process for hyponyms words; that is calculate the Wu and Palmer between every hyponym of ‘car’ and that of ‘bus’ and then take arithmetic average of all hyponym-pairs as the new Hyponym-based similarity values, and then consider the new evaluations of S1, S2 and S3 when all synsets are considered.

In [23]:
def wu_palmer_similarity(synset1, synset2):
    try:
        return synset1.wup_similarity(synset2)
    except:
        return None

def similarity(word1: str, word2: str, compare_function: Callable[[Synset, Synset], float | None], word_type: str=wn.NOUN) -> tuple[float, float, float]:
    synsets1 = wn.synsets(word1, word_type)
    synsets2 = wn.synsets(word2, word_type)

    similarities = []

    for synset1 in synsets1:
        for synset2 in synsets2:
            sim = compare_function(synset1, synset2)
            if sim is not None:
                similarities.append(sim)

    if similarities:
        s1 = max(similarities)  # Maximum similarity
        s2 = min(similarities)  # Minimum similarity
        s3 = sum(similarities) / len(similarities)  # Average similarity
        return s1, s2, s3
    else:
        return None, None, None

def hyponym_similarity(hyponyms1: list[str], hyponyms2: list[str], compare_function: Callable[[Synset, Synset], float | None], word_type: str=wn.NOUN) -> tuple[float, float, float]:
    s1_scores = []
    s2_scores = []
    s3_scores = []

    for hyponym1 in hyponyms1:
        for hyponym2 in hyponyms2:
            s1, s2, s3 = similarity(hyponym1, hyponym2, compare_function, word_type)
            if s1 is not None:
                s1_scores.append(s1)

            if s2 is not None:
                s2_scores.append(s2)

            if s3 is not None:
                s3_scores.append(s3)

    s1 = max(s1_scores) if s1_scores else None
    s2 = min(s2_scores) if s2_scores else None
    s3 = sum(s3_scores) / len(s3_scores) if s3_scores else None
    return s1, s2, s3

def print_word_similarity(word1: str, word2: str, compare_function: Callable[[Synset, Synset], float | None], word_type: str=wn.NOUN) -> None:
    hypernym1, hyponyms1 = get_hypernym_hyponyms_str(word1)
    hypernym2, hyponyms2 = get_hypernym_hyponyms_str(word2)
    s1_synset, s2_synset, s3_synset = similarity(word1, word2, compare_function, word_type)
    s1_hypernym, s2_hypernym, s3_hypernym = similarity(hypernym1, hypernym2, compare_function, word_type)
    s1_hyponym, s2_hyponym, s3_hyponym = hyponym_similarity(hyponyms1, hyponyms2, compare_function, word_type)

    print(f"Wu & Palmer Similarity between all synsets of '{word1}' and '{word2}':")
    print(f"Max (S1): {s1_synset}, Min (S2): {s2_synset}, Avg (S3): {s3_synset}")
    print(f"Wu & Palmer Similarity between first hypernyms of '{word1}' and '{word2}':")
    print(f"Max (S1): {s1_hypernym}, Min (S2): {s2_hypernym}, Avg (S3): {s3_hypernym}")
    print(f"Wu & Palmer Similarity between hyponyms of '{word1}' and '{word2}':")
    print(f"Max (S1): {s1_hyponym}, Min (S2): {s2_hyponym}, Avg (S3): {s3_hyponym}")

    print(f"Merged Similarity Score:")
    s1_list = list(filter(None, [s1_synset, s1_hypernym, s1_hyponym]))
    s2_list = list(filter(None, [s2_synset, s2_hypernym, s2_hyponym]))
    s3_list = list(filter(None, [s3_synset, s3_hypernym, s3_hyponym]))
    max_s1 = max(s1_list)
    min_s2 = min(s2_list)
    avg_s3 = sum(s3_list) / len(s3_list)
    print(f"Max (S1): {max_s1}, Min (S2): {min_s2}, Avg (S3): {avg_s3}")


word1 = 'car'
word2 = 'bus'
print_word_similarity(word1, word2, compare_function=wu_palmer_similarity, word_type=wn.NOUN)

Wu & Palmer Similarity between all synsets of 'car' and 'bus':
Max (S1): 0.96, Min (S2): 0.09523809523809523, Avg (S3): 0.46739299830604175
Wu & Palmer Similarity between first hypernyms of 'car' and 'bus':
Max (S1): 0.7368421052631579, Min (S2): 0.7368421052631579, Avg (S3): 0.7368421052631579
Wu & Palmer Similarity between hyponyms of 'car' and 'bus':
Max (S1): 0.9473684210526315, Min (S2): 0.1, Avg (S3): 0.5957758922012587
Merged Similarity Score:
Max (S1): 0.96, Min (S2): 0.09523809523809523, Avg (S3): 0.6000036652568195


## Task 4: Repeat Task-3: when Jiang-Conrath similarity is employed where the corpus consists of Brown corpus, see https://www.nltk.org/howto/wordnet.html for examples. 

In [26]:
from nltk.corpus import wordnet_ic


nltk.download('wordnet_ic')
brown_ic = wordnet_ic.ic('ic-brown.dat')

def jiang_conrath_similarity(synset1, synset2):
    try:
        return synset1.jcn_similarity(synset2, brown_ic)
    except:
        return None

word1 = 'car'
word2 = 'bus'
print_word_similarity(word1, word2, compare_function=jiang_conrath_similarity, word_type=wn.NOUN)

Wu & Palmer Similarity between all synsets of 'car' and 'bus':
Max (S1): 0.34659468740185323, Min (S2): 0.05161364962677664, Avg (S3): 0.09387159388812354
Wu & Palmer Similarity between first hypernyms of 'car' and 'bus':
Max (S1): 0.27016908921466043, Min (S2): 0.27016908921466043, Avg (S3): 0.27016908921466043
Wu & Palmer Similarity between hyponyms of 'car' and 'bus':
Max (S1): 1e-300, Min (S2): 5e-301, Avg (S3): 7.44047619047619e-301
Merged Similarity Score:
Max (S1): 0.34659468740185323, Min (S2): 5e-301, Avg (S3): 0.12134689436759466


[nltk_data] Downloading package wordnet_ic to
[nltk_data]     /home/azureuser/nltk_data...
[nltk_data]   Package wordnet_ic is already up-to-date!


## Task 5: Now consider two sentences T1 and T2, each constituted with a set of tokens. For this purpose, study expression (1) of the aforementioned Mihalcea et al.’s paper above (see below).  You can check with a potential implementation available in Mihalcea’s resources and elsewhere. Start with sentences: T1: “Students feel unhappy today about the class today”. T2: ”Several students study hard at classes in recent days”,  and study the influence of various preprocessing (stopword removal, stemming) on the result of the sentence-to-sentence similarity above.

## Task 6: Consider a new approach of calculating the semantic similarity by transforming all words of sentence in their noun counterpart and then calculate the maximum similarity score as in Mihalcea’s formula.. The extraction of the noun part of each token of the sentence can be performed using ‘morphy’ function in wordnet, see example in https://www.nltk.org/howto/wordnet.html.

## Task 7: Now consider a new sentence-to-sentence similarity where the similarity score is calculated as the cosine similarity of embedding vectors of the two sentences and where the embedding vector of each sentence is the average of FastText embedding vector of each word constituting the sentence prior to any pre-processing stage. Write a program that implements this similarity metric and compute the sentence-to-sentence similarity of T1 and T2.  Repeat this process when using word2vec embeddings and doc2vec embedding.   


## Task 8: Implement a program that calculates the sentence-to-sentence similarity as the result of the FuzzyWuzzy score of comparison of string of both sentences, after initial preprocessing and lemmatization using wordnet lemmatizer. Calculate the new similarity score between sentence T1 and T2.