<h1 style="text-align: center;">WordNet & WSD - Tutorial</h1> 

* WordNet synset names use a precise format e.g., **bank.n.01**:
    * bank → **The lemma name** (the word itself).
    * n → **Part of speech (POS)**: n = noun, v = verb, a = adjective, r = adverb.
    * 01 → **The sense number** of that word (because a word may have multiple meanings).

In [11]:
import nltk
# Download wordnet for the first time
nltk.download("wordnet")

from nltk.corpus import wordnet as wn

# Get all synsets for "bank"
for syn in wn.synsets("bank"):
    print(syn, ":", syn.definition())


Synset('bank.n.01') : sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') : a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') : a long ridge or pile
Synset('bank.n.04') : an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') : a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') : the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') : a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') : a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') : a building in which the business of banking transacted
Synset('bank.n.10') : a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)
Synset('bank.v.0

[nltk_data] Downloading package wordnet to C:\Users\Khor Kean
[nltk_data]     Teng\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!


In [12]:
# Get all noun senses of 'bank'
bank_synsets = wn.synsets('bank', pos=wn.NOUN)
for syn in bank_synsets:
    print(syn, ":", syn.definition())

Synset('bank.n.01') : sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') : a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') : a long ridge or pile
Synset('bank.n.04') : an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') : a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') : the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') : a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') : a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') : a building in which the business of banking transacted
Synset('bank.n.10') : a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turning)


## Taxonomy-Based Similarity
* It’s a semantic similarity metric between **two synsets (word senses) within the same part of speech** (noun-noun, verb-verb, etc.).
* The **shortest path between two synsets in WordNet based on their hypernym/hyponym tree**.
* **path similarity Formula:** 1/(shortest path length + 1)
* Range:
  * 1.0 → identical synsets (maximum similarity)
  * Close to 0 → distant or unrelated synsets
  * None → no path between the synsets (usually if they are different parts of speech, e.g., noun vs verb).


## Hypernym path = IS-A path

In [13]:
dog = wn.synset('dog.n.01')
for path in dog.hypernym_paths():
    print([syn.name() for syn in path])

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'living_thing.n.01', 'organism.n.01', 'animal.n.01', 'domestic_animal.n.01', 'dog.n.01']
['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'living_thing.n.01', 'organism.n.01', 'animal.n.01', 'chordate.n.01', 'vertebrate.n.01', 'mammal.n.01', 'placental.n.01', 'carnivore.n.01', 'canine.n.02', 'dog.n.01']


In [14]:
dog = wn.synset('cat.n.01')
for path in dog.hypernym_paths():
    print([syn.name() for syn in path])

['entity.n.01', 'physical_entity.n.01', 'object.n.01', 'whole.n.02', 'living_thing.n.01', 'organism.n.01', 'animal.n.01', 'chordate.n.01', 'vertebrate.n.01', 'mammal.n.01', 'placental.n.01', 'carnivore.n.01', 'feline.n.01', 'cat.n.01']


## Path between Synsets
* In WordNet, all nouns (and verbs) are organized in a tree.
    * **Synsets** (concepts) are nodes.
    * **Hypernym/Hyponym** relations are edges.
* Path distance = the number of edges between two synsets.

In [15]:
from nltk.corpus import wordnet as wn

syn1 = wn.synset('dog.n.01')
syn2 = wn.synset('cat.n.01')

# Get all possible paths to the root for both synsets
paths1 = syn1.hypernym_paths()
paths2 = syn2.hypernym_paths()

# Find the shortest path between syn1 and syn2
shortest_distance = syn1.shortest_path_distance(syn2)
print("Shortest path length:", shortest_distance)

similarity = syn1.path_similarity(syn2)
print("Path Similarity:", similarity)


Shortest path length: 4
Path Similarity: 0.2


* The Lowest Common Ancestor (LCA) is **carnivore**, and we count edges:
    * dog → canine → **carnivore** (2 steps)
    * cat → feline → **carnivore** (2 steps)
    * Shortest path = 2 + 2 = 4.
    * Similarity = 1 / (4 + 1) = 0.2


## Semantic Similarity Comparison using WordNet

In [16]:
from nltk.corpus import wordnet as wn

# 1. First vs Final
first = wn.synset('first.n.01')
final = wn.synset('final.n.01')
similarity_first_final = first.path_similarity(final)

# 2. Hair vs Comb
hair = wn.synset('hair.n.01')
comb = wn.synset('comb.n.01')
similarity_hair_comb = hair.path_similarity(comb)

# 3. Doctor vs Hospital & Doctor vs Nurse
doctor = wn.synset('doctor.n.01')
hospital = wn.synset('hospital.n.01')
nurse = wn.synset('nurse.n.01')

similarity_doc_hosp = doctor.path_similarity(hospital)
similarity_doc_nurse = doctor.path_similarity(nurse)

# Print results
print(f"First vs Final similarity: {similarity_first_final}")
print(f"Hair vs Comb similarity: {similarity_hair_comb}")
print(f"Doctor vs Hospital similarity: {similarity_doc_hosp}")
print(f"Doctor vs Nurse similarity: {similarity_doc_nurse}")

# Which is stronger
if similarity_doc_hosp > similarity_doc_nurse:
    print("Doctor-Hospital is stronger.")
elif similarity_doc_hosp < similarity_doc_nurse:
    print("Doctor-Nurse is stronger.")
else:
    print("Both have equal similarity.")


First vs Final similarity: 0.08333333333333333
Hair vs Comb similarity: 0.1111111111111111
Doctor vs Hospital similarity: 0.07142857142857142
Doctor vs Nurse similarity: 0.25
Doctor-Nurse is stronger.


## Word Sense Disambiguation with WordNet
* We will disambiguate the word 'bank' in two contexts using NLTK's Lesk.
* nltk.wsd.lesk → NLTK’s implementation of **the Simplified Lesk Algorithm**.
* Predicts the best matching sense **based on gloss overlap** between the sentence context and the synset definitions.
* No fancy NLP tricks (no stemming, no semantic similarity)—just raw word overlap.
  

In [17]:
from nltk.wsd import lesk
from nltk.tokenize import word_tokenize
from nltk.corpus import wordnet as wn

In [18]:
# Example 1: Financial context
sentence1 = "She conducts her lending activities through a bank."
tokens1 = word_tokenize(sentence1)

sense = lesk(tokens1, 'bank', 'n')
print("Sentence 1 context:", sentence1)
print("Predicted sense:", sense.name())
print("Definition:", sense.definition())

Sentence 1 context: She conducts her lending activities through a bank.
Predicted sense: depository_financial_institution.n.01
Definition: a financial institution that accepts deposits and channels the money into lending activities


In [19]:
# Example 2: River context
sentence2 = "The fisherman sat patiently on the grassy bank beside the water."
tokens = word_tokenize(sentence2)

sense = lesk(tokens, 'bank', 'n')
print("Sentence 2 context:", sentence2)
print("Predicted sense:", sense.name())
print("Definition:", sense.definition())

Sentence 2 context: The fisherman sat patiently on the grassy bank beside the water.
Predicted sense: bank.n.01
Definition: sloping land (especially the slope beside a body of water)


## WSD Using Contextual Embeddings (WordNet + BERT)
Instead of relying on word overlap (like Lesk), we now:

* Retrieve glosses (definitions) of each possible WordNet sense of the target word (bank).
* Use a pre-trained language model like BERT to **compute sentence embeddings**:
    * One for the context sentence.
    * One for each gloss.
* Measure similarity (cosine similarity) between the context embedding and each gloss embedding.
* Pick the sense with the highest similarity.

In [20]:
from nltk.corpus import wordnet as wn
from sentence_transformers import SentenceTransformer, util

# Initialize the sentence embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')  # BERT-based model designed for efficiency

# Context sentence
sentence = "She went to the bank to open a savings account."

# Get all noun senses of 'bank'
bank_synsets = wn.synsets('bank', pos=wn.NOUN)

# Embed the sentence
sentence_emb = model.encode(sentence, convert_to_tensor=True)

best_score = -1
best_sense = None

# Compare with each sense's gloss
for syn in bank_synsets:
    gloss = syn.definition()
    gloss_emb = model.encode(gloss, convert_to_tensor=True)
    similarity = util.pytorch_cos_sim(sentence_emb, gloss_emb).item()
    print(f"Sense: {syn.name()}\nGloss: {gloss}\nSimilarity: {similarity:.4f}\n")
    
    if similarity > best_score:
        best_score = similarity
        best_sense = syn

# Final result
print("Best Sense:", best_sense.name())
print("Definition:", best_sense.definition())


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.5k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

Sense: bank.n.01
Gloss: sloping land (especially the slope beside a body of water)
Similarity: -0.0125

Sense: depository_financial_institution.n.01
Gloss: a financial institution that accepts deposits and channels the money into lending activities
Similarity: 0.3637

Sense: bank.n.03
Gloss: a long ridge or pile
Similarity: 0.0037

Sense: bank.n.04
Gloss: an arrangement of similar objects in a row or in tiers
Similarity: -0.0993

Sense: bank.n.05
Gloss: a supply or stock held in reserve for future use (especially in emergencies)
Similarity: 0.1110

Sense: bank.n.06
Gloss: the funds held by a gambling house or the dealer in some gambling games
Similarity: 0.3313

Sense: bank.n.07
Gloss: a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Similarity: -0.0231

Sense: savings_bank.n.02
Gloss: a container (usually with a slot in the top) for keeping money at home
Similarity: 0.2544

Sense: bank.n.09
Gloss: a build