<a href="https://colab.research.google.com/github/mimuruth-msft/NLP/blob/main/07_WordNet/Chapter_7_WordNet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

1. ***Chapter 7: WordNet***

*   WordNet is a large lexical database of English that groups English words into sets of synonyms called synsets, providing their definitions, usage examples, and relationships with other synsets. It organizes words hierarchically, allowing for traversal through hypernym and hyponym relationships, and can be used for natural language processing tasks such as word sense disambiguation, text classification, and information retrieval.

2.   Let's select the noun "dog" and output all synsets



In [11]:
from nltk.corpus import wordnet as wn

import nltk
#nltk.download("popular")   # uncomment on the initial run
from nltk.corpus import wordnet

# get all synsets of 'dog'
dog_synsets = wordnet.synsets('dog', pos='n')
for synset in dog_synsets:
    print(synset)

Synset('dog.n.01')
Synset('frump.n.01')
Synset('dog.n.03')
Synset('cad.n.01')
Synset('frank.n.02')
Synset('pawl.n.01')
Synset('andiron.n.01')


In [12]:
# get all synsets of 'dog'
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

3. Let's select the synset 'dog.n.01' and extract its definition, usage examples, and lemmas. We will then traverse up the WordNet hierarchy as far as we can, outputting the synsets as we go:

In [14]:
# get a definition for the first noun synset

wn.synset('dog.n.01').definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [16]:
# get a definition for the first verb synset

wn.synset('dog.v.01').definition()

'go after with the intent to catch'

In [17]:
# extract examples

wn.synset('dog.v.01').examples()

['The policeman chased the mugger down the alley', 'the dog chased the rabbit']

In [18]:
# extract lemmas

wn.synset('dog.v.01').lemmas()

[Lemma('chase.v.01.chase'),
 Lemma('chase.v.01.chase_after'),
 Lemma('chase.v.01.trail'),
 Lemma('chase.v.01.tail'),
 Lemma('chase.v.01.tag'),
 Lemma('chase.v.01.give_chase'),
 Lemma('chase.v.01.dog'),
 Lemma('chase.v.01.go_after'),
 Lemma('chase.v.01.track')]

In [20]:
# iterate over synsets

exercise_synsets = wn.synsets('dog', pos=wn.VERB)
for sense in exercise_synsets:
    lemmas = [l.name() for l in sense.lemmas()]
    print("Synset: " + sense.name() + "(" +sense.definition() + ")  \n\t Lemmas:" + str(lemmas))

Synset: chase.v.01(go after with the intent to catch)  
	 Lemmas:['chase', 'chase_after', 'trail', 'tail', 'tag', 'give_chase', 'dog', 'go_after', 'track']


In [22]:
selected_synset = wordnet.synset('dog.n.01')

print(f"Definition: {selected_synset.definition()}")
print("Usage examples:")
for example in selected_synset.examples():
    print(f"- {example}")
print("Lemmas:")
for lemma in selected_synset.lemmas():
    print(f"- {lemma.name()}")

print("Hypernyms:")
hypernyms = selected_synset.hypernyms()
print(hypernyms)

print("Hyponyms:")
hyponyms = selected_synset.hyponyms()
print(hyponyms)

print("Meronyms:")
meronyms = selected_synset.part_meronyms()
print(meronyms)

print("Holonyms:")
holonyms = selected_synset.part_holonyms()
print(holonyms)

print("Antonyms:")
antonyms = selected_synset.lemmas()[0].antonyms()
print(antonyms)

Definition: a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
Usage examples:
- the dog barked all night
Lemmas:
- dog
- domestic_dog
- Canis_familiaris
Hypernyms:
[Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Hyponyms:
[Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
Meronyms:
[Synset('flag.n.07')]
Holonyms:
[]
Antonyms:
[]


- From the output, we can observe that WordNet organizes synsets in a hierarchical structure, with each synset having hypernyms (more general terms) and hyponyms (more specific terms). Synsets can also have meronyms (parts) and holonyms (whole) relationships. In this case, the synset 'dog.n.01' has hypernyms 'canine.n.02' and 'domestic_animal.n.01', and hyponyms for various breeds of dogs.

4. Let's output hypernyms, hyponyms, meronyms, holonyms, and antonyms for the verb "run":

In [23]:
print(f"Hypernyms: {selected_synset.hypernyms()}")
print(f"Hyponyms: {selected_synset.hyponyms()}")
print(f"Meronyms: {selected_synset.part_meronyms()}")
print(f"Holonyms: {selected_synset.part_holonyms()}")
print(f"Antonym: {selected_synset.lemmas()[0].antonyms()}")

Hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
Hyponyms: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]
Meronyms: [Synset('flag.n.07')]
Holonyms: []
Antonym: []


5. Let's select the verb "run" and output all synsets:

In [24]:
from nltk.corpus import wordnet

verb_synsets = wordnet.synsets('run', pos='v')
for synset in verb_synsets:
    print(synset, synset.definition())

Synset('run.v.01') move fast by using one's feet, with one foot off the ground at any given time
Synset('scat.v.01') flee; take to one's heels; cut and run
Synset('run.v.03') stretch out over a distance, space, time, or scope; run or extend between two points or beyond a certain point
Synset('operate.v.01') direct or control; projects, businesses, etc.
Synset('run.v.05') have a particular form
Synset('run.v.06') move along, of liquids
Synset('function.v.01') perform as expected when applied
Synset('range.v.01') change or be different within limits
Synset('campaign.v.01') run, stand, or compete for an office or a position
Synset('play.v.18') cause to emit recorded audio or video
Synset('run.v.11') move about freely and without restraint, or act as if running around in an uncontrolled way
Synset('tend.v.01') have a tendency or disposition to do or be something; be inclined
Synset('run.v.13') be operating, running or functioning
Synset('run.v.14') change from one state to another
Synset('

*   This code will output a list of all synsets for the selected verb.




6. Let's select the first synset from the list of synsets, extract its definition, usage examples, and lemmas, and traverse up the WordNet hierarchy as far as we can, outputting the synsets as you go:

In [25]:
from nltk.corpus import wordnet

verb = 'run' # select a verb
synsets = wordnet.synsets(verb, pos=wordnet.VERB) # get all synsets for the verb and the 'verb' part of speech

if synsets:
    selected_synset = synsets[0] # select the first synset

    # extract definition, usage examples, and lemmas
    definition = selected_synset.definition()
    examples = selected_synset.examples()
    lemmas = selected_synset.lemmas()

    # traverse up the hierarchy and output synsets
    hypernyms = selected_synset.hypernyms()
    while hypernyms:
        hypernym = hypernyms[0]
        print(hypernym)
        hypernyms = hypernym.hypernyms()

    # write observations about the organization of WordNet for verbs
    print("WordNet organizes verbs into synsets based on their semantic relationships. These synsets are hierarchical, with each synset having hypernyms (more general terms) and hyponyms (more specific terms). This organization allows for easy navigation of the semantic relationships between different verbs.")

Synset('travel_rapidly.v.01')
Synset('travel.v.01')
WordNet organizes verbs into synsets based on their semantic relationships. These synsets are hierarchical, with each synset having hypernyms (more general terms) and hyponyms (more specific terms). This organization allows for easy navigation of the semantic relationships between different verbs.




*   WordNet organizes verbs into synsets based on their semantic relationships. These synsets are hierarchical, with each synset having hypernyms (more general terms) and hyponyms (more specific terms). This organization allows for easy navigation of the semantic relationships between different verbs.



In [30]:
from nltk.corpus import wordnet

synset = wordnet.synset('run.v.01')
print('Definition:', synset.definition())
print('Usage examples:', synset.examples())
print('Lemmas:', synset.lemmas())

hypernyms = synset.hypernyms()
while hypernyms:
    print('Hypernyms:', hypernyms)
    hypernyms = hypernyms[0].hypernyms()

Definition: move fast by using one's feet, with one foot off the ground at any given time
Usage examples: ["Don't run--you'll be out of breath", 'The children ran to the store']
Lemmas: [Lemma('run.v.01.run')]
Hypernyms: [Synset('travel_rapidly.v.01')]
Hypernyms: [Synset('travel.v.01')]




*   This code will output the definition, examples, and lemmas for the first synset in the list, then traverse up the WordNet hierarchy by printing the hypernyms of each synset until it reaches the top of the hierarchy.
WordNet organizes verbs based on their hypernym (or parent) synsets, which define more general concepts. This hierarchy can be useful in NLP tasks such as word sense disambiguation



In [29]:
import nltk
from nltk.corpus import wordnet

# Select a synset
synset = wordnet.synset('eat.v.01')

# Extract definition, usage examples, and lemmas
definition = synset.definition()
examples = synset.examples()
lemmas = synset.lemmas()

# Print definition, examples, and lemmas
print("Definition:", definition)
print("Examples:", examples)
print("Lemmas:", [lemma.name() for lemma in lemmas])

# Traverse up the WordNet hierarchy
current_synset = synset
while True:
    hypernyms = current_synset.hypernyms()
    if len(hypernyms) == 0:
        break
    print("Hypernyms:")
    for hypernym in hypernyms:
        print("- ", hypernym.name(), ": ", hypernym.definition())
    current_synset = hypernyms[0]  # Continue with the first hypernym

Definition: take in solid food
Examples: ['She was eating a banana', 'What did you eat for dinner last night?']
Lemmas: ['eat']
Hypernyms:
-  consume.v.02 :  serve oneself to, or consume regularly
-  eat.v.02 :  eat a meal; take a meal


*   WordNet organizes verbs hierarchically based on their semantic relationships. The hypernyms() method returns a list of synsets that represent the immediate hypernyms (or parent synsets) of the current synset in the hierarchy. This allows us to traverse up the hierarchy from a specific synset to more general synsets. The top of the hierarchy is the root synset, which is not a verb but a concept (represented by entity.n.01). The hierarchy is organized in such a way that more general synsets are higher up, and more specific synsets are lower down. This organization reflects the way we understand and use language, and allows for efficient retrieval and comparison of related words.


7. To use morphy to find as many different forms of the word as possible, we can use the lemmatize() method of the WordNetLemmatizer class: 

In [31]:
from nltk.stem import WordNetLemmatizer

word = 'ran'
lemmatizer = WordNetLemmatizer()
forms = set(lemmatizer.lemmatize(word, pos='v') for pos in ['v', 'n', 'a', 'r'])
print(forms)

{'run'}


In [32]:
from nltk.stem import WordNetLemmatizer

word = 'eat'
lemmatizer = WordNetLemmatizer()
forms = set(lemmatizer.lemmatize(word, pos='v') for pos in ['v', 'n', 'a', 'r'])
print(forms)

{'eat'}




- Here's an example of using the morphy() method to find different forms of the word "run":



In [33]:
from nltk.corpus import wordnet

word_forms = set()
for form in ['run', 'running', 'runs', 'ran']:
    word_forms.update(set(wordnet._morphy(form, wordnet.VERB)))
print(word_forms)

{'run'}




*   This code will output a set of all possible forms of the word "ran" using morphy.





*   This code uses the _morphy() method of the wordnet corpus to find the possible 
lemmas that could generate the given word form in its part of speech category, which is specified as wordnet.VERB. We use the set() function to remove duplicates from the resulting list of word forms.



In [34]:
selected_synset = synsets[0]
print("Definition:", selected_synset.definition())
print("Usage examples:", selected_synset.examples())
print("Lemmas:", selected_synset.lemmas())

print("\nHypernyms:")
for hypernym in selected_synset.hypernyms():
    print(hypernym)

print("\nHyponyms:")
for hyponym in selected_synset.hyponyms():
    print(hyponym)

print("\nMeronyms:")
for meronym in selected_synset.part_meronyms() + selected_synset.substance_meronyms() + selected_synset.member_meronyms():
    print(meronym)

print("\nHolonyms:")
for holonym in selected_synset.part_holonyms() + selected_synset.substance_holonyms() + selected_synset.member_holonyms():
    print(holonym)

print("\nAntonyms:")
for lemma in selected_synset.lemmas():
    for antonym in lemma.antonyms():
        print(antonym)


Definition: move fast by using one's feet, with one foot off the ground at any given time
Usage examples: ["Don't run--you'll be out of breath", 'The children ran to the store']
Lemmas: [Lemma('run.v.01.run')]

Hypernyms:
Synset('travel_rapidly.v.01')

Hyponyms:
Synset('hare.v.01')
Synset('jog.v.03')
Synset('lope.v.01')
Synset('outrun.v.01')
Synset('romp.v.02')
Synset('run.v.33')
Synset('run_bases.v.01')
Synset('rush.v.05')
Synset('scurry.v.01')
Synset('sprint.v.01')
Synset('streak.v.02')
Synset('trot.v.01')

Meronyms:

Holonyms:

Antonyms:


We observe that WordNet is organized hierarchically for verbs as well, with each synset having hypernyms and hyponyms.

In [35]:
synset = wordnet.synset('run.v.01')
print('Definition:', synset.definition())
print('Usage Examples:', synset.examples())
print('Lemmas:', [lemma.name() for lemma in synset.lemmas()])

hypernym = synset.hypernyms()
while hypernym:
    print('Hypernym:', hypernym[0])
    hypernym = hypernym[0].hypernyms()

Definition: move fast by using one's feet, with one foot off the ground at any given time
Usage Examples: ["Don't run--you'll be out of breath", 'The children ran to the store']
Lemmas: ['run']
Hypernym: Synset('travel_rapidly.v.01')
Hypernym: Synset('travel.v.01')


8. Let's select two words that we think might be similar "car" and "automobile" and find the specific synsets we are interested in, then run the Wu-Palmer similarity metric and the Lesk algorithm:

In [36]:
from nltk.corpus import wordnet
from nltk.wsd import lesk

word1 = "car"
word2 = "automobile"

synset1 = wordnet.synset(f"{word1}.n.01")
synset2 = wordnet.synset(f"{word2}.n.01")

wup_similarity = synset1.wup_similarity(synset2)
print("Wu-Palmer similarity:", wup_similarity)

lesk_similarity = lesk([word1, word2], word1, "n").wup_similarity(lesk([word1, word2], word2, "n"))
print("Lesk algorithm similarity:", lesk_similarity)

Wu-Palmer similarity: 1.0
Lesk algorithm similarity: 0.47619047619047616




*   Select two other similar words and find their specific synsets, and run the Wu-Palmer similarity metric and Lesk algorithm:


In [37]:
word1 = wordnet.synset('dog.n.01')
word2 = wordnet.synset('cat.n.01')
print(word1.wup_similarity(word2))
print(word1.lch_similarity(word2))

0.8571428571428571
2.0281482472922856




*   Both the Wu-Palmer similarity metric and the Lesk algorithm produce similarity scores between synsets based on the overlap of their lexical definitions and the structure of the WordNet hierarchy.
The Wu-Palmer similarity metric and the Lesk algorithm can be used to measure the similarity between two words based on the relatedness of their synsets in WordNet. These measures are useful in NLP applications such as information retrieval, text classification, and machine translation.
The Wu-Palmer similarity metric and Lesk algorithm both measure the similarity between two synsets. The Wu-Palmer similarity metric computes the similarity based on the shortest path that connects the two synsets in the WordNet hierarchy, while the Lesk algorithm compares the overlap of the glosses (i.e., definitions) of the two synsets.





*   SentiWordNet is a lexical resource that assigns sentiment scores to synsets in WordNet. Its functionality includes sentiment analysis, text classification, and opinion mining in NLP applications. Here's Python code to select an emotionally charged word, find its senti-synsets, and output the polarity scores:





*   Example of finding the senti-synsets and polarity scores for the word "love":



In [38]:
nltk.download('sentiwordnet')
from nltk.corpus import sentiwordnet as swn
from nltk.corpus import wordnet

word = 'love'

synsets = wordnet.synsets(word)
for synset in synsets:
    senti_synset = swn.senti_synset(synset.name())
    print('SentiSynset:', senti_synset)
    print('Positive Score:', senti_synset.pos_score())
    print('Negative Score:', senti_synset.neg_score())
    print('Objective Score:', senti_synset.obj_score())

[nltk_data] Downloading package sentiwordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/sentiwordnet.zip.


SentiSynset: <love.n.01: PosScore=0.625 NegScore=0.0>
Positive Score: 0.625
Negative Score: 0.0
Objective Score: 0.375
SentiSynset: <love.n.02: PosScore=0.375 NegScore=0.0>
Positive Score: 0.375
Negative Score: 0.0
Objective Score: 0.625
SentiSynset: <beloved.n.01: PosScore=0.125 NegScore=0.0>
Positive Score: 0.125
Negative Score: 0.0
Objective Score: 0.875
SentiSynset: <love.n.04: PosScore=0.25 NegScore=0.0>
Positive Score: 0.25
Negative Score: 0.0
Objective Score: 0.75
SentiSynset: <love.n.05: PosScore=0.0 NegScore=0.0>
Positive Score: 0.0
Negative Score: 0.0
Objective Score: 1.0
SentiSynset: <sexual_love.n.02: PosScore=0.0 NegScore=0.0>
Positive Score: 0.0
Negative Score: 0.0
Objective Score: 1.0
SentiSynset: <love.v.01: PosScore=0.5 NegScore=0.0>
Positive Score: 0.5
Negative Score: 0.0
Objective Score: 0.5
SentiSynset: <love.v.02: PosScore=1.0 NegScore=0.0>
Positive Score: 1.0
Negative Score: 0.0
Objective Score: 0.0
SentiSynset: <love.v.03: PosScore=0.625 NegScore=0.0>
Positive Sc

In [39]:
from nltk.corpus import sentiwordnet as swn

word = "love"
synsets = wordnet.synsets(word)
senti_synsets = [swn.senti_synset(synset.name()) for synset in synsets]

for senti_synset in senti_synsets:
    print(senti_synset.synset.name(), senti_synset.pos_score(), senti_synset.neg_score(), senti_synset.obj_score())

love.n.01 0.625 0.0 0.375
love.n.02 0.375 0.0 0.625
beloved.n.01 0.125 0.0 0.875
love.n.04 0.25 0.0 0.75
love.n.05 0.0 0.0 1.0
sexual_love.n.02 0.0 0.0 1.0
love.v.01 0.5 0.0 0.5
love.v.02 1.0 0.0 0.0
love.v.03 0.625 0.0 0.375
sleep_together.v.01 0.375 0.125 0.5




*   SentiWordNet provides three polarity scores for each senti-synset, which represent the degree of positivity, negativity, and objectivity of the associated word sense. These scores can be used to perform sentiment analysis and other NLP tasks.





*   SentiWordNet is a lexical resource that assigns sentiment scores (positive, negative, and neutral) to synsets in WordNet. It can be used in sentiment analysis tasks to classify text as positive, negative, or neutral.



In [40]:
senti_synsets = list(swn.senti_synsets('love', 'v'))
for synset in senti_synsets:
    print(synset.synset.name(), synset.pos_score(), synset.neg_score())

love.v.01 0.5 0.0
love.v.02 1.0 0.0
love.v.03 0.625 0.0
sleep_together.v.01 0.375 0.125




*   The polarity scores of the synsets in SentiWordNet indicate the degree of positivity (pos_score) or negativity (neg_score) of a word in a specific context. These scores can be used in sentiment analysis applications to better understand the sentiment of text.





*   A collocation is a sequence of words or group of words that frequently occur together in a language and form a meaningful expression. It can be used to infer the meaning of individual words and to identify the syntactic structure of a sentence. Collocations can provide useful information for language modeling, text classification, and other NLP tasks. Here's Python code to output collocations for text4, the Inaugural corpus, and calculate mutual information:




*   This code prints out the collocations that occur most frequently in the Inaugural corpus.


In [41]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import inaugural

text = inaugural.raw()

finder = BigramCollocationFinder.from_words(text.split())
finder.apply_freq_filter(5)
collocations = finder.nbest(BigramAssocMeasures().pmi, 10)
#print(collocations)
for collocation in collocations:
  print(collocation)

('preserve,', 'protect,')
('coordinate', 'branches')
('Chief', 'Magistrate')
('Thank', 'you.')
('Chief', 'Justice,')
('Vice', 'President,')
('Almighty', 'God.')
('President', 'Bush,')
('God', 'bless')
('Fellow', 'citizens,')




*   To calculate the mutual information of a collocation, we need to know the frequency of the collocation and the individual frequencies of its constituent words.





*   Here's another example of how to output collocations and calculate mutual information:


In [42]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder
from nltk.corpus import inaugural

text = inaugural.words()
finder = BigramCollocationFinder.from_words(text)
bigram_measures = BigramAssocMeasures()
collocations = finder.nbest(bigram_measures.pmi, 10)

for collocation in collocations:
    if 'freedom' in collocation:
        print(collocation)



*   Above code will find the top 10 bigram collocations based on their pointwise mutual information (PMI) score. It will then select the collocations containing the word 'freedom' and output them. The last line is the selected collocation, 'freedom of the', and we can then calculate its mutual information using the following code:



In [43]:
from nltk.probability import FreqDist
from nltk.collocations import BigramAssocMeasures

word_fd = FreqDist(text)
bigram_fd = FreqDist(nltk.bigrams(text))
N = bigram_fd.N()

freedom_of_freq = bigram_fd[('freedom', 'of')]
freedom_freq = word_fd['freedom']
of_freq = word_fd['of']

mi = BigramAssocMeasures().mi_like(freedom_of_freq, (freedom_freq, of_freq), N)
print(mi)

0.012108460811208754




*   The mutual information of 'freedom means that the occurrence of 'freedom' and 'of' together is highly informative and indicative of the meaning of the text.



In [45]:
import nltk

# Load the Inaugural corpus
nltk.download('inaugural')
text = nltk.corpus.inaugural.words()

# Generate collocations using the BigramAssocMeasures function
bigram_measures = nltk.collocations.BigramAssocMeasures()
finder = nltk.collocations.BigramCollocationFinder.from_words(text)
finder.apply_freq_filter(5)
collocations = finder.nbest(bigram_measures.pmi, 10)

# Select one collocation and calculate mutual information
collocation = collocations[0]
pmi = BigramAssocMeasures().mi_like(freedom_of_freq, (freedom_freq, of_freq), N)
print("Selected collocation:", collocation)
print("Mutual information:", pmi)

[nltk_data] Downloading package inaugural to /root/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!


Selected collocation: ('Indian', 'tribes')
Mutual information: 0.012108460811208754




*   This will output the 10 collocations with the highest pointwise mutual information (PMI) score. The code filters out common stopwords and words that are too short, and then selects only the collocation "united states" to calculate mutual information. The output shows the collocation and its MI score.

