# Accessing WordNet through the NLTK interface

>- [Accessing WordNet](#Accessing-WordNet)
>
>
>- [WN-based Semantic Similarity](#WN-based-Semantic-Similarity)

---

## Accessing WordNet

WordNet 3.0 can be accessed from NLTK by calling the appropriate NLTK corpus reader

In [None]:
import nltk
from nltk.corpus import wordnet as wn

### Retrieving Synsets

The easiest way to retrieve synsets is by submitting the relevant lemma to the `synsets()` method, that returns the list of all the synsets containing it:

In [10]:
print(wn.synsets('dog'))

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01'), Synset('chase.v.01')]


The optional paramater `pos` allows you to constrain the search to a given part of speech 

- available options: `wn.NOUN`, `wn.VERB`, `wn.ADJ`, `wn.ADV`

In [11]:
# let's ignore the verbal synsets from our previous results
print(wn.synsets('dog', pos = wn.NOUN))

[Synset('dog.n.01'), Synset('frump.n.01'), Synset('dog.n.03'), Synset('cad.n.01'), Synset('frank.n.02'), Synset('pawl.n.01'), Synset('andiron.n.01')]


You can use the `synset()` method together with the notation `lemma.pos.number` (e.g. `dog.n.01`) to access a given synset

In [12]:
# retrive the gloss of a given synset
wn.synset('dog.n.01').definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [13]:
# let's see some examples
wn.synset('dog.n.01').examples()

['the dog barked all night']

Did anyone notice something weird in these results? Why did I get `frank.n.02`?

In [14]:
# let's retrieve the lemmas associated with a given synset
wn.synset('frank.n.02').lemmas()

[Lemma('frank.n.02.frank'),
 Lemma('frank.n.02.frankfurter'),
 Lemma('frank.n.02.hotdog'),
 Lemma('frank.n.02.hot_dog'),
 Lemma('frank.n.02.dog'),
 Lemma('frank.n.02.wiener'),
 Lemma('frank.n.02.wienerwurst'),
 Lemma('frank.n.02.weenie')]

What's the definition?

In [15]:
wn.synset('frank.n.02').definition()

'a smooth-textured sausage of minced beef or pork usually smoked; often served on a bread roll'

The notation `lemmas.pos.number` is used to identify the **name** of the synset, that is the unique id that is used to store it in the semantic resources 

- note that it is different from the notation used to refer to synset lemmas, e.g. `frank.n.02.frank`

In [16]:
wn.synset('frank.n.02').name()

'frank.n.02'

Applied to our original query...

In [17]:
# synsets for a given word
wn.synsets('dog', pos = wn.NOUN)

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01')]

In [18]:
# synonyms for a particular meaning of a word
wn.synset('dog.n.01').lemmas()

[Lemma('dog.n.01.dog'),
 Lemma('dog.n.01.domestic_dog'),
 Lemma('dog.n.01.Canis_familiaris')]

In [19]:
wn.synset('dog.n.01').definition()

'a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'

In [20]:
wn.synset('dog.n.03').lemmas()

[Lemma('dog.n.03.dog')]

In [21]:
wn.synset('dog.n.03').definition()

'informal term for a man'

**Q. How are the senses in WordNet ordered?**

A. *WordNet senses are ordered using sparse data from semantically tagged text. The order of the senses is given simply so that some of the most common uses are listed above others (and those for which there is no data are randomly ordered). The sense numbers and ordering of senses in WordNet should be considered random for research purposes.*

(source: the [FAQ section](https://wordnet.princeton.edu/frequently-asked-questions) of the official WordNet web page)

Finally, the method `all_synsets()` allows you to retrieve all the synsets in the resource:

In [22]:
for synset in list(wn.all_synsets())[:10]:
    print(synset)

Synset('able.a.01')
Synset('unable.a.01')
Synset('abaxial.a.01')
Synset('adaxial.a.01')
Synset('acroscopic.a.01')
Synset('basiscopic.a.01')
Synset('abducent.a.01')
Synset('adducent.a.01')
Synset('nascent.a.01')
Synset('emergent.s.02')


... again, you can use the optional `pos` paramter to constrain your search:

In [23]:
for synset in list(wn.all_synsets(wn.ADV))[:10]:
    print(synset)

Synset('a_cappella.r.01')
Synset('ad.r.01')
Synset('ce.r.01')
Synset('bc.r.01')
Synset('bce.r.01')
Synset('horseback.r.01')
Synset('barely.r.01')
Synset('just.r.06')
Synset('hardly.r.02')
Synset('anisotropically.r.01')


### Retrieving Semantic and Lexical Relations

#### the Nouns sub-net

NLTK makes it easy to explore the WordNet hierarchy. The `hyponyms()` method allows you to retrieve all the immediate hyponyms of our target synset 

In [24]:
wn.synset('dog.n.01').hyponyms()

[Synset('basenji.n.01'),
 Synset('corgi.n.01'),
 Synset('cur.n.01'),
 Synset('dalmatian.n.02'),
 Synset('great_pyrenees.n.01'),
 Synset('griffon.n.02'),
 Synset('hunting_dog.n.01'),
 Synset('lapdog.n.01'),
 Synset('leonberg.n.01'),
 Synset('mexican_hairless.n.01'),
 Synset('newfoundland.n.01'),
 Synset('pooch.n.01'),
 Synset('poodle.n.01'),
 Synset('pug.n.01'),
 Synset('puppy.n.01'),
 Synset('spitz.n.01'),
 Synset('toy_dog.n.01'),
 Synset('working_dog.n.01')]

to move in the opposite direction (i.e. towards more general synsets) we can use:

- either the `hypernyms()` method to retrieve the immediate hypernym (or hypernyms in the following case)

In [25]:
wn.synset('dog.n.01').hypernyms()

[Synset('canine.n.02'), Synset('domestic_animal.n.01')]

- or the `hypernym_paths()` method to retrieve all the hyperonymyc  chain **up to the root node**

In [26]:
wn.synset('dog.n.01').hypernym_paths()

[[Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('living_thing.n.01'),
  Synset('organism.n.01'),
  Synset('animal.n.01'),
  Synset('chordate.n.01'),
  Synset('vertebrate.n.01'),
  Synset('mammal.n.01'),
  Synset('placental.n.01'),
  Synset('carnivore.n.01'),
  Synset('canine.n.02'),
  Synset('dog.n.01')],
 [Synset('entity.n.01'),
  Synset('physical_entity.n.01'),
  Synset('object.n.01'),
  Synset('whole.n.02'),
  Synset('living_thing.n.01'),
  Synset('organism.n.01'),
  Synset('animal.n.01'),
  Synset('domestic_animal.n.01'),
  Synset('dog.n.01')]]

Another important semantic relation for the nouns sub-net is **meronymy**, that links an object (holonym) with its parts (meronym). There are three semantic relations of this kind in WordNet:


- **Part meronymy**: the relation between an object and its separable components:

In [27]:
wn.synset('tree.n.01').part_meronyms()

[Synset('burl.n.02'),
 Synset('crown.n.07'),
 Synset('limb.n.02'),
 Synset('stump.n.01'),
 Synset('trunk.n.01')]

- **Substance meronymy**: the relation between an object and the substance it is made of

In [28]:
wn.synset('tree.n.01').substance_meronyms()

[Synset('heartwood.n.01'), Synset('sapwood.n.01')]

- **Member meronymy**: the relation between a group and its members 

In [29]:
wn.synset('tree.n.01').member_holonyms()

[Synset('forest.n.01')]

**Instances** do not have hypernyms, but **instance_hypernyms**:

In [30]:
# amsterdam is a national capital vs *Amsterdam is a kind of a national capital
wn.synset('amsterdam.n.01').instance_hypernyms()

[Synset('national_capital.n.01')]

In [31]:
wn.synset('amsterdam.n.01').hypernyms()

[]

#### the Verbs sub-net

Moving in the Verbs sub-net, the **troponymy** relation can be navigated by using the same methods used to navigate the nominal hyperonymyc relations

In [32]:
wn.synset('sleep.v.01').hypernyms()

[Synset('rest.v.05')]

In [33]:
wn.synset('sleep.v.01').hypernym_paths()

[[Synset('lie.v.02'),
  Synset('recumb.v.01'),
  Synset('rest.v.05'),
  Synset('sleep.v.01')]]

The other central relation in the organization of the verbs is the **entailment** one:

In [34]:
wn.synset('eat.v.01').entailments()

[Synset('chew.v.01'), Synset('swallow.v.01')]

#### Adjective clusters

Adjectives are organized in clusters of **satellites** adjectives (labeled as `lemma.s.number`) connected to a central adjective (labeled as `lemma.a.number`) by means of the **similar_to** relation

In [35]:
# a satellite adjective is linked just to one central adjective
wn.synset('quick.s.01').similar_tos()

[Synset('fast.a.01')]

In [36]:
# a central adjective is linked to many satellite adjectives
wn.synset('fast.a.01').similar_tos()

[Synset('accelerated.s.01'),
 Synset('alacritous.s.01'),
 Synset('blistering.s.03'),
 Synset('double-quick.s.01'),
 Synset('express.s.02'),
 Synset('fast-breaking.s.01'),
 Synset('fast-paced.s.01'),
 Synset('fleet.s.01'),
 Synset('high-speed.s.01'),
 Synset('hurrying.s.01'),
 Synset('immediate.s.05'),
 Synset('instantaneous.s.01'),
 Synset('meteoric.s.03'),
 Synset('quick.s.01'),
 Synset('rapid.s.01'),
 Synset('rapid.s.02'),
 Synset('smart.s.06'),
 Synset('windy.s.03'),
 Synset('winged.s.02')]

The **lemmas** of the central adjective of each cluster, moreover, are connected to their **antonyms**, that is to lemmas that have the opposite meaning

In [37]:
wn.lemma('fast.a.01.fast').antonyms()

[Lemma('slow.a.01.slow')]

But take note:

In [38]:
try:
    wn.synset('fast.a.01').antonyms()
except AttributeError:
    print("antonymy is a LEXICAL relation, it cannot involve synsets")

antonymy is a LEXICAL relation, it cannot involve synsets


## WN-based Semantic Similarity

Simulating the human ability to estimate semantic distances between concepts is crucial for:


- Psycholinguistics: for long time the study of human semantic memory has been tied to the study of concepts similarity


- Natural Language Processing: for any task that requires some sort of semantic comprehensions

### Classes of Semantic Distance Measures

#### Relatedness

- two concepts are related if **a relation of any sort** holds between them


- information can be extracted from:

    - semantic networks
    
    - dictionaries
    
    - corpora

#### Similarity


- it is a special case of relatedness


- the relation holding between two concepts **by virtue of their ontological status**, i.e. by virtue of their taxonomic positions (Resnik, 1995)

    - car - bicycle
    - \*car - fuel


- information can be extracted from

    - hierarchical networks
    
    - taxnomies

### WordNet-based Similarity Measures

In [39]:
dog = wn.synset('dog.n.01')
cat = wn.synset('cat.n.01')
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
fish = wn.synset('fish.n.01')
bird = wn.synset('bird.n.01')

#### Path Length-based measures


These measures are based on $pathlen(c_1, c_2)$: 

-  i.e. the number of arc in the shorted path connecting two nodes $c_1$ and $c_2$

![alt text](images/pathlen.png)

you can use the `shortest_path_distance()` method to count the number of arcs

In [40]:
fish.shortest_path_distance(bird)

3

In [41]:
dog.shortest_path_distance(cat)

4

When two notes belongs to different sub-nets, it does not return any values...

In [42]:
print(dog.shortest_path_distance(hit))

None


... unless you simulate the existance of a **dummy root** by setting the `simulate_root` option to `True`

In [43]:
print(dog.shortest_path_distance(hit, simulate_root = True))

12


This is quite handy expecially when working on the **verb sub-net** that **do not have a unique root node** (differently to what happens in the nouns sub-net)

In [44]:
print(hit.shortest_path_distance(slap))

None


In [45]:
print(hit.shortest_path_distance(slap, simulate_root = True))

6


**Simple Path Length**:

$$sim_{simple}(c_1,c_2) = \frac{1}{pathlen(c_1,c_2) + 1}$$

use the `path_similarity()` method to calculate this measure

In [46]:
dog.path_similarity(cat)

0.2

**Leacock & Chodorow (1998)**

$$sim_{L\&C}(c_1,c_2) = -log \left(\frac{pathlen(c_1,c_2)}{2 \times D}\right)$$

where $D$ is the maximum depth of the taxonomy

- as a consequence, $2 \times D$ is the maximum possible pathlen

In [47]:
dog.lch_similarity(cat)

2.0281482472922856

you cannot compare synset belonging to different pos

In [48]:
try:
    dog.lch_similarity(hit)
except Exception as e:
    print(e)

Computing the lch similarity requires Synset('dog.n.01') and Synset('hit.v.01') to have the same part of speech.


#### Wu & Palmer (1994)

This measure is based on the notion of **Least Common Subsumer**

-  i.e. the lowest node that dominates both synsets, e.g. `LCS({fish}, {bird}) = {vertebrate, craniate}`

![alt text](images/lcs.png)

NLTK allows you to use the `lowest_common_hypernyms()` method to identify the Least Common Subsumer of two nodes

In [49]:
dog.lowest_common_hypernyms(cat)

[Synset('carnivore.n.01')]

If necessary, use option `simulate_root` to simulate the existance onf a dummy root: 

In [50]:
print(hit.lowest_common_hypernyms(slap, simulate_root = True))

[Synset('*ROOT*')]


Wu & Palmer (1998) proposed to measure the semantic simliiarity between concepts by contrasting the depth of the LCS with the depths of the nodes:

$$sim_{W\&P(c_1, c_2)} = \frac{2 \times depth(LCS(c_1, c_2))}{depth(c_1) + depth(c_2)}$$

where $depth(s)$ is the number of arcs between the root node and the node $s$

the minimum and the maximum depths of each node can be calculated with the `min_depth()` and `max_depth()` modules

In [51]:
print(dog.min_depth(), dog.max_depth())

8 13


...and the `wup_similarity()` (authors' names) method to calculate this measure (option `simulate_root` available)

In [52]:
print(dog.wup_similarity(cat))

0.8571428571428571


#### Information Content-based measures

- the **Information Content** of a concept $C$ is the probability of a randomly selected word to be an instance of the concept $C$ (i.e. the synset $c$ or one of its hyponyms)

$$IC(C) = -log(P(C))$$

- Following Resnik (1995), corpus frequencies can be used to estimate this probability

$$P(C) = \frac{freq(C)}{N} = \frac{\sum_{w \in words(c)}count(w)}{N}$$

- $words(c)$ = set of words that are hierarchically included by $C$ (i.e. its hyponyms)


- N = number of corpus tokens for which there is a representation in WordNet

A fragment of the WN nominal hierarchy, in which each node has been labeled with its $P(C)$ (from Lin, 1998)

![alt text](images/ic.png)

**Resnik (1995)**

$$sim_{resnik}(c_1,c_2) = IC(LCS(c_1,c_2)) = -log(P(LCS(c_1,c_2)))$$

Several Information Content dictionaries are available in NLTK...

In [53]:
from nltk.corpus import wordnet_ic

In [56]:
# the IC estimated from the brown corpus
brown_ic = wordnet_ic.ic('ic-brown.dat')

In [57]:
# the IC estimated from the semcor
semcor_ic = wordnet_ic.ic('ic-semcor.dat')

... or it can be estimated form an available corpus

In [60]:
from nltk.corpus import genesis
genesis_ic = wn.ic(genesis, False, 0.0)

Note that these calculation of the resnick measure depends on the corpus used to generate the information content 

In [61]:
print(dog.res_similarity(cat, ic = brown_ic))
print(dog.res_similarity(cat, ic = semcor_ic))
print(dog.res_similarity(cat, ic = genesis_ic))

7.911666509036577
7.2549003421277245
7.204023991374833


**Lin (1998)**

$$sim_{lin}(c_1,c_2) = \frac{log(P(common(c_1,c_2)))}{log(P(description(c_1,c_2)))} = \frac{2 \times IC(LCS(c_1,c_2))}{IC(c_1) + IC(c_2)}$$

- $common(c_1,c_2)$ = the information that is common between $c_1$ and $c_2$


- $description(c_1,c_2)$ = the information that is needed to describe $c_1$ and $c_2$

In [62]:
print(dog.lin_similarity(cat, ic = brown_ic))
print(dog.lin_similarity(cat, ic = semcor_ic))
print(dog.lin_similarity(cat, ic = genesis_ic))

0.8768009843733973
0.8863288628086228
0.8043806652422293


**Jiang & Conrath (1997)**

$$sim_{J\&C}(c_1,c_2) = \frac{1}{dist(c_1,c_2)} = \frac{1}{IC(c_1) + IC(c_2) - 2 \times IC(LCS(c_1, c_2))}$$

In [63]:
print(dog.jcn_similarity(cat, ic = brown_ic))
print(dog.jcn_similarity(cat, ic = semcor_ic))
print(dog.jcn_similarity(cat, ic = genesis_ic))

0.4497755285516739
0.537382154955756
0.28539390848096946


### WordNet-based Relatedness Measures

#### The Lesk algorithm (1986)


- *“how to tell a pine cone from an ice cream cone”*


- Lesk's intuition: let's have a look at the dictionary glosses

pine [1]: *kind of **evergreen tree** with needle-shaped leaves*

pine [2]: *waste away through sorrow or illness*

cone [1]: *solid body which narrows to a point*

cone [2]: *something of this shape wheter solid or hollow*

cone [3]: *fruit of certain **evergreen tree**.*

#### Extended Lesk (Banerjee and Pedersen, 2003)

Glosses overlap score = sum of $n^2$, where $n$ is the length in words of each locution shared by two glosses 

- in what follows the gloss overlap score is $1^2 + 3^2$

`{chest of drawers, chest, bureau, dresser}` : *a **piece of furniture** with drawers for keeping **clothes**.*

`{wardrobe, closet, press}` : *a tall **piece of furniture** that provides storage space for **clothes**.*

This measure takes into consideration also che glosses of the synsets that are related to the target synsets by one of an apriori specified set of relations RELS:

$$sim_{eLesk}(c_1, c_2) = \sum_{r,q \in RELS}overlap\ (gloss(r(c_1)),\ gloss(q(c_2)))$$

---

#### Now, here's a challenge for you...

Let's suppose you have a list of word pair and that you want to measure their similarity by using WordNet. Your immediate problem is polisemy: a single word may refer to multiple concepts, so that a lemma may appear in more WordNet synsets. 

**Can you think of a way to deal with this issue** other that relying on some existing WSD tool? (TIP: *can you think of a way of filtering out some senses and/or combining multiple similarity scores in order to derive an unique word pair similarity score?*)

---