# Lexical Semantics: WordNet (recitation-3)

_Much of this is taken from Chris Potts' <a href="http://compprag.christopherpotts.net/wordnet.html">2011 LSA course</a>._

This exercise is intended for you to get familiar with WordNet and some of its functionalities. In particular, you will learn how to extract synsets, relations between them, and the lemmas associated with them.

## Download instructions

We will be using the <a href="http://www.nltk.org/howto/wordnet.html">NLTK WordNet module</a>.


To download NLTK, run the following command on your terminal:

Then run the following commands:

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In the window that appears, click on the "corpora" tab and download "wordnet".

Load the WordNet module, which provides access to the structure of WordNet:

In [1]:
from nltk.corpus import wordnet as wn

## The structure of WordNet

The two most important WordNet constructs are lemmas and synsets:

0. __Lemma__: near to the linguistic concept of a word. Lemmas are identified by strings like `firm.s.10.fast`, where: 
 * `fast` is the morphological form of the lemma
 * `firm` is the stem identifier for the synset containing this lemma
 * `s` is the WordNet part of speech (`n`: noun, `v`: verb, `a`: adjective, `s`: adjective satellite, `r`: adverb)
 * `10` is the sense number (sense `01` is considered the primary sense)
<br><br>
0. __Synset__: A collection of synonyms, i.e. Lemmas that are synonymous (by the standards of WordNet). Synsets are identified by strings like `firm.s.10` where:
 * `firm` is the canonical string name
 * `s` is the WordNet part of speech
 * `10` is the sense number

###  Synset lists

The function `wn.synsets()` returns the list of Synset objects compatible with the string, or string–tag pair, provided.



In [2]:
#list of synsets containing the lemma "fast"
wn.synsets('fast')

[Synset('fast.n.01'),
 Synset('fast.v.01'),
 Synset('fast.v.02'),
 Synset('fast.a.01'),
 Synset('fast.a.02'),
 Synset('fast.a.03'),
 Synset('fast.s.04'),
 Synset('fast.s.05'),
 Synset('debauched.s.01'),
 Synset('flying.s.02'),
 Synset('fast.s.08'),
 Synset('firm.s.10'),
 Synset('fast.s.10'),
 Synset('fast.r.01'),
 Synset('fast.r.02')]

In [3]:
#list of synsets containing adjectives and the lemma "fast"
wn.synsets('fast', 'a')

[Synset('fast.a.01'),
 Synset('fast.a.02'),
 Synset('fast.a.03'),
 Synset('fast.s.04'),
 Synset('fast.s.05'),
 Synset('debauched.s.01'),
 Synset('flying.s.02'),
 Synset('fast.s.08'),
 Synset('firm.s.10'),
 Synset('fast.s.10')]

In [4]:
#list of synsets containing verbs and the lemma "fast"
wn.synsets('fast', 'v')

[Synset('fast.v.01'), Synset('fast.v.02')]

In [5]:
#list of synsets containing nouns and the lemma "fast"
wn.synsets('fast', 'n')

[Synset('fast.n.01')]

In [6]:
#list of synsets containing adverbs and the lemma "fast"
wn.synsets('fast', 'r')

[Synset('fast.r.01'), Synset('fast.r.02')]

The first member of these lists is the primary (most frequent) sense for the input supplied.



### Synsets

Let's take one of the synsets returned by `wn.synsets('fast', 'a')`, and see what kind of information we can extract from it.

In [7]:
fast_adj_synsets = wn.synsets('fast', 'a')
fast_adj_synsets[5] #the sixth synset containing the adjective 'fast'

Synset('debauched.s.01')

In [8]:
debauched = fast_adj_synsets[5]
debauched.definition()

'unrestrained by convention or morality'

In [9]:
debauched.lemmas()

[Lemma('debauched.s.01.debauched'),
 Lemma('debauched.s.01.degenerate'),
 Lemma('debauched.s.01.degraded'),
 Lemma('debauched.s.01.dissipated'),
 Lemma('debauched.s.01.dissolute'),
 Lemma('debauched.s.01.libertine'),
 Lemma('debauched.s.01.profligate'),
 Lemma('debauched.s.01.riotous'),
 Lemma('debauched.s.01.fast')]

In [10]:
debauched.examples()

['Congreve draws a debauched aristocratic society',
 'deplorably dissipated and degraded',
 'riotous living',
 'fast women']

In [11]:
debauched.pos() #what part of speech is associated with synset 'debauched'?

's'

### Relations between synsets

Relations between synsets can be extracted via functions such as the following:

The synset "debauched" appears to have no hypernyms. With nouns we should have more luck:

In [13]:
wn.synsets('bike', 'n') #list of synsets containing the noun 'bike'

[Synset('motorcycle.n.01'), Synset('bicycle.n.01')]

In [127]:
wn.synsets('bike')

[Synset('motorcycle.n.01'), Synset('bicycle.n.01'), Synset('bicycle.v.01')]

In [14]:
bike = wn.synsets('bike', 'n')[1]
bike.hypernyms() 

[Synset('wheeled_vehicle.n.01')]

In [15]:
bike.hyponyms()

[Synset('bicycle-built-for-two.n.01'),
 Synset('mountain_bike.n.01'),
 Synset('ordinary.n.04'),
 Synset('push-bike.n.01'),
 Synset('safety_bicycle.n.01'),
 Synset('velocipede.n.01')]

In [16]:
bike.member_holonyms()

[]

The function `bike.root_hypernyms()` returns the most abstract/general class that contains "bike". Check what it is for other nouns.

In [17]:
bike.root_hypernyms()

[Synset('entity.n.01')]

### Lemma Lists

Parallel to `wn.synsets()`, the function `wn.lemmas()` will take you from strings to lists of Lemma objects:

In [18]:
wn.lemmas('fast')

[Lemma('fast.n.01.fast'),
 Lemma('fast.v.01.fast'),
 Lemma('fast.v.02.fast'),
 Lemma('fast.a.01.fast'),
 Lemma('fast.a.02.fast'),
 Lemma('fast.a.03.fast'),
 Lemma('fast.s.04.fast'),
 Lemma('fast.s.05.fast'),
 Lemma('debauched.s.01.fast'),
 Lemma('flying.s.02.fast'),
 Lemma('fast.s.08.fast'),
 Lemma('firm.s.10.fast'),
 Lemma('fast.s.10.fast'),
 Lemma('fast.r.01.fast'),
 Lemma('fast.r.02.fast')]

In [19]:
wn.lemmas('fast', 'a')

[Lemma('fast.a.01.fast'),
 Lemma('fast.a.02.fast'),
 Lemma('fast.a.03.fast'),
 Lemma('fast.s.04.fast'),
 Lemma('fast.s.05.fast'),
 Lemma('debauched.s.01.fast'),
 Lemma('flying.s.02.fast'),
 Lemma('fast.s.08.fast'),
 Lemma('firm.s.10.fast'),
 Lemma('fast.s.10.fast')]

### Lemmas

Remember, the objects in a Synset list are Lemmas.

In [20]:
debauched.lemmas()

[Lemma('debauched.s.01.debauched'),
 Lemma('debauched.s.01.degenerate'),
 Lemma('debauched.s.01.degraded'),
 Lemma('debauched.s.01.dissipated'),
 Lemma('debauched.s.01.dissolute'),
 Lemma('debauched.s.01.libertine'),
 Lemma('debauched.s.01.profligate'),
 Lemma('debauched.s.01.riotous'),
 Lemma('debauched.s.01.fast')]

In [21]:
libertine = debauched.lemmas()[5]

In [22]:
libertine.synset().definition()

'unrestrained by convention or morality'

### A high-level perspective on the WordNet database

All of the above manipulations can be done with the Web or command-line interface to WordNet itself. The power of working within a full programming language is that we can also take a high-level perspective on the data. For example, the following code looks at the distribution of Synsets by part-of-speech (pos) category:

In [23]:
from collections import defaultdict
 
def wn_pos_dist():
    """Count the Synsets in each WordNet POS category."""
    # One-dimensional count dict with 0 as the default value:
    cats = defaultdict(int)
    # The counting loop:
    for synset in wn.all_synsets():
        cats[synset.pos()] += 1
    # Print the results to the screen:
    for tag, count in cats.items():
        print(tag, count)
    # Total number (sum of the above):
    print('Total', sum(cats.values()))
    
wn_pos_dist()

a 7463
s 10693
r 3621
n 82115
v 13767
Total 117659


## Exercises

__1\.__ Find the 5 synsets that have the most:
- hypernyms
- hyponyms
- antonyms
- holonyms
- meronyms
- entailments
- causes

Are some relations more common for certain parts of speech?

In [1]:
# Your code goes here

__2\.__  Pick a domain (e.g., mammals, sciences, vehicles).
* Use `lowest_common_hypernyms()` to explore pairs of Synsets in your domain. Do the return values generally belong specifically to your domain, or are they more/overly general?
* Based on the above findings, provide an overall assessment of WordNet's coverage in your chosen domain.

In [166]:
# Your code goes here

[Synset('carnivore.n.01')]

__3\.__ NLTK provides a number of path similarity measures for Synsets, via its WordNetCorpusReader class. For example:

In [24]:
tree = wn.synsets('tree', 'n')[0]
flower = wn.synsets('flower', 'n')[0]
wn.path_similarity(tree, flower)

0.16666666666666666

Similarity measures in WordNet:-

- Path Similarity:  Return a score denoting how similar two word senses are, based on the shortest path that connects the senses in the is-a (hypernym/hypnoym) taxonomy. The score is in the range 0 to 1.

- Leacock-Chodorow Similarity: Return a score denoting how similar two word senses are, based on the shortest path that connects the senses and the maximum depth of the taxonomy in which the senses occur.

- Wu-Palmer Similarity: Return a score denoting how similar two word senses are, based on the depth of the two senses in the taxonomy and that of their Least Common Subsumer (most specific ancestor node).

- Resnik Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node).
 
- Lin Similarity: Return a score denoting how similar two word senses are, based on the Information Content (IC) of the Least Common Subsumer (most specific ancestor node) and that of the two input Synsets.

Pick a domain (e.g., mammals, sciences, vehicles). Check on the path similarity of pairs of things inside and outside your domain. Do the results make sense? Compare different similarity measures and comment on your observations.

In [168]:
# Your code goes here