# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import nltk, pandas as pd, itertools as it
from nltk.corpus import wordnet as wn
_ = nltk.download(['wordnet'], quiet=True)  # load WordNet ontology database
# _ = nltk.download(['wordnet'], quiet=True, download_dir='/home/codio/workspace')  # import WordNet ontology database
# nltk.data.path.clear()
# nltk.data.path.append('/home/codio/workspace/corpora/')
# nltk.data.path

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

<font color='black'>Review Professor Melnikov's code to familiarize yourself with the operations of a WordNet database.</font>

## **WordNet Database**

<font color='black'>[WordNet](https://wordnet.princeton.edu/) is a lexical database that groups words into sets of synonyms, called **synsets**. Each word in a synset is known as a **lemma**, and all lemmas in a synset share a common meaning. 

Using the WordNet object `wn`, the following code demonstrates a WordNet's structure and operations. The `wn.words()` command, which takes language as an argument, lists all 147,306 words in the English language. A few selected words are printed below.</font>

In [None]:
LsWords = [w for w in wn.words(lang='eng')]
print(f'{len(LsWords):,} words:', ', '.join(LsWords[1000:1010]) + ', ...')

## Synsets

<font color="black">You can draw a list of all synsets that are linked to a lemma using the `wn.synsets()` command. Optionally, you can also limit the resulting synsets by specifying a part of speech (POS):

1. `v` for verb or `wn.VERB`
1. `n` for noun or `wn.NOUN`
1. `a` for adjective or `wn.ADJ`
1. `r` for adverb or `wn.ADV`

In the example below, the lemma `'dog'` (with no specified POS) has eight related synsets. The name of each synset takes the form of `word lemma`.`POS tag`.`number`, and can be retrieved using the `name()` method.</font>

In [None]:
# ?wn.synsets                                 # check help manual for full functionality
sWord = 'dog'                                 # A word -> synsets. A synset (many attributes) -> lemmas (many attributes)
wnSynsets = wn.synsets(lemma=sWord, pos=None) # retrieve synsets linked to the word 
sNames = ', '.join(sorted(ss.name() for ss in wnSynsets))
print(len(wnSynsets), 'synsets:', sNames)     # each synset (with unique sense) has a form: word.pos.nn

<font color="black">You can also retrieve only the verb synsets of the lemma `'dog'` by specifying the POS.</font>

In [None]:
wn.synsets(sWord, pos=wn.VERB) # we can restrict results to POS of VERB, NOUN, ADJ, ADV

<font color="black">A few common synset attributes for the word `dog` are displayed in the dataframe below, but they are not exhaustive. Notice that the synset `dog.n.01` is linked to three lemmas with names `'dog, domestic_dog, Canis_familiaris'`.</font>

In [None]:
pd.set_option('max_colwidth', 1000)
ss = wn.synset('dog.n.01')                    # we can retrieve a specific synset object
LsAttrValues = [ss.name(), ss.pos(), ss.lexname(), ss.definition(), ss.examples(), ss.lemmas(), ss.lemma_names()]
LsIx = ['name', 'pos', 'lex name', 'definition', 'example', 'lemmas', 'lemma names']
print(type(ss))
pd.DataFrame(LsAttrValues, index=LsIx).T

<font color="black">Synsets have almost 50 attributes and methods to help you relate words to each other in the WordNet taxonomic tree.</font>

In [None]:
LsSynAttrNames = [a for a in dir(ss) if a[0]!='_']       # ignore internal attributes with an underscore
print(f'{len(LsSynAttrNames)} synset attributes:')
', '.join(LsSynAttrNames)

<font color="black">You can use the `wn.all_synsets(pos)` method to retrieve all synsets, with or without a specified `pos` argument. There are 117K synsets in NLTK's WordNet, 82K of which are nouns. A few such nouns are shown below.</font>

In [None]:
LssAll = list(wn.all_synsets())
LssNouns = list(wn.all_synsets('n'))      # synsets for nouns only
print(f'{len(LssAll):,} synsets\n {len(LssNouns):,} noun synsets')
print('A few nouns: ', ', '.join([ss.name() for ss in LssNouns[:10]]))

## Lemma

<font color="black">You can retrieve a lemma by specifying its full name in the form `word lemma`.`POS code`.`number`.`lemma name`. A lemma's synset, or parent object, can be retrieved with the `synset()` method.</font>

In [None]:
lm = wn.lemma('dog.n.01.domestic_dog')   # retrieve one "dog" lemma for a sysnset dog (noun, version 1, animal)
print(type(lm))
print(lm, 'is in', lm.synset())

<font color="black">A lemma object has 33 attributes and methods to relate to with other words, lemmas, and synsets.
</font>

In [None]:
LsLemmaAttrNames = [a for a in dir(lm) if a[0]!='_']       # ignore internal attributes with an underscore
print(f'{len(LsLemmaAttrNames)} lemma attributes:')
', '.join(LsLemmaAttrNames)

<font color="black">Most lemma attributes are equivalent to those of synsets, but the few that are unique to lemmas are printed below.</font>

In [None]:
', '.join(sorted(set(LsLemmaAttrNames) - set(LsSynAttrNames)))   # attributes specific to lemmas

## Find Standard Form of a Word

<font color="black">The next cell demonstrates some basic WordNet operations, which includes converting `ing` verbs to infinitive form and plural nouns to singular form. Occasionally, you may notice a word or phrase is missing from WordNet. This is because languages naturally change over time, but the database is updated manually and recognizes these changes more slowly.
</font>

In [None]:
print(wn.morphy('running', wn.VERB))        # find a closest word form in WordNet
print(wn.morphy('corpora', wn.NOUN))        # can convert plural to singular
print(wn.morphy('wake up', wn.NOUN))        # not all words and phrases can be located in WordNet

## **Open Multilingual WordNet**

<font color="black">You can operate on words, lemmas, and synsets in 29 other languages with Multilingual WordNet. It is accessible through the `omw` corpus, which needs to be downloaded separately.
</font>

In [None]:
_ = nltk.download(['omw'], quiet=True)    # Open Multilingual WordNet (ISO-639 language codes)
print(len(wn.langs()), 'languages:', ','.join(sorted(wn.langs())))   # list of supported languages

<font color='black'>Now, you can retrieve all lemmas in the synset `ss` in any of the available languages.</font>

In [None]:
ss.lemma_names('jpn')               # returns lemma names in specified language
ss.lemma_names('spa')               # returns lemma names in specified language

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Practice**

Now you will practice manipulating English language synsets and lemmas.

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice manipulating English language synsets and lemmas.

<font color='black'> 
As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the See **solution** drop-down to view the answer.

## Task 1

<font color='black'>Retrieve all synsets for the noun `'break'`.</font>

<b>Hint:</b> You can use the <code>wn</code> object and its <code>synsets()</code> method, as demonstrated in the Review section above.</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<font color='black'>
<pre class="ec">
wn.synsets('break', 'n')
            </pre>
</details> 
</font>

<hr>

## Task 2

Count the number of synsets for the lemma `'break'` in each POS category.

<b>Hint:</b> You can use list comprehension to apply your code from Task 1 to each POS tag: 
1. `v` for verb or `wn.VERB`
1. `n` for noun or `wn.NOUN`
1. `a` for adjective or `wn.ADJ`
1. `r` for adverb or `wn.ADV`</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<font color='black'>
    <pre class="ec">
[(pos, len(wn.synsets('break', pos))) for pos in 'nvar']
</pre>
</details> 
</font>

<hr>

## Task 3

<font color='black'>Which word has the most noun synsets? Print the count as well.</font>

<font color='black'><b>Hint:</b> Use WordNet's <code>words()</code> method to iterate over all words. Then, check if the word contains any noun synsets. If so, add that word to the list along with its count of noun synsets. Finally, you can use the <code>max</code> function to retrieve the noun synset with the largest count.</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
LsNouns = [(len(wn.synsets(w, 'n')), w) for w in wn.words() if len(wn.synsets(w, 'n'))>0]
max(LsNouns)
</pre>
</details> 
</font>

<hr>

## Task 4

<font color='black'>Print all lemmas for the noun synsets `'programmer'` and `'doctor'`. Note that there are may be multiple noun synsets. 
    
<font color='black'><b>Hint:</b> First, you need to package the search query into the proper format <code>synset.POS.number</code>. Then pass it to the <code>synset()</code> method of the loaded WordNet object, <code>wn</code>. Lastly, call <code>lemma_names()</code> on the returned synset object.</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
wn.synset('programmer.n.01').lemma_names() # first noun synset
wn.synset('doctor.n.01').lemma_names()     # first noun synset
[w.lemma_names() for w in wn.synsets(lemma='doctor', pos='n')]     # all noun synsets
[w.lemma_names() for w in wn.synsets(lemma='programmer', pos='n')] # all noun synsets
</pre>
</details> 
</font>

<hr>

## Task 5

<font color='black'>Find the adverb synset with the most lemmas. Print the synset's name and all its lemmas.</font>

<font color='black'><b>Hint:</b> Try iterating over all adverb synsets using <code>wn</code>'s <code>all_synsets()</code> method. Use a list comprehension to count and collect each synset's lemmas using <code>lemma_names()</code>. Then use the <code>max()</code> function to retrieve the synset with the largest count.</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
SS = [(len(ss.lemma_names()), ss.name(), ss.lemma_names()) for ss in wn.all_synsets(pos='r')]
max(SS)
</pre>
</details> 
</font>

<hr>

## Task 6

Print all noun synsets and their lemma names if `'language'` is one of the lemmas.

<font color='black'><b>Hint:</b> Iterate over all noun synsets using WordNet's <code>synsets()</code> method. If any of the synsets contains the lemma name <code>'language'</code>, include the synset in your results. Use <code>lemma_names()</code> method of a synset object to retrieve the names of all lemmas related to the synset.</font>

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
[(ss.name(), ss.lemma_names()) for ss in wn.all_synsets(pos='n') if 'language' in ss.lemma_names()]
</pre>
</details> 
</font>

<hr>