# Wordnet sandbox


## Preface

This tutorial illustrates the use of Wordnet for the types of exploration to be conducted in the [Dante’s _Inferno_](http://dante.obdurodon.org) and [Victorian ghost stories](http://ghost.obdurodon.org) research projects that were part of a [Computational methods in the humanities](http://dh.obdurodon.org) course in the autumn 2016 academic semester. Thanks to Na-Rae Han for discussion and suggestions.

Students completing [Computational methods in the humanities](http://dh.obdurodon.org) to satisfy the “methods” requirement for the Linguistics major need to perform some linguistic tasks with their data, and Wordnet is one way to do that. Below, after an introduction to how Wordnet works, we describe how to add Wordnet-related markup to your XML and how to use that markup to explore your data. You do not need to add Wordnet-related markup to all of your data (which would not be feasible within the context of a semester-long course because some of the work must be performed manually and your documents may be long), but you should do enough of it to be able to experiment a bit with how it works. You also do not have to perform all of the tasks we describe below (which also would not be feasible in the available time); pick one or two that sound interesting and see what you’re able to learn about your documents by implementing them. Ask your instructors should you have any questions about either the content of this tutorial (that is, about how to use Wordnet) or the scope of the assignment.

**tl;dr:** Use Wordnet as described below to add semantic markup to some (not all) of your data. Then perform some (not all) of the tasks below to explore how meaning is represented in your texts.

## Introduction

In Real Life you’ll export the words you care about from your XML using XSLT and then read the list into your Python program, but to start, let’s concentrate on learning how Wordet works. We’re writing this tutorial in the **Jupyter notebook** interface, which allows us to break up the code into pieces that are interspersed with discussion. Because the code is fragmented, in order to run the statements at the bottom of the page you need to have run at least some of the ones at the top. For example, we import Wordnet at the beginning with `from nltk.corpus import wordnet as wn`, and later code depends on our having done that. This means that if you copy and try to run something below without having done the import, you’ll throw an error. We also create some variables near the top that we use below without redeclaring them. You don’t need to use Jupyter notebook for your own development; we’ve used it here because the combination of code cells and text cells is convenient for tutorial purposes.

**tl;dr:** Run the code from the top of this notebook to the bottom, and not just in a single cell.

## How Wordnet is organized

Wordnet is a hierarchical organization of units of meaning, called **synsets**. Synsets are represented in texts by **words**, and a combination of a **lexeme** (represented by the dictionary form of a word) with a specific synset is called a **lemma**. Synsets are identified within Wordnet by three dot-separated parts:

1. A representative word, that is, a word that conveys the meaning of the synset. This representative word may not be the only word that conveys that meaning, and it may also be able to convey other meanings. We’ll see below that that the lexeme ‘ghost’ can represent several different meanings (that is, is associated with multiple synsets), and that each of those meanings can alternatively be conveyed by lexemes other than “ghost”.
1. A part of speech (POS) identifier, like “n” for ‘noun’ or “v” for ‘verb’.
1. A two-digit number that distinguishes different synsets that may have the same head word and the same POS, but that convey different meaning. For example, the synsets 'ghost.n.01' and 'ghost.n.02' are two different nominal meanings that can be expressed by the lexeme “ghost.”

### Exploring synsets

There’s a lot more organization within Wordnet, but for the purpose of this tutorial we’re going to stick to the information conveyed through synsets. Let’s explore that with the synset 'koala.n.01', which is a noun that represents a particular arboreal Australian marsupial. Here’s how it looks when we ask Python about it:

In [3]:
import nltk
nltk.download('wordnet')
nltk.download('omw-1.4')
from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity
wn.synset('koala.n.01')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Synset('koala.n.01')

The output above tells us that the synset 'koala.n.01' is a synset that Wordnet calls 'koala.n.01'. That tautology isn’t very useful, so the only point of the code snippet above is to determine whether such a synset exists. If it doesn’t, we’ll get an error. You can test this by running the cell below, which will raise an error because there is no 'koala.n.02' synset in Wordnet (your error message may differ from ours):

In [4]:
wn.synset('koala.n.02')

WordNetError: ignored

How can we know that there is a 'koala.n.01' synset but not 'koala.n.02' synset without having to ask for the latter and raising an error? We can ask Wordnet to tell us about all of the synsets associated with the word ‘koala’ by using the `wn.synsets()` function:

In [5]:
wn.synsets('koala')

[Synset('koala.n.01')]

The preceding code tells us that there is exactly one synset associated with the word ‘koala’, and that the synset is called 'koala.n.01'.

### Getting the definition of a synset

Synsets are units of meaning, and we can ask for a definition of a synset by using the `.definition()` method:

In [6]:
wn.synset('koala.n.01').definition()

'sluggish tailless Australian arboreal marsupial with grey furry ears and coat; feeds on eucalyptus leaves and bark'

### Getting the lexemes associated with a synset

As we write above, synsets, as units of meaning, are represented in a text by lexemes, and the combination of a synset (a meaning) plus a lexeme (a word) is called a **lemma**. We can get the lemmata for a particular synset by asking for them with the `.lemmas()` method:

In [7]:
wn.synset('koala.n.01').lemmas()

[Lemma('koala.n.01.koala'),
 Lemma('koala.n.01.koala_bear'),
 Lemma('koala.n.01.kangaroo_bear'),
 Lemma('koala.n.01.native_bear'),
 Lemma('koala.n.01.Phascolarctos_cinereus')]

Note that a lemma like 'koala.n.01.koala' combines the synset representation (“koala.n.01”) with a lexeme that expresses that meaning (“koala”). You can get just the lexical part, without the synset prefix, by applying the `name()` method to a lemma. Here we ask for the first (zeroeth in Python enumeration) lemma associated with our synset and return just its name:

In [8]:
wn.synset('koala.n.01').lemmas()[0].name()

'koala'

### What about inflected forms?

As noted above, we can identify all of the synsets associated with a word by using the `wn.synsets()` function:

In [10]:
wn.synsets('koala')

[Synset('koala.n.01')]

The word that we use as an argument to the `wn.synsets()` function doesn’t have to be the dictionary form, which for nouns is a typically a singular. We’ll get the same result if we ask for the synsets associated with the plural:

In [11]:
wn.synsets('koalas')

[Synset('koala.n.01')]

We see above that the lexeme “koala” (whether represented by its singular or plural form) belongs to only one synset. The word “ghost”, though, belongs to seven, four of which are nouns and three of which are verbs:

In [12]:
wn.synsets('ghost')

[Synset('ghost.n.01'),
 Synset('ghostwriter.n.01'),
 Synset('ghost.n.03'),
 Synset('touch.n.03'),
 Synset('ghost.v.01'),
 Synset('haunt.v.02'),
 Synset('ghost.v.03')]

### Synset summary

* A word may represent multiple meanings, and we get the meanings with `wn.synsets()`.
* We can get a definition of a synset with `.definition()` .
* We can get the lemmata (combination of a lexeme with a meaning) associated with a synset with `.lemmas()`.
* We can get just the lexical part of a lemma with `.name()`.

## Using Wordnet to explore course project data

For this tutorial, assume that we’re interested in words that express scary concepts. Assume that we’re interested in painful concepts instead of scary ones. The interesting words have already been tagged using manual methods, but we’re assuming that they are all tagged only in a simple way, along the lines of `<spooky_word>ghost</spooky_word>`. This initial markup makes it possible to find the words we care about easily, but it doesn’t tell us what they mean beyond the fact that they’re associated with scariness.

We can begin our richer exploration of meaning by compiling a list of sample words and examining their synsets. In the example below we’ve included four spooky words plus one non-spooky control item:

In [13]:
from nltk.corpus import wordnet as wn # import Wordnet and call it just “wn” for brevity
words = ['scare', 'ghost', 'fright', 'spook', 'koala'] # create a list of words to examine
synset_list =[wn.synsets(word) for word in words] # get the synsets for each word
synset_list # display them

[[Synset('panic.n.02'),
  Synset('scare.n.02'),
  Synset('frighten.v.01'),
  Synset('daunt.v.01')],
 [Synset('ghost.n.01'),
  Synset('ghostwriter.n.01'),
  Synset('ghost.n.03'),
  Synset('touch.n.03'),
  Synset('ghost.v.01'),
  Synset('haunt.v.02'),
  Synset('ghost.v.03')],
 [Synset('fear.n.01'), Synset('frighten.v.01')],
 [Synset('creep.n.01'), Synset('ghost.n.01'), Synset('spook.v.01')],
 [Synset('koala.n.01')]]

The output above is a list of lists, where each of the inner lists contains the synsets that pertain to a particular word form. We can see that the first inner list shows the four synsets associated with the word “scare”, the second inner list shows the seven synsets associated with the word “ghost”, etc. Our assumption is that each word taken from a text is associated, _in the context in which it occurs_, with exactly one meaning represented by one of the available synsets. The part about context matters; the same lexeme may occur in different contexts with different meanings within the same text. For example, as noted above, the word “scare” may be a noun in one place and a verb in another.

Occasionally your texts may contain words that are not included in Wordnet, or words that are used with meanings that are not represented in Wordnet. You cannot add anything to Wordnet, so when that happens, make a note of it, but otherwise you’ll have to exclude those words from your Wordnet processing.

## Add synset markup to your documents

So far your data contains nothing more than a tag that identifies spooky words, e.g., `<spooky_word>ghost</spooky_word>`. Your goal here is to identify the synset represented by the word in its context and add an attribute (`@synset`) to the markup, using a value that identifies the synset. This task requires human analysis, since although Wordnet can tell you the possible synsets for a particular lexeme, it can’t tell which of those available meanings the lexeme has at a particular location in the text. Remember that the same word form may represent different synsets in different locations. For example, as noted above, “scare” could be a noun in one place and a verb in a different place, and those are different synsets. You don’t need to do this for your entire corpus, which wouldn’t be realistic given the fifteen-week semester and the size of the corpus, but you’ll want to do enough to get a sense of the relationship between word forms in your corpus and the synsets that Wordnet uses to represent units of meaning.

The procedure for adding synset markup to the document has three steps:

1. Get the definitions of each synset for each scary word in your corpus or selection. You can use Python to do this.
1. Choose the appropriate synset for each scary word in your corpus or selection. This requires human decisions, since Python doesn’t understand the context.
1. Write the correct synset into the markup as a new `@synset` attribute. You have to do this manually, as well.

### 1. Get the definitions of each synset for each word

You can get the definition of a synset like `Synset('panic.n.02')` with:

In [14]:
wn.synset('panic.n.02').definition()

'sudden mass fear and anxiety over anticipated events'

A lexeme like “scare” is associated with four synsets:

In [15]:
wn.synsets('scare')

[Synset('panic.n.02'),
 Synset('scare.n.02'),
 Synset('frighten.v.01'),
 Synset('daunt.v.01')]

For each occurrence of some form of “scare” in our texts (it might be ‘scare’ or ‘scares’ or some other inflected form), we want to add an attribute to our XML that indicates the appropriate synset. To tell the synsets apart (in case the sample word that’s part of the synset identifier is not sufficiently clear by itself), we can get their definitions. The code below outputs each synset and its definition:

In [16]:
[str(item) + ' means: ' + item.definition() for item in wn.synsets('scare')]

["Synset('panic.n.02') means: sudden mass fear and anxiety over anticipated events",
 "Synset('scare.n.02') means: a sudden attack of fear",
 "Synset('frighten.v.01') means: cause fear in",
 "Synset('daunt.v.01') means: cause to lose courage"]

We use the `str()` function above to stringify the synset (represented by the variable `item`) so that we can concatenate it with the other strings for output.

### 2. Choose the appropriate synset for each spooky word _in context_

Once you know the synsets that are available for each word in your document, look at your XML and choose the appropriate synset for each word _in context_. For example, if “scare” occurs as a verb that means ‘cause fear in’ in one place, the synset you‘d choose from above would be 'frighten.v.01'. If it occurs as a noun that means ‘a sudden attack of fear’ in another, you‘d choose 'scare.n.02'.

## Examine the lemmata for each synset

At the moment this is just for curiosity. Below we construct a list of two synsets and for each of them we print the Wordnet synset identifier and a list of the lexemes associated with it. As described above, we use the `.lemmas()` method to get the lemmata associated with the synset and we use the `.name()` method to keep only the lexical part of the lemma:

In [17]:
scare_synsets = [wn.synset('scare.n.02'), wn.synset('frighten.v.01')]
for synset in scare_synsets:
    print(str(synset) + ' has the following lemmata: ' + str([lemma.name() for lemma in synset.lemmas()]))

Synset('scare.n.02') has the following lemmata: ['scare', 'panic_attack']
Synset('frighten.v.01') has the following lemmata: ['frighten', 'fright', 'scare', 'affright']


In [18]:
for word in words:
    synset_count = len(wn.synsets(word))
    print('The word "' + word + '" belongs to ' + str(synset_count) + ' synsets')

The word "scare" belongs to 4 synsets
The word "ghost" belongs to 7 synsets
The word "fright" belongs to 2 synsets
The word "spook" belongs to 3 synsets
The word "koala" belongs to 1 synsets


In [19]:
with open('spooky_words.txt', 'r') as infile: # open the plain text file that contains the list of words
    wordlist = infile.read().split() # read the words into a list, splitting on the new lines
with open('synset_counts.xml', 'w') as outfile: # open a file to hold the XML output
    outfile.write('<root>') # create a start tag for the root element in the output XML file
    for word in wordlist: # create output for each word
        synset_count = len(wn.synsets(word)) # for each word, count the number of synsets to which it belongs
        outfile.write('<word><form>' + word + '</form><count>' + str(synset_count) + '</count></word>') # write it out
    outfile.write('</root>') # create the end tag for the root element

FileNotFoundError: ignored

We saved the output to a file called synset\_counts.xml, so we don’t see it here in the notebook, but we can now use Python to read it. This is just for human inspection, to make sure that it looks the way we want. It isn’t pretty-printed, but we can still see how it looks:

In [None]:
with open('synset_counts.xml') as infile:
    print(infile.read())

<root><word><form>ghost</form><count>7</count></word><word><form>scared</form><count>4</count></word><word><form>scare</form><count>4</count></word></root>


### Explore the richness of the vocabulary (by writer or by text)

Synsets are represented by one or more lemmata, which you can retrieve with the `lemmas()` method, as in:

In [20]:
synsets = wn.synsets('ghost')
for synset in synsets:
    lemmata = synset.lemmas()
    print(str(synset) + ' means "' + synset.definition() + '" and has ' + str(len(lemmata)) + ' lemmata: ' + \
         str([lemma.name() for lemma in lemmata]))

Synset('ghost.n.01') means "a mental representation of some haunting experience" and has 6 lemmata: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']
Synset('ghostwriter.n.01') means "a writer who gives the credit of authorship to someone else" and has 2 lemmata: ['ghostwriter', 'ghost']
Synset('ghost.n.03') means "the visible disembodied soul of a dead person" and has 1 lemmata: ['ghost']
Synset('touch.n.03') means "a suggestion of some quality" and has 3 lemmata: ['touch', 'trace', 'ghost']
Synset('ghost.v.01') means "move like a ghost" and has 1 lemmata: ['ghost']
Synset('haunt.v.02') means "haunt like a ghost; pursue" and has 3 lemmata: ['haunt', 'obsess', 'ghost']
Synset('ghost.v.03') means "write for someone else" and has 2 lemmata: ['ghost', 'ghostwrite']


We use the `name()` method to get just the lexical part of the lemma. (We took a lazy way out and used the plural “lemmata” even after the value 1, although “has 1 lemmata” should really read “has 1 lemma”. If we intended to use this code to produce final output for end-users, we’d include additional code to control for that difference.)

A writer or text that uses the synset 'ghost.n.01' has six lemmata available to express that meaning. What proportion of the available vocabulary does your writer or text use? 

That would be easy to calculate if the writer always used the exact form provided by the `name()` method of lemmata. For example, you might find that a particular text contains the following mappings of lemmata and word forms:

Synset | Word form
--- | ---
ghost.n.01 | ghost
ghost.n.01 | shade
ghost.n.01 | spook

You can count up the number of word forms associated with each synset, and because each word form corresponds to a different one of the 6 lemmata for that synset, you’ll determine correctly that the writer or text uses 50% of the available lemmata.

In [21]:
available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]
print('There are ' + str(len(available)) + ' lemmata for ghost.n.01 and they are: ' + str(available))
used = ['ghost', 'shade', 'spook']
print('The 3 lemmata for ghost.n.01 used in the document are ' + str(used))
print('The ratio of used (' + str(len(used)) + ') divided by available (' + \
      str(len(available)) + ') = ' + str(len(used) / len(available)))

There are 6 lemmata for ghost.n.01 and they are: ['ghost', 'shade', 'spook', 'wraith', 'specter', 'spectre']
The 3 lemmata for ghost.n.01 used in the document are ['ghost', 'shade', 'spook']
The ratio of used (3) divided by available (6) = 0.5


But suppose the forms include different inflections of the same lemma, such as singular 'ghost' and plural 'ghosts'. The challenge here is that those two forms represent the same lemma, and the `.lemmas()` won’t return the plural form “ghosts”, so you can’t simply count forms that occur in the text and use that as a surrogate for counting lemmata that occur in the text. Wordnet helps resolve these situations with `wn.morphy()`, which lemmatizes:

In [None]:
print('The result of applying wn.morphy() to “ghost” (sg) is “' + wn.morphy('ghost') + '”')
print('The result of applying wn.morphy() to “ghosts” (pl) is ' + wn.morphy('ghosts') + '”')

The result of applying wn.morphy() to “ghost” (sg) is “ghost”
The result of applying wn.morphy() to “ghosts” (pl) is ghost”


This means that we can resolve the variation caused by inflection along the following lines. In the code snippet below we have the same list as above, except that instead of three items in our `used` variable that correspond to three different lemmata, we have three items that correspond to only two different lemmata. Here we print the values of `used` and `normalized` to show that they have the same length (the same number of items), but `normalized` has only two _distinct_ values, while `used` has three. We then convert `normalized` from a list (which allows duplicates) to a set (which doesn’t), which is a quick way of removing duplicates:

In [None]:
available = [lemma.name() for lemma in wn.synset('ghost.n.01').lemmas()]
used = ['ghost', 'ghosts', 'spook']
normalized = [wn.morphy(item) for item in used]
print(used)
print(normalized)
print(len(set(normalized)) / len(available))

['ghost', 'ghosts', 'spook']
['ghost', 'ghost', 'spook']
0.3333333333333333


By using `wn.morphy()`, then, we can determine the richness of the vocabulary (the number of available different lemmata that are actually used) without being misled by different inflected forms of the same lexeme. Of course we still have to decide how to use these counts to explore or present information about how much of the available vocabulary variation the writer or text actually uses.