# Using NLTK

This Python notebook details the process of exploring Python's `NLTK` library - most of the steps are from the [NLTK book](http://www.nltk.org/book/ch01.html). Other resources include:

* [Text classification with NLTK](https://pythonprogramming.net/text-classification-nltk-tutorial/)
* [WordNet search](http://wordnetweb.princeton.edu/perl/webwn) - this allows you to manually search for words in the WordNet corpus and see the relationships
* [WordNet Interface documentation](http://www.nltk.org/howto/wordnet.html)
* [Tutorial: What is WordNet? A Conceptual Introduction Using Python](https://stevenloria.com/wordnet-tutorial/)
* Stackoverflow threads, such as [this one on finding word categories](https://stackoverflow.com/questions/24195854/finding-category-for-words) - note that sometimes answers can be outdated as new versions of NLTK no longer perform in the same way - read the comments and look for more recent answers.

## Importing the NLTK library

Let's get started. First, import the library

In [2]:
import nltk

This next line opens up a window where you can download relevant data. It will by default choose your default user account directory (e.g. Users/paul). Leave this as it is, as the code later will assume it is in this place.

Download the 'book' line *only*.

Make sure you close the window once the download has finished, which will signal that the line has finished executing.

In [3]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

Now we import the books. It will look in the Gutenberg folder in the folders that we've just downloaded.

In [2]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


In [3]:
#List the texts
texts()

text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


Try a few functions...

In [4]:
text1.concordance("whale")

Displaying 25 of 1226 matches:
s , and to teach them by what name a whale - fish is to be called in our tongue
t which is not true ." -- HACKLUYT " WHALE . ... Sw . and Dan . HVAL . This ani
ulted ." -- WEBSTER ' S DICTIONARY " WHALE . ... It is more immediately from th
ISH . WAL , DUTCH . HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALE
HWAL , SWEDISH . WHALE , ICELANDIC . WHALE , ENGLISH . BALEINE , FRENCH . BALLE
least , take the higgledy - piggledy whale statements , however authentic , in 
 dreadful gulf of this monster ' s ( whale ' s ) mouth , are immediately lost a
 patient Job ." -- RABELAIS . " This whale ' s liver was two cartloads ." -- ST
 Touching that monstrous bulk of the whale or ork we have received nothing cert
 of oil will be extracted out of one whale ." -- IBID . " HISTORY OF LIFE AND D
ise ." -- KING HENRY . " Very like a whale ." -- HAMLET . " Which to secure , n
restless paine , Like as the wounded whale to shore flies thro ' the maine ." -
. OF SPER

## Wordnet

Let's try some wordnet:

In [5]:
from nltk.corpus import wordnet as wn

This bit comes from [this Stackoverflow post](https://stackoverflow.com/questions/24195854/finding-category-for-words). Theres more about synsets in [this NLTK tutorial](https://pythonprogramming.net/wordnet-nltk-tutorial/):

In [6]:
#Show the synonym sets for the string 'blue'
wn.synsets('blue')

[Synset('blue.n.01'),
 Synset('blue.n.02'),
 Synset('blue.n.03'),
 Synset('blue_sky.n.01'),
 Synset('bluing.n.01'),
 Synset('amobarbital_sodium.n.01'),
 Synset('blue.n.07'),
 Synset('blue.v.01'),
 Synset('blue.s.01'),
 Synset('blue.s.02'),
 Synset('gloomy.s.02'),
 Synset('blasphemous.s.02'),
 Synset('blue.s.05'),
 Synset('aristocratic.s.01'),
 Synset('blue.s.07'),
 Synset('blue.s.08')]

In [7]:
#Show the first synset for 'cat'
wn.synsets('cat')[0]

Synset('cat.n.01')

In [8]:
#For that first synset, grab the hypernyms, and for the first result, grab more hypernyms
wn.synsets('cat')[0].hypernyms()[0].hypernyms()

[Synset('carnivore.n.01')]

We can use a loop to [add synonyms and antonyms all together](https://pythonprogramming.net/wordnet-nltk-tutorial/):

In [9]:
synonyms = []
antonyms = []

for syn in wordnet.synsets("arms"):
    for l in syn.lemmas():
        synonyms.append(l.name())
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

{'blazonry', 'arms', 'blazon', 'weapon', 'implements_of_war', 'branch', 'coat_of_arms', 'subdivision', 'weaponry', 'weapon_system', 'sleeve', 'gird', 'arm', 'fortify', 'munition', 'weapons_system', 'limb', 'build_up'}
{'disarm'}


In [10]:
#The 'v' in this means 'verb' - so this lists all verbs - but limits results up to the 10th item by addings [:10]
#Change to 'n' for noun or 'a' for adjective
for synset in list(wn.all_synsets('a'))[:10]:
    print(synset)

Synset('able.a.01')
Synset('unable.a.01')
Synset('abaxial.a.01')
Synset('adaxial.a.01')
Synset('acroscopic.a.01')
Synset('basiscopic.a.01')
Synset('abducent.a.01')
Synset('adducent.a.01')
Synset('nascent.a.01')
Synset('emergent.s.02')


### Trying out with pub name words

Now let's try a few pub name words copied from the spreadsheet

In [11]:
#When copied from Excel there's a carriage return between each item
list = '''arms
white
lion
head
house'''
#We convert to a list by splitting on each new line
publist = list.split('\n')
print(publist)

['arms', 'white', 'lion', 'head', 'house']


In [12]:
for i in publist:
    print(i)
    print(wn.synsets(i)[0].name())
    print(wn.synsets(i)[0].definition())

arms
weaponry.n.01
weapons considered collectively
white
white.n.01
a member of the Caucasoid race
lion
lion.n.01
large gregarious predatory feline of Africa and India having a tawny coat with a shaggy mane in the male
head
head.n.01
the upper part of the human body or the front part of the body in animals; contains the face and brains
house
house.n.01
a dwelling that serves as living quarters for one or more families


### Hyponyms and hypernyms

What about hyponyms? These are words "of more specific meaning than a general or superordinate term applicable to it. For example, spoon is a hyponym of cutlery."

Hypernyms are the synsets that are more *general*

In [13]:
wn.synset('arms.n.01').hyponyms()

[Synset('ammunition.n.01'),
 Synset('armament.n.01'),
 Synset('bomb.n.01'),
 Synset('defense_system.n.01'),
 Synset('gunnery.n.01'),
 Synset('hardware.n.01'),
 Synset('naval_weaponry.n.01')]

In [14]:
wn.synset('lion.n.01').hyponyms()

[Synset('lion_cub.n.01'), Synset('lioness.n.01'), Synset('lionet.n.01')]

In [15]:
wn.synset('lion.n.01').hypernyms()

[Synset('big_cat.n.01')]

In [16]:
#Note that jaguar shares the same hypernym as lion:
wn.synset('jaguar.n.01').hypernyms()

[Synset('big_cat.n.01')]

In [17]:
wn.synset('arms.n.01').hypernyms()

[Synset('instrumentality.n.03')]

Perhaps we can use hyponyms to work downwards from 'animal' and grab them all?

In [18]:
wn.synset('animal.n.01').hyponyms()

[Synset('acrodont.n.01'),
 Synset('adult.n.02'),
 Synset('biped.n.01'),
 Synset('captive.n.02'),
 Synset('chordate.n.01'),
 Synset('creepy-crawly.n.01'),
 Synset('critter.n.01'),
 Synset('darter.n.02'),
 Synset('domestic_animal.n.01'),
 Synset('embryo.n.02'),
 Synset('feeder.n.06'),
 Synset('female.n.01'),
 Synset('fictional_animal.n.01'),
 Synset('game.n.04'),
 Synset('giant.n.01'),
 Synset('herbivore.n.01'),
 Synset('hexapod.n.01'),
 Synset('homeotherm.n.01'),
 Synset('insectivore.n.02'),
 Synset('invertebrate.n.01'),
 Synset('larva.n.01'),
 Synset('male.n.01'),
 Synset('marine_animal.n.01'),
 Synset('mate.n.03'),
 Synset('metazoan.n.01'),
 Synset('migrator.n.02'),
 Synset('molter.n.01'),
 Synset('mutant.n.02'),
 Synset('omnivore.n.02'),
 Synset('peeper.n.03'),
 Synset('pest.n.04'),
 Synset('pet.n.01'),
 Synset('pleurodont.n.01'),
 Synset('poikilotherm.n.01'),
 Synset('predator.n.02'),
 Synset('prey.n.02'),
 Synset('racer.n.03'),
 Synset('range_animal.n.01'),
 Synset('scavenger.n.03'

If we don't know what one is, we can use `.definition()`

In [19]:
wn.synset('acrodont.n.01').definition()

'an animal having teeth consolidated with the summit of the alveolar ridge without sockets'

The word 'acrodont' has no hyponyms. But others do:

In [20]:
wn.synset('domestic_animal.n.01').hyponyms()

[Synset('dog.n.01'),
 Synset('domestic_cat.n.01'),
 Synset('feeder.n.01'),
 Synset('head.n.02'),
 Synset('stocker.n.01'),
 Synset('stray.n.01')]

In [21]:
wn.synset('work_animal.n.01').hyponyms()

[Synset('beast_of_burden.n.01'), Synset('draft_animal.n.01')]

### Holonyms and meronyms

We can use `.member_holonyms()` and `part_meronyms()` to see what the item is contained in (holonym) and what items are make up the item.

In [22]:
wn.synset('animal.n.01').member_holonyms()

[Synset('animalia.n.01')]

In [23]:
wn.synset('animal.n.01').part_meronyms()

[Synset('face.n.07'), Synset('head.n.01')]

## Measuring similarity

The [Wu and Palmer method](http://search.cpan.org/~tpederse/WordNet-Similarity-1.03/lib/WordNet/Similarity/wup.pm) for semantic related-ness can apparently return a similarity:


In [24]:
print("The similarity between",publist[0],"and",publist[1],"is",wn.synset(publist[0]+'.n.01').wup_similarity(wn.synset(publist[1]+'.n.01')))

The similarity between arms and white is 0.5333333333333333


In [25]:
print("The similarity between",publist[1],"and",publist[2],"is",wn.synset(publist[1]+'.n.01').wup_similarity(wn.synset(publist[2]+'.n.01')))

The similarity between white and lion is 0.5217391304347826


In [26]:
print("The similarity between",publist[2],"and",publist[3],"is",wn.synset(publist[2]+'.n.01').wup_similarity(wn.synset(publist[3]+'.n.01')))

The similarity between lion and head is 0.18181818181818182


In [27]:
print("The similarity between horse and lion is",wn.synset('horse.n.01').wup_similarity(wn.synset('lion.n.01')))

The similarity between horse and lion is 0.7333333333333333


In [28]:
#Note that 'engineers' plural throws an error here
print("The similarity between saracen and engineer is",wn.synset('saracen.n.01').wup_similarity(wn.synset('engineer.n.01')))

The similarity between saracen and engineer is 0.631578947368421
