# NLTK tutorial
(From https://www.nltk.org/)
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.

We'll talk about the following sections in this tutorial:

1. Tokenizer
2. Stemmer
3. WordNet
4. Tips to the assignments

In [11]:
pip install nltk

Collecting nltk
  Downloading nltk-3.5.zip (1.4 MB)
Collecting click
  Downloading click-7.1.2-py2.py3-none-any.whl (82 kB)
Collecting regex
  Downloading regex-2020.7.14-cp38-cp38-win_amd64.whl (264 kB)
Collecting tqdm
  Downloading tqdm-4.49.0-py2.py3-none-any.whl (69 kB)
Building wheels for collected packages: nltk
  Building wheel for nltk (setup.py): started
  Building wheel for nltk (setup.py): finished with status 'done'
  Created wheel for nltk: filename=nltk-3.5-py3-none-any.whl size=1434680 sha256=58bfd6b52d33ffc7a235bc3c7ff0fdeecdea11b521b6384fc919bb0b484ad4f2
  Stored in directory: c:\users\elva\appdata\local\pip\cache\wheels\ff\d5\7b\f1fb4e1e1603b2f01c2424dd60fbcc50c12ef918bafc44b155
Successfully built nltk
Installing collected packages: click, regex, tqdm, nltk
Successfully installed click-7.1.2 nltk-3.5 regex-2020.7.14 tqdm-4.49.0
Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Elva\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


In [12]:
pip install numpy

Note: you may need to restart the kernel to use updated packages.


You should consider upgrading via the 'C:\Users\Elva\AppData\Local\Programs\Python\Python38\python.exe -m pip install --upgrade pip' command.


# 1. NLTK Tokenizer

In [13]:
import nltk
nltk.download('punkt') # to make nltk.tokenizer works
nltk.download('wordnet') 

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Elva\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping tokenizers\punkt.zip.
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\Elva\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\wordnet.zip.


True

In [15]:
text1 = "Text mining is to identify useful information."
text2 = "Current NLP models isn't able to solve NLU perfectly."

print("string.split tokenizer", text1.split(" "))
print("string.split tokenizer", text2.split(" "))

string.split tokenizer ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information.']
string.split tokenizer ['Current', 'NLP', 'models', "isn't", 'able', 'to', 'solve', 'NLU', 'perfectly.']


Cannot deal with punctuations, i.e., full stops and apostrophes.

In [16]:
import regex # regular expression
print("regular expression tokenizer", regex.split("[\s\.]", text1))
print("regular expression tokenizer", regex.split("[\s\.]", text2))

regular expression tokenizer ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '']
regular expression tokenizer ['Current', 'NLP', 'models', "isn't", 'able', 'to', 'solve', 'NLU', 'perfectly', '']


- Here, the `string.split` function can not deal with punctuations
- Simple regular expression can deal with most punctuations but may fail in the cases of "isn't, wasn't, can't"

In [17]:
def tokenize(text):
    """
    :param text: a doc with multiple sentences, type: str
    return a word list, type: list
    e.g.
    Input: 'Text mining is to identify useful information.'
    Output: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    """
    return nltk.word_tokenize(text)

In [18]:
print(tokenize(text1))
print(tokenize(text2))

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
['Current', 'NLP', 'models', 'is', "n't", 'able', 'to', 'solve', 'NLU', 'perfectly', '.']


In [7]:
# Other examples:
# 1. Possessive cases: Apostrophe (isn't, I've, ...)
tokens = tokenize("Bob's text mining skills are perfect.")
print(tokens)
# 2. Parentheses
tokens = tokenize("Bob's text mining skills (or, NLP) are perfect.")
print(tokens)
# 3. ellipsis
tokens = tokenize("Bob's text mining skills are perfect...")
print(tokens)

['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', '(', 'or', ',', 'NLP', ')', 'are', 'perfect', '.']
['Bob', "'s", 'text', 'mining', 'skills', 'are', 'perfect', '...']


# 2. Stemming and lemmatization

(https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-lemmatization-1.html)

Stemming: chops off the ends of words to acquire the root, and often includes the removal of derivational affixes. 

e.g., gone -> go, wanted -> want, trees -> tree.

Lemmatization: doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma . 

Differences:
The two may also differ in that stemming most commonly collapses derivationally related words, whereas lemmatization commonly only collapses the different inflectional forms of a lemma (focus on the concrete semantic meaning). 

E.g.: useful -> use(stemming), useful(lemmatization)

PorterStemmer:

Rule-based methods. E.g., SSES->SS, IES->I, NOUNS->NOUN. # misses->miss, flies->fli.

Doc: https://www.nltk.org/api/nltk.stem.html

In [19]:
from nltk.stem import PorterStemmer
ps = PorterStemmer()

def stem(tokens):
    """
    :param tokens: a list of tokens, type: list
    return a list of stemmed words, type: list
    e.g.
    Input: ['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']
    Output: ['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']
    """
    ### equivalent code
    # results = list()
    # for token in tokens:
    #     results.append(ps.stem(token))
    # return results

    return [ps.stem(token) for token in tokens]

In [20]:
tokens = stem(tokenize("Text mining is to identify useful information."))
print(tokens)

['text', 'mine', 'is', 'to', 'identifi', 'use', 'inform', '.']


In [21]:
from nltk.stem import WordNetLemmatizer
lm = WordNetLemmatizer()
def lemmatize(tokens):
    return [lm.lemmatize(token) for token in tokens]

In [22]:
tokens = lemmatize(tokenize("Text mining is to identify useful information."))
print(tokens)

['Text', 'mining', 'is', 'to', 'identify', 'useful', 'information', '.']


# 3. WordNet

https://www.nltk.org/howto/wordnet.html

- a semantically-oriented dictionary of English,
- similar to a traditional thesaurus but with a richer structure

In [23]:
from nltk.corpus import wordnet as wn

### 3.1 synsets

A set of one or more **synonyms** that are interchangeable in some context without changing the truth value of the proposition in which they are embedded.

In [24]:
# Look up a word using synsets(); 
wn.synsets('dog')

[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]

In [25]:
wn.synsets('bank')

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10'),
 Synset('bank.v.01'),
 Synset('bank.v.02'),
 Synset('bank.v.03'),
 Synset('bank.v.04'),
 Synset('bank.v.05'),
 Synset('deposit.v.02'),
 Synset('bank.v.07'),
 Synset('trust.v.01')]

In [26]:
print("synset","\t","definition")
for synset in wn.synsets('bank'):
    print(synset, '\t', synset.definition())

synset 	 definition
Synset('bank.n.01') 	 sloping land (especially the slope beside a body of water)
Synset('depository_financial_institution.n.01') 	 a financial institution that accepts deposits and channels the money into lending activities
Synset('bank.n.03') 	 a long ridge or pile
Synset('bank.n.04') 	 an arrangement of similar objects in a row or in tiers
Synset('bank.n.05') 	 a supply or stock held in reserve for future use (especially in emergencies)
Synset('bank.n.06') 	 the funds held by a gambling house or the dealer in some gambling games
Synset('bank.n.07') 	 a slope in the turn of a road or track; the outside is higher than the inside in order to reduce the effects of centrifugal force
Synset('savings_bank.n.02') 	 a container (usually with a slot in the top) for keeping money at home
Synset('bank.n.09') 	 a building in which the business of banking transacted
Synset('bank.n.10') 	 a flight maneuver; aircraft tips laterally about its longitudinal axis (especially in turni

In [27]:
# this function has an optional pos argument which lets you constrain the part of speech of the word:
# pos: part-of-speech
wn.synsets('bank', pos=wn.NOUN)

[Synset('bank.n.01'),
 Synset('depository_financial_institution.n.01'),
 Synset('bank.n.03'),
 Synset('bank.n.04'),
 Synset('bank.n.05'),
 Synset('bank.n.06'),
 Synset('bank.n.07'),
 Synset('savings_bank.n.02'),
 Synset('bank.n.09'),
 Synset('bank.n.10')]

In [28]:
wn.synset('dog.n.01')

Synset('dog.n.01')

In [29]:
print(wn.synset('dog.n.01').definition())

a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds


In [30]:
wn.synset('dog.n.01').examples()

['the dog barked all night']

In [31]:
wn.synset('dog.n.01').lemma_names()

['dog', 'domestic_dog', 'Canis_familiaris']

In [32]:
dir(wn.synset('dog.n.01'))
# isA: hyponyms, hypernyms
# part_of: member_holonyms, substance_holonyms, part_holonyms
# being part of: member_meronyms, substance_meronyms, part_meronyms
# domains: topic_domains, region_domains, usage_domains
# attribute: attributes
# entailments: entailments
# causes: causes
# also_sees: also_sees
# verb_groups: verb_groups
# similar_to: similar_tos

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',
 '__eq__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__gt__',
 '__hash__',
 '__init__',
 '__init_subclass__',
 '__le__',
 '__lt__',
 '__module__',
 '__ne__',
 '__new__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__setattr__',
 '__sizeof__',
 '__slots__',
 '__str__',
 '__subclasshook__',
 '__weakref__',
 '_all_hypernyms',
 '_definition',
 '_examples',
 '_frame_ids',
 '_hypernyms',
 '_instance_hypernyms',
 '_iter_hypernym_lists',
 '_lemma_names',
 '_lemma_pointers',
 '_lemmas',
 '_lexname',
 '_max_depth',
 '_min_depth',
 '_name',
 '_needs_root',
 '_offset',
 '_pointers',
 '_pos',
 '_related',
 '_shortest_hypernym_paths',
 '_wordnet_corpus_reader',
 'also_sees',
 'attributes',
 'causes',
 'closure',
 'common_hypernyms',
 'definition',
 'entailments',
 'examples',
 'frame_ids',
 'hypernym_distances',
 'hypernym_paths',
 'hypernyms',
 'hyponyms',
 'in_region_domains',
 'in_topic_domains',
 'in_usage_domains',
 '

Check more relations in http://www.nltk.org/api/nltk.corpus.reader.html?highlight=wordnet

In [33]:
# hypernyms: abstraction
# hyponyms: instantiation

dog = wn.synset('dog.n.01')
print("hypernyms:", dog.hypernyms())
print("hyponyms:", dog.hyponyms())

hypernyms: [Synset('canine.n.02'), Synset('domestic_animal.n.01')]
hyponyms: [Synset('basenji.n.01'), Synset('corgi.n.01'), Synset('cur.n.01'), Synset('dalmatian.n.02'), Synset('great_pyrenees.n.01'), Synset('griffon.n.02'), Synset('hunting_dog.n.01'), Synset('lapdog.n.01'), Synset('leonberg.n.01'), Synset('mexican_hairless.n.01'), Synset('newfoundland.n.01'), Synset('pooch.n.01'), Synset('poodle.n.01'), Synset('pug.n.01'), Synset('puppy.n.01'), Synset('spitz.n.01'), Synset('toy_dog.n.01'), Synset('working_dog.n.01')]


In [34]:
print(dog.hypernyms()[0].hypernyms()) # the hypernym of canine
# animals that feeds on flesh
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of carnivore
# placental mammals
print(dog.hypernyms()[0].hypernyms()[0].hypernyms()[0].hypernyms()) # the hypernym of placental
# mammals
# ...
print("root hypernyms for dog:", dog.root_hypernyms())

[Synset('carnivore.n.01')]
[Synset('placental.n.01')]
[Synset('mammal.n.01')]
root hypernyms for dog: [Synset('entity.n.01')]


In [35]:
# find common hypernyms
print("root hypernyms for cat:", wn.synset('cat.n.01').hypernyms())
print("root hypernyms for cat:", wn.synset('cat.n.01').root_hypernyms())
print("the lowest common hypernyms of dog and cat")
print(wn.synset('dog.n.01').lowest_common_hypernyms(wn.synset('cat.n.01')))

root hypernyms for cat: [Synset('feline.n.01')]
root hypernyms for cat: [Synset('entity.n.01')]
the lowest common hypernyms of dog and cat
[Synset('carnivore.n.01')]


### 3.2 Similarity

In [36]:
dog = wn.synset('dog.n.01')
corgi = wn.synset('corgi.n.01')
bensenji = wn.synset('basenji.n.01')
cat = wn.synset('cat.n.01')

In [37]:
dog.path_similarity(cat) # dog <- canine <- carnivore -> feline -> cat

0.2

In [38]:
dog.path_similarity(corgi) # corgi <- dog

0.5

In [39]:
corgi.path_similarity(bensenji) # bensenji <- dog -> corgi

0.3333333333333333

In [40]:
hit = wn.synset('hit.v.01')
slap = wn.synset('slap.v.01')
jump = wn.synset('jump.v.01')
run = wn.synset('run.v.01')

In [41]:
hit.path_similarity(slap) # 1/7

0.14285714285714285

In [42]:
hit.path_similarity(jump) # 1/6

0.16666666666666666

also check:
- wup_similarity
- lch_similarity
- res_similarity
...

Find more on https://www.nltk.org/howto/wordnet.html

### 3.3 Traverse the synsets to build a graph

In [43]:
wn_graph_hypernyms = {}
# or you could use networkx package

for synset in list(wn.all_synsets('n'))[:10]:
    for hyp_syn in synset.hypernyms():
        wn_graph_hypernyms[synset.name()] = {**wn_graph_hypernyms.get(synset.name(), {}), **{hyp_syn.name():True}}

In [44]:
wn_graph_hypernyms['physical_entity.n.01']['entity.n.01']

True

# 4. Tips to the assignments

Some corpus in the NLTK.

Reference: https://www.nltk.org/book/ch02.html. You could search for `gutenberg` and `brown` for detailed documentations.

### 4.1 gutenberg corpus

In [45]:
from nltk.corpus import gutenberg as gb
nltk.download("gutenberg")

[nltk_data] Downloading package gutenberg to
[nltk_data]     C:\Users\Elva\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\gutenberg.zip.


True

In [46]:
file_id = 'austen-sense.txt'
word_list = gb.words(file_id)

In [47]:
print(word_list[:100])

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']', 'CHAPTER', '1', 'The', 'family', 'of', 'Dashwood', 'had', 'long', 'been', 'settled', 'in', 'Sussex', '.', 'Their', 'estate', 'was', 'large', ',', 'and', 'their', 'residence', 'was', 'at', 'Norland', 'Park', ',', 'in', 'the', 'centre', 'of', 'their', 'property', ',', 'where', ',', 'for', 'many', 'generations', ',', 'they', 'had', 'lived', 'in', 'so', 'respectable', 'a', 'manner', 'as', 'to', 'engage', 'the', 'general', 'good', 'opinion', 'of', 'their', 'surrounding', 'acquaintance', '.', 'The', 'late', 'owner', 'of', 'this', 'estate', 'was', 'a', 'single', 'man', ',', 'who', 'lived', 'to', 'a', 'very', 'advanced', 'age', ',', 'and', 'who', 'for', 'many', 'years', 'of', 'his', 'life', ',', 'had', 'a', 'constant', 'companion']


In [54]:
sents = gb.sents(file_id)

In [55]:
sents[0]

['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']']

### 4.2 brown corpus

In [58]:
from nltk.corpus import brown
nltk.download("brown")
print(brown.categories())

romance_word_list = brown.words(categories='romance')

[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\Elva\AppData\Roaming\nltk_data...
[nltk_data]   Unzipping corpora\brown.zip.


['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']


In [63]:
romance_word_list[:50]

['They',
 'neither',
 'liked',
 'nor',
 'disliked',
 'the',
 'Old',
 'Man',
 '.',
 'To',
 'them',
 'he',
 'could',
 'have',
 'been',
 'the',
 'broken',
 'bell',
 'in',
 'the',
 'church',
 'tower',
 'which',
 'rang',
 'before',
 'and',
 'after',
 'Mass',
 ',',
 'and',
 'at',
 'noon',
 ',',
 'and',
 'at',
 'six',
 'each',
 'evening',
 '--',
 'its',
 'tone',
 ',',
 'repetitive',
 ',',
 'monotonous',
 ',',
 'never',
 'breaking',
 'the',
 'boredom']

In [13]:

def q1():
    print('q1: {:}'.format(''))
    from nltk.corpus import gutenberg as gb
    import nltk
    file_id = 'austen-sense.txt'
    word_list = gb.words(file_id)
    # YOUR CODE
    # 1. Print the number of word tokens in the corpus.
    print("1. Print the number of word tokens in the corpus:")
    print(len(word_list))
    # 2. Print the size of the vocabulary (number of unique word tokens).
    print()
    print("2. Print the size of the vocabulary (number of unique word tokens).")
    print(len(set(word_list)))
    print()
    print("3. Print the tokenized words of the first sentence in the corpus.")
    # 3. Print the tokenized words of the first sentence in the corpus.
    first_sents = gb.sents(file_id)[0]
    s = " ".join(first_sents)
    print(nltk.word_tokenize(s))
def q2():
    print('q2: {:}'.format(''))
    import nltk
    from nltk.corpus import brown
    # Your Code
    romance_word_list = brown.words(categories='romance')
    # 1. Print the top 10 most common words in the romance category.
    print("1. Print the top 10 most common words in the romance category:")
    # 2. Print the word frequencies
    from collections import Counter
    words_given = ['ring','activities','love','sports','church']
    counter =  Counter(romance_word_list)
    most_common = counter.most_common(10)
    word_list = []
    for word,freq in most_common:
        word_list.append(word)
    print(word_list)
    print()
    print("2. Print the word frequencies:")
    for word in words_given:
        print('frequency of ' + word + ' in romance: ' + str(counter[word]))
    hobbies_word_list = brown.words(categories='hobbies')
    counter = Counter(hobbies_word_list)
    print()
    for word in words_given:
        print('frequency of ' + word + ' in hobbies: ' + str(counter[word])) 
    
    
    
    
    

def q3():
    print('q3: {:}'.format(''))
    from nltk.corpus import wordnet as wn
    # Your Code
    # 1. Print all synonymous words of the word ‘dictionary’.
    print('1. All synonymous words of the word \'dictionary\':')
    print(wn.synset('dictionary.n.01').lemma_names())
    print()
    # 2. Print all hyponyms of the word ‘dictionary’.
    print("2. Hyponyms of the word \'dictionary\':")
    dictionary = wn.synset('dictionary.n.01')
    hyp = dictionary.hyponyms()
    print(hyp)
    print()
    # 3. Calculate similarities.
    sim = []
    right_whale = wn.synset('right_whale.n.01')
    novel = wn.synset('novel.n.01')
    minke_whale = wn.synset('minke_whale.n.01')
    tortoise = wn.synset('tortoise.n.01')
    
    pair1 = right_whale.path_similarity(novel)
    pair2 = right_whale.path_similarity(minke_whale)
    pair3 = right_whale.path_similarity(tortoise)
    print("3.Calculate similarities:")
    print("Similarity of \'right_whale\' and \'minke_whale\': " + str(pair2))
    print("Similarity of \'right_whale\' and \'tortoise\': " + str(pair3))
    print("Similarity of \'right_whale\' and \'novel\': " + str(pair1))
q1()
print()
q2()
print()
q3()

q1: 
1. Print the number of word tokens in the corpus:
141576

2. Print the size of the vocabulary (number of unique word tokens).
6833

3. Print the tokenized words of the first sentence in the corpus.
['[', 'Sense', 'and', 'Sensibility', 'by', 'Jane', 'Austen', '1811', ']']

q2: 
1. Print the top 10 most common words in the romance category:
[',', '.', 'the', 'and', 'to', 'a', 'of', '``', "''", 'was']

2. Print the word frequencies:
frequency of ring in romance: 2
frequency of activities in romance: 1
frequency of love in romance: 32
frequency of sports in romance: 3
frequency of church in romance: 29

frequency of ring in hobbies: 11
frequency of activities in hobbies: 11
frequency of love in hobbies: 6
frequency of sports in hobbies: 12
frequency of church in hobbies: 1

q3: 
1. All synonymous words of the word 'dictionary':
['dictionary', 'lexicon']

2. Hyponyms of the word 'dictionary':
[Synset('bilingual_dictionary.n.01'), Synset('desk_dictionary.n.01'), Synset('etymological_dic