<a href="https://colab.research.google.com/github/mkane968/Text-Mining-Experiments/blob/main/NLTK/Tutorial%205%3A%20Part%20of%20Speech%20Taggers%20and%20Named%20Entity%20Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Tutorial 5: Part of Speech Taggers and Named Entity Recognition

***Creating a POS Tagger:*** Create a tagger that will identify parts of speech in a given sentence. 

Train a classifier to work out which suffixes are most informative for POS tagging. 

We can begin by finding out what the most common suffixes are

Import brown corpus and frequency distribution module

In [1]:
import nltk
nltk.download('brown')
from nltk.corpus import brown
from nltk import FreqDist

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Determine most frequent suffixes in brown corpus (frequency of last 1, 2, 3 characters in words in brown corpus)


In [2]:
suffix_fdist = FreqDist()
for word in brown.words():
    word = word.lower()
    suffix_fdist[word[-1:]] += 1
    suffix_fdist[word[-2:]] += 1
    suffix_fdist[word[-3:]] += 1
    
suffix_fdist

FreqDist({'e': 202946,
          'he': 92084,
          'the': 70026,
          'n': 87889,
          'on': 33382,
          'ton': 1019,
          'y': 59146,
          'ty': 6458,
          'nty': 391,
          'd': 105687,
          'nd': 36418,
          'and': 31057,
          'ry': 7500,
          'ury': 482,
          'id': 4272,
          'aid': 2460,
          'ay': 6482,
          'day': 1613,
          'an': 17650,
          'ion': 14905,
          'f': 43173,
          'of': 72978,
          's': 128722,
          "'s": 5865,
          "a's": 202,
          't': 94459,
          'nt': 13151,
          'ent': 9369,
          'ary': 2122,
          'ed': 41527,
          'ced': 1262,
          '`': 8837,
          '``': 17674,
          'o': 42363,
          'no': 4402,
          'ce': 10953,
          'nce': 5971,
          "'": 10455,
          "''": 17639,
          'at': 25410,
          'hat': 12692,
          'ny': 3437,
          'any': 2793,
          'es': 22408,
  

Put 100 most common suffixes into list and print the top 10


In [3]:
common_suffixes = [suffix for (suffix, count) in suffix_fdist.most_common(100)]
common_suffixes[:10]

['e', ',', '.', 's', 'd', 't', 'he', 'n', 'a', 'of']

Next, we'll define a feature extractor function which checks a given word for these suffixes:

In [4]:
def pos_features(word):
    features = {}
    for suffix in common_suffixes:
        features['endswith({})'.format(suffix)] = word.lower().endswith(suffix)
    return features

pos_features('test')

{"endswith('')": False,
 "endswith(')": False,
 "endswith('s)": False,
 'endswith(()': False,
 'endswith())': False,
 'endswith(,)': False,
 'endswith(--)': False,
 'endswith(.)': False,
 'endswith(:)': False,
 'endswith(;)': False,
 'endswith(?)': False,
 'endswith(`)': False,
 'endswith(``)': False,
 'endswith(a)': False,
 'endswith(ad)': False,
 'endswith(al)': False,
 'endswith(an)': False,
 'endswith(and)': False,
 'endswith(are)': False,
 'endswith(as)': False,
 'endswith(at)': False,
 'endswith(ay)': False,
 'endswith(be)': False,
 'endswith(by)': False,
 'endswith(c)': False,
 'endswith(ce)': False,
 'endswith(ch)': False,
 'endswith(d)': False,
 'endswith(e)': False,
 'endswith(ed)': False,
 'endswith(en)': False,
 'endswith(ent)': False,
 'endswith(er)': False,
 'endswith(ere)': False,
 'endswith(ers)': False,
 'endswith(es)': False,
 'endswith(ey)': False,
 'endswith(f)': False,
 'endswith(for)': False,
 'endswith(g)': False,
 'endswith(h)': False,
 'endswith(had)': False,
 

Now that we've defined our feature extractor, we can use it to train a new decision tree classifier:

In [5]:
tagged_words = brown.tagged_words(categories='news')
featuresets = [(pos_features(n), g) for (n,g) in tagged_words]
#featuresets[0]

Import decision tree classifier and accuracy

In [None]:
from nltk import DecisionTreeClassifier
from nltk.classify import accuracy

Set cutoff limit for classifier and training and test set variables

In [None]:
cutoff = int(len(featuresets) * 0.1)
train_set, test_set = featuresets[cutoff:], featuresets[:cutoff]

Run classifer on training set

NLTK is a teaching toolkit which is not really optimized for speed.

Therefore, this may take forever. For speed, use scikit-learn for the classifiers.

In [None]:
classifier = DecisionTreeClassifier.train(train_set) 

In [None]:
##from sklearn.tree import DecisionTreeClassifier

In [None]:
accuracy(classifier, test_set)

In [None]:
classifier.classify(pos_features('cats'))

In [None]:
classifier.pseudocode(depth=4)

To improve the classifier, we can add contextual features:

def pos_features(sentence, i): [1]
    features = {"suffix(1)": sentence[i][-1:],
                "suffix(2)": sentence[i][-2:],
                "suffix(3)": sentence[i][-3:]}
    if i == 0:
        features["prev-word"] = "<START>"
    else:
        features["prev-word"] = sentence[i-1]
    return features
Then, instead of working with tagged words, we work with tagged sentences:

tagged_sents = brown.tagged_sents(categories='news')
We can then improve this further by adding more features such as prev-tag etc.

***Parts of Speech and Meaning (English Only)***

Create string, import word tokenizer, tokenize words in t and print tokens in second sentence

In [None]:
t = "Cyprus, officially the Republic of Cyprus, is an island country in the Eastern Mediterranean and the third largest and third most populous island in the Mediterranean. Cyprus is located south of Turkey, west of Syria and Lebanon, northwest of Israel, north of Egypt, and southeast of Greece. Cyprus is a major tourist destination in the Mediterranean. With an advanced, high-income economy and a very high Human Development Index, the Republic of Cyprus has been a member of the Commonwealth since 1961 and was a founding member of the Non-Aligned Movement until it joined the European Union on 1 May 2004. On 1 January 2008, the Republic of Cyprus joined the eurozone."

nltk.download('punkt')
from nltk import sent_tokenize, word_tokenize
sentences = sent_tokenize(t.lower())
sentences

tokens = word_tokenize(sentences[2])
tokens

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


['cyprus',
 'is',
 'a',
 'major',
 'tourist',
 'destination',
 'in',
 'the',
 'mediterranean',
 '.']

Import part of speech tagger from nltk and tag tokens in string t 

In [None]:
from nltk import pos_tag
nltk.download('averaged_perceptron_tagger')
tags = pos_tag(tokens)
tags

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.


[('cyprus', 'NN'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('major', 'JJ'),
 ('tourist', 'NN'),
 ('destination', 'NN'),
 ('in', 'IN'),
 ('the', 'DT'),
 ('mediterranean', 'NN'),
 ('.', '.')]

Access documentation for tags, for example for NN:

In [None]:
import nltk.help
nltk.download('tagsets')
nltk.help.upenn_tagset('NN')

[nltk_data] Downloading package tagsets to /root/nltk_data...
[nltk_data]   Unzipping help/tagsets.zip.
NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


***Word senses for homonyms***

WordNet is a lexical database for the English language in the form of a semantic graph.

WordNet groups English words into sets of synonyms called synsets, provides short definitions and usage examples, and records a number of relations among these synonym sets or their members.

NLTK provides an interface to the WordNet API.

Download wordnet and list set of synonyms (synset) for "human"

In [None]:
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
wn.synsets('human')

[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Unzipping corpora/wordnet.zip.


[Synset('homo.n.02'),
 Synset('human.a.01'),
 Synset('human.a.02'),
 Synset('human.a.03')]

Get first definition human from synset

In [None]:
wn.synsets('human')[0].definition()

'any living or extinct member of the family Hominidae characterized by superior intelligence, articulate speech, and erect carriage'

Get second definition human from synset

In [None]:
wn.synsets('human')[1].definition()

'characteristic of humanity'

Define variable "human" as "human" in synset

In [None]:
human = wn.synsets('Human', pos=wn.NOUN)[0]
human

Synset('homo.n.02')

A hypernym is a word with a broad meaning constituting a category into which words with more specific meanings fall a superordinate. 


In [None]:
# For example, colour is a hypernym of red.
human.hypernyms() 

[Synset('hominid.n.01')]

In [None]:
human.hyponyms()

[Synset('homo_erectus.n.01'),
 Synset('homo_habilis.n.01'),
 Synset('homo_sapiens.n.01'),
 Synset('homo_soloensis.n.01'),
 Synset('neandertal_man.n.01'),
 Synset('rhodesian_man.n.01'),
 Synset('world.n.08')]

In [None]:
bike = wn.synsets('bicycle')[0]
bike

Synset('bicycle.n.01')

In [None]:
girl = wn.synsets('girl')[1]
girl

Synset('female_child.n.01')

The Wu-Palmer metric (WUP) is a measure of similarity based on distance in the graph. There are many other metrics too.

Get similarity between bike and human

In [None]:
bike.wup_similarity(human) 

0.34782608695652173

Get similarity between girl and human

In [None]:
girl.wup_similarity(human)

0.5217391304347826

Get synonyms for 'girl'

In [None]:
synonyms = []
for syn in wn.synsets('girl'):
    # A lemma is basically the dictionary form or base form of a word, as opposed to the various inflected forms of a word. 
    for lemma in syn.lemmas():
        synonyms.append(lemma.name())
synonyms

['girl',
 'miss',
 'missy',
 'young_lady',
 'young_woman',
 'fille',
 'female_child',
 'girl',
 'little_girl',
 'daughter',
 'girl',
 'girlfriend',
 'girl',
 'lady_friend',
 'girl']

Get antonyms for 'girl'

In [None]:
antonyms = []
for syn in wn.synsets("girl"):
    for l in syn.lemmas():
        if l.antonyms():
            antonyms.append(l.antonyms()[0].name())
antonyms

['male_child', 'boy', 'son', 'boy']

***Chunking and Entity Recognition:***

**Chunking:** Divide a sentence into chunks. Usually each chunk contains a head and (optionally) additional words and modifiers. Examples of chunks include noun groups and verb groups.



In [None]:
from nltk.chunk import RegexpParser

In order to create a chunker, we need to first define a chunk grammar, consisting of rules that indicate how sentences should be chunked.

We can define a simple grammar for a noun phrase (NP) chunker with a single regular-expression rule. This rule says that an NP chunk should be formed whenever the chunker finds an optional determiner (DT) followed by any number of adjectives (JJ) and then a noun (NN).

Note how grammatical structures which are not noun phrases are not chunked, which is totally fine:

In [None]:
grammar = "NP: {<DT>?<JJ>*<NN>}"
import matplotlib
matplotlib.use('Agg')

In [None]:
###DOES NOT WORK: no display name and no $DISPLAY environment variable

chunker = RegexpParser(grammar)
result = chunker.parse(tags)
result

TclError: ignored

Tree('S', [Tree('NP', [('cyprus', 'NN')]), ('is', 'VBZ'), Tree('NP', [('a', 'DT'), ('major', 'JJ'), ('tourist', 'NN')]), Tree('NP', [('destination', 'NN')]), ('in', 'IN'), Tree('NP', [('the', 'DT'), ('mediterranean', 'NN')]), ('.', '.')])

***Entity Recognition:*** The goal of entity recogintion is to detect entities such as Person, Location, Time, etc.

In [None]:
###DOES NOT WORK: no display name and no $DISPLAY environment variable
nltk.download('maxent_ne_chunker')
nltk.download('words')
from nltk.chunk import ne_chunk # ne = named entity
ne_chunk(tags)

Note ne_chunk was unable to detect any entities in our sentence. That is because it is quite limited, being able to recognize only the following entities:

FACILITY, GPE (Geo-Political Entity), GSP (Geo-Socio-Political group), LOCATION, ORGANIZATION, PERSON