# Learning Objectives 

In this lab we are going to get hands on experience with Term Extraction.

In [None]:
# setting the stage, as usual with colab ;)
import nltk
nltk.download('all')


### Point-Wise Mutual Information
Mutual Information is a measure on how much one word tells about the other.

We consider the task of identifying word collocations in a corpus to demonstrate the use of Pointwise
Mutual Information (PMI).

The two most common types of collocation are bigrams and trigrams. PMI can generalize to any n-grams.

### Two Word Collocations

* suppose the collocation candidate is $(w_1, w_2)$
* if both $w_1$ and $w_2$ are frequent, they may co-locate just by chance, even if they don't form a collocation
* we want to test if this collocation occurs significantly more often than by chance
* for that we can compare how often $w_1$ and $w_2$ occur independently, and how often they occur together

In [2]:
# Use the given text as example to find the 10 best collocations based on PMI.

text ='''
Natural Language Processing (NLP) is a sub-field of Artificial Intelligence that is focused on enabling computers 
to understand and process human languages, to get computers closer to a human-level understanding of language. 
Computers don’t yet have the same intuitive understanding of natural language that humans do. 
They can’t really understand what the language is really trying to say. In a nutshell, a computer can’t read between the lines.

That being said, recent advances in Machine Learning (ML) have enabled computers to do quite a lot of useful things 
with natural language! Deep Learning has enabled us to write programs to perform things like language translation, 
semantic understanding, and text summarization. All of these things add real-world value, making it easy for you to 
understand and perform computations on large blocks of text without the manual effort.
'''

In [3]:
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

bigram_measures = BigramAssocMeasures()

# find all bigrams
finder = BigramCollocationFinder.from_words(nltk.wordpunct_tokenize(text))

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)           # try and change pmi to raw_freq

[('!', 'Deep'),
 ('Artificial', 'Intelligence'),
 ('Computers', 'don'),
 ('Language', 'Processing'),
 ('Natural', 'Language'),
 ('That', 'being'),
 ('add', 'real'),
 ('advances', 'in'),
 ('being', 'said'),
 ('easy', 'for')]

In [4]:
# Find 10 best trigrams based on PMI.
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder

# your code goes here
trigram_measures = TrigramAssocMeasures()

finder = TrigramCollocationFinder.from_words(nltk.wordpunct_tokenize(text))

# print the 10 n-grams with the highest PMI
finder.nbest(trigram_measures.pmi, 10)

[('Natural', 'Language', 'Processing'),
 ('That', 'being', 'said'),
 ('advances', 'in', 'Machine'),
 ('easy', 'for', 'you'),
 ('it', 'easy', 'for'),
 ('making', 'it', 'easy'),
 ('recent', 'advances', 'in'),
 ('!', 'Deep', 'Learning'),
 ('Artificial', 'Intelligence', 'that'),
 ('Deep', 'Learning', 'has')]

### Exercise 1

Read the document in `Lab9_IE_TermExtraction_Corpus.txt` and find the best bigrams and trigrams, based on PMI.

In [5]:
# Download corpus (uncomment the line below if not corpus not downloaded already from Blackboard)
!curl "https://pastebin.com/raw/dNSb9kAy" > Lab9_IE_TermExtraction_Corpus.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 14987    0 14987    0     0  39232      0 --:--:-- --:--:-- --:--:-- 39232


In [6]:
# Read corpus in Lab9_IE_TermExtraction_Corpus.txt
doc = None

# your code goes here
with open('Lab9_IE_TermExtraction_Corpus.txt', 'r') as f:
  doc = f.read()

In [7]:
# Find 10 best bigrams based on PMI.
from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# your code goes here
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(nltk.word_tokenize(doc))
finder.nbest(bigram_measures.pmi, 10)  

[('1966', 'declares'),
 (':', 'art'),
 ('Argentina', 'were'),
 ('Easter', 'Rising'),
 ('First', 'Protocol'),
 ('Gaelic', 'Romantic'),
 ('High', 'Middle'),
 ('I', 'founded'),
 ('International', 'Covenant'),
 ('Its', 'function')]

In [8]:
# Find 10 best trigrams based on PMI.
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder

# your code goes here
trigram_measures = TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(nltk.word_tokenize(doc))
finder.nbest(trigram_measures.pmi, 10)

[('Gaelic', 'Romantic', 'revivalists'),
 ('High', 'Middle', 'Ages'),
 ('Latin', 'universitas', 'magistrorum'),
 ('Rupert', 'I', 'founded'),
 ('Since', 'World', 'War'),
 ('UN', 'International', 'Covenant'),
 ('World', 'War', 'II'),
 ('approach', 'involves', 'combining'),
 ('completely', 'independent', 'body'),
 ('days', 'when', 'few')]

### Example 2

As mentioned on slide 34 in this week's lecture
> Unsupervised term extraction can build on syntactic analysis as **a term often corresponds to a Noun Phrase (NP)**.

However, in the examples above, several collocations contain a Verb Phrase such as "*I founded*", "*Argentina were*", etc. Hence, we have to modify the code above and look for only those collocations which contain Noun Phrases for our task of Term Extraction.

---

In general, a typical term extraction algorithm has three main steps:

1. **Text Pre-processing**: which can include stop words removal, normalization (lower casing), punctuation removal, etc.

2. **Candidate selection**: Here, we extract all possible phrases that can potentially be "terms" based on grammar rules e.g. noun phrases.

3. **Scoring and selecting terms**: All candidates can be ranked by various methods. The simplest ones can rely on frequency statistics, such as $\text{TF-IDF}$ or mutual information such as point wise mutual information ($\text{PMI}$). In this session, our focus is on $\text{PMI}$.

Finally, we also need a score or a threshold, or a limit on the number of terms that is used to select the final set of terms.

In [9]:
# We can define grammar rules to look for Noun phrases and parse a given sentence
# using those rules to find the relevant phrases for finding the best collocations.

def np_chunk(sentence):
  '''Function to parse a sentence with the given grammar rules for Noun Phrases.'''

  # Define grammar rules.
  grammar =  r"""
    NBAR:
        {<NN.*|JJ>*<NN.*>}  # Nouns and Adjectives, terminated with Nouns
        
    NP:
        {<NBAR>}
        {<NBAR><IN><NBAR>}  # Above, connected with in/of/etc...
  """
  # These patterns for candidate selection on Noun Phrases are defined 
  # in Table 2 in this paper: https://www.aclweb.org/anthology/C10-1065.pdf

  
  # Create a chunk parser using this grammar.
  chunker = nltk.RegexpParser(grammar)
  
  # Tokenize sentence on whitespace and punctuation.
  # More details here: https://www.nltk.org/api/nltk.tokenize.html
  tokenized_sentence = nltk.wordpunct_tokenize(sentence)
  
  # Find POS tag for the sentence.
  tagged_sentence = nltk.pos_tag(tokenized_sentence)
  
  # Parse the sentence based on grammar rules.
  tree = chunker.parse(tagged_sentence)

  # Store Noun Phrases in a list.
  noun_phrases = []

  for subtree in tree.subtrees():
    if subtree.label() == 'NP':
        phrase = ""
        for leaf in subtree.leaves():
            phrase = phrase + leaf[0] + " "
        noun_phrases.append(phrase.strip())
             
  return noun_phrases

In [10]:
# Function to extract Noun phrases from a document.
def extract_nps(document):
  '''Find all noun phrases in a document and return as a list.'''

  all_noun_phrases = []

  for sent in nltk.sent_tokenize(document):
    all_noun_phrases.extend(np_chunk(sent))
  
  return all_noun_phrases

In [11]:
# Find Noun phrase candidates using the functions implemented above.
all_noun_phrases = extract_nps(text)
all_noun_phrases

['Natural Language Processing',
 'NLP',
 'field',
 'Artificial Intelligence',
 'computers',
 'human languages',
 'computers',
 'level understanding',
 'language',
 'Computers',
 '’ t',
 'same intuitive understanding',
 'natural language',
 'language',
 'nutshell',
 'computer',
 't read',
 'lines',
 'recent advances',
 'Machine Learning',
 'ML',
 'computers',
 'lot',
 'useful things',
 'natural language',
 'Deep Learning',
 'programs',
 'things',
 'language translation',
 'semantic understanding',
 'text summarization',
 'things',
 'world value',
 'computations',
 'large blocks',
 'text',
 'manual effort']

In [12]:
# Find bigram candidates.
candidates = [candidate for candidate in all_noun_phrases if len(candidate.split()) == 2]
print(candidates)

['Artificial Intelligence', 'human languages', 'level understanding', '’ t', 'natural language', 't read', 'recent advances', 'Machine Learning', 'useful things', 'natural language', 'Deep Learning', 'language translation', 'semantic understanding', 'text summarization', 'world value', 'large blocks', 'manual effort']


In [13]:
# Find trigram candidates.
candidates = [candidate for candidate in all_noun_phrases if len(candidate.split()) == 3]
print(candidates)

['Natural Language Processing', 'same intuitive understanding']


### Exercise 3

Extract **Noun Phrase** candidates from `Lab9_IE_TermExtraction_Corpus.txt` using the functions implemented above.


In [14]:
# Find noun phrases in the given document.

# your code goes here
all_noun_phrases = extract_nps(doc)
all_noun_phrases

['university',
 'institution',
 'education',
 'research',
 'academic degrees',
 'various academic disciplines',
 'Universities',
 'undergraduate education',
 'postgraduate education',
 'word university',
 'Latin universitas magistrorum et scholarium',
 '" community',
 'teachers',
 'scholars',
 'antecedents',
 'Asia',
 'Africa',
 'modern university system',
 'European medieval university',
 'Italy',
 'Catholic Cathedral schools',
 'clergy',
 'High Middle Ages',
 'National',
 'A national university',
 'university',
 'national state',
 'same time',
 'state autonomic institution',
 'functions',
 'independent body',
 'same state',
 'national universities',
 'political aspirations',
 'instance',
 'National University',
 'Ireland',
 'Catholic University',
 'Ireland',
 'answer',
 'denominational universities',
 'Ireland',
 'years',
 'Easter Rising',
 'small part',
 'result',
 'Gaelic Romantic revivalists',
 'NUI',
 'large amount',
 'information',
 'Irish language',
 'Irish culture',
 'Reforms'

In [15]:
# Find bigram candidates

# your code goes here
candidates = [candidate for candidate in all_noun_phrases if len(candidate.split()) == 2]
print(candidates)

['academic degrees', 'undergraduate education', 'postgraduate education', 'word university', '" community', 'national state', 'same time', 'independent body', 'same state', 'national universities', 'political aspirations', 'National University', 'Catholic University', 'denominational universities', 'Easter Rising', 'small part', 'large amount', 'Irish language', 'Irish culture', 'University Revolution', 'posterior reforms', 'education system', 'vice president', 'various divisions', 'academic departments', 'education boards', 'financial requests', 'budget proposals', 'new programs', 'various institutions', 'considerable degree', 'pedagogical autonomy', 'Private universities', 'state policies', 'business corporations', 'universities varies', 'different countries', 'countries universities', 'vast majority', 'local town', 'university accommodation', 'undergraduate degree', 's degree', 'colloquial term', 'academic degree', 'undergraduate courses', 'United States', 'common type', 'undergradu

In [16]:
# Find trigrams candidates

# your code goes here
candidates = [candidate for candidate in all_noun_phrases if len(candidate.split()) == 3]
print(candidates)

['various academic disciplines', 'modern university system', 'European medieval university', 'Catholic Cathedral schools', 'High Middle Ages', 'A national university', 'state autonomic institution', 'Gaelic Romantic revivalists', 'Public university systems', 'further coordinated growth', 'many public universities', 'other countries universities', 'A ., Licentiate', 'Postgraduate degree course', 'Stricto Sensu courses', 'optional final stage', 'other career colleges', 'UN International Covenant', 'free education ".', 'term high school', 'dental schools ),', 'social services activities', 'analytical reasoning skills', 'World War II', 'many developed countries', 'relevant age group', 'measurable wage premium', 'only average ability', 'level graduate programs', 'Mean financial wealth', 'analytical reasoning skills', 'speech ), problem', 'liberal arts degrees', 'distinctive undergraduate experience', 'excellent liberal arts', 'innovative interdisciplinary research', 'top undergraduate progr

### Excercise 4: Text pre-processing and Normalization

Apply some preprocessiyng steps such as lowercasing, punctuation removal, stopword removal etc. Feel free to modify the `extract_nps()` and `np_chunk()` functions for this, or you can also convert the candidates list to lowercase for normalization.

In [17]:
# your code goes here

# Extract noun phrases.
all_noun_phrases = extract_nps(doc)

# Lowercase and retrive bigram candidates.
candidates = set(candidate.lower() for candidate in all_noun_phrases if len(candidate.split()) == 2)
print(candidates)

{'admitted students', 'political aspirations', 'small part', 'latter level', 'extension courses', 'audio design', 'mba programs', 'professional certifications', 'year program', 'different degrees', 'undergraduate fields', 'academic advisor', 'undergraduate level', 'graduate schools', 'private universities', 'united states', 'professional master', 'academic disciplines', 'alternative means', 'progressive introduction', 'paid workers', 'level institutions', 'signatory parties', 'graduate unemployment', 'relevant technologies', 'vocational schools', '" community', 'human rights', 'broad backgrounds', 'communication skills', 'public sector', 'academic units', 'national state', 'secondary education', 'postgraduate education', 'gain understanding', 'principal ingredient', 'countries universities', 'mass rate', 'stimulate innovation', 'catholic university', 'cultural rights', 'college graduate', 'national university', 'major electives', 'national universities', 'minor programs', 'elite rate',

### Excercise 5: PMI Scoring
 
Score and rank the candidate terms in the given document using PMI, using the formula shown in slides 36-37.

In [18]:
from nltk.probability import FreqDist
from math import log2

def calculate_pmi(candidate, document):
  '''In order to compute the PMI we need the joint probability P(X = x, Y = y) and the marginal probabilities
  P(X = x) and P(Y = y). In our case P(X = x) and P(Y = y) will be the unigram probabilities of the two
  words that are considered to form a collocation, and P(X = x, Y = y) will be the bigram probability.'''
  
  word1 = candidate.split(" ")[0].strip()
  word2 = candidate.split(" ")[1].strip()
  
  tokens = nltk.wordpunct_tokenize(document)

  pmi = 0.0
  
  # Your code goes here (implement the following steps)
  #
  # 1. Lowercase tokens and candidate words.
  # 2. Compute unigram and bigram frequencies.
  # 3. Compute unigram and bigram probabilities.
  # 4. Calculate PMI using the formula on slide 36.

  # Lowercase
  tokens = [word.lower() for word in tokens]
  word1 = word1.lower()
  word2 = word2.lower()
  
  # Unigram frequencies
  unigrams_freq = FreqDist(tokens)
  
  # Bigram frequencies
  bigrams = list(nltk.bigrams(tokens))
  bigrams_freq = nltk.FreqDist(bigrams)
  
  tuple_w1_w2 = (word1.lower(), word2.lower())

  # Bigram probability.
  joint_prob = bigrams_freq[tuple_w1_w2] / unigrams_freq[word1]

  # Unigram probabilities.
  prob_w1 = unigrams_freq[word1.lower()] / len(tokens)
  prob_w2 = unigrams_freq[word2.lower()] / len(tokens)

  # Calculate PMI.
  pmi = log2(joint_prob / (prob_w1 * prob_w2))
  # pmi = log2(joint_prob) - (log2(prob_w1) + log2(prob_w2))

  return pmi

In [19]:
# Calculate PMI for each bigram candidate and store it in a dictionary.
candidates_pmi = {}

# your code goes here
for candidate in candidates:
  pmi = calculate_pmi(candidate, doc)
  candidates_pmi[candidate] = pmi

print(candidates_pmi)

{'admitted students': 18.736965594166207, 'political aspirations': 22.643856189774723, 'small part': 22.643856189774723, 'latter level': 19.184424571137427, 'extension courses': 20.05889368905357, 'audio design': 21.05889368905357, 'mba programs': 18.05889368905357, 'professional certifications': 17.47393118833241, 'year program': 17.47393118833241, 'different degrees': 17.32192809488736, 'undergraduate fields': 13.658014252771384, 'academic advisor': 14.643856189774725, 'undergraduate level': 12.783545134855244, 'graduate schools': 12.403064857612767, 'private universities': 18.736965594166207, 'united states': 19.47393118833241, 'professional master': 16.47393118833241, 'academic disciplines': 13.321928094887362, 'alternative means': 21.05889368905357, 'progressive introduction': 22.643856189774723, 'paid workers': 21.05889368905357, 'level institutions': 14.72499295250013, 'signatory parties': 22.643856189774723, 'graduate unemployment': 14.72499295250013, 'relevant technologies': 1

In [20]:
# Filter terms based on PMI (sort & rank).

# your code goes here
sorted_candidates = sorted(candidates_pmi.items(), key=lambda x: x[1], reverse=True)
for candidate, score in sorted_candidates[:25]:
  print(candidate, score)

political aspirations 22.643856189774723
small part 22.643856189774723
progressive introduction 22.643856189774723
signatory parties 22.643856189774723
broad backgrounds 22.643856189774723
gain understanding 22.643856189774723
principal ingredient 22.643856189774723
intelligent citizenship 22.643856189774723
few pupils 22.643856189774723
easter rising 22.643856189774723
individual strengths 22.643856189774723
societal relevance 22.643856189774723
abstract elements 22.643856189774723
pedagogical autonomy 22.643856189774723
title specialist 22.643856189774723
north america 22.643856189774723
little theory 22.643856189774723
undergraduation diplomas 22.643856189774723
budget proposals 22.643856189774723
local town 22.643856189774723
independent body 22.643856189774723
mass rate 21.643856189774723
stimulate innovation 21.643856189774723
elite rate 21.643856189774723
grade inflation 21.643856189774723


---

## Exercise 6

We can follow the same steps as above for a larger corpus from Gutenberg like Shakespeare or Jane Austen.

In [21]:
# find the 10 best bigrams based on PMI. (same as example 1)

from nltk.collocations import BigramAssocMeasures

# Access a corpus from nltk.corpus.gutenberg.words
# For example, to acess Jane Austen's Emma, use:
# nltk.corpus.gutenberg.words(['austen-emma.txt']) to get a list of words
# or
# nltk.corpus.gutenberg.raw('austen-emma.txt') to get raw plaintext

# your code goes here
bigram_measures = BigramAssocMeasures()

# find all bigrams
finder = BigramCollocationFinder.from_words(nltk.corpus.gutenberg.words(['austen-emma.txt']))

# only bigrams that appear 3+ times
finder.apply_freq_filter(3) 

# return the 10 n-grams with the highest PMI
finder.nbest(bigram_measures.pmi, 10)

[('&', 'c'),
 ('caro', 'sposo'),
 ('South', 'End'),
 ('frozen', 'maid'),
 ('Mill', 'Farm'),
 ('Brunswick', 'Square'),
 ('extensive', 'grounds'),
 ('nicely', 'dressed'),
 ('sore', 'throat'),
 ('Vicarage', 'Lane')]

In [22]:
# Follow the 3 steps for term extraction (same as Example 2 and Exercises 3-4)
# to find the candidates for term extraction.

# your code goes here
import string

# function to look for punctuation while pre-processing
def has_punct(s):
  for p in string.punctuation:
    if p in s:
      return True

  return False


emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
all_noun_phrases = extract_nps(emma)
candidates = {candidate for candidate in all_noun_phrases if len(candidate.split()) == 2 and all_noun_phrases.count(candidate) > 3 and not has_punct(candidate)}
print(candidates)

{'Robert Martin', 'Miss Nash', 's confidence', 's eyes', 's thoughts', 'few weeks', 'little doubt', 'Miss Smith', 'Frank Churchill', 'little while', 'own powers', 's time', 'few moments', 'dear sir', 'young people', 's letter', 'Poor Mr', 'Emma Woodhouse', 'many months', 'Miss Churchill', 'Maple Grove', 'whole party', 'few lines', 'old friend', 's family', 'good hands', 'old acquaintance', 's side', 'little distance', 'poor Mr', 's claims', 's party', 'good society', 'own feelings', 'dear Harriet', 'such occasions', 'dear papa', 'such thing', 's letters', 'South End', 'good Mr', 's sake', 'next day', 'worthy people', 'good manners', 'first object', 'own mind', 's mind', 'Miss Bates', 's son', 'little boys', 'good sort', 'good deal', 'poor Mrs', 's account', 'proper attention', 'Donwell Abbey', 'John Knightley', 'Colonel Campbell', 'vast deal', 'Miss Taylor', 'only way', 's attention', 's marriage', 'full justice', 'such things', 'Jane Fairfax', 'bad thing', 'next morning', 'great amuse

In [23]:
# Rank the candidates based on PMI score (same as Exercise 5).

# your code goes here
# Calculate PMI for each candidate and store it in a dictionary.
candidates_pmi = {}

# your code goes here
for candidate in list(candidates):
  pmi = calculate_pmi(candidate, emma)

  candidates_pmi[candidate] = pmi

sorted_candidates = sorted(candidates_pmi.items(), key=lambda x: x[1], reverse=True)
for candidate, score in sorted_candidates[:25]:
  print(candidate, score)

Brunswick Square 28.06350931373095
baked apples 27.446125335317415
South End 26.522940932368247
William Larkins 26.18904019581481
hair cut 25.815121683861555
Box Hill 25.771272903780392
vast deal 25.73634457047744
Maple Grove 25.19951081231565
Colonel Campbell 23.74697798628834
Robert Martin 23.652708807338577
worthy people 22.52012591676119
ten minutes 22.471590357752454
Donwell Abbey 22.094212435914425
ten days 21.555654622540928
full justice 21.47581702025422
common sense 21.363910572029226
common course 21.240528156523947
proper attention 20.44062553817004
few moments 20.38902811812921
sensible man 20.265602201862315
few minutes 20.24607016428717
few lines 20.06710002324185
last night 19.960715901855433
few weeks 19.8447076019054
pleasant party 19.80783692313208
