# <center>Natural Language Processing Using NLTK (I)</center>

References:
 - http://www.nltk.org/book_1ed/
 - https://web.stanford.edu/class/cs124/lec/Information_Extraction_and_Named_Entity_Recognition.pdf

## 1. NLTK installation
 1. Install NLTK package using: pip install nltk 
 2. Open your python editor (Jupyter Notebook, Spyder etc.) and type the following comands below. Select "all packages" to install data included in NLTK, including corpora and books. It may take a few minutes to download all data

In [1]:
import nltk
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

## 2. NLP Objectives and Basic Steps

 - Objectives:
   * Split documents into tokens, phrases, or segments
   * Clean up tokens and annotate tokens
   * Extract features from tokens for further text mining tasks
 - Basic processing steps:
   * Tokenization: split documents into individual words, phrases, or segments
   * Remove stop words and filter tokens
   * POS (part of speech) Tagging
   * Normalization: Stemming, Lemmatization
   * Named Entity Recognition (NER)
   * Term Frequency and Inverse Dcoument Frequency (TF-IDF)
   * Document-to-term matrix (bag of words)
 - NLP packages: NLTK, Gensim, spaCy


In [2]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

import re    # import re module
import nltk

In [3]:
# this extract is from https://www.sciencenews.org/article/coronavirus-what-does-covid-19-vaccine-efficacy-mean

text = "The FDA setting a minimum recommendation for efficacy doesn't mean vaccines \
couldn't perform better. The benchmark is also a reminder that COVID-19 vaccine \
development is in its early days. If the first vaccines made available only meet \
the minimum, they may be replaced by others that prove to protect more people. \
But with more than 1 million deaths from COVID-19 worldwide — \
and U.S. deaths surpassing 200,000 — the urgency in finding a \
vaccine that safely helps at least some people is at the forefront."

text

"The FDA setting a minimum recommendation for efficacy doesn't mean vaccines couldn't perform better. The benchmark is also a reminder that COVID-19 vaccine development is in its early days. If the first vaccines made available only meet the minimum, they may be replaced by others that prove to protect more people. But with more than 1 million deaths from COVID-19 worldwide — and U.S. deaths surpassing 200,000 — the urgency in finding a vaccine that safely helps at least some people is at the forefront."

## 3. Tokenization
 - **Definition**: the process of breaking a stream of textual content up into words, terms, symbols, or some other meaningful elements called tokens.
    * Word (Unigram)
    * Bigram (Two consecutive words)
    * Trigram (Three consecutive words)
    * Sentence
 - Different methods exist:
    * Split by regular expression patterns
    * NLTK's word tokenizer
    * NLTK's regular expression tokenizer (customizable)
 - None of them can be perfect for any tokenization task. 

### 3.1. Unigram

#### Regular Expression

In [4]:
# Exercise 3.1.1. Simply split the text by one or more non-word characters

# \W+: one or more non-words
tokens = re.split(r"\W+", text)   

# get the number of tokens

print(len(tokens))                   
print(tokens)                     

# Pros: no punctuation, just words
# Cons: COVID-19, doesn't, couldn't, 200,000
# are split into two words

re.findall(r"\w+", text) 

90
['The', 'FDA', 'setting', 'a', 'minimum', 'recommendation', 'for', 'efficacy', 'doesn', 't', 'mean', 'vaccines', 'couldn', 't', 'perform', 'better', 'The', 'benchmark', 'is', 'also', 'a', 'reminder', 'that', 'COVID', '19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', 'But', 'with', 'more', 'than', '1', 'million', 'deaths', 'from', 'COVID', '19', 'worldwide', 'and', 'U', 'S', 'deaths', 'surpassing', '200', '000', 'the', 'urgency', 'in', 'finding', 'a', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront', '']


['The',
 'FDA',
 'setting',
 'a',
 'minimum',
 'recommendation',
 'for',
 'efficacy',
 'doesn',
 't',
 'mean',
 'vaccines',
 'couldn',
 't',
 'perform',
 'better',
 'The',
 'benchmark',
 'is',
 'also',
 'a',
 'reminder',
 'that',
 'COVID',
 '19',
 'vaccine',
 'development',
 'is',
 'in',
 'its',
 'early',
 'days',
 'If',
 'the',
 'first',
 'vaccines',
 'made',
 'available',
 'only',
 'meet',
 'the',
 'minimum',
 'they',
 'may',
 'be',
 'replaced',
 'by',
 'others',
 'that',
 'prove',
 'to',
 'protect',
 'more',
 'people',
 'But',
 'with',
 'more',
 'than',
 '1',
 'million',
 'deaths',
 'from',
 'COVID',
 '19',
 'worldwide',
 'and',
 'U',
 'S',
 'deaths',
 'surpassing',
 '200',
 '000',
 'the',
 'urgency',
 'in',
 'finding',
 'a',
 'vaccine',
 'that',
 'safely',
 'helps',
 'at',
 'least',
 'some',
 'people',
 'is',
 'at',
 'the',
 'forefront']

#### NLTK's word tokenizer does the following steps:
* split standard contractions, e.g. don't -> do n't and they'll -> they 'll
* treat most punctuation characters as separate tokens
* split off commas and single quotes, when followed by whitespace
* separate periods that appear at the end of line

In [5]:
# Exercise 3.1.2 NLTK's word tokenizer: 

# break down text into words and punctuations

# invoke NLTK's word tokenizer
tokens = nltk.word_tokenize(text)    
print(len(tokens) )                   
print (tokens)       

# Pros: words are well tokenized, 
# e.g. COVID-19, 200,000 are not split by punctuations
# doesn't becomes does n't
# cons: need to remove punctuation 

92
['The', 'FDA', 'setting', 'a', 'minimum', 'recommendation', 'for', 'efficacy', 'does', "n't", 'mean', 'vaccines', 'could', "n't", 'perform', 'better', '.', 'The', 'benchmark', 'is', 'also', 'a', 'reminder', 'that', 'COVID-19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', '.', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', ',', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', '.', 'But', 'with', 'more', 'than', '1', 'million', 'deaths', 'from', 'COVID-19', 'worldwide', '—', 'and', 'U.S.', 'deaths', 'surpassing', '200,000', '—', 'the', 'urgency', 'in', 'finding', 'a', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront', '.']


In [2]:
# Exercise 3.1.3 remove leading or trailing punctuations

import string

string.punctuation

tokens=[token.strip(string.punctuation+'—') for token in tokens]
tokens
# remove empty tokens
tokens=[token.strip() for token in tokens \
        if token.strip()!='']
print(len(tokens) )
print(tokens)  

# Note '—' is still kept since it's not in the punctuation list. How to remove it?

NameError: name 'tokens' is not defined

#### NLTK's regular expression tokinizer (customizable)

In [6]:
# Exercise 3.1.4 NLTK's regular expression tokenizer 

# Pattern can be customized to your need

# a word is defined as:
# (1) must start with a word character  \w
# (2) then contain zero or more word characters,"-",",", 
#     or "'" in the middle [\w\,'-]*
#     e.g.: couldn't, 600,000, COVID-19
# (3) must end with a word character \w

pattern=r'\w[\w\',-]*\w'                        

# call NLTK's regular expression tokenization
tokens=nltk.regexp_tokenize(text, pattern)

print(len(tokens))
print (tokens)

78
['The', 'FDA', 'setting', 'minimum', 'recommendation', 'for', 'efficacy', "doesn't", 'mean', 'vaccines', "couldn't", 'perform', 'better', 'The', 'benchmark', 'is', 'also', 'reminder', 'that', 'COVID-19', 'vaccine', 'development', 'is', 'in', 'its', 'early', 'days', 'If', 'the', 'first', 'vaccines', 'made', 'available', 'only', 'meet', 'the', 'minimum', 'they', 'may', 'be', 'replaced', 'by', 'others', 'that', 'prove', 'to', 'protect', 'more', 'people', 'But', 'with', 'more', 'than', 'million', 'deaths', 'from', 'COVID-19', 'worldwide', 'and', 'deaths', 'surpassing', '200,000', 'the', 'urgency', 'in', 'finding', 'vaccine', 'that', 'safely', 'helps', 'at', 'least', 'some', 'people', 'is', 'at', 'the', 'forefront']


In [None]:
# Exercise use regular expression tokenizer to extract
# course and title pharse, i.e 
# 'COM-101 COMPUTERS'

text = '''COM-101   COMPUTERS
COM-111   DATABASE
COM-211   ALGORITHM
MAT-103   STATISTICS learning
MAT-102   STATISTICS'''


### 3.2. Sentence

In [6]:
# Exercise 3.2.1. Segmentation by Sentences

sentences = nltk.sent_tokenize(text)
len(sentences)
sentences

# what patterns can be used to segment 
# text into sentences?

4

["The FDA setting a minimum recommendation for efficacy doesn't mean vaccines couldn't perform better.",
 'The benchmark is also a reminder that COVID-19 vaccine development is in its early days.',
 'If the first vaccines made available only meet the minimum, they may be replaced by others that prove to protect more people.',
 'But with more than 1 million deaths from COVID-19 worldwide — and U.S. deaths surpassing 200,000 — the urgency in finding a vaccine that safely helps at least some people is at the forefront.']

### 3.3 Phrases: Bigrams (2 consecutive words),  Trigrams (3 consecutive words), or in general n-grams
 - Why bigrams and trigrams?
 - How to get bigrams or trigrams:
    1. First tokenize text into unigrams
    2. Slice through the list of unigrams to get bigrams

In [7]:
# Exercise 3.3.1. Get bigrams from the text                       

# bigrams are formed from unigrams
# nltk.bigram returns an iterator

bigrams=list(nltk.bigrams(tokens))  # tokens are created in Exercise 3.1.4
print(bigrams)

# trigrams
list(nltk.trigrams(tokens))

[('The', 'FDA'), ('FDA', 'setting'), ('setting', 'minimum'), ('minimum', 'recommendation'), ('recommendation', 'for'), ('for', 'efficacy'), ('efficacy', "doesn't"), ("doesn't", 'mean'), ('mean', 'vaccines'), ('vaccines', "couldn't"), ("couldn't", 'perform'), ('perform', 'better'), ('better', 'The'), ('The', 'benchmark'), ('benchmark', 'is'), ('is', 'also'), ('also', 'reminder'), ('reminder', 'that'), ('that', 'COVID-19'), ('COVID-19', 'vaccine'), ('vaccine', 'development'), ('development', 'is'), ('is', 'in'), ('in', 'its'), ('its', 'early'), ('early', 'days'), ('days', 'If'), ('If', 'the'), ('the', 'first'), ('first', 'vaccines'), ('vaccines', 'made'), ('made', 'available'), ('available', 'only'), ('only', 'meet'), ('meet', 'the'), ('the', 'minimum'), ('minimum', 'they'), ('they', 'may'), ('may', 'be'), ('be', 'replaced'), ('replaced', 'by'), ('by', 'others'), ('others', 'that'), ('that', 'prove'), ('prove', 'to'), ('to', 'protect'), ('protect', 'more'), ('more', 'people'), ('people',

[('The', 'FDA', 'setting'),
 ('FDA', 'setting', 'minimum'),
 ('setting', 'minimum', 'recommendation'),
 ('minimum', 'recommendation', 'for'),
 ('recommendation', 'for', 'efficacy'),
 ('for', 'efficacy', "doesn't"),
 ('efficacy', "doesn't", 'mean'),
 ("doesn't", 'mean', 'vaccines'),
 ('mean', 'vaccines', "couldn't"),
 ('vaccines', "couldn't", 'perform'),
 ("couldn't", 'perform', 'better'),
 ('perform', 'better', 'The'),
 ('better', 'The', 'benchmark'),
 ('The', 'benchmark', 'is'),
 ('benchmark', 'is', 'also'),
 ('is', 'also', 'reminder'),
 ('also', 'reminder', 'that'),
 ('reminder', 'that', 'COVID-19'),
 ('that', 'COVID-19', 'vaccine'),
 ('COVID-19', 'vaccine', 'development'),
 ('vaccine', 'development', 'is'),
 ('development', 'is', 'in'),
 ('is', 'in', 'its'),
 ('in', 'its', 'early'),
 ('its', 'early', 'days'),
 ('early', 'days', 'If'),
 ('days', 'If', 'the'),
 ('If', 'the', 'first'),
 ('the', 'first', 'vaccines'),
 ('first', 'vaccines', 'made'),
 ('vaccines', 'made', 'available'),
 (

### 3.4. Collocation
 - Most bigrams or trigrams may sound odd. However, we need to pay attention to frequent bigrams or trigrams
 - **Collocation**: an expression consisting of two or more words that correspond to some conventional way of saying things, e.g. red wine, United States, balance sheet etc.
    - Collocations are not fully compositional in that there is usually an element of meaning added to the combination.
 - Question: how to find collocations?
    - Suppose you have a rich collection of text, e.g. english-web.txt
    - How to find good collocations from this file?

In [8]:
# Sample text: inaugural address

# To check the text, use

print(nltk.corpus.inaugural.raw('1789-Washington.txt')[0:200])

Fellow-Citizens of the Senate and of the House of Representatives:

Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was


In [9]:
# Exercise 3.4.1.
# construct bigrams using words from a large bulit-in NLTK corpus

from nltk.collocations import BigramAssocMeasures, BigramCollocationFinder

# bigram association measures
# different measures, e.g. frequency, are implemented

bigram_measures = BigramAssocMeasures()

# First load text from a NLTK corpus (inagural) 
# and create unigram tokens
# Then create bigrams from the tokens
words=nltk.corpus.inaugural.words()

finder = BigramCollocationFinder.from_words(words)

# find the top 10 bigrams by frequency
finder.nbest(bigram_measures.raw_freq, 10) 

# Note that the most frequent bigrams are very odd
# how to fix it?

[('of', 'the'),
 (',', 'and'),
 ('in', 'the'),
 ('to', 'the'),
 ('of', 'our'),
 ('.', 'The'),
 ('.', 'We'),
 ('and', 'the'),
 (',', 'the'),
 ('.', 'It')]

In [12]:
# Exercise 3.4.2. Find collocation by filter

import string
# construct bigrams using words from a NLTK corpus

stop_words = nltk.corpus.stopwords.words('english')
#print(stop_words)

finder.apply_word_filter(lambda w: w.lower() in stop_words\
                         or w.strip(string.punctuation)=='')

finder.nbest(bigram_measures.raw_freq, 20) 

# better?
# notice "let us", "upon us"

[('United', 'States'),
 ('fellow', 'citizens'),
 ('let', 'us'),
 ('Let', 'us'),
 ('American', 'people'),
 ('Federal', 'Government'),
 ('years', 'ago'),
 ('four', 'years'),
 ('General', 'Government'),
 ('upon', 'us'),
 ('one', 'another'),
 ('fellow', 'Americans'),
 ('Vice', 'President'),
 ('God', 'bless'),
 ('every', 'citizen'),
 ('Fellow', 'citizens'),
 ('Almighty', 'God'),
 ('foreign', 'nations'),
 ('Chief', 'Justice'),
 ('every', 'American')]

#### 3.4.1 How to find collocations - PMI
- By **frequency** (perhaps with filter)
- **Pointwise Mutual Information (PMI)**
  - giving two words $w_1, w_2$, $$PMI(w_1,w_2)=\log{\frac{p(w_1,w_2)}{p(w_1)*p(w_2)}}$$
  - Some observations:
    - if $w_1$ and $w_2$ are independent, $PMI(w_1,w_2)=0$
    - if $w_1$ is completely dependent on $w_2$, i.e. $p(w_1,w_2)=p(w_2)$, $PMI(w_1,w_2)=\log\frac{1}{p(w_1)}$. In this case, what if $w_1$ just appears once in the corpus? 
    - PMI favors less frequent collocations 
    - how to fix it?


In [13]:
# Exercise 3.4.1.1 Metrics for Collocations

from nltk.collocations import *

# construct bigrams using words from a NLTK corpus
finder = BigramCollocationFinder.from_words(words)

# find top-n bigrams by pmi
finder.nbest(bigram_measures.pmi, 20) 

# Notice most of them are names

[('/', '11'),
 ('25', 'straight'),
 ('Amelia', 'Island'),
 ('Apollo', 'astronauts'),
 ('Archibald', 'MacLeish'),
 ('BUSINESS', 'COOPERATION'),
 ('Barbary', 'Powers'),
 ('Belleau', 'Wood'),
 ('Boston', 'lawyer'),
 ('Britannic', 'Majesty'),
 ('COOPERATION', 'BY'),
 ('CRIMINAL', 'JUSTICE'),
 ('Calvin', 'Coolidge'),
 ('Cape', 'Horn'),
 ('Cardinal', 'Bernardin'),
 ('Chop', 'Hill'),
 ('Chosin', 'Reservoir'),
 ('Christmas', 'Eve'),
 ('Colonel', 'Goethals'),
 ('Dark', 'pictures')]

In [14]:
# 3.4.1.2 filter bigrams by frequency

finder.apply_freq_filter(5)  #5
finder.nbest(bigram_measures.pmi, 20) 

[('Indian', 'tribes'),
 ('Western', 'Hemisphere'),
 ('¡', 'Xand'),
 ('coordinate', 'branches'),
 ('Old', 'World'),
 ('George', 'Washington'),
 ('faithfully', 'executed'),
 ('nuclear', 'weapons'),
 ('Chief', 'Magistrate'),
 ('middle', 'class'),
 ('Chief', 'Justice'),
 ('tariff', 'bill'),
 ('World', 'War'),
 ('executive', 'department'),
 ('move', 'forward'),
 ('President', 'Bush'),
 ('Vice', 'President'),
 ('executive', 'branch'),
 ('interstate', 'commerce'),
 ('domestic', 'concerns')]

#### 3.4.2 How to find collocations - NPMI and others
- **Normalized Pointwise Mutual Information (`NPMI`)**
   - If $w_1$ and $w_2$ always occur together, i.e., $p(w_1)=p(w_2)=p(w_1,w_2)$, PMI reaches the maximum: $$PMI(w_1,w_2)=-\log{p(w_1)}=-\log{p(w_2)}=-\log{p(w_1,w_2)}$$
   - Normalized PMI is the PMI divided by the upper bound:
   $$NPMI(w_1,w_2)=\frac{\log{\frac{p(w_1,w_2)}{p(w_1)*p(w_2)}}}{-\log{p(w_1,w_2)}}$$
   
- Another simple method by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf):

    - $Score(w_1, w_2)=\frac{count(w_1,w_2)-\delta}{count(w_1)*count(w_2)}, \text{where}~\delta~\text{is the minimum collocation frequency} $ 

    - This is equivalent to PMI with a minimum collocation threshold
- Both methods are implemented in `gensim` package

#### 3.4.3 Phrase extraction by Gensim package
- Gensim is an open source Python library for NLP, with a focus on topic modeling.
- It is not an everything-including-the-kitchen-sink NLP research library (like NLTK); instead, Gensim is a mature, focused, and efficient suite of NLP tools for topic modeling, including 
  - Word2Vec word embedding 
  - Topic modeling
  - Text preprocessing like **phrase extraction**
  
- Gensim Phrase Model: 
    - `gensim.models.phrases.Phrases(sentences, min_count, threshold, max_vocab_size, delimiter, scoring, ...)`
        - `sentences`: list of sentences or iterables, each of which can be a document
        - `min_count`: Ignore all words and bigrams with total collected count lower than this value.
        - `threshold`: Represent a score threshold for forming the phrases (higher means fewer phrases). A phrase of words $a$ followed by $b$ is accepted if the score of the phrase is greater than threshold. Heavily depends on concrete scoring-function.
        - `max_vocab_size`: Maximum size (number of tokens) of the vocabulary. 
        - `delimiter`: Glue character used to join collocation tokens, should be a byte string (e.g. '\_').
        - `scoring`: Specify how potential phrases are scored. 
           - `default` - original_scorer(), by Mikolov et al. (2013) (https://arxiv.org/pdf/1310.4546.pdf)
           - `npmi` - npmi_scorer().

In [1]:
# Exercise 2.1. Find bigrams using gensim
import gensim
import nltk
from nltk.collocations import *

from gensim.models.phrases import Phrases, Phraser


words=nltk.corpus.inaugural.words()

# Train phrase model to find phrases using scorer (Mikolov et al. 2013)
phrases = Phrases([words], min_count=2, threshold=50)

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.find_phrases([words]).items()), key=lambda item: -item[1])

# print top 50 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

Santo_Domingo:	8549.11
Indian_tribes:	6411.83
Abraham_Lincoln:	6411.83
Founding_Fathers:	6411.83
Social_Security:	6411.83
specie_payments:	6411.83
illegal_liquor:	5129.47
merchant_marine:	4808.88
Western_Hemisphere:	4808.88
founding_documents:	4274.56
Supreme_Court:	4274.56
lock_type:	3847.10
Old_World:	3108.77
inland_frontiers:	2747.93
¡_Xand:	2564.73
coordinate_branches:	2355.37
Thomas_Jefferson:	2331.58
eighteenth_amendment:	2137.28
Chief_Magistrate:	1998.49
extra_session:	1972.87
Great_Britain:	1923.55
George_Washington:	1846.61
silent_prayer:	1810.40
faithfully_executed:	1803.33
entangling_alliances:	1748.68
fervent_supplications:	1709.82
nuclear_weapons:	1648.76
distinguished_guests:	1538.84
Civil_War:	1465.56
onward_march:	1424.85
plainly_written:	1373.96
Chief_Justice:	1373.96
middle_class:	1221.30
fifteenth_amendment:	1068.64
earliest_practicable:	961.78
fertile_soil:	961.78
preceding_term:	961.78
walk_humbly:	874.34
World_War:	814.20
tariff_bill:	744.60
regular_session:	739.8

In [2]:
# Exercise 2.2. Find bigrams by NPMI

# find phrases using NPMI

phrases = Phrases([words], min_count=2, threshold=0.5, \
                  scoring='npmi')

# get unique set of phrases and sorted by score in descending order
items = sorted(set(phrases.find_phrases([words]).items()), key=lambda item: -item[1])

# print top 20 phrases
for phrase, score in items[0:50]:
    print("{0}:\t{1:.2f}".format(phrase, score))

Santo_Domingo:	1.00
Philippine_Islands:	1.00
reverend_clergy:	1.00
Information_Age:	1.00
Rocky_Mountains:	1.00
Porto_Rico:	1.00
Panama_Canal:	1.00
Social_Security:	0.97
Founding_Fathers:	0.97
Indian_tribes:	0.97
'_s:	0.96
specie_payments:	0.96
Abraham_Lincoln:	0.96
illegal_liquor:	0.95
merchant_marine:	0.95
Majority_Leader:	0.94
electors_residing:	0.94
Western_Hemisphere:	0.94
founding_documents:	0.94
Old_World:	0.93
sheet_anchor:	0.93
lock_type:	0.93
Supreme_Court:	0.92
Dingley_Act:	0.92
secondary_boycott:	0.92
Middle_East:	0.92
cleaner_environment:	0.92
start_afresh:	0.92
Permanent_Court:	0.90
elective_franchise:	0.90
inland_frontiers:	0.90
Chief_Magistrate:	0.88
United_States:	0.88
200th_anniversary:	0.88
Thomas_Jefferson:	0.88
¡_Xand:	0.88
coordinate_branches:	0.87
Chief_Justice:	0.87
exclusive_metallic:	0.87
Pacific_Coast:	0.87
extra_session:	0.86
eighteenth_amendment:	0.86
Senator_Mathias:	0.86
Senator_Dole:	0.86
temporary_restraining:	0.86
entangling_alliances:	0.85
fugitive_sla

In [3]:
# Exercise 2.3. Tokenize by unigrams and bigrams

# Initialize phrase tokenizer
bigram = Phraser(phrases)


#sent = nltk.corpus.inaugural.raw('2009-Obama.txt')

sent = '''That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.'''

print(sent)

print(bigram[sent.split()])

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a consequence of greed and irresponsibility on the part of some, but also our collective failure to make hard choices and prepare the nation for a new age. Homes have been lost; jobs shed; businesses shuttered. Our health care is too costly; our schools fail too many; and each day brings further evidence that the ways we use energy strengthen our adversaries and threaten our planet.
['That', 'we', 'are', 'in', 'the', 'midst', 'of', 'crisis', 'is', 'now', 'well', 'understood.', 'Our', 'nation', 'is', 'at', 'war,', 'against', 'a', 'far-reaching', 'network', 'of', 'violence', 'and', 'hatred.', 'Our', 'economy', 'is', 'badly', 'weakened,', 'a', 'consequence', 'of', 'greed', 'and', 'irresponsibility', 'on', 'the', 'part', 'of', 'some,', 'but', 'also', 'our', 'collective', 'failure', 'to', 'make', 'hard_choices', 'and', 'prepar

## 4. Vocabulary 
 - Vocabulary: the set of unique tokens (unigrams/phrases)  
 - Dictionary: typicallly, the vocabulary of a text can be represented as a dictionary 
    * Key: word, Value: count of the word
    * **nltk.FreqDist()**: a nice function for calculating frequncy of words/phrases
        - Get the frequency of items in the parameter list 
        - Retruns an object similar to a dictionary

In [15]:
# 3.5.1 Get token frequency

# first tokenize the text
pattern=r'\w[\w\',-]*\w'                        
tokens=nltk.regexp_tokenize(text.lower(), pattern)

#tokens
# get unigram frequency 
# recall, you can also get the dictionary by 
# {token:count(token) for token in set(tokens)}

word_dist=nltk.FreqDist(tokens)
word_dist

# get the most frequent items
print("top 5 words:", word_dist.most_common(5))

# what kind of words usually have high frequency?

# it behaves as a dictionary
for word in word_dist:
    print(word,":", word_dist[word])
    

FreqDist({'the': 6, 'is': 3, 'that': 3, 'minimum': 2, 'vaccines': 2, 'covid-19': 2, 'vaccine': 2, 'in': 2, 'more': 2, 'people': 2, ...})

top 5 words: [('the', 6), ('is', 3), ('that', 3), ('minimum', 2), ('vaccines', 2)]
the : 6
is : 3
that : 3
minimum : 2
vaccines : 2
covid-19 : 2
vaccine : 2
in : 2
more : 2
people : 2
deaths : 2
at : 2
fda : 1
setting : 1
recommendation : 1
for : 1
efficacy : 1
doesn't : 1
mean : 1
couldn't : 1
perform : 1
better : 1
benchmark : 1
also : 1
reminder : 1
development : 1
its : 1
early : 1
days : 1
if : 1
first : 1
made : 1
available : 1
only : 1
meet : 1
they : 1
may : 1
be : 1
replaced : 1
by : 1
others : 1
prove : 1
to : 1
protect : 1
but : 1
with : 1
than : 1
million : 1
from : 1
worldwide : 1
and : 1
surpassing : 1
200,000 : 1
urgency : 1
finding : 1
safely : 1
helps : 1
least : 1
some : 1
forefront : 1


### 4.1 Stop words and word filtering

 - Stop words: a set of commonly used words, have very little meaning, and cannot differentiate a text from others, such as "and", "the" etc. 
 - Stop words are typically ignored in NLP processing or by search engine
 - Stop words usually are application specific. You can define your own stop words!

In [17]:
# Exercise 3.5.1.1
# get NLTK English stop words
# You can modify this list by adding more stop words or remove stop words

from nltk.corpus import stopwords
import string

stop_words = stopwords.words('english')
stop_words+=["covid-19", "virus"]
#print (stop_words)

# filter stop words out of the dictionary
# by creating a new dictionary

filtered_dict={word: word_dist[word] \
                     for word in word_dist \
                     if word not in stop_words}


filtered_dict

# how to sort the dictionary by value?
sorted(filtered_dict.items(), key = lambda item: -item[-1])

{'minimum': 2,
 'vaccines': 2,
 'vaccine': 2,
 'people': 2,
 'deaths': 2,
 'fda': 1,
 'setting': 1,
 'recommendation': 1,
 'efficacy': 1,
 'mean': 1,
 'perform': 1,
 'better': 1,
 'benchmark': 1,
 'also': 1,
 'reminder': 1,
 'development': 1,
 'early': 1,
 'days': 1,
 'first': 1,
 'made': 1,
 'available': 1,
 'meet': 1,
 'may': 1,
 'replaced': 1,
 'others': 1,
 'prove': 1,
 'protect': 1,
 'million': 1,
 'worldwide': 1,
 'surpassing': 1,
 '200,000': 1,
 'urgency': 1,
 'finding': 1,
 'safely': 1,
 'helps': 1,
 'least': 1,
 'forefront': 1}

[('minimum', 2),
 ('vaccines', 2),
 ('vaccine', 2),
 ('people', 2),
 ('deaths', 2),
 ('fda', 1),
 ('setting', 1),
 ('recommendation', 1),
 ('efficacy', 1),
 ('mean', 1),
 ('perform', 1),
 ('better', 1),
 ('benchmark', 1),
 ('also', 1),
 ('reminder', 1),
 ('development', 1),
 ('early', 1),
 ('days', 1),
 ('first', 1),
 ('made', 1),
 ('available', 1),
 ('meet', 1),
 ('may', 1),
 ('replaced', 1),
 ('others', 1),
 ('prove', 1),
 ('protect', 1),
 ('million', 1),
 ('worldwide', 1),
 ('surpassing', 1),
 ('200,000', 1),
 ('urgency', 1),
 ('finding', 1),
 ('safely', 1),
 ('helps', 1),
 ('least', 1),
 ('forefront', 1)]

### 4.2 positive/negative words: sentiment analysis
- Sentiment analysis often relies on **lists of words and phrases with positive and negative connotations**. 
- Many dictionaries of positive and negative opinion words were already developed:

  - **Hu and Liu's lexicon**: http://www.cs.uic.edu/~liub/FBS/
  - **SentiWordNet**: an excellent publicly available lexicon (http://sentiwordnet.isti.cnr.it/) 
  - **SentiWords**: contains 155,000 English words (https://hlt-nlp.fbk.eu/technologies/sentiwords)
  - **WordStat**: contains more than 9164 negative and 4847 positive word patterns (https://provalisresearch.com/products/content-analysis-software/wordstat-dictionary/sentiment-dictionaries/)
  - **SenticNet**: provides polarity associated with 50,000 natural language concepts https://sentic.net
  - **Sentiment140**:  created from 1.6 million tweets and contains a list of words and their associations with positive and negative sentiment (https://github.com/felipebravom/StaticTwitterSent/tree/master/extra/Sentiment140-Lexicon-v0.1)
- Opinion words are <b>domain-specific</b>. (e.g. "power" in political domain vs. in engergy sector)
  - For example, for financial industry, there are a number of dictionaries for opinion words:
     * Harvard's General Inquirer (GI): http://www.wjh.harvard.edu/~inquirer/
     * Loughran and McDonald (2015):  https://sraf.nd.edu/textual-analysis/resources/
- For description of these lexicons, check https://medium.com/@datamonsters/sentiment-analysis-tools-overview-part-1-positive-and-negative-words-databases-ae35431a470c
- Question: **How to select the right lexicon**?


In [18]:
# Exercise 3.5.2.1
# Find positive words 
text = '''the problem is that the writers, james cameron and jay cocks ,\
were too ambitious, aiming for a film with social relevance, thrills, and drama. 
 not that ambitious film-making should be discouraged; \
 just that when it fails to achieve its goals, it fails badly and obviously. 
 the film just ends up preachy, unexciting and uninvolving.'''

pattern=r'\w[\w\',-]*\w'                        
tokens=nltk.regexp_tokenize(text.lower(), pattern)


with open("positive-words.txt",'r') as f:
    positive_words=[line.strip() for line in f]

#positive_words
#print(positive_words)

positive_tokens=[token for token in tokens \
                 if token in positive_words]

print(positive_tokens)

['ambitious', 'thrills', 'ambitious']


- **Naive sentiment analysis**:
  - Find positive/negative words
  - If more positive words than negative, then positive
  - Otherwise, negative
- Note the sentence: 
  -  "the problem is that the writers, james cameron and jay cocks , were **<font color="red">too ambitious</font>**, aiming for a film with social relevance, thrills, and drama. **<font color="red">not that ambitious</font>** film-making should be discouraged; just that when it fails to achieve its goals ..."
- How to deal with **negation**?
- Some useful rules:
    - Negative sentiment: 
      - negative words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - positive words preceded by a negation within $n$ (e.g. three) words in the same sentence.
    - Positive sentiment (in the similar fashion):
      - positive words not preceded by a negation within $n$ (e.g. three) words in the same sentence.
      - negative terms following a negation within  $n$ (e.g. three) words in the same sentence


In [16]:
# Exercise 3.5.2.2 # check if a positive word is preceded by negation words
# e.g. not, too, n't, no, cannot

# this is not an exhaustive list of negation words!
negations=['not', 'too', 'n\'t', 'no', 'cannot', 'neither','nor', 'little','few']
tokens = nltk.word_tokenize(text)  

#print(tokens)

positive_tokens=[]
for idx, token in enumerate(tokens):
    if token in positive_words:
        if idx>0:
            if tokens[idx-1] not in negations:
                positive_tokens.append(token)
        else:
            positive_tokens.append(token)


print(positive_tokens)

# what if a positive word is preceded 
# by a negation within N words? 
# e.g. 'does not make any customer happy'

['thrills', 'ambitious']
