# NLP (Natural Language Processing)

Any computation or manipulation of natural language

#### Natural languages evolve
* new words get added;
* old words lose popularity;
* meanings of words change;
* language rules themselves may change;

#### NLP Tasks
* Counting words, counting frequency of words
* Finding sentence boundaries
* Part of speech tagging
* Parsing the sentence structure
* Identifying semantic role
* Finding witch pronoun refers to each entity
* and much more...

## Basic NLP Tasks with NLTK (Natural Language Toolkit)

In [29]:
import nltk
nltk.download('gutenberg')
nltk.download('genesis')
nltk.download('inaugural')
nltk.download('nps_chat')
nltk.download('webtext')
nltk.download('treebank')
nltk.download('udhr')
nltk.download('wordnet')
nltk.download('omw-1.4')
nltk.download('punkt')
nltk.download('tagsets')
nltk.download('averaged_perceptron_tagger')
from nltk.book import *

[nltk_data] Downloading package gutenberg to /home/user/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!
[nltk_data] Downloading package genesis to /home/user/nltk_data...
[nltk_data]   Package genesis is already up-to-date!
[nltk_data] Downloading package inaugural to /home/user/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
[nltk_data] Downloading package nps_chat to /home/user/nltk_data...
[nltk_data]   Package nps_chat is already up-to-date!
[nltk_data] Downloading package webtext to /home/user/nltk_data...
[nltk_data]   Package webtext is already up-to-date!
[nltk_data] Downloading package treebank to /home/user/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package udhr to /home/user/nltk_data...
[nltk_data]   Package udhr is already up-to-date!
[nltk_data] Downloading package wordnet to /home/user/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package om

### Counting vocabulary of words

In [30]:
text7

<Text: Wall Street Journal>

In [31]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [32]:
len(sent7)

18

In [33]:
len(text7)

100676

In [34]:
len(set(text7))

12408

In [35]:
list(set(text7))[:10]

['attributed',
 'proving',
 'get',
 'democratic',
 'Packages',
 'extending',
 '451',
 'Wakui',
 '1.6',
 'encroaching']

### Frequency of words

In [36]:
dist = FreqDist(text7)
len(dist)

12408

In [37]:
vocab1 = dist.keys()
#vocab1[:10] 
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

In [38]:
dist['four']

20

In [39]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

### Normalization and stemming

* **Normalization**: Normalization is to find out different forms of a single word and to bring them to the same form.
* **Steming**: Stemming is to take a word and remove all common suffixes to bring the word to the base (root) form.

In [40]:
input1 = "List listed lists listing listings"

# normalization
words1 = input1.lower().split(' ')
print(words1)

# steming
porter = nltk.PorterStemmer()
print([porter.stem(t) for t in words1])

['list', 'listed', 'lists', 'listing', 'listings']
['list', 'list', 'list', 'list', 'list']


### Lemmatization

Lemmatization is a slight variant of stemming. Making the words that come out to be actually meaningful. 

In [41]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [42]:
[porter.stem(t) for t in udhr[:20]] # Still Lemmatization

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [43]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

### Tokenization

In [44]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [45]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

In [46]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [47]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

## Advanced NLP Tasks with NLTK

### Part-of-speech (POS) tagging

Attribute tags to the words or tokens (nouns, verbs, adjectives, ...)

In [48]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [49]:
text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

#### Ambiguity in POS tagging

In [50]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

#### Parsing sentence structure
Making sense of sentences is easy if they follow a well-defined grammatical structure.

In [51]:
text15 = nltk.word_tokenize("Alice loves Bob")
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

parser = nltk.ChartParser(grammar)
trees = parser.parse_all(text15)
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


##### Ambiguity in Parsing
Ambiguity may exist even if sentences are grammatically correct

In [52]:
text16 = nltk.word_tokenize("I saw the man with a telescope")
grammar1 = nltk.data.load('mygrammar.cfg')
grammar1

<Grammar with 13 productions>

In [53]:
parser = nltk.ChartParser(grammar1)
trees = parser.parse_all(text16)
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (Det the) (N man)))
    (PP (P with) (NP (Det a) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (Det the) (N man) (PP (P with) (NP (Det a) (N telescope))))))


In [54]:
from nltk.corpus import treebank

text17 = treebank.parsed_sents('wsj_0001.mrg')[0]
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


### POS tagging and parsing ambiguity

In [55]:
# Uncommon usage of words
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [56]:
# Well-formed sentences may still be meaningless
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

* #### POS tagging provides insights into the word classes/types in a sentence.
* #### Parsing the grammatical structure helps deriving meaning.
* #### Both tasks are difficult, linguistic ambiguity increases the difficult even more. 
* #### Better models could be learned with supervised training.
* #### NLTK provides access to tools and data for training.

## Spelling Correction
A common way to check for mis-spelt words and correct them is to find valid words that share similar spelling 

Requires:
* A dictionary with valid words - `words` from `nltk.corpus`
* Some way to measure spelling similarity - "edit distance" between two strings.

Edit distance
* Number of changes that need to be made to string `A` to get to string `B`
* **Levenshtein distance**
    * Insertions
    * Deletions
    * Substitutions
* **N-grams**
    * Character sequences in a word of size n
    * Can be used for word sequence (n words that usually appear together)
    * `pierce` => `pi`, `ie`, `er`, `rc`, `ce`
    * `pierse` => `pi`, `ie`, `er`, `rs`, `se`
* **Jaccard similarity**
    * Used to measure similarity of sets
    * Jaccard index / coefficient of similarity of two sets `A` and `B` is = $(A \bigcap B) \over (A \bigcup B)$
    * Jaccard(`pierce`, `pierse`) = $ 3 \over 7 $

#### Main concepts
* NLTK and simple text processing can be used to build a simple spelling checker
* N-grams can similarly be used to capture word phrases to find "near duplicate" passages
* N-grams also fundamental to many NLP pre-processing steps, like suffix/prefix matching, character-level embedding, etc
