# Advanced NLP tasks with NLTK
---
*covering the following in the video* 
- part of speech tagging
- parsing the sentence structure 

*all other nlp functions* 
- Counting words, counting frequency of words
- finding sentence boundaries 
- identifying semantic role labeling
- named entity recognition 
- Co-referene and pronoun resolution 

## Part of speech (POS) tagging
---
* nouns verbs adjectives
* many more tags (conjunction, cardinal, determiner, preposition, modal, possesive)

In [3]:
import nltk
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [5]:
# first split sentence into words
text11 = "Children shoudn't drink a sugary drink before bed."
text13 = nltk.word_tokenize(text11)

In [6]:
# find pos - gives list of tuples 
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('shoud', 'VBP'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

# Ambiguity in POS tagging
---
- abiguity is common in English 

*Visiting aunts can be a nuisance.*


In [7]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance.")

In [8]:
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN'),
 ('.', '.')]

In [9]:
nltk.help.upenn_tagset('VBG')

VBG: verb, present participle or gerund
    telegraphing stirring focusing angering judging stalling lactating
    hankerin' alleging veering capping approaching traveling besieging
    encrypting interrupting erasing wincing ...


In [10]:
nltk.help.upenn_tagset('NNS')

NNS: noun, common, plural
    undergraduates scotches bric-a-brac products bodyguards facets coasts
    divestitures storehouses designs clubs fragrances averages
    subjectivists apprehensions muses factory-jobs ...


In [11]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [12]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


## Parsing sentence structure

In [13]:
text15 = nltk.word_tokenize("Alice loves Bob")

In [14]:
# Create context free grammar 
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

In [15]:
parser = nltk.ChartParser(grammar)

In [17]:
trees = parser.parse_all(text15)

In [18]:
for tree in trees:
    print(tree)

(S (NP Alice) (VP (V loves) (NP Bob)))


## Ambiguity in parsing
*I saw the man with the telescope.*

- To which entity does the preposition apply?  Is it that we saw the man who had a telescope, or did we have the telescope when we saw the man

In [19]:
text16 = nltk.word_tokenize("I saw the man with the telescope")

In [21]:
grammar_tel = nltk.data.load('mygrammar4.cfg')

In [22]:
grammar_tel

<Grammar with 13 productions>

In [23]:
parser = nltk.ChartParser(grammar_tel)

In [24]:
trees = parser.parse_all(text16)

In [25]:
len(trees)

2

In [26]:
for tree in trees:
    print(tree)

(S
  (NP I)
  (VP
    (VP (V saw) (NP (DT the) (N man)))
    (PP (P with) (NP (DT the) (N telescope)))))
(S
  (NP I)
  (VP
    (V saw)
    (NP (DT the) (N man) (PP (P with) (NP (DT the) (N telescope))))))


## NLTK and Parse Tree collection

In [28]:
from nltk.corpus import treebank

In [29]:
text17 = treebank.parsed_sents('wsj_0001.mrg')[0]

In [30]:
print(text17)

(S
  (NP-SBJ
    (NP (NNP Pierre) (NNP Vinken))
    (, ,)
    (ADJP (NP (CD 61) (NNS years)) (JJ old))
    (, ,))
  (VP
    (MD will)
    (VP
      (VB join)
      (NP (DT the) (NN board))
      (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director)))
      (NP-TMP (NNP Nov.) (CD 29))))
  (. .))


# POS tagging & parsing complexity
--- 
*Uncommon usage of words* 

In [31]:
text18 = nltk.word_tokenize("The old man the boat")

In [32]:
nltk.pos_tag(text18)

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [33]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")

In [34]:
nltk.pos_tag(text19)

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

# take home concepts
---
- POS tagging provides insights into the word classes/types in a sentence
- parsing the grammatical sructures helps derive meaning 
- both tasks are difficult, liguistic ambiguity increases the difficulty even more 
- better models could be learned with supervised training
- nltk provides access to tools and dadta for training 

# Classification of Text
*Supervised learning for text* 

## What is classification? 
---
- Given a set of classes
- Assign class label to an input 

## Supervised Learning 
---
- humans learn from past experiences, machines learn from past instances! 
- Training phase - model is created
    - labled instance
    - feed to classification algorithm 
    - Build classification model
- Inference phase
    - Create labels for input data 
   

## Supervised Classification 
---
- learn classification model on properties ("features") andtheirimportance ("weights") from labeled instances 
- X: Setof attributes or features { x1, x2, x3,...,xn) 
    - email - where does it come from
    - text - does it contain the word 'prince' 
- y: A "class" label from the label set Y = {y1, y2, ...,yk}

*Apply model on new instances to predict the label* 

## Supervised Learning - Phases and datasets
---
- Labeled dataset
    - divided into training data
    - validation data - test effectiveness of the model 
    
- Unlabeled dataset
    - do further testing of the model 

# Classification paradigms
---
- Where there are only two possible classes; |Y| = 2: 

**BINARY CLASSIFICATION**
        
- Where there are more than two possible classes; |Y| > 2: 

**MULTI-CLASS CLASSIFICATION** 
        
- When data instances can have two or more labels: 
 
**MULTI-LABEL CLASSIFICATION**

# Questions to ask in supervised learning 
---
- Training phase
    - what are the features?  How do you represent them?
    - what is the classification model/algo?
    - what are the model parameters? 
    
- Inference Phase
    - What is the expected performance?  What is a good measure?

# Identifying features from text 

# Why is textual data unique? 
---
- Textual data presents a unique set of challenges 
- All the information that you need is in the text 
- But features can be pulled from text at different granularities! 

## Type of textual features (I)
---
- words
    - By far the most used for features 
    - Handling commonly-occurring words: stop words
    - Normalization: Make lower case vs.leave as-is 
    - stemming / lemmatization
- characteristics of words
    - captalized? 
    - parts of speach of words in a sentence 
    - gramatical structure, sentence parsing 
    - Grouping words of similar meaning (buy/purchase) - synonyms (Mr., Ms., Dr.,), dates (recognize with re) 
- depending upon classification of tasks features may come from inside words and word sequences
    - bigrams, trigrams, n-grams: "White House"  "Saturday Night Live" 
    - character sub-sequences in words: "ing", "ion", ... 

# Naive Bayes Classifiers

## Case study: Classifying text search queries 
---
- Suppose you are interested in classifying search queries in three classes: Entertainment, Computer Science, Zoology
- Most common class is Entertainment 
- Suppose the query is "Python" 
    - ent? comp science?  or Zoology
    - if snake, than zoology 
    - if python is language, then comp science
    - if monty python, then entertainment 
    
*If most common label with word python is zoology, then that becomes the classification* 
*If next word after python is download, than it is classified as computer science* 

    

# Probabilistic Model 
--- 
- Update the likelyhood of the class given new information 
- Prior probability: Pr(y=Entertainment), Pr(y=CS), Pr(y=Zoology) 
- Posterior probability: Pr(y=Entertainment|x="Python") 
*Note that posterior probability is after evaulating new information* 

## Baye's Rule 
---
- Posterior probability = (prior probability x likelihood)/Evidence
- Pr(y|X) = ( Pr(y) x Pr(X|y))/Pr(X)

## Naive Baye's Classification 
---
- probability of computer science as class given the word "Python" 
- in math, shows as Pr(y=CS|"Python") = ( Pr(y=CS) *  Pr("Python"|y=CS) )/ Pr("Python") 
- below probability that word "Python" should be labeled zoology: 
- Pr(y=Zoology|Python) = ( Pr(y=Zoology) * Pr("Python" | y = Zoology) )/Pr("Python") 

*if Pr(y=CS|"Python") > Pr(y=Zoology|"Python") - then label, y, is CS* 

## Naive assumption 
*Given the class label, features are assumed to beindependent of each other* 

`argmax Pr(y|X) = argmax Pr(y) * Pr(X|y)`

final formula for query "Python download"  y is classification

`y = argmax Pr(y) * Pr("Python"|y)* Pr("download"|y)`

# Naive Bayes: What are the parameters? 
---
- Prior probabilities: Pr(y) for all y in Y 
- Likelihood - seeing a feature in documents of class y: Pr(xi|y) for all features xi and labels y in Y 
- If there are 3 classes and 100 possible features (all features are binary), how many parameters: 

3 + (2 * 100 * 3) - 3 class probabilities plus 100 features times 3 labels times 2 because each feature is binary


## Naive Bayes: Learning parmeters

- probability of entertainment?  
    - training data - count number in each of the classes 
    - N instances in all, and n out of those are labeled as class, y -> Pr(y) = n/N <simple> 

- likelyhood: Pr(xi|y) for all features xi and labels y in Y 
    - count numberof timesfeature xi appears instances labeled as class y 
    - if there are p instances of class t, and xi appears in k of those, P(xi|y) = k/p


# Naive Bayes: Smoothing 

---

*What happens if (Pr(xi|y) = 0?*

- feature xi never occurs in documents labeled y 
- but then posterior probability Pr(y|xi) will be 0! 

**Instead, smooth the parameters** 

- Laplace smoothing or additive smoothing: Add a dummy count 
    - add count of 1 for all pr(y|xi) 
    - Pr(xi|y)=(k+1)/(p+n); where n is number of features 


# Take home concepts 
---
- Naive Bayes is a probabilistic model
- Naive because itassumes features are independent of each other, given the class label. 
- For text classification problems, naive baye smodels typically provide very strong baselines. 
- Simple model to learn, easy to learn parameters 

## Two classic Naive Bayes Variants for Text 
---
- Multinomial Naive Bayes 
    - Data follows multinomial distribution (when you have features, you can have multiple instances per each feature) 
    - Each feature value is acount (wrd occurrence counts, TF-IDF weighting, ... ) 
    - If you say number of times a word occurs, keep track of frequency of words, more importance to rare words. 
- Buernuli Naive Bayes 
    - Data follows multivariate Bernoulli distribution 
    - Each feature is binary (word is present/absent) 

# Support Vector Machines 
