<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#NLP-Tasks-with-NLTK" data-toc-modified-id="NLP-Tasks-with-NLTK-1">NLP Tasks with NLTK</a></span><ul class="toc-item"><li><span><a href="#Counting-vocabulary-of-words" data-toc-modified-id="Counting-vocabulary-of-words-1.1">Counting vocabulary of words</a></span></li><li><span><a href="#Frequency-of-words" data-toc-modified-id="Frequency-of-words-1.2">Frequency of words</a></span></li><li><span><a href="#Tokenization" data-toc-modified-id="Tokenization-1.3">Tokenization</a></span></li><li><span><a href="#Normalization-and-Stemming" data-toc-modified-id="Normalization-and-Stemming-1.4">Normalization and Stemming</a></span></li><li><span><a href="#Lemmatization" data-toc-modified-id="Lemmatization-1.5">Lemmatization</a></span></li><li><span><a href="#Remove-Punctuation" data-toc-modified-id="Remove-Punctuation-1.6">Remove Punctuation</a></span></li><li><span><a href="#Remove-Stopwords" data-toc-modified-id="Remove-Stopwords-1.7">Remove Stopwords</a></span></li></ul></li><li><span><a href="#Advanced-NLP-Tasks-with-NLTK" data-toc-modified-id="Advanced-NLP-Tasks-with-NLTK-2">Advanced NLP Tasks with NLTK</a></span><ul class="toc-item"><li><span><a href="#Part-of-Speech-(POS)-tagging" data-toc-modified-id="Part-of-Speech-(POS)-tagging-2.1">Part-of-Speech (POS) tagging</a></span></li><li><span><a href="#POS-tagging-ambiguity" data-toc-modified-id="POS-tagging-ambiguity-2.2">POS tagging ambiguity</a></span></li></ul></li><li><span><a href="#Classifiing-Text" data-toc-modified-id="Classifiing-Text-3">Classifiing Text</a></span></li></ul></div>

## NLP Tasks with NLTK

In [37]:
# !conda install nltk

In [1]:
import nltk

In [2]:
# to get neccessary nltk packages
nltk.download() 

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [3]:
from nltk.book import *

*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908


### Counting vocabulary of words

In [4]:
text7

<Text: Wall Street Journal>

In [5]:
#one sentence from text7
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [6]:
#how many words in sent7? (number of tokens?)
len(sent7)

18

In [7]:
#how many tokens in the whole text7?
len(text7)

100676

In [8]:
#how many unique tokens in text7?
len(set(text7))

12408

In [9]:
#first 10 unique tokens?
list(set(text7))[:10]

['neat',
 'heads',
 'Sasaki',
 'can',
 'vintages',
 'objectionable',
 '12-member',
 'producers',
 'packages',
 'identity-management']

### Frequency of words

* **`FreqDist`**

In [42]:
dist = FreqDist(text7)
len(dist)
#the number of unique tokens

12408

In [71]:
dist

FreqDist({',': 4885, 'the': 4045, '.': 3828, 'of': 2319, 'to': 2164, 'a': 1878, 'in': 1572, 'and': 1511, '*-1': 1123, '0': 1099, ...})

Take the **`.keys()`** of the frequency distribution to get the tokens

In [43]:
vocab1 = dist.keys()
# In Python 3 dict.keys() returns an iterable view instead of a list
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

How many times did the word `four` appear in text7?

In [12]:
dist['four']

20

In [50]:
# dist.items()

# returns (token,freqeuncy) dictionary

Print out the most 10 frequent (unique) words

In [52]:
order = sorted(dist.items(), key=lambda x: x[1], reverse = True)    
order[:10]

[(',', 4885),
 ('the', 4045),
 ('.', 3828),
 ('of', 2319),
 ('to', 2164),
 ('a', 1878),
 ('in', 1572),
 ('and', 1511),
 ('*-1', 1123),
 ('0', 1099)]

How many words are more than 5 letters AND appeared more at least 100 times in the text?

In [23]:
freqwords = [w for w in dist.keys() if len(w) > 5 and dist[w] > 100]
freqwords

['billion',
 'company',
 'president',
 'because',
 'market',
 'million',
 'shares',
 'trading',
 'program']

### Tokenization
* **`nltk.word_tokenize(text)`**
* **`nltk.sent_tokenize(text)`** sentence tokenizer

In [34]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [29]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

Notice `Shouldn't` is two words!

In [58]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [36]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

In [70]:
#counting the # of words in each sentence
[len(nltk.word_tokenize(one_sent)) for one_sent in sentences]

[6, 11, 6, 5]

### Normalization and Stemming

**Normalizing**
* **`lower()`** un-capitalize words

In [28]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

**Stemming**

Getting the roots of the words

* **`nltk.PorterStemmer()`**

In [29]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

### Lemmatization

In [33]:
#print first 20 words of UDHR corpus
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

If you do Stemming here, you will find that not all tokens are meaningful words. Therefore, you can use **Lemmatization**

In [31]:
[porter.stem(t) for t in udhr[:20]] # Still Lemmatization

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

* **`nltk.WordNetLemmatizer()`**

In [32]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

### Remove Punctuation

Remove Punctuation using **`string`** library 

In [15]:
import string

mess = 'Sample message! Notice: it has punctuation.'

# Check characters to see if they are in punctuation
nopunc = [char for char in mess if char not in string.punctuation]

# Join the characters again to form the string.
nopunc = ''.join(nopunc)

In [16]:
nopunc

'Sample message Notice it has punctuation'

### Remove Stopwords
Now remove *stopwords*

In [44]:
from nltk.corpus import stopwords

stopwords.words('english')[0:10] # Show some stop words

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]

In [18]:
nopunc.split()

['Sample', 'message', 'Notice', 'it', 'has', 'punctuation']

In [19]:
clean_mess = [word for word in nopunc.split() 
              if word.lower() not in stopwords.words('english')]

clean_mess

['Sample', 'message', 'Notice', 'punctuation']

Now let's combine above code and write a function that removes **punctuation** and **stopwords**

In [20]:
def punc_stopwords(text):
    """
    Takes in a string of text, then performs the following:
    1. Remove all punctuation
    2. Remove all stopwords
    3. Returns a list of the cleaned text
    """
    # Check characters to see if they are in punctuation
    nopunc = [char for char in text if char not in string.punctuation]
    # Join the characters again to form the string.
    nopunc = ''.join(nopunc)
    
    # Now just remove any stopwords
    return [word for word in nopunc.split() 
            if word.lower() not in stopwords.words('english')]

## Advanced NLP Tasks with NLTK

### Part-of-Speech (POS) tagging

| Tag | World Class | Tag | World Class |
|:---:|:-----------:|:---:|:-----------:|
|  CC | Conjunction |  NN |     Noun    |
|  CD |   Cardinal  | POS |  Possessive |
|  DT |  Determiner | PRP |   Pronoun   |
|  IN | Preposition |  RB |    Adverb   |
|  JJ |  Adjective  | SYM |    Symbol   |
|  MD |    Modal    |  VB |     Verb    |

In [33]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


* **`nltk.pos_tag`** to get the tag of each token

In [38]:
text11 = "Children shouldn't drink a sugary drink before bed."

text13 = nltk.word_tokenize(text11)
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

**Ambiguity in POS Tagging**

In some English sentences you cannot tell the meaning of the them because they can be interpreted in different ways

In [39]:
text14 = nltk.word_tokenize("Visiting aunts can be a nuisance")
nltk.pos_tag(text14)

[('Visiting', 'VBG'),
 ('aunts', 'NNS'),
 ('can', 'MD'),
 ('be', 'VB'),
 ('a', 'DT'),
 ('nuisance', 'NN')]

### POS tagging ambiguity

In [43]:
text18 = nltk.word_tokenize("The old man the boat")
nltk.pos_tag(text18)

#you will get man as a noun which is not true

[('The', 'DT'), ('old', 'JJ'), ('man', 'NN'), ('the', 'DT'), ('boat', 'NN')]

In [41]:
text19 = nltk.word_tokenize("Colorless green ideas sleep furiously")
nltk.pos_tag(text19)

#well-formed sentences but meaningless

[('Colorless', 'NNP'),
 ('green', 'JJ'),
 ('ideas', 'NNS'),
 ('sleep', 'VBP'),
 ('furiously', 'RB')]

-----------------------

## Classifiing Text 

You can use one of the following classifiers from Sklearn:
* **Naive Bayes:**
```python
from sklearn.naive_bayes import MultinomialNB
```


* **Support Vector Machine:**
```python
from sklearn.svm import SVC
```
    * **`C`**: Regularization parameter
    * **`Kernel`**: 'linear', 'rbf', etc