# Week 2: Natural Language Processing

# Basic NLP tasks with `NLTK`

In [41]:
import nltk
nltk.download()
from nltk.book import *

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


In [42]:
nltk.download('punkt')

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\pfsch\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

## Investigate texts in `NLTK`

In [3]:
text7

<Text: Wall Street Journal>

In [4]:
sent7

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [5]:
len(sent7)

18

How many words does text 7 have?

In [8]:
len(text7)

100676

How many *unique* words does text 7 have?

In [7]:
len(set(text7))

12408

Look at the first 10 unique words.

In [9]:
list(set(text7))[:10]

['replacing',
 'justice',
 'Monopolies',
 'COPPER',
 'sauce',
 '37-a-share',
 'inherently',
 '37.3',
 'gallon',
 'Even']

## Frequency of words

Look at frequency of words- one entry for each unique word

In [11]:
dist = FreqDist(text7)
len(dist)

12408

In [12]:
vocab1 = dist.keys()
list(vocab1)[:10]

['Pierre', 'Vinken', ',', '61', 'years', 'old', 'will', 'join', 'the', 'board']

How many times do we see the word 'four'?

In [13]:
dist['four']

20

Find any word with more than 5 characters that appears more than 100 times.

In [14]:
freqwords = [w for w in vocab1 if len(w) > 5 and dist[w] > 100]

## Normalization

Use lower case only so you don't double count. 

Look for different forms of the same "word"

In [16]:
input1 = "List listed lists listing listings"
words1 = input1.lower().split(' ')
words1

['list', 'listed', 'lists', 'listing', 'listings']

In [17]:
porter = nltk.PorterStemmer()
[porter.stem(t) for t in words1]

['list', 'list', 'list', 'list', 'list']

## Lemmatization

Where you want to have the words to come out as being actually meaningful. 

In [18]:
udhr = nltk.corpus.udhr.words('English-Latin1')
udhr[:20]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'rights',
 'of']

In [20]:
[porter.stem(t) for t in udhr[:20]] # Many of these aren't real words anymore

['univers',
 'declar',
 'of',
 'human',
 'right',
 'preambl',
 'wherea',
 'recognit',
 'of',
 'the',
 'inher',
 'digniti',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalien',
 'right',
 'of']

In [22]:
WNlemma = nltk.WordNetLemmatizer()
[WNlemma.lemmatize(t) for t in udhr[:20]]

['Universal',
 'Declaration',
 'of',
 'Human',
 'Rights',
 'Preamble',
 'Whereas',
 'recognition',
 'of',
 'the',
 'inherent',
 'dignity',
 'and',
 'of',
 'the',
 'equal',
 'and',
 'inalienable',
 'right',
 'of']

## Tokenization

How to split a sentence into words and tokens.

In [23]:
text11 = "Children shouldn't drink a sugary drink before bed."
text11.split(' ')

['Children', "shouldn't", 'drink', 'a', 'sugary', 'drink', 'before', 'bed.']

In [24]:
nltk.word_tokenize(text11)

['Children',
 'should',
 "n't",
 'drink',
 'a',
 'sugary',
 'drink',
 'before',
 'bed',
 '.']

The separation of `should` and `n't` is important here because you can use it to identify negation.

You can also tokenize by sentences

In [25]:
text12 = "This is the first sentence. A gallon of milk in the U.S. costs $2.99. Is this the third sentence? Yes, it is!"
sentences = nltk.sent_tokenize(text12)
len(sentences)

4

In [26]:
sentences

['This is the first sentence.',
 'A gallon of milk in the U.S. costs $2.99.',
 'Is this the third sentence?',
 'Yes, it is!']

# Part of speech (POS) Tagging

How to identify nouns, verbs, adjectives...

In [27]:
nltk.help.upenn_tagset('MD')

MD: modal auxiliary
    can cannot could couldn't dare may might must need ought shall should
    shouldn't will would


In [28]:
text11

"Children shouldn't drink a sugary drink before bed."

In [29]:
text13 = nltk.word_tokenize(text11)

In [30]:
nltk.pos_tag(text13)

[('Children', 'NNP'),
 ('should', 'MD'),
 ("n't", 'RB'),
 ('drink', 'VB'),
 ('a', 'DT'),
 ('sugary', 'JJ'),
 ('drink', 'NN'),
 ('before', 'IN'),
 ('bed', 'NN'),
 ('.', '.')]

There is a lot of ambiguity in POS tagging. Ex: "Visiting aunts can be a nuisance." Visiting can be a verb or an adjective, depending on how you read the sentence.

# Parsing sentence structure

Making sense of sentences is easy if they follow a well-defined grammatical structure.

In [31]:
text15 = nltk.word_tokenize('Alice loves Bob')
grammar = nltk.CFG.fromstring("""
S -> NP VP
VP -> V NP
NP -> 'Alice' | 'Bob'
V -> 'loves'
""")

In [32]:
parser = nltk.ChartParser(grammar)

In [33]:
word1 = 'test'

In [35]:
word1.islower()

True

In [38]:
import string

In [40]:
word1.upper()

'TEST'