<DIV ALIGN=CENTER>

# Introduction to NLP: Basic Concepts
## Professor Robert J. Brunner
  
</DIV>  
-----
-----


## Introduction

In this IPython Notebook, we explore basic concepts in NLP:

Tokenization
Chunking
POS
NER

Example application?

NLTK and spacy.

-----

Need to nltk download:
punkt
averaged_perceptron_tagger
maxent_ne_chunker
words
universal_tagset


In [1]:
# As a text example, we use the course description for INFO490  SP16.
info_course = ['Advanced Data Science: This class is an asynchronous, online course.', 
               'This course will introduce advanced data science concepts by building on the foundational concepts presented in INFO 490: Foundations of Data Science.', 
               'Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.', 
               'Next, students will learn machine learning techniques including supervised and unsupervised learning, dimensional reduction, and cluster finding.', 
               'An emphasis will be placed on the practical application of these techniques to high-dimensional numerical data, time series data, image data, and text data.', 
               'Finally, students will learn to use relational databases and cloud computing software components such as Hadoop, Spark, and NoSQL data stores.', 
               'Students must have access to a fairly modern computer, ideally that supports hardware virtualization, on which they can install software.', 
               'This class is open to sophomores, juniors, seniors and graduate students in any discipline who have either taken a previous INFO 490 data science course or have received instructor permission.']

text = " ".join(info_course)

from nltk import sent_tokenize
snts = sent_tokenize(text)
print('{0} sentances in course description'.format(len(snts)))
print(40*'-')
print(snts[2])

8 sentances in course description
----------------------------------------
Students will first learn how to perform more statistical data exploration and constructing and evaluating statistical models.


In [2]:
from nltk import word_tokenize
wtks = word_tokenize(text)

print('{0} words in course description'.format(len(wtks)))
print(40*'-')

# Display the tokens
import pprint
pp = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=True)

pp.pprint(wtks[:13])

185 words in course description
----------------------------------------
[ 'Advanced', 'Data', 'Science', ':', 'This', 'class', 'is', 'an',
  'asynchronous', ',', 'online', 'course', '.']


In [3]:
from nltk.tokenize import WhitespaceTokenizer
tokenizer = WhitespaceTokenizer()
wtks = tokenizer.tokenize(text)

print('{0} words in course description (WS Tokenizer)'.format(len(wtks)))
print(40*'-')

pp.pprint(wtks[:10])

161 words in course description (WS Tokenizer)
----------------------------------------
[ 'Advanced', 'Data', 'Science:', 'This', 'class', 'is', 'an', 'asynchronous,',
  'online', 'course.']


In [4]:
from nltk.tokenize import WordPunctTokenizer
tokenizer = WordPunctTokenizer()
wtks = tokenizer.tokenize(text)

print('{0} words in course description (WP Tokenizer)'.format(len(wtks)))
print(40*'-')

pp.pprint(wtks[:13])

187 words in course description (WP Tokenizer)
----------------------------------------
[ 'Advanced', 'Data', 'Science', ':', 'This', 'class', 'is', 'an',
  'asynchronous', ',', 'online', 'course', '.']


-----

### Collocations

PMI = pointwise mutual information

-----

In [5]:
top_bgs = 10

from nltk.collocations import BigramCollocationFinder, BigramAssocMeasures
bigram_measures = BigramAssocMeasures()
finder = BigramCollocationFinder.from_words(wtks)
bgs = finder.nbest(bigram_measures.pmi, top_bgs)

print('Best {0} bi-grams in course description (WP Tokenizer)'.format(top_bgs))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(bgs)

Best 10 bi-grams in course description (WP Tokenizer)
--------------------------------------------------
[ ('An', 'emphasis'),
  ('an', 'asynchronous'),
  ('any', 'discipline'),
  ('as', 'Hadoop'),
  ('be', 'placed'),
  ('by', 'building'),
  ('can', 'install'),
  ('cloud', 'computing'),
  ('cluster', 'finding'),
  ('components', 'such')]


In [6]:
from nltk.collocations import TrigramAssocMeasures, TrigramCollocationFinder

trigram_measures = TrigramAssocMeasures()
finder = TrigramCollocationFinder.from_words(wtks)
tgs = finder.nbest(trigram_measures.pmi, top_bgs)

print('Best {0} tri-grams in course description (WP Tokenizer)'.format(top_bgs))
print(50*'-')

ppf = pprint.PrettyPrinter(indent=2, depth=2, width=80, compact=False)
ppf.pprint(tgs)

Best 10 tri-grams in course description (WP Tokenizer)
--------------------------------------------------
[ ('any', 'discipline', 'who'),
  ('components', 'such', 'as'),
  ('fairly', 'modern', 'computer'),
  ('ideally', 'that', 'supports'),
  ('received', 'instructor', 'permission'),
  ('such', 'as', 'Hadoop'),
  ('supports', 'hardware', 'virtualization'),
  ('that', 'supports', 'hardware'),
  ('they', 'can', 'install'),
  ('use', 'relational', 'databases')]


-----

## Tagging


-----

In [7]:
a_tag = 'INFO'

from nltk.tag import DefaultTagger
default_tagger = DefaultTagger(a_tag)
tgs = default_tagger.tag(wtks)

print('Tagged course description (WP Tokenizer)')
print(50*'-')

pp.pprint(tgs[:13])

Tagged course description (WP Tokenizer)
--------------------------------------------------
[ ('Advanced', 'INFO'), ('Data', 'INFO'), ('Science', 'INFO'), (':', 'INFO'),
  ('This', 'INFO'), ('class', 'INFO'), ('is', 'INFO'), ('an', 'INFO'),
  ('asynchronous', 'INFO'), (',', 'INFO'), ('online', 'INFO'),
  ('course', 'INFO'), ('.', 'INFO')]


----

### Part of Speech Tagging

----

In [8]:
from nltk import pos_tag

ptgs = pos_tag(wtks, tagset='universal')

print('POS tagged course description (WP Tokenizer/Univesal Tagger)')
print(60*'-')

ppf.pprint(ptgs[:13])

POS tagged course description (WP Tokenizer/Univesal Tagger)
------------------------------------------------------------
[ ('Advanced', 'NOUN'),
  ('Data', 'NOUN'),
  ('Science', 'NOUN'),
  (':', '.'),
  ('This', 'DET'),
  ('class', 'NOUN'),
  ('is', 'VERB'),
  ('an', 'DET'),
  ('asynchronous', 'ADJ'),
  (',', '.'),
  ('online', 'ADJ'),
  ('course', 'NOUN'),
  ('.', '.')]


----

More detaile tagging possible, embed webpage?

----

In [9]:
ptgs = pos_tag(wtks)

print('POS tagged course description (WP Tokenizer/Default Tagger)')
print(60*'-')

ppf.pprint(ptgs[:13])

POS tagged course description (WP Tokenizer/Default Tagger)
------------------------------------------------------------
[ ('Advanced', 'NNP'),
  ('Data', 'NNP'),
  ('Science', 'NN'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', 'JJ'),
  (',', ','),
  ('online', 'JJ'),
  ('course', 'NN'),
  ('.', '.')]


-----

### Named Entity Recognition

-----

In [10]:
from nltk import ne_chunk

nrcs = ne_chunk(pos_tag(wtks))

print(50*'-')
print('NER tagged course description (WP Tokenizer)')
print(50*'-')

ppf.pprint(nrcs[:13])

--------------------------------------------------
NER tagged course description (WP Tokenizer)
--------------------------------------------------
[ Tree('PERSON', [('Advanced', 'NNP')]),
  Tree('ORGANIZATION', [('Data', 'NNP'), ('Science', 'NN')]),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', 'JJ'),
  (',', ','),
  ('online', 'JJ'),
  ('course', 'NN'),
  ('.', '.'),
  ('This', 'DT')]


-----

## Corpus

- Penn Treebank
- Brown
- Wordnet

-----

In [11]:
from nltk.corpus import treebank

print('Penn Treebank tagged text.')
print(80*'-')

print('Words:     ', end='')
pp.pprint(treebank.words()[:18])
print(80*'-')

print('Setnences: ', end='')
pp.pprint(treebank.sents()[0])
print(80*'-')

print('Tagged Words: ')
pp.pprint(treebank.tagged_words()[:18])
print(80*'-')

print('Tagged Sentances: ')
pp.pprint(treebank.tagged_sents()[0])
print(80*'-')

Penn Treebank tagged text.
--------------------------------------------------------------------------------
Words:     [ 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
  'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
--------------------------------------------------------------------------------
Setnences: [ 'Pierre', 'Vinken', ',', '61', 'years', 'old', ',', 'will', 'join', 'the',
  'board', 'as', 'a', 'nonexecutive', 'director', 'Nov.', '29', '.']
--------------------------------------------------------------------------------
Tagged Words: 
[ ('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'),
  ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'),
  ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'),
  ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'),
  ('.', '.')]
--------------------------------------------------------------------------------
Tagged Sentances: 
[ ('

In [12]:
from nltk.tag import UnigramTagger
pt_tagger = UnigramTagger(treebank.tagged_sents())

In [13]:
pt_tgs = pt_tagger.tag(wtks)

print('Penn Treebank tagged course description (WP Tokenizer)')
print(60*'-')

ppf.pprint(pt_tgs[:13])

Penn Treebank tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'NNP'),
  ('Data', 'NNP'),
  ('Science', 'NN'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'VBZ'),
  ('an', 'DT'),
  ('asynchronous', None),
  (',', ','),
  ('online', None),
  ('course', 'NN'),
  ('.', '.')]


----

Brown corpus has over 1 million tagged words

----

In [14]:
from nltk.corpus import brown

b_tagger = UnigramTagger(brown.tagged_sents(brown.fileids()))

In [15]:
b_tgs = b_tagger.tag(wtks)

print('Brown tagged course description (WP Tokenizer)')
print(60*'-')

ppf.pprint(b_tgs[:13])

Brown tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'JJ-TL'),
  ('Data', 'NNS-TL'),
  ('Science', 'NN-TL'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'BEZ'),
  ('an', 'AT'),
  ('asynchronous', None),
  (',', ','),
  ('online', None),
  ('course', 'NN'),
  ('.', '.')]


In [16]:
# We can link taggers

b_tagger._taggers = [b_tagger, default_tagger]

b_tgs = b_tagger.tag(wtks)

print('Brown tagged course description (WP Tokenizer/Linked Tagger)')
print(60*'-')

ppf.pprint(b_tgs[:13])

Brown tagged course description (WP Tokenizer/Linked Tagger)
------------------------------------------------------------
[ ('Advanced', 'JJ-TL'),
  ('Data', 'NNS-TL'),
  ('Science', 'NN-TL'),
  (':', ':'),
  ('This', 'DT'),
  ('class', 'NN'),
  ('is', 'BEZ'),
  ('an', 'AT'),
  ('asynchronous', 'INFO'),
  (',', ','),
  ('online', 'INFO'),
  ('course', 'NN'),
  ('.', '.')]


-----

Extracting Specififc terms

-----

In [17]:
import re

# NN matchs NN|NNS|NNP|NNPS
rgxs = re.compile(r"(JJ|NN|VBN|VBG)")

ptgs = pos_tag(wtks)
trms = [tkn[0] for tkn in ptgs if re.match(rgxs, tkn[1])]

print('POS tagged course description (WP Tokenizer)')
print(60*'-')
pp.pprint(ptgs[:13])
print(60*'-')
print('POS tagged course description (WP Tokenizer/RegEx applied)')
print(60*'-')
pp.pprint(trms[:7])

POS tagged course description (WP Tokenizer)
------------------------------------------------------------
[ ('Advanced', 'NNP'), ('Data', 'NNP'), ('Science', 'NN'), (':', ':'),
  ('This', 'DT'), ('class', 'NN'), ('is', 'VBZ'), ('an', 'DT'),
  ('asynchronous', 'JJ'), (',', ','), ('online', 'JJ'), ('course', 'NN'),
  ('.', '.')]
------------------------------------------------------------
POS tagged course description (WP Tokenizer/RegEx applied)
------------------------------------------------------------
['Advanced', 'Data', 'Science', 'class', 'asynchronous', 'online', 'course']


-----

### Student Activity

In the preceding cells, we . Now that you have run the Notebook, go back and make
the following changes to see how the results change.

1. Change
2. Change 
3. Try 

-----