Part of Speech (POS) tagging 
-----

![](https://nicholasdale.files.wordpress.com/2015/10/parts-of-speech.jpg)

Overiview
-----

- English tokens can be put into groups (aka, parts of speech)
- Make useful (hard) classification
- Penn Treebank is the default labels
- There are other labels

----
Programmatically apply pos tags
---

In [1]:
reset -fs

Download spacy's English Language Models if you don't have it

In [None]:
# ! python -m spacy.en.download

In [1]:
from spacy.en import English  

In [2]:
nlp = English(tagger=True,
              parser=False,  
              entity=False)

In [3]:
sentence = "See Dick run."
tokens = nlp(sentence)

for token in tokens:
    print(token, token.tag_, sep="\t| ")

See	| VB
Dick	| NNP
run	| VB
.	| .


[Tag sentence demo](http://spacy.io/displacy/)

-----

In [4]:
from nltk import tokenize, pos_tag

In [5]:
phrase = "I'll be back!"

In [6]:
tokens = tokenize.word_tokenize(phrase)

In [7]:
pos_tag(tokens)

[('I', 'PRP'), ("'ll", 'MD'), ('be', 'VB'), ('back', 'RB'), ('!', '.')]

In [8]:
print(*pos_tag(tokens), sep="\n")

('I', 'PRP')
("'ll", 'MD')
('be', 'VB')
('back', 'RB')
('!', '.')


What is diff between ntlk and spaCy?
-----

There's a philosophical difference between spaCy and NLTK. 

spaCy is written to help you get things done. It's minimal and opinionated. We want to provide you with exactly one way to do it --- the right way. Spacy has accurate part-of-speech tagger + dependency parser. If you want something that has good defaults, Spacy is the way to go.

In contrast, NLTK was created to support education. Most of what's there is for demo purposes, to help students explore ideas. But if you have your own data that you want to train on, NLTK is probably better. 

----

In [18]:
from textblob import TextBlob

In [13]:
print(*TextBlob("I'll be back.").tags, sep="\n")

('I', 'PRP')
("'ll", 'MD')
('be', 'VB')
('back', 'RB')


In [14]:
TextBlob('He is a tall skinny guy with a long, sad, mean-looking kisser, and a mournful voice.').tags

[('He', 'PRP'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('tall', 'JJ'),
 ('skinny', 'NN'),
 ('guy', 'NN'),
 ('with', 'IN'),
 ('a', 'DT'),
 ('long', 'JJ'),
 ('sad', 'JJ'),
 ('mean-looking', 'JJ'),
 ('kisser', 'NN'),
 ('and', 'CC'),
 ('a', 'DT'),
 ('mournful', 'JJ'),
 ('voice', 'NN')]

In [15]:
TextBlob("If only Bradley's arm was longer. Best photo ever. #oscars").tags

[('If', 'IN'),
 ('only', 'RB'),
 ('Bradley', 'NNP'),
 ("'s", 'POS'),
 ('arm', 'NN'),
 ('was', 'VBD'),
 ('longer', 'RBR'),
 ('Best', 'JJS'),
 ('photo', 'NN'),
 ('ever', 'RB'),
 ('oscars', 'NNS')]

What happened to the hash tag?

In [16]:
pos_tag(tokenize.word_tokenize("If only Bradley's arm was longer. Best photo ever. #oscars"))

[('If', 'IN'),
 ('only', 'RB'),
 ('Bradley', 'NNP'),
 ("'s", 'POS'),
 ('arm', 'NN'),
 ('was', 'VBD'),
 ('longer', 'RBR'),
 ('.', '.'),
 ('Best', 'JJS'),
 ('photo', 'NN'),
 ('ever', 'RB'),
 ('.', '.'),
 ('#', '#'),
 ('oscars', 'NNS')]

---
Deep Dive into Penn Treebank POS tags
----

Penn Tags: somewhat popular but awful (not human readable)

[List of Penn Treebank POS tags](https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html)

In [9]:
import nltk

In [10]:
tags = nltk.data.load('help/tagsets/upenn_tagset.pickle')

In [11]:
tags

{'$': ('dollar', '$ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$ '),
 "''": ('closing quotation mark', "' '' "),
 '(': ('opening parenthesis', '( [ { '),
 ')': ('closing parenthesis', ') ] } '),
 ',': ('comma', ', '),
 '--': ('dash', '-- '),
 '.': ('sentence terminator', '. ! ? '),
 ':': ('colon or ellipsis', ': ; ... '),
 'CC': ('conjunction, coordinating',
  "& 'n and both but either et for less minus neither nor or plus so therefore times v. versus vs. whether yet "),
 'CD': ('numeral, cardinal',
  "mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025 fifteen 271,124 dozen quintillion DM2,000 ... "),
 'DT': ('determiner',
  'all an another any both del each either every half la many much nary neither no some such that the them these this those '),
 'EX': ('existential there', 'there '),
 'FW': ('foreign word',
  "gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous lutihaw alai je jour objets saluta

Wow! That is a lot of tags

In [13]:
pen_label = 'NN' 

In [14]:
nltk.help.upenn_tagset(pen_label)

NN: noun, common, singular or mass
    common-carrier cabbage knuckle-duster Casino afghan shed thermostat
    investment slide humour falloff slick wind hyena override subhumanity
    machinist ...


In [15]:
tags[pen_label][0]

'noun, common, singular or mass'

Let's just grab the most important part

In [22]:
tags[pen_label][0].split(',')[0]

'noun'

In [16]:
tags_simple = {pen_label:long_description[0].split(',')[0]
                   for pen_label, long_description in tags.items()}

In [25]:
tags_simple

{'$': 'dollar',
 "''": 'closing quotation mark',
 '(': 'opening parenthesis',
 ')': 'closing parenthesis',
 ',': 'comma',
 '--': 'dash',
 '.': 'sentence terminator',
 ':': 'colon or ellipsis',
 'CC': 'conjunction',
 'CD': 'numeral',
 'DT': 'determiner',
 'EX': 'existential there',
 'FW': 'foreign word',
 'IN': 'preposition or conjunction',
 'JJ': 'adjective or numeral',
 'JJR': 'adjective',
 'JJS': 'adjective',
 'LS': 'list item marker',
 'MD': 'modal auxiliary',
 'NN': 'noun',
 'NNP': 'noun',
 'NNPS': 'noun',
 'NNS': 'noun',
 'PDT': 'pre-determiner',
 'POS': 'genitive marker',
 'PRP': 'pronoun',
 'PRP$': 'pronoun',
 'RB': 'adverb',
 'RBR': 'adverb',
 'RBS': 'adverb',
 'RP': 'particle',
 'SYM': 'symbol',
 'TO': '"to" as preposition or infinitive marker',
 'UH': 'interjection',
 'VB': 'verb',
 'VBD': 'verb',
 'VBG': 'verb',
 'VBN': 'verb',
 'VBP': 'verb',
 'VBZ': 'verb',
 'WDT': 'WH-determiner',
 'WP': 'WH-pronoun',
 'WP$': 'WH-pronoun',
 'WRB': 'Wh-adverb',
 '``': 'opening quotation ma

Okay let's replace the cryptic Penn tags with the longer descriptions

In [19]:
tagged = TextBlob("I'll be back.").tags

In [20]:
tagged[0][1]

'PRP'

In [21]:
[(item[0], tags_simple[item[1]]) for item in tagged]

[('I', 'pronoun'),
 ("'ll", 'modal auxiliary'),
 ('be', 'verb'),
 ('back', 'adverb')]

***