# Parts of Speech Tagger

Using nltk, we will create a simple method that can tag the parts of speech in a sentence. NLTK is the utilities library of the natural language world. Most natural language libraries use it to tokenize, preprocess and parse their corpuses
It also provides a large set of different corpuses for different purposes, most well known being the Brown corpus

In [1]:
import nltk
import pprint

tokenizer = None
tagger = None
def init_nltk():
    global tokenizer
    global tagger
    tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+|[^\w\s]+')
    tagger = nltk.UnigramTagger(nltk.corpus.brown.tagged_sents())

def tag(text):
    tokenized = tokenizer.tokenize(text)
    tagged = tagger.tag(tokenized)
    tagged.sort(lambda x,y:cmp(x[1],y[1]))
    return tagged

text = """Mr. Blobby is a fictional character who featured on Noel
    Edmonds' Saturday night entertainment show Noel's House Party,
    which was often a ratings winner in the 1990s. Mr Blobby also
    appeared on the Jamie Rose show of 1997. He was designed as an
    outrageously over the top parody of a one-dimensional, mute novelty
    character, which ironically made him distinctive, absurd and popular.
    He was a large pink humanoid, covered with yellow spots, sporting a
    permanent toothy grin and jiggling eyes. He communicated by saying
    the word "blobby" in an electronically-altered voice, expressing
    his moods through tone of voice and repetition.

    There was a Mrs. Blobby, seen briefly in the video, and sold as a
    doll.

    However Mr Blobby actually started out as part of the 'Gotcha'
    feature during the show's second series (originally called 'Gotcha
    Oscars' until the threat of legal action from the Academy of Motion
    Picture Arts and Sciences[citation needed]), in which celebrities
    were caught out in a Candid Camera style prank. Celebrities such as
    dancer Wayne Sleep and rugby union player Will Carling would be
    enticed to take part in a fictitious children's programme based around
    their profession. Mr Blobby would clumsily take part in the activity,
    knocking over the set, causing mayhem and saying "blobby blobby
    blobby", until finally when the prank was revealed, the Blobby
    costume would be opened - revealing Noel inside. This was all the more
    surprising for the "victim" as during rehearsals Blobby would be
    played by an actor wearing only the arms and legs of the costume and
    speaking in a normal manner.[citation needed]"""

init_nltk()
tagged = tag(text)    
tag_list = list(set(tagged))
tag_list.sort(lambda x,y:cmp(x[1],y[1]))
pprint.pprint(tag_list)

[('blobby', None),
 ('Mr', None),
 ('outrageously', None),
 ('s', None),
 ('Celebrities', None),
 ('1990s', None),
 ('Jamie', None),
 ('Blobby', None),
 ('Candid', None),
 ('1997', None),
 ('",', None),
 ('Edmonds', None),
 ('humanoid', None),
 ('.[', None),
 ('enticed', None),
 ('programme', None),
 ('Oscars', None),
 ('Carling', None),
 ('rugby', None),
 ('toothy', None),
 ('Gotcha', None),
 ('"', None),
 (']),', None),
 ("'", u"'"),
 ('[', u'('),
 ('(', u'('),
 (']', u')'),
 (',', u','),
 ('.', u'.'),
 ('all', u'ABN'),
 ('the', u'AT'),
 ('an', u'AT'),
 ('a', u'AT'),
 ('be', u'BE'),
 ('were', u'BED'),
 ('was', u'BEDZ'),
 ('is', u'BEZ'),
 ('and', u'CC'),
 ('one', u'CD'),
 ('until', u'CS'),
 ('as', u'CS'),
 ('This', u'DT'),
 ('There', u'EX'),
 ('in', u'IN'),
 ('inside', u'IN'),
 ('from', u'IN'),
 ('around', u'IN'),
 ('with', u'IN'),
 ('through', u'IN'),
 ('-', u'IN'),
 ('on', u'IN'),
 ('of', u'IN'),
 ('by', u'IN'),
 ('during', u'IN'),
 ('over', u'IN'),
 ('for', u'IN'),
 ('distinctive',

## But what do those tags mean?

In [2]:
tags = {}
for tag in tag_list:
    if tag[1] not in tags.keys():
        tags[tag[1]] = nltk.help.upenn_tagset(tag[1])


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## Okay cool, but what if I just wanted separate sentences?

The Punkt tokenizer divides a text into a list of sentences, by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plaintext in the target language before it can be used.

In [3]:
# Takes a little while too loads
sent_detector = nltk.data.load('tokenizers/punkt/english.pickle')

# Let's print our sentences
print('\n-----\n'.join(sent_detector.tokenize(text.strip())))

Mr. Blobby is a fictional character who featured on Noel
    Edmonds' Saturday night entertainment show Noel's House Party,
    which was often a ratings winner in the 1990s.
-----
Mr Blobby also
    appeared on the Jamie Rose show of 1997.
-----
He was designed as an
    outrageously over the top parody of a one-dimensional, mute novelty
    character, which ironically made him distinctive, absurd and popular.
-----
He was a large pink humanoid, covered with yellow spots, sporting a
    permanent toothy grin and jiggling eyes.
-----
He communicated by saying
    the word "blobby" in an electronically-altered voice, expressing
    his moods through tone of voice and repetition.
-----
There was a Mrs. Blobby, seen briefly in the video, and sold as a
    doll.
-----
However Mr Blobby actually started out as part of the 'Gotcha'
    feature during the show's second series (originally called 'Gotcha
    Oscars' until the threat of legal action from the Academy of Motion
    Picture Arts an

## I spoke about corpuses. Where can we get those?

For descriptions, just go to: http://www.nltk.org/api/nltk.tokenize.html. To install use the following:

In [4]:
nltk.download()

showing info https://raw.githubusercontent.com/nltk/nltk_data/gh-pages/index.xml


True

In [6]:
words = nltk.corpus.gutenberg.words(fileids='austen-emma.txt')
print words
print len(words)

[u'[', u'Emma', u'by', u'Jane', u'Austen', u'1816', ...]
192427
