### Part-of-speech tagging is one of the most important text analysis tasks used to classify words into their part-of-speech and label them according the tagset which is a collection of tags used for the pos tagging. Part-of-speech tagging also known as word classes or lexical categories. 

In [1]:
import nltk

In [2]:
text = nltk.word_tokenize("In corpus linguistics, part-of-speech tagging (POS tagging or POST), also called grammatical tagging or word-category disambiguation, is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition, as well as its context—i.e. relationship with adjacent and related words in a phrase, sentence, or paragraph. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.")

In [4]:
text

['In',
 'corpus',
 'linguistics',
 ',',
 'part-of-speech',
 'tagging',
 '(',
 'POS',
 'tagging',
 'or',
 'POST',
 ')',
 ',',
 'also',
 'called',
 'grammatical',
 'tagging',
 'or',
 'word-category',
 'disambiguation',
 ',',
 'is',
 'the',
 'process',
 'of',
 'marking',
 'up',
 'a',
 'word',
 'in',
 'a',
 'text',
 '(',
 'corpus',
 ')',
 'as',
 'corresponding',
 'to',
 'a',
 'particular',
 'part',
 'of',
 'speech',
 ',',
 'based',
 'on',
 'both',
 'its',
 'definition',
 ',',
 'as',
 'well',
 'as',
 'its',
 'context—i.e',
 '.',
 'relationship',
 'with',
 'adjacent',
 'and',
 'related',
 'words',
 'in',
 'a',
 'phrase',
 ',',
 'sentence',
 ',',
 'or',
 'paragraph',
 '.',
 'A',
 'simplified',
 'form',
 'of',
 'this',
 'is',
 'commonly',
 'taught',
 'to',
 'school-age',
 'children',
 ',',
 'in',
 'the',
 'identification',
 'of',
 'words',
 'as',
 'nouns',
 ',',
 'verbs',
 ',',
 'adjectives',
 ',',
 'adverbs',
 ',',
 'etc',
 '.']

In [5]:
nltk.pos_tag(text)

[('In', 'IN'),
 ('corpus', 'NN'),
 ('linguistics', 'NNS'),
 (',', ','),
 ('part-of-speech', 'JJ'),
 ('tagging', 'NN'),
 ('(', '('),
 ('POS', 'NNP'),
 ('tagging', 'VBG'),
 ('or', 'CC'),
 ('POST', 'NNP'),
 (')', ')'),
 (',', ','),
 ('also', 'RB'),
 ('called', 'VBD'),
 ('grammatical', 'JJ'),
 ('tagging', 'NN'),
 ('or', 'CC'),
 ('word-category', 'JJ'),
 ('disambiguation', 'NN'),
 (',', ','),
 ('is', 'VBZ'),
 ('the', 'DT'),
 ('process', 'NN'),
 ('of', 'IN'),
 ('marking', 'VBG'),
 ('up', 'RP'),
 ('a', 'DT'),
 ('word', 'NN'),
 ('in', 'IN'),
 ('a', 'DT'),
 ('text', 'NN'),
 ('(', '('),
 ('corpus', 'NN'),
 (')', ')'),
 ('as', 'IN'),
 ('corresponding', 'VBG'),
 ('to', 'TO'),
 ('a', 'DT'),
 ('particular', 'JJ'),
 ('part', 'NN'),
 ('of', 'IN'),
 ('speech', 'NN'),
 (',', ','),
 ('based', 'VBN'),
 ('on', 'IN'),
 ('both', 'DT'),
 ('its', 'PRP$'),
 ('definition', 'NN'),
 (',', ','),
 ('as', 'RB'),
 ('well', 'RB'),
 ('as', 'IN'),
 ('its', 'PRP$'),
 ('context—i.e', 'NN'),
 ('.', '.'),
 ('relationship', '

In [7]:
## NLTK provides documentation for each tag, which can be queried using the tag, e.g 

nltk.help.upenn_tagset('RB')

RB: adverb
    occasionally unabatingly maddeningly adventurously professedly
    stirringly prominently technologically magisterially predominately
    swiftly fiscally pitilessly ...


In [11]:
nltk.help.upenn_tagset('JJ')

JJ: adjective or numeral, ordinal
    third ill-mannered pre-war regrettable oiled calamitous first separable
    ectoplasmic battery-powered participatory fourth still-to-be-named
    multilingual multi-disciplinary ...


In [15]:
help(nltk)

Help on package nltk:

NAME
    nltk

DESCRIPTION
    The Natural Language Toolkit (NLTK) is an open source Python library
    for Natural Language Processing.  A free online book is available.
    (If you use the library for academic research, please cite the book.)
    
    Steven Bird, Ewan Klein, and Edward Loper (2009).
    Natural Language Processing with Python.  O'Reilly Media Inc.
    http://nltk.org/book
    
    @version: 3.2.3

PACKAGE CONTENTS
    app (package)
    book
    ccg (package)
    chat (package)
    chunk (package)
    classify (package)
    cluster (package)
    collections
    collocations
    compat
    corpus (package)
    data
    decorators
    downloader
    draw (package)
    featstruct
    grammar
    help
    inference (package)
    internals
    jsontags
    lazyimport
    metrics (package)
    misc (package)
    parse (package)
    probability
    sem (package)
    sentiment (package)
    stem (package)
    tag (package)
    tbl (package)
    test (p

### The default pos tagger model using in NLTK is maxent_treebanck_pos_tagger model, 
### Code can be found in nltk-master/nltk/tag/__init__.py

# Training a POS Tagging Model or POS Tagger in NLTK

In [16]:
from nltk.corpus import treebank

In [17]:
len(treebank.tagged_sents())

3914

In [18]:
train_data = treebank.tagged_sents()[:3000]
test_data = treebank.tagged_sents()[3000:]

train_data[0]  ## First sentence of the training set

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

In [19]:
test_data[0]

[('At', 'IN'),
 ('Tokyo', 'NNP'),
 (',', ','),
 ('the', 'DT'),
 ('Nikkei', 'NNP'),
 ('index', 'NN'),
 ('of', 'IN'),
 ('225', 'CD'),
 ('selected', 'VBN'),
 ('issues', 'NNS'),
 (',', ','),
 ('which', 'WDT'),
 ('*T*-1', '-NONE-'),
 ('gained', 'VBD'),
 ('132', 'CD'),
 ('points', 'NNS'),
 ('Tuesday', 'NNP'),
 (',', ','),
 ('added', 'VBD'),
 ('14.99', 'CD'),
 ('points', 'NNS'),
 ('to', 'TO'),
 ('35564.43', 'CD'),
 ('.', '.')]

### We use the first 3000 treebank tagged sentences as the train_data, and last 914 tagged sentences as the test_data, now we train TnT POS Tagger by the train_data and evaluate it by the test_data

In [20]:
from nltk.tag import tnt
tnt_pos_tagger = tnt.TnT()

In [21]:
tnt_pos_tagger.train(train_data)

In [22]:
tnt_pos_tagger.evaluate(test_data)  ### This one takes a while

0.8756313403842003

### This pos tagger model can also be saved as a pickle file : 

In [23]:
import pickle

In [26]:
f = open('tnt_treebank_pos_tagger.pickle', 'wb')

In [27]:
pickle.dump(tnt_pos_tagger, f)

In [28]:
f.close()

To reuse the model using the pickle: 

In [33]:
tnt_pos_tagger.tag(nltk.word_tokenize('this is a tnt treebank tnt tagger'))

[('this', 'DT'),
 ('is', 'VBZ'),
 ('a', 'DT'),
 ('tnt', 'Unk'),
 ('treebank', 'Unk'),
 ('tnt', 'Unk'),
 ('tagger', 'Unk')]