# **What is Part-of-speech (POS) tagging?**
POS tagging is the process of labeling each word in a sentence with its corresponding part of speech, such as noun, verb, adjective, etc. This is important for various natural language processing (NLP) tasks, as it provides grammatical context to the words.


##How does it work?
1. **Training:** Most POS taggers are trainable. This means that they can learn from a dataset where each word in a sentence is already tagged with its part-of-speech. This dataset is called the training data.
2. **Tagging:** Once trained, the tagger can then be used to tag new sentences. It will assign a part-of-speech tag to each word based on the patterns it learned during training.
3. **Backoff Chain:** Some taggers can be combined in a sequence. If the first tagger is uncertain about a word, the next one in the sequence can try to tag it, and so on. This increases the overall accuracy.

## **NLTK's UnigramTagger**

In [3]:
#import necessary modules
import nltk
from nltk.tag import UnigramTagger
from nltk.corpus import treebank
nltk.download('treebank')
#UnigramTagger is the tagger class
#treebank is a well-known corpus that contains tagged sentences.

[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.


True

In [4]:
#Train the tagger
train_sents = treebank.tagged_sents()[:3000]
tagger = UnigramTagger(train_sents)

Here, the first 3000 sentences from the treebank corpus are used to train the UnigramTagger.


In [6]:
#Sentence
treebank.sents()[0]

['Pierre',
 'Vinken',
 ',',
 '61',
 'years',
 'old',
 ',',
 'will',
 'join',
 'the',
 'board',
 'as',
 'a',
 'nonexecutive',
 'director',
 'Nov.',
 '29',
 '.']

In [7]:
#Tag the sentence
tagger.tag(treebank.sents()[0])

[('Pierre', 'NNP'),
 ('Vinken', 'NNP'),
 (',', ','),
 ('61', 'CD'),
 ('years', 'NNS'),
 ('old', 'JJ'),
 (',', ','),
 ('will', 'MD'),
 ('join', 'VB'),
 ('the', 'DT'),
 ('board', 'NN'),
 ('as', 'IN'),
 ('a', 'DT'),
 ('nonexecutive', 'JJ'),
 ('director', 'NN'),
 ('Nov.', 'NNP'),
 ('29', 'CD'),
 ('.', '.')]

After training, the tagger is used to tag the first sentence from the treebank corpus. The output is a list of tuples, where each tuple contains a word and its corresponding part-of-speech tag.

We use the first 3000 tagged sentences of the treebank corpus as the training set to
initialize the UnigramTagger class. Then, we see the first sentence as a list of words,
and can see how it is transformed by the tag() function into a list of tagged tokens.

Some examples from this tag set include:

- NNP: Proper noun, singular
- JJ: Adjective
- VB: Verb, base form
- DT: Determiner