# Part of Speech (POS) Tagging

What do we do if we would like to identify the parts of speech of all the words in a sentence? The answer is POS tagging.

In [17]:
import nltk
from nltk.tokenize import sent_tokenize

Import `sent_tokenize` to separate all the sentences into separate entities.

In [18]:
text = "It is unfortunate that the families of the victims do not have the consolation of anyone being brought to justice. While Sohrabuddin’s killing has ‘encounter’ as an explanation, his wife’s disappearance remains a mystery. It was not proved that she was taken to a farm, killed and her body burnt. And it cannot be a coincidence that Prajapati was killed a year later in Rajasthan in another encounter. It was under a cloud of suspicion over the circumstances of their death that Sohrabuddin’s brother had approached the Supreme Court and obtained an order for an investigation, which was subsequently handed over to the CBI. In losing this case, the CBI has shown that it continues to struggle when it comes to handling cases with political overtones. The 2014 discharge of Mr. Shah and the subsequent pre-trial exoneration of senior police officer D.G. Vanzara had come as a boost to the BJP. The final decision in the trial is also likely to be interpreted as a justification for some encounters that took place in Gujarat when Narendra Modi was Chief Minister. Mr. Vanzara has implied as much in controversial tweets. He has also claimed that such ‘pre-emptive encounters’ were needed to save Mr. Modi. This is a tacit acknowledgement that these may not have been chance encounters, as genuine ones are supposed to be, but part of a plan to eliminate a threat to the leader’s life through extrajudicial killings. It is regrettable that such a triumphalist narrative is sought to be built around such incidents."

In [20]:
tokenizer = sent_tokenize(text)

Run a loop which iterates through all the sentences in the document, separates each word using `word_tokenize` and assigns a POS tag using `pos_tag`.

Example: 
    
Sentence 1
    
    Word 1 -> POS tag assigned
    Word 2 -> POS tag assigned
    Word 3 -> POS tag assigned
    Word 4 -> POS tag assigned
    Word 5 -> POS tag assigned
    
Sentence 2

    Word 1 -> POS tag assigned
    Word 2 -> POS tag assigned
    Word 3 -> POS tag assigned
    Word 4 -> POS tag assigned
    Word 5 -> POS tag assigned
    
.

.

.

.

.

.

(iterates till all the words in all the sentences are tagged in the document)

In [22]:
for i in tokenizer:
    words = nltk.word_tokenize(i)
    tagged = nltk.pos_tag(words)
    print(tagged)

[('It', 'PRP'), ('is', 'VBZ'), ('unfortunate', 'JJ'), ('that', 'IN'), ('the', 'DT'), ('families', 'NNS'), ('of', 'IN'), ('the', 'DT'), ('victims', 'NNS'), ('do', 'VBP'), ('not', 'RB'), ('have', 'VB'), ('the', 'DT'), ('consolation', 'NN'), ('of', 'IN'), ('anyone', 'NN'), ('being', 'VBG'), ('brought', 'VBN'), ('to', 'TO'), ('justice', 'NN'), ('.', '.')]
[('While', 'IN'), ('Sohrabuddin', 'NNP'), ('’', 'NNP'), ('s', 'NN'), ('killing', 'NN'), ('has', 'VBZ'), ('‘', 'VBN'), ('encounter', 'RB'), ('’', 'NNP'), ('as', 'IN'), ('an', 'DT'), ('explanation', 'NN'), (',', ','), ('his', 'PRP$'), ('wife', 'NN'), ('’', 'NNP'), ('s', 'NN'), ('disappearance', 'NN'), ('remains', 'VBZ'), ('a', 'DT'), ('mystery', 'NN'), ('.', '.')]
[('It', 'PRP'), ('was', 'VBD'), ('not', 'RB'), ('proved', 'VBN'), ('that', 'IN'), ('she', 'PRP'), ('was', 'VBD'), ('taken', 'VBN'), ('to', 'TO'), ('a', 'DT'), ('farm', 'NN'), (',', ','), ('killed', 'VBN'), ('and', 'CC'), ('her', 'PRP$'), ('body', 'NN'), ('burnt', 'NN'), ('.', '.')

<img src="tagset.png">

The above table will help us to identify the Part of Speech (POS) tags. Example: NNP = Proper Noun, Singular.