# POS Tagging

![](https://miro.medium.com/max/1170/1*CbZE2ZTBlmswW84Knjbqkg.png)

The primary target of Part-of-Speech(POS) tagging is to identify the grammatical group of a given word. Whether it is a NOUN, PRONOUN, ADJECTIVE, VERB, ADVERBS, etc. based on the context. POS Tagging looks for relationships within the sentence and assigns a corresponding tag to the word.

For POS tagging I have used tools provided by NLTK library in python.

#### **Dataset:** I have used reuters news provided by NLTK



**Necessary Dependancies**

In [2]:
import nltk
from nltk.corpus import reuters,brown
nltk.download('reuters')
nltk.download('brown')
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('universal_tagset')
from nltk import word_tokenize
from nltk.tag import untag
from nltk import UnigramTagger
import pandas as pd

pd.set_option('display.max_columns', None)  
pd.set_option('display.expand_frame_repr', False)
pd.set_option('max_colwidth', -1)


[nltk_data] Downloading package reuters to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package brown to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package universal_tagset to
[nltk_data]     C:\Users\nguye\AppData\Roaming\nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
  from ipykernel import kernelapp as app


## 1. Load the Data

In [3]:
articles = reuters.sents()

In [4]:
# view the corpus
for sentence in articles[:5]:
    print(sentence)

['ASIAN', 'EXPORTERS', 'FEAR', 'DAMAGE', 'FROM', 'U', '.', 'S', '.-', 'JAPAN', 'RIFT', 'Mounting', 'trade', 'friction', 'between', 'the', 'U', '.', 'S', '.', 'And', 'Japan', 'has', 'raised', 'fears', 'among', 'many', 'of', 'Asia', "'", 's', 'exporting', 'nations', 'that', 'the', 'row', 'could', 'inflict', 'far', '-', 'reaching', 'economic', 'damage', ',', 'businessmen', 'and', 'officials', 'said', '.']
['They', 'told', 'Reuter', 'correspondents', 'in', 'Asian', 'capitals', 'a', 'U', '.', 'S', '.', 'Move', 'against', 'Japan', 'might', 'boost', 'protectionist', 'sentiment', 'in', 'the', 'U', '.', 'S', '.', 'And', 'lead', 'to', 'curbs', 'on', 'American', 'imports', 'of', 'their', 'products', '.']
['But', 'some', 'exporters', 'said', 'that', 'while', 'the', 'conflict', 'would', 'hurt', 'them', 'in', 'the', 'long', '-', 'run', ',', 'in', 'the', 'short', '-', 'term', 'Tokyo', "'", 's', 'loss', 'might', 'be', 'their', 'gain', '.']
['The', 'U', '.', 'S', '.', 'Has', 'said', 'it', 'will', 'impo

Size of the corpus...

In [5]:
len(articles)

54716

We could therefore find POS tags for 54716 sentences.

Lets view the first article...

In [7]:
sentence1 = " ".join(articles[0])
sentence1

"ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said ."

## 2. Training the Unigram POS tagger

As the name implies, unigram tagger is a tagger that only uses a single word as its context for determining the POS(Part-of-Speech) tag. In simple words, Unigram Tagger is a context-based tagger whose context is a single word, i.e., Unigram.

In [38]:
brown_news_tagged = brown.tagged_sents(categories='news', tagset='universal',)
brown_train = brown_news_tagged[100:]
brown_test = brown_news_tagged[:100]

# Train the unigram model
unigram_tagger = UnigramTagger(brown_train,backoff=nltk.DefaultTagger('NN'))

# Test it on a single sentence
unigram_tagger.tag(untag(brown_test[0]))

[('The', 'DET'),
 ('Fulton', 'NN'),
 ('County', 'NOUN'),
 ('Grand', 'ADJ'),
 ('Jury', 'NOUN'),
 ('said', 'VERB'),
 ('Friday', 'NOUN'),
 ('an', 'DET'),
 ('investigation', 'NOUN'),
 ('of', 'ADP'),
 ("Atlanta's", 'NOUN'),
 ('recent', 'ADJ'),
 ('primary', 'NOUN'),
 ('election', 'NOUN'),
 ('produced', 'VERB'),
 ('``', '.'),
 ('no', 'DET'),
 ('evidence', 'NOUN'),
 ("''", '.'),
 ('that', 'ADP'),
 ('any', 'DET'),
 ('irregularities', 'NN'),
 ('took', 'VERB'),
 ('place', 'NOUN'),
 ('.', '.')]

**Evaluation on Test Set**

In [39]:
unigram_tagger.evaluate(brown_test)

0.8888888888888888

Evaluated at 88% accuracy on the Brown Test Data

## 3. Structure the Tags into a Table

**Lets now create a generalized function that would return the POS tags in a structured table format for any article.**

In [40]:
from collections import defaultdict

def get_POS(article):

    POS = {}

    word_tags = unigram_tagger.tag(article)
    for word, tag in word_tags:
        if tag not in POS:
            POS[tag] = [word]
        else:
            POS[tag].append(word)
    
    DF = {'Tags':[], 'Words':[], 'Count':[]}


    for k in POS:
        DF['Tags'].append(k)
        DF['Words'].append(" ".join(POS[k]))
        DF['Count'].append(len(POS[k]))
    
    return pd.DataFrame(DF)

**Testing on External News article**

In [41]:
news = '''Memorial Day started off as a somber day of remembrance; a day when Americans went to cemeteries and placed flags or flowers on the graves of our war dead. It was a day to remember ancestors, family members, and loved ones who gave the ultimate sacrifice. But now, too many people “celebrate” the day without more than a casual thought to the purpose and meaning of the day. How do we honor the 1.8 million that gave their life for America since 1775? How do we thank them for their sacrifice? We believe Memorial Day is one day to remember.'''

In [45]:
df = get_POS(word_tokenize(news))
df

  This is separate from the ipykernel package so we can avoid doing imports until


Unnamed: 0,Tags,Words,Count
0,ADJ,Memorial dead ultimate many more Memorial,6
1,NOUN,Day day day Americans flowers war day family members ones sacrifice people day purpose meaning day honor life America sacrifice Day day,22
2,VERB,started went placed was remember gave celebrate thought do gave do thank believe is remember,15
3,PRT,off to to to to,5
4,ADP,as of on of without than of that for since for,11
5,DET,a a the our a the the a the the the their their,13
6,NN,somber remembrance cemeteries flags graves ancestors loved “ ” casual 1.8 1775,12
7,.,"; . , , . , . ? ? .",10
8,ADV,when now too How How,5
9,CONJ,and or and But and,5


For the Following article:

In [28]:
print(" ".join(articles[0]))

ASIAN EXPORTERS FEAR DAMAGE FROM U . S .- JAPAN RIFT Mounting trade friction between the U . S . And Japan has raised fears among many of Asia ' s exporting nations that the row could inflict far - reaching economic damage , businessmen and officials said .


Following is the POS tagging for the news article

In [46]:
df = get_POS(articles[0])
df

Unnamed: 0,Tags,Words,Count
0,NN,ASIAN EXPORTERS FEAR DAMAGE FROM U .- JAPAN RIFT Mounting U fears s inflict -,15
1,.,". . . ' , .",6
2,NOUN,S trade friction S Japan Asia nations row damage businessmen officials,11
3,ADP,between among of that,4
4,DET,the the,2
5,CONJ,And and,2
6,VERB,has raised exporting could reaching said,6
7,ADJ,many economic,2
8,ADV,far,1
