# Part of Speech Tagging

**POS Tagging** is the process of labeling each word in a sentence with its grammatical role. It is important to get an idea that which parts of speech does tokens belong to i.e whether it is a noun, verb, adverb, conjunction, pronoun, adjective, preposition, interjection, and so on so forth. Whether it is plural or singular and many more conditions.

**POS Tagging** (Part-of-Speech Tagging) is the fundamental <u>*natural language processing technique*</u> that assigns grammatical labels to words in text based on their functions and contexts. This process systematically categorizes each token in a sentence according to its linguistic role. 


In Computational Linguistics, POS Tagging serves as critical preprocessing step for more complex language analysis task. Modern POS Taggers typically use statistical methods, rule-based systems, or neural networks to determine the appropriate tags with high accuracy.

*A popular* ***Penn treebank*** *lists the possible tags are generally used to tag these tokens. [link](https://gist.githubusercontent.com/yashj302/76b82a90739ebec90120cc1da31c967e/raw/4b608b5d0f41e212267bfa73f97fb7b4275e7e7b/POS%20tag.csv)

<u>Common POS categories:</u>
<ul>
    <li><b>Nouns</b></li>
    <li><b>Verbs</b></li>
    <li><b>Adjectives</b></li>
    <li><b>Adverbs</b></li>
    <li><b>Pronouns</b></li>
    <li><b>Prepositions</b></li>
    <li><b>Conjunctions</b></li>
    <li><b>Interjections</b></li>
</ul>

<u>List of common tags:</u> <br>
* CC - coordinating conjunction 
* CD - cardinal digit 
* DT - determiner 
* EX - existential there (like: "there is" ... think of it like "there exists") 
* FW - foreign word 
* IN - preposition/subordinating conjunction 
* JJ - adjective - 'big' 
* JJR - adjective, comparative - 'bigger' 
* JJS - adjective, superlative - 'biggest' 
* LS - list marker 1) 
* MD - modal - could, will 
* NN - noun, singular '- desk' 
* NNS - noun plural - 'desks' 
* NNP - proper noun, singular - 'Harrison' 
* NNPS - proper noun, plural - 'Americans' 
* PDT - predeterminer - 'all the kids' 
* POS - possessive ending parent's 
* PRP - personal pronoun -  I, he, she 
* `PRP$` - possessive pronoun - my, his, hers 
* RB - adverb - very, silently, 
* RBR - adverb, comparative - better 
* RBS - adverb, superlative - best 
* RP - particle - give up 
* TO - to go 'to' the store. 
* UH - interjection - errrrrrrrm 
* VB - verb, base form - take 
* VBD - verb, past tense - took 
* VBG - verb, gerund/present participle - taking 
* VBN - verb, past participle - taken 
* VBP - verb, sing. present, non-3d - take 
* VBZ - verb, 3rd person sing. present - takes 
* WDT - wh-determiner - which 
* WP - wh-pronoun - who, what 
* `WP$` - possessive wh-pronoun, eg- whose 
* WRB - wh-adverb, eg- where, when

In `predictive analytics`, accurate POS tagging enhances the performance of downstream tasks such as sentiment analysis, topic modeling, and text classification by providing structured linguistic features.

<img src = '1_Yj-1jtWm9z5hRJq-OtntMA.webp'><br>
[https://medium.com/@martinthetechie/nlp-guide-part-of-speech-8e890c7a0b51](https://medium.com/@martinthetechie/nlp-guide-part-of-speech-8e890c7a0b51)

## Why Part of Speech Tagging matters

POS Tagging matters because it improves downstream NLP tasks like parsing and named entity recognition (NER). It also enables information extraction in identifying specific tags in a document.

It also helps disambiguate word meaning based on usage,<br>
Example of disambiguation:

- book (noun): “I read a book.”
- book (verb): “I will book a ticket.”

## There are multiple approaches to POS Tagging:

<font size = '3px'><b> Rule Based </b></font> - Uses Handcrafted grammar Rules <br>
<font size = '3px'><b> Statistical </b></font> - Uses Probabilistic models to assign POS tags based on patterns on annotated text corpora.<br>
<font size = '3px'><b> Machine Learning  </b></font> - Supervised models trained on annoted datasets <br>
<font size = '3px'><b> Deep Learning </b></font> - Models such as BiLSTM and Transformer-based architecture

## Rule Based

In [1]:
import nltk
import re

In [3]:
pos_dict = {
    "the": "DET",
    "a": "DET",
    "an": "DET",
    "over": "PREP",
    "in": "PREP",
    "on": "PREP",
    "is": "VERB",
    "are": "VERB",
    "jumps": "VERB",
    "runs": "VERB",
    "dog": "NOUN",
    "cat": "NOUN",
    "fox": "NOUN"
}


def rule_based(corpus):
    tokens = nltk.word_tokenize(corpus)
    tags = []

    for i, word in enumerate(tokens):
        word = word.lower()
        tag = None

        if word in pos_dict:
            tag = pos_dict[word]

        elif re.search("ly$", word):
            tag = "ADV"
        elif re.search("ing$", word):
            tag = "VBG"
        elif re.search("ed$", word):
            tag = "VBN"
        elif re.search("ous$|ful$|able$|al$|ive$", word):
            tag = "ADJ"

        elif i > 0 and tokens[i-1].lower() in ['the', 'a', 'an']:
            tag = "ADJ"

        else:
            tag = "NN"
        tags.append((word, tag))
    return tags

sentence = "The quick brown fox jumps over the lazy dog"
print(rule_based(sentence))

[('the', 'DET'), ('quick', 'ADJ'), ('brown', 'NN'), ('fox', 'NOUN'), ('jumps', 'VERB'), ('over', 'PREP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN')]


## Statistical Approach

In [4]:
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

In [5]:
# nltk.download('averaged_perceptron_tagger')
# nltk.download('averaged_perceptron_tagger_eng')

[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /home/iragca/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger.zip.
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /home/iragca/nltk_data...
[nltk_data]   Unzipping taggers/averaged_perceptron_tagger_eng.zip.


True

In [6]:
corpus = "I am not dreaming anymore, I can't wake up, Turn on the light I'm haunted by my feelings"

tokens = corpus.lower().split()
statistical_tags = pos_tag(tokens)

statistical_tags

[('i', 'NN'),
 ('am', 'VBP'),
 ('not', 'RB'),
 ('dreaming', 'VBG'),
 ('anymore,', 'NN'),
 ('i', 'NN'),
 ("can't", 'VBP'),
 ('wake', 'VB'),
 ('up,', 'JJ'),
 ('turn', 'NN'),
 ('on', 'IN'),
 ('the', 'DT'),
 ('light', 'JJ'),
 ("i'm", 'NN'),
 ('haunted', 'VBN'),
 ('by', 'IN'),
 ('my', 'PRP$'),
 ('feelings', 'NNS')]

## POS Tagging is a foundational ***preprocessing*** step in NLP pipelines

By grouping words into their respective grammatical roles, POS Tagging provides greater insights into sentence structure and meaning which are important for higher-level NLP applications.

In [8]:
## Lets say we have a sentence:
corpus= ''.join(["The statement 'mitochondria is the powerhouse of the cell' is a common analogy, ", 
        "and it is correct in that mitochondria are responsible for generating the majority of the cell's usable ",
        "energy in the form of adenosine triphosphate (ATP) through a process called cellular respiration. ",
        "This vital function makes them essential for the cell's survival and operation."])
corpus

"The statement 'mitochondria is the powerhouse of the cell' is a common analogy, and it is correct in that mitochondria are responsible for generating the majority of the cell's usable energy in the form of adenosine triphosphate (ATP) through a process called cellular respiration. This vital function makes them essential for the cell's survival and operation."

In [9]:
import re
from nltk.corpus import stopwords
import pandas as pd

def rem_punct(corpus):
    if not isinstance(corpus, str):
        return corpus
    corpus = re.sub(r'[^a-zA-Z\s]', '', corpus)
    return corpus

stop_words = set(stopwords.words('english'))

tokens = rem_punct(corpus)
tokens = word_tokenize(tokens.lower())
tokens = [word for word in tokens if word not in stop_words]
tokens

['statement',
 'mitochondria',
 'powerhouse',
 'cell',
 'common',
 'analogy',
 'correct',
 'mitochondria',
 'responsible',
 'generating',
 'majority',
 'cells',
 'usable',
 'energy',
 'form',
 'adenosine',
 'triphosphate',
 'atp',
 'process',
 'called',
 'cellular',
 'respiration',
 'vital',
 'function',
 'makes',
 'essential',
 'cells',
 'survival',
 'operation']

In [10]:
statistical_tags = pos_tag(tokens)
statistical_tags

[('statement', 'NN'),
 ('mitochondria', 'NN'),
 ('powerhouse', 'NN'),
 ('cell', 'NN'),
 ('common', 'JJ'),
 ('analogy', 'NN'),
 ('correct', 'JJ'),
 ('mitochondria', 'NN'),
 ('responsible', 'JJ'),
 ('generating', 'VBG'),
 ('majority', 'NN'),
 ('cells', 'NNS'),
 ('usable', 'JJ'),
 ('energy', 'NN'),
 ('form', 'NN'),
 ('adenosine', 'NN'),
 ('triphosphate', 'NN'),
 ('atp', 'NN'),
 ('process', 'NN'),
 ('called', 'VBD'),
 ('cellular', 'JJ'),
 ('respiration', 'NN'),
 ('vital', 'JJ'),
 ('function', 'NN'),
 ('makes', 'VBZ'),
 ('essential', 'JJ'),
 ('cells', 'NNS'),
 ('survival', 'JJ'),
 ('operation', 'NN')]

In [11]:
df = pd.DataFrame(statistical_tags, columns=['word', 'pos'])
df[df['pos'] != 'NN']

Unnamed: 0,word,pos
4,common,JJ
6,correct,JJ
8,responsible,JJ
9,generating,VBG
11,cells,NNS
12,usable,JJ
19,called,VBD
20,cellular,JJ
22,vital,JJ
24,makes,VBZ


# Assignment (POS Tagging)

Get a paragraph from your favorite literature. Analayze your chosen paragraph and do Textual Cleaning if needed. Conduct POS Tagging then Analayze and interpret the following:
- POS Frequency (How many nouns, verbs, adj, etc.)
- Identify dominant POS types - Is the paragraph Noun-heavy or is the paragraph Verb-dominated or etc.? Explain your answer.
- Observe Patterns - *******ex. frequent Adjectives might indicate rich imagery or emotional tone*******

What are your other observations and interpretations?<br>
