# üß© Part-of-Speech (POS) Tagging

> **Objective:**  
> Identify the grammatical role of each word in a sentence ‚Äî  
> e.g., whether it‚Äôs a **noun**, **verb**, **adjective**, **adverb**, etc.

POS tagging helps machines understand **sentence structure** and **word meaning in context**,  
which is essential for downstream tasks like lemmatization, parsing, NER, and text understanding.

---

## üìò 1. What is POS Tagging?

**Definition:**  
Part-of-Speech Tagging is the process of assigning a part of speech label (like *NN*, *VB*, *JJ*)  
to each token in a sentence, based on both its **definition** and **context**.

Formally, for a tokenized sequence $ T = [w_1, w_2, ..., w_n] $:

$$
\text{POS}(w_i) = \operatorname{tagger}(w_i, C_i)
$$

where $ C_i $ is the context window around word $ w_i $.

---

| Example | Word | POS Tag | Meaning |
|----------|------|----------|----------|
| The mice were running fast | The | DT | Determiner |
| The mice were running fast | mice | NNS | Noun, plural |
| The mice were running fast | were | VBD | Verb, past tense |
| The mice were running fast | running | VBG | Verb, gerund |
| The mice were running fast | fast | RB | Adverb |

---

### üí° Key Idea
A single word can take multiple POS tags depending on context:

| Word | Sentence | POS | Meaning |
|------|-----------|-----|----------|
| play | I **play** cricket. | VB | Verb (action) |
| play | The **play** was excellent. | NN | Noun (thing) |


In [3]:
import nltk
from nltk import pos_tag, word_tokenize, sent_tokenize

# Example sentence
sentence = "The mice were running faster and the better runner was finally organized."

tokens = word_tokenize(sentence)
tags = pos_tag(tokens)

for word, tag in tags:
    print(f"{word:>12}  ‚Üí  {tag}")


         The  ‚Üí  DT
        mice  ‚Üí  NN
        were  ‚Üí  VBD
     running  ‚Üí  VBG
      faster  ‚Üí  RBR
         and  ‚Üí  CC
         the  ‚Üí  DT
      better  ‚Üí  JJR
      runner  ‚Üí  NN
         was  ‚Üí  VBD
     finally  ‚Üí  RB
   organized  ‚Üí  VBN
           .  ‚Üí  .


## ‚öôÔ∏è 2. POS Tag Sets (Penn Treebank)

NLTK uses the **Penn Treebank POS Tag Set**,  
a standardized collection of POS abbreviations.

| Tag | Part of Speech | Example |
|------|----------------|----------|
| NN | Noun, singular | cat, boy |
| NNS | Noun, plural | mice, cars |
| VB | Verb, base | go, play |
| VBD | Verb, past tense | went, played |
| VBG | Verb, gerund | running, eating |
| JJ | Adjective | happy, fast |
| JJR | Comparative adjective | better, bigger |
| RB | Adverb | quickly, silently |
| DT | Determiner | the, an |
| IN | Preposition | on, in, with |
| PRP | Pronoun | he, she, they |
| CC | Coordinating conjunction | and, but, or |
| UH | Interjection | wow, oh |
| . | Punctuation | ., ? |

Full tag list is available with:


In [2]:
import nltk
nltk.download('tagsets', quiet=True)
nltk.download('tagsets_json', quiet=True)
nltk.help.upenn_tagset()  # Shows all POS tag definitions


$: dollar
    $ -$ --$ A$ C$ HK$ M$ NZ$ S$ U.S.$ US$
'': closing quotation mark
    ' ''
(: opening parenthesis
    ( [ {
): closing parenthesis
    ) ] }
,: comma
    ,
--: dash
    --
.: sentence terminator
    . ! ?
:: colon or ellipsis
    : ; ...
CC: conjunction, coordinating
    & 'n and both but either et for less minus neither nor or plus so
    therefore times v. versus vs. whether yet
CD: numeral, cardinal
    mid-1890 nine-thirty forty-two one-tenth ten million 0.5 one forty-
    seven 1987 twenty '79 zero two 78-degrees eighty-four IX '60s .025
    fifteen 271,124 dozen quintillion DM2,000 ...
DT: determiner
    all an another any both del each either every half la many much nary
    neither no some such that the them these this those
EX: existential there
    there
FW: foreign word
    gemeinschaft hund ich jeux habeas Haementeria Herr K'ang-si vous
    lutihaw alai je jour objets salutaris fille quibusdam pas trop Monte
    terram fiche oui corporis ...
IN: preposition or

## üßÆ 3. POS Tagging Process

POS Tagging involves two main components:

1. **Lexical lookup:**  
   Assigns tags based on dictionary or training data.
2. **Contextual disambiguation:**  
   Uses the surrounding words to refine tags.

$$
\text{POS}(w_i) = \arg\max_{t_j} P(t_j \mid w_i, t_{i-1}, t_{i+1})
$$

where:
- $ t_j $: possible POS tag for token $ w_i $
- $ t_{i-1}, t_{i+1} $: neighboring tags

This formula represents how **probabilistic POS taggers** (like HMM or CRF) decide tags contextually.


In [4]:
# Example: POS tagging on MS Dhoni motivational paragraph
paragraph = """
When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.
Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.
You don‚Äôt wake up one morning and find you‚Äôre great; you become great because you kept showing up, kept trying, and kept learning.
"""

sentences = sent_tokenize(paragraph)
for i, s in enumerate(sentences, 1):
    tokens = word_tokenize(s)
    tags = pos_tag(tokens)
    print(f"\nSentence {i}: {s}\n{'-'*len(s)}")
    for word, tag in tags:
        print(f"{word:>12}  ‚Üí  {tag}")



Sentence 1: 
When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.
----------------------------------------------------------------------------------------------------
        When  ‚Üí  WRB
         you  ‚Üí  PRP
        step  ‚Üí  VBP
        into  ‚Üí  IN
         any  ‚Üí  DT
   challenge  ‚Üí  NN
           ,  ‚Üí  ,
         you  ‚Üí  PRP
        must  ‚Üí  MD
       bring  ‚Üí  VB
         one  ‚Üí  CD
       thing  ‚Üí  NN
       above  ‚Üí  IN
         all  ‚Üí  DT
           ‚Äî  ‚Üí  JJ
 consistency  ‚Üí  NN
          in  ‚Üí  IN
        your  ‚Üí  PRP$
     actions  ‚Üí  NNS
           .  ‚Üí  .

Sentence 2: Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.
-------------------------------------------------------------------------
     Success  ‚Üí  NNP
          is  ‚Üí  VBZ
         not  ‚Üí  RB
           a  ‚Üí  DT
      sudden  ‚Üí  JJ
        peak  ‚Üí  NN
           ‚Äî  ‚Üí  NN
          it 

## üìò 4. Mapping POS Tags for Lemmatization

As we saw earlier, WordNet uses a **simpler POS scheme**:
- `wn.NOUN` (n)
- `wn.VERB` (v)
- `wn.ADJ` (a)
- `wn.ADV` (r)

We therefore need a conversion from **Penn Treebank ‚Üí WordNet** tags before lemmatization.

$$
\text{map}(t_{Penn}) =
\begin{cases}
n, & \text{if starts with } N \\
v, & \text{if starts with } V \\
a, & \text{if starts with } J \\
r, & \text{if starts with } R
\end{cases}
$$


In [5]:
from nltk.corpus import wordnet as wn

def penn_to_wn(tag):
    if tag.startswith('J'):
        return wn.ADJ
    elif tag.startswith('V'):
        return wn.VERB
    elif tag.startswith('N'):
        return wn.NOUN
    elif tag.startswith('R'):
        return wn.ADV
    return wn.NOUN

example_tags = ['NN', 'VBD', 'JJ', 'RB']
for t in example_tags:
    print(f"{t:>4} ‚Üí {penn_to_wn(t)}")


  NN ‚Üí n
 VBD ‚Üí v
  JJ ‚Üí a
  RB ‚Üí r


## üìä 5. POS Tag Distribution

Let‚Äôs analyze the **frequency of POS tags** in our text to understand its structure:


In [6]:
from collections import Counter
import pandas as pd

# POS distribution for Dhoni paragraph
tokens = word_tokenize(paragraph)
tags = pos_tag(tokens)
tag_counts = Counter(tag for _, tag in tags)

df_tags = pd.DataFrame(tag_counts.items(), columns=['POS Tag', 'Count']).sort_values(by='Count', ascending=False)
df_tags.head(10)


Unnamed: 0,POS Tag,Count
10,JJ,9
5,NN,8
1,PRP,7
3,IN,5
2,VBP,4
4,DT,4
12,NNS,3
6,",",3
21,VBD,3
15,VBZ,3


## üí° 6. Observations

- Most frequent tags: **NN**, **VB**, **JJ**, and **RB** ‚Äî nouns, verbs, adjectives, adverbs.  
- Indicates text is rich in **action (verbs)** and **motivation (adjectives/adverbs)**.
- POS tagging enables:
  - Context-aware **lemmatization**
  - **Named Entity Recognition (NER)**
  - **Dependency parsing**
  - **Sentiment polarity analysis**

---

## ‚úÖ 7. Summary

| Step | Description |
|------|--------------|
| **Tokenization** | Break text into words |
| **POS Tagging** | Assign grammatical labels |
| **Tag Mapping** | Convert Treebank ‚Üí WordNet |
| **Use Cases** | Lemmatization, NER, Parsing, Sentiment |

---

### üí¨ Final Thought

> *‚ÄúUnderstanding language begins with understanding the role every word plays.‚Äù*  
> ‚Äî *M.S. Dhoni-inspired NLP wisdom* üß¢


# üß≠ POS Tagging Flow

**Inline math example:** $ \text{POS}(w_i)=\operatorname{tagger}(w_i, C_i) $,  
and mapping to WordNet: $ \text{map}(t_{\text{Penn}})\in\{n,v,a,r\} $.

---

**Vertical flow (display math):**

$$
\begin{array}{c}
\boxed{\text{Raw Text / Paragraph}} \\[6pt]
\Downarrow\ \text{sent\_tokenize()} \\[6pt]
\boxed{\text{Sentences } S_1,\dots,S_n} \\[6pt]
\Downarrow\ \text{word\_tokenize()} \\[6pt]
\boxed{\text{Tokens } T_i=[w_{i1},\dots,w_{im}]} \\[6pt]
\Downarrow\ \text{POS Tagger } \text{tag}(w) \\[6pt]
\boxed{\text{Penn POS Tags } \{ \text{NN},\text{VB},\text{JJ},\text{RB},\dots \}} \\[6pt]
\Downarrow\ \text{map(Penn}\!\to\!\text{WordNet}) \\[6pt]
\boxed{\text{WordNet POS } \{n,v,a,r\}} \\[6pt]
\Downarrow\ \text{Apply (e.g., Lemmatizer, Rules)} \\[6pt]
\boxed{\text{Downstream Tasks: Lemma / NER / Parse / Sentiment}} \\[6pt]
\end{array}
$$

---

**Horizontal flow (compact):**

$$
\boxed{\text{Text}}
\xrightarrow{\text{sent\_tokenize}}
\boxed{\text{Sentences}}
\xrightarrow{\text{word\_tokenize}}
\boxed{\text{Tokens}}
\xrightarrow{\text{POS Tagger}}
\boxed{\text{Penn POS}}
\xrightarrow{\text{map}}
\boxed{\text{WN POS }(n,v,a,r)}
\xrightarrow{\text{apply}}
\boxed{\text{Lemmas / NER / Parse}}
$$

---

**Core equations (display):**

POS assignment with context window $C_i$:
$$
\text{POS}(w_i)=\arg\max_{t\in\mathcal{T}} P\!\left(t \mid w_i, C_i\right)
$$

Penn $\to$ WordNet mapping:
$$
\text{map}(t)=
\begin{cases}
n, & t\ \text{starts with } N \\[2pt]
v, & t\ \text{starts with } V \\[2pt]
a, & t\ \text{starts with } J \\[2pt]
r, & t\ \text{starts with } R
\end{cases}
$$

Applying POS-aware lemmatization:
$$
\text{Lemma}(w,\text{POS})=\operatorname{lookup}_{\text{WN}}\!\big(\text{morph}(w),\text{POS}\big)
$$


# üß≠ POS Tagging Flow ‚Äî Example with Dhoni‚Äôs Motivational Paragraph üèè

We‚Äôll visualize the complete **POS tagging workflow** applied to our **MS Dhoni motivational paragraph**,  
showing how each step transforms text ‚Äî from raw sentences to lemmatized words.  

---

### üí¨ Input Paragraph
> ‚ÄúWhen you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions.  
> Success is not a sudden peak ‚Äî it‚Äôs a steady climb built on daily habits.  
> You don‚Äôt wake up one morning and find you‚Äôre great; you become great because you kept showing up, kept trying, and kept learning.‚Äù

---

### ‚öôÔ∏è Flowchart (LaTeX Visualization)

$$
\begin{array}{c}
\boxed{\textbf{Raw Paragraph}} \\[6pt]
\text{"When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions."} \\[8pt]
\Downarrow\ \text{sent\_tokenize()} \\[6pt]
\boxed{\textbf{Sentences } S_1, S_2, S_3} \\[6pt]
\text{S}_1=\text{"When you step into any challenge, you must bring one thing above all ‚Äî consistency in your actions."}\\[8pt]
\Downarrow\ \text{word\_tokenize()} \\[6pt]
\boxed{\textbf{Tokens for } S_1} \\[6pt]
\text{["When","you","step","into","any","challenge","you","must","bring","one","thing","above","all","consistency","in","your","actions","."]}\\[8pt]
\Downarrow\ \text{POS Tagger } \text{tag}(w) \\[6pt]
\boxed{\textbf{Penn POS Tags}} \\[6pt]
\text{When‚ÜíWRB, step‚ÜíVB, challenge‚ÜíNN, must‚ÜíMD, bring‚ÜíVB, thing‚ÜíNN, consistency‚ÜíNN, actions‚ÜíNNS}\\[8pt]
\Downarrow\ \text{map(Penn}\!\to\!\text{WordNet}) \\[6pt]
\boxed{\textbf{Mapped WordNet POS}} \\[6pt]
\text{VB‚Üív,\ NN‚Üín,\ NNS‚Üín,\ MD‚Üív,\ WRB‚Üír}\\[8pt]
\Downarrow\ \text{POS-Aware Lemmatizer} \\[6pt]
\boxed{\textbf{Lemmas}} \\[6pt]
\text{step‚Üístep,\ bring‚Üíbring,\ challenge‚Üíchallenge,\ consistency‚Üíconsistency,\ actions‚Üíaction}\\[8pt]
\Downarrow\ \text{Summarize / NLP Usage} \\[6pt]
\boxed{\textbf{Downstream Applications}} \\[6pt]
\text{Lemmatization ‚úì \quad NER ‚úì \quad Sentiment ‚úì \quad Parsing ‚úì}
\end{array}
$$

---

### üí° Core Equations

POS tagging as contextual prediction:
$$
\text{POS}(w_i)=\arg\max_{t\in\mathcal{T}}P(t\mid w_i,C_i)
$$

Penn ‚Üí WordNet mapping:
$$
\text{map}(t_{\text{Penn}})=
\begin{cases}
n,&t\text{ starts with }N\\[4pt]
v,&t\text{ starts with }V\\[4pt]
a,&t\text{ starts with }J\\[4pt]
r,&t\text{ starts with }R
\end{cases}
$$

POS-aware Lemmatization:
$$
\text{Lemma}(w,\text{POS})=\operatorname{lookup}_{\text{WN}}\!\big(\text{morph}(w),\text{POS}\big)
$$

---

### ‚úÖ Summary Table

| Step | Function | Example Transformation (from Dhoni paragraph) |
|------|-----------|-----------------------------------------------|
| Sentence Tokenization | `sent_tokenize()` | Paragraph ‚Üí 3 sentences |
| Word Tokenization | `word_tokenize()` | Sentence ‚Üí `["When","you","step",...]` |
| POS Tagging | `pos_tag()` | `step ‚Üí VB`, `challenge ‚Üí NN` |
| POS Mapping | Penn ‚Üí WordNet | `VB ‚Üí v`, `NN ‚Üí n` |
| Lemmatization | `lemmatizer.lemmatize()` | `actions ‚Üí action`, `bring ‚Üí bring` |
| Summary | Merge results | Clean, lemmatized tokens ready for BoW/TF-IDF |

---

> üß¢ *‚ÄúIt‚Äôs not just about words ‚Äî it‚Äôs about what role each word plays.‚Äù*  
> ‚Äî Inspired by **M.S. Dhoni**
