# Part-of-Speech (POS) Tagging

## What is POS Tagging?

**Part-of-Speech (POS) tagging** is the process of assigning grammatical categories (noun, verb, adjective, etc.) to each word in a text based on its definition and context.

### Why is it important?
- **Word Disambiguation**: Words like "book" can be a noun or verb depending on context
- **Information Extraction**: Helps identify entities, relationships, and key information
- **Sentiment Analysis**: Identifies opinion words (adjectives, adverbs)
- **Machine Translation**: Understanding sentence structure for better translation
- **Text-to-Speech**: Correct pronunciation based on word type

### Common POS Tags:
- **NN**: Noun (singular) - dog, car
- **NNS**: Noun (plural) - dogs, cars
- **VB**: Verb (base form) - run, eat
- **VBD**: Verb (past tense) - ran, ate
- **VBG**: Verb (gerund) - running, eating
- **JJ**: Adjective - big, beautiful
- **RB**: Adverb - quickly, slowly
- **DT**: Determiner - the, a
- **IN**: Preposition - in, on, at

In [None]:
import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import DefaultTagger, UnigramTagger, BigramTagger
from nltk.corpus import treebank

nltk.download('punkt', quiet=True)
nltk.download('averaged_perceptron_tagger', quiet=True)
nltk.download('treebank', quiet=True)

## 1. Basic POS Tagging

NLTK provides a pre-trained POS tagger that uses the Penn Treebank tag set. It's easy to use and works well for most English text.

In [None]:
sentence = "The cat sits on the mat"
tokens = word_tokenize(sentence)
pos_tags = nltk.pos_tag(tokens)

print(f"Sentence: {sentence}\n")
for word, tag in pos_tags:
    print(f"{word:10} -> {tag}")

## 2. Rule-Based Tagging

**Rule-based taggers** use patterns and rules to assign POS tags. They match word endings or patterns to determine the tag.

**Advantages:**
- Simple and interpretable
- No training data needed
- Fast

**Disadvantages:**
- Limited accuracy
- Can't handle context well
- Requires manual rule creation

In [None]:
# Regular expression patterns
patterns = [
    (r'.*ing$', 'VBG'),    # gerunds
    (r'.*ed$', 'VBD'),     # past tense
    (r'.*ly$', 'RB'),      # adverbs
    (r'.*s$', 'NNS'),      # plural nouns
    (r'.*', 'NN')          # default
]

regexp_tagger = nltk.RegexpTagger(patterns)

test = "The cats are running quickly"
tokens = word_tokenize(test)
print(f"Sentence: {test}\n")
print(regexp_tagger.tag(tokens))

## 3. Probabilistic Tagging

**Probabilistic taggers** learn from annotated training data and use statistics to predict tags. They consider context and word frequencies.

### Types:
- **Unigram Tagger**: Assigns the most frequent tag for each word
- **Bigram Tagger**: Considers the previous word's tag
- **Trigram Tagger**: Considers the previous two tags

**Advantages:**
- Higher accuracy (85-95%)
- Learn from data automatically
- Handle context better

**Backoff Chain**: If a tagger can't decide, it falls back to a simpler tagger (Bigram → Unigram → Default)

In [None]:
# Prepare training data
tagged_sentences = treebank.tagged_sents()
split = int(len(tagged_sentences) * 0.8)
train_data = tagged_sentences[:split]
test_data = tagged_sentences[split:]

print(f"Training sentences: {len(train_data)}")
print(f"Test sentences: {len(test_data)}")

In [None]:
# Train taggers
default_tagger = DefaultTagger('NN')
unigram_tagger = UnigramTagger(train_data, backoff=default_tagger)
bigram_tagger = BigramTagger(train_data, backoff=unigram_tagger)

# Evaluate
print("Accuracy on test data:\n")
print(f"Unigram: {unigram_tagger.accuracy(test_data):.2%}")
print(f"Bigram:  {bigram_tagger.accuracy(test_data):.2%}")

In [None]:
# Test on new sentence
test_sent = "The dog runs quickly through the garden"
tokens = word_tokenize(test_sent)

print(f"Sentence: {test_sent}\n")
for word, tag in bigram_tagger.tag(tokens):
    print(f"{word:12} -> {tag}")

## 4. Extracting Specific POS

Once we have POS tags, we can extract specific types of words for various applications:

- **Nouns (NN, NNS)**: For keyword extraction, topic modeling
- **Adjectives (JJ)**: For sentiment analysis, product features
- **Verbs (VB*)**: For action identification, event extraction
- **Named Entities (NNP, NNPS)**: For entity recognition

This is useful for:
- Summarization
- Information extraction
- Search and indexing
- Content analysis

In [None]:
text = "The beautiful garden contains colorful flowers and tall trees"
tokens = word_tokenize(text)
tagged = nltk.pos_tag(tokens)

# Extract nouns
nouns = [word for word, tag in tagged if tag in ['NN', 'NNS']]
# Extract adjectives
adjectives = [word for word, tag in tagged if tag.startswith('JJ')]

print(f"Text: {text}\n")
print(f"Nouns: {nouns}")
print(f"Adjectives: {adjectives}")