# ***To Implement PoS Tagging on text***


**Name:** Prexit Joshi  
**Roll No.:** 118


## 1. Aim
To implement Part-of-Speech (PoS) tagging on sample text using the NLTK library and explain the results.

## 2. Description
Part-of-Speech tagging assigns a lexical category (such as noun, verb, adjective) to each word in a sentence. This practical demonstrates basic preprocessing, tokenization, and PoS tagging using NLTK's `pos_tag` function. The output helps in syntactic analysis and is a common preprocessing step in many NLP tasks.

## 3. Requirements
- Google Colab (or local Jupyter)
- Libraries: `nltk`, `pandas`

Colab already provides `nltk`, but we download necessary NLTK resources in the notebook.

In [10]:
# 4.1 Imports and NLTK downloads
import nltk
import pandas as pd

# Download required NLTK data (only first run required)
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')

print('NLTK ready')

NLTK ready


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!


In [11]:
# 4.2 Sample dataset (small and clear)
texts = [
    "The quick brown fox jumps over the lazy dog.",
    "I love studying Natural Language Processing.",
    "Prexit submitted the practical on time.",
    "The weather is pleasant today, so we will go outside."
]

df = pd.DataFrame({'text': texts})
df.index += 1  # make row numbers start at 1 for lab-style display
df

Unnamed: 0,text
1,The quick brown fox jumps over the lazy dog.
2,I love studying Natural Language Processing.
3,Prexit submitted the practical on time.
4,"The weather is pleasant today, so we will go o..."


In [12]:
from nltk import word_tokenize, pos_tag
import nltk

# Download required NLTK data (only first run required, added here as it's specifically needed for tokenization)
nltk.download('punkt_tab')
nltk.download('averaged_perceptron_tagger_eng') # Added to fix the LookupError

# Function to tokenize and tag a sentence
def pos_tag_sentence(sentence):
    tokens = word_tokenize(sentence)
    tags = pos_tag(tokens)
    return tokens, tags

# Apply to all texts
results = []
for i, row in df.iterrows():
    sent = row['text']
    tokens, tags = pos_tag_sentence(sent)
    results.append({'id': i, 'text': sent, 'tokens': tokens, 'pos_tags': tags})

res_df = pd.DataFrame(results)
res_df[['id','text','tokens','pos_tags']]

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger_eng to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger_eng is already up-to-
[nltk_data]       date!


Unnamed: 0,id,text,tokens,pos_tags
0,1,The quick brown fox jumps over the lazy dog.,"[The, quick, brown, fox, jumps, over, the, laz...","[(The, DT), (quick, JJ), (brown, NN), (fox, NN..."
1,2,I love studying Natural Language Processing.,"[I, love, studying, Natural, Language, Process...","[(I, PRP), (love, VBP), (studying, VBG), (Natu..."
2,3,Prexit submitted the practical on time.,"[Prexit, submitted, the, practical, on, time, .]","[(Prexit, NN), (submitted, VBD), (the, DT), (p..."
3,4,"The weather is pleasant today, so we will go o...","[The, weather, is, pleasant, today, ,, so, we,...","[(The, DT), (weather, NN), (is, VBZ), (pleasan..."


In [13]:
# 4.4 Display results in readable form
for _, r in res_df.iterrows():
    print(f"Sentence {r['id']}: {r['text']}")
    print('Tokens:', r['tokens'])
    print('PoS Tags:')
    for tok, tag in r['pos_tags']:
        print(f"  {tok:15} -> {tag}")
    print('-'*60)


Sentence 1: The quick brown fox jumps over the lazy dog.
Tokens: ['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog', '.']
PoS Tags:
  The             -> DT
  quick           -> JJ
  brown           -> NN
  fox             -> NN
  jumps           -> VBZ
  over            -> IN
  the             -> DT
  lazy            -> JJ
  dog             -> NN
  .               -> .
------------------------------------------------------------
Sentence 2: I love studying Natural Language Processing.
Tokens: ['I', 'love', 'studying', 'Natural', 'Language', 'Processing', '.']
PoS Tags:
  I               -> PRP
  love            -> VBP
  studying        -> VBG
  Natural         -> NNP
  Language        -> NNP
  Processing      -> NNP
  .               -> .
------------------------------------------------------------
Sentence 3: Prexit submitted the practical on time.
Tokens: ['Prexit', 'submitted', 'the', 'practical', 'on', 'time', '.']
PoS Tags:
  Prexit          -> NN
  submitted  

In [14]:
# 4.5 Save results to CSV (optional)
import os # Import the os module

out_path = '/mnt/data/pos_tagging_results.csv'

# Create the directory if it doesn't exist
os.makedirs(os.path.dirname(out_path), exist_ok=True)

# convert lists to strings for CSV
res_df_to_save = res_df.copy()
res_df_to_save['tokens'] = res_df_to_save['tokens'].apply(lambda x: ' '.join(x))
res_df_to_save['pos_tags'] = res_df_to_save['pos_tags'].apply(lambda x: ' '.join([f"{t}/{p}" for t,p in x]))
res_df_to_save.to_csv(out_path, index=False)
print('Saved results to', out_path)

# Show saved file preview
res_df_to_save.head()

Saved results to /mnt/data/pos_tagging_results.csv


Unnamed: 0,id,text,tokens,pos_tags
0,1,The quick brown fox jumps over the lazy dog.,The quick brown fox jumps over the lazy dog .,The/DT quick/JJ brown/NN fox/NN jumps/VBZ over...
1,2,I love studying Natural Language Processing.,I love studying Natural Language Processing .,I/PRP love/VBP studying/VBG Natural/NNP Langua...
2,3,Prexit submitted the practical on time.,Prexit submitted the practical on time .,Prexit/NN submitted/VBD the/DT practical/NN on...
3,4,"The weather is pleasant today, so we will go o...","The weather is pleasant today , so we will go ...",The/DT weather/NN is/VBZ pleasant/JJ today/NN ...


## 5. Observations
- NLTK uses the Penn Treebank tagset (e.g., NN for noun, VB for verb, JJ for adjective).
- PoS tagging depends on correct tokenization; punctuation is tokenized separately.
- For short, clean sentences NLTK's tagger performs well; for noisy or domain-specific text, performance may drop.

## 6. Conclusion

In this practical, Part-of-Speech tagging was implemented using NLTK. Sentences were tokenized and each token was assigned a PoS tag from the Penn Treebank tagset. The exercise demonstrated the role of PoS tagging in syntactic analysis and its usefulness as a preprocessing step for higher-level NLP tasks. The objective of implementing PoS tagging on text was successfully achieved. For improved results on larger or domain-specific corpora, one may employ trained taggers, additional preprocessing, or statistical/ML-based taggers.

