<a href="https://colab.research.google.com/github/kusuma687/AIAC/blob/main/2403A52096_NLP_ASSIGNMENT_6_3.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Lab Objectives:

o Understand domain mismatch in HMMs

o Analyze tag transition patterns in technical writing

Tasks:

1. Collect 20–30 research abstracts.

2. Automatically POS-tag them using NLTK.

3. Treat the tagged data as training data for HMM.

4. Compute:

a) Transition probabilities

b) Emission probabilities

c) Analyze:

 o Most frequent tag transitions

o Apply HMM tagging to a new abstract sentence.

**1. Collect 20–30 research abstracts.**

In [5]:
import pandas as pd

df = pd.read_csv("arxiv_data.csv")

# Use correct column name
abstract_column = "summaries"

# Select 25 abstracts
abstracts = df[abstract_column].dropna().astype(str).head(25).tolist()

print("Number of abstracts:", len(abstracts))
print("\nSample abstract:\n")
print(abstracts[0])



Number of abstracts: 25

Sample abstract:

Stereo matching is one of the widely used techniques for inferring depth from
stereo images owing to its robustness and speed. It has become one of the major
topics of research since it finds its applications in autonomous driving,
robotic navigation, 3D reconstruction, and many other fields. Finding pixel
correspondences in non-textured, occluded and reflective areas is the major
challenge in stereo matching. Recent developments have shown that semantic cues
from image segmentation can be used to improve the results of stereo matching.
Many deep neural network architectures have been proposed to leverage the
advantages of semantic segmentation in stereo matching. This paper aims to give
a comparison among the state of art networks both in terms of accuracy and in
terms of speed which are of higher importance in real-time applications.


**2. Automatically POS-tag them using NLTK.**

In [6]:
import re

# Simple tokenizer
def tokenize(text):
    return re.findall(r"\b\w+\b", text)

# Simple rule-based POS tagger
def simple_pos_tag(tokens):
    tagged = []
    for word in tokens:
        if word.lower() in ['the', 'a', 'an']:
            tag = 'DT'
        elif word.endswith('ing'):
            tag = 'VBG'
        elif word.endswith('ed'):
            tag = 'VBD'
        elif word[0].isupper():
            tag = 'NNP'
        else:
            tag = 'NN'
        tagged.append((word, tag))
    return tagged

# Tag all abstracts
tagged_data = []

for abstract in abstracts:
    tokens = tokenize(abstract)
    tagged = simple_pos_tag(tokens)
    tagged_data.extend(tagged)

print("Sample tagged output:")
print(tagged_data[:20])


Sample tagged output:
[('Stereo', 'NNP'), ('matching', 'VBG'), ('is', 'NN'), ('one', 'NN'), ('of', 'NN'), ('the', 'DT'), ('widely', 'NN'), ('used', 'VBD'), ('techniques', 'NN'), ('for', 'NN'), ('inferring', 'VBG'), ('depth', 'NN'), ('from', 'NN'), ('stereo', 'NN'), ('images', 'NN'), ('owing', 'VBG'), ('to', 'NN'), ('its', 'NN'), ('robustness', 'NN'), ('and', 'NN')]


**3. Treat the tagged data as training data for HMM.**

In [7]:
from collections import defaultdict

transition_counts = defaultdict(lambda: defaultdict(int))
emission_counts = defaultdict(lambda: defaultdict(int))
tag_counts = defaultdict(int)

previous_tag = None

for word, tag in tagged_data:
    tag_counts[tag] += 1
    emission_counts[tag][word] += 1

    if previous_tag is not None:
        transition_counts[previous_tag][tag] += 1

    previous_tag = tag

print("Training complete.")


Training complete.


**4.Compute:**

**a) Transition probabilities**

In [8]:
transition_prob = defaultdict(dict)

for prev_tag in transition_counts:
    total = sum(transition_counts[prev_tag].values())
    for tag in transition_counts[prev_tag]:
        transition_prob[prev_tag][tag] = transition_counts[prev_tag][tag] / total

print("Sample Transition Probabilities:\n")

for prev_tag in list(transition_prob.keys())[:3]:
    for tag in list(transition_prob[prev_tag].keys())[:3]:
        print(f"P({tag} | {prev_tag}) = {transition_prob[prev_tag][tag]:.3f}")


Sample Transition Probabilities:

P(VBG | NNP) = 0.045
P(NN | NNP) = 0.623
P(VBD | NNP) = 0.057
P(NN | VBG) = 0.689
P(NNP | VBG) = 0.124
P(VBG | VBG) = 0.038
P(NN | NN) = 0.754
P(DT | NN) = 0.095
P(VBD | NN) = 0.049


**b) Emission probabilities**

In [9]:
emission_prob = defaultdict(dict)

for tag in emission_counts:
    total = sum(emission_counts[tag].values())
    for word in emission_counts[tag]:
        emission_prob[tag][word] = emission_counts[tag][word] / total

print("Sample Emission Probabilities (NN):\n")

for word, prob in list(emission_prob['NN'].items())[:10]:
    print(f"P({word} | NN) = {prob:.4f}")


Sample Emission Probabilities (NN):

P(is | NN) = 0.0159
P(one | NN) = 0.0025
P(of | NN) = 0.0414
P(widely | NN) = 0.0008
P(techniques | NN) = 0.0011
P(for | NN) = 0.0143
P(depth | NN) = 0.0003
P(from | NN) = 0.0074
P(stereo | NN) = 0.0011
P(images | NN) = 0.0038


**c) Analyze:**

**o Most frequent tag transitions**

In [10]:
from collections import Counter

transition_pairs = Counter()

for prev_tag in transition_counts:
    for tag in transition_counts[prev_tag]:
        transition_pairs[(prev_tag, tag)] += transition_counts[prev_tag][tag]

print("Top 10 Most Frequent Transitions:\n")

for pair, count in transition_pairs.most_common(10):
    print(f"{pair[0]} → {pair[1]} : {count}")


Top 10 Most Frequent Transitions:

NN → NN : 2752
NN → DT : 347
DT → NN : 341
NNP → NN : 208
NN → NNP : 207
VBD → NN : 204
NN → VBD : 179
NN → VBG : 164
VBG → NN : 144
NNP → NNP : 66


**o Apply HMM tagging to a new abstract sentence.**

In [11]:
def viterbi(sentence_tokens, tags, transition_prob, emission_prob):
    V = [{}]
    path = {}

    # Initialization
    for tag in tags:
        V[0][tag] = emission_prob[tag].get(sentence_tokens[0], 1e-6)
        path[tag] = [tag]

    # Recursion
    for t in range(1, len(sentence_tokens)):
        V.append({})
        new_path = {}

        for curr_tag in tags:
            max_prob = 0
            best_prev_tag = None

            for prev_tag in tags:
                prob = V[t-1][prev_tag] * \
                       transition_prob[prev_tag].get(curr_tag, 1e-6) * \
                       emission_prob[curr_tag].get(sentence_tokens[t], 1e-6)

                if prob > max_prob:
                    max_prob = prob
                    best_prev_tag = prev_tag

            V[t][curr_tag] = max_prob
            new_path[curr_tag] = path[best_prev_tag] + [curr_tag]

        path = new_path

    final_tag = max(V[-1], key=V[-1].get)
    return path[final_tag]


# Test sentence
test_sentence = "The proposed model improves classification accuracy"
tokens = tokenize(test_sentence)

tags = list(tag_counts.keys())
predicted_tags = viterbi(tokens, tags, transition_prob, emission_prob)

print("\nHMM Tagging Result:\n")
for word, tag in zip(tokens, predicted_tags):
    print(word, "→", tag)



HMM Tagging Result:

The → DT
proposed → VBD
model → NN
improves → NN
classification → NN
accuracy → NN
