📌 Overview
This project implements a Part-of-Speech (POS) tagger using a Hidden Markov Model (HMM), enhanced with suffix-based rules and word-level features to handle unknown words effectively. The model is trained on the combined Treebank and Brown corpora from the NLTK library, utilizing the Universal POS tagset.​

🧱 Key Components
1. Data Preparation
Corpus Loading: The code loads tagged sentences from the Treebank and Brown corpora, both annotated with the Universal POS tagset.​

Data Splitting: The combined dataset is split into training and testing sets using an 80-20 split to evaluate the model's performance.​

2. EnhancedHMMTagger Class
This class encapsulates the HMM-based POS tagging logic with enhancements for better handling of unknown words.​

a. Frequency Computation
Word Frequencies: Calculates the frequency of each word in the training set.​

Tag Frequencies: Computes how often each POS tag appears.​
Medium

Word-Tag Frequencies: Determines how often each word is associated with each tag.​

b. Transition Probabilities
Tag Bigrams: Counts occurrences of consecutive tag pairs to understand tag transitions.​

Transition Matrix: Constructs a matrix representing the probability of transitioning from one tag to another, applying Laplace smoothing to handle unseen transitions.​

c. Start Probabilities
Sentence Start Tags: Calculates the probability distribution of tags that begin sentences, again using Laplace smoothing.​

d. Suffix Rules for Unknown Words
Suffix-Based Tagging: Defines a set of suffixes (e.g., 'ing', 'ed', 'ly') associated with probable tags and their weights to guess the tag of unknown words.​

e. Emission Probabilities
Known Words: For words seen during training, computes the probability of a word given a tag.​

Unknown Words: For unseen words, applies suffix rules and word features (e.g., capitalization, digits) to estimate emission probabilities.​

f. Viterbi Algorithm for Decoding
Initialization: Sets up the initial probabilities for the first word in the sentence.​

Recursion: Iteratively computes the most probable tag sequence for the sentence using dynamic programming.​
FreeCodeCamp
+1
Seong Hyun Hwang
+1

Termination: Backtracks through the computed probabilities to determine the optimal sequence of tags.​

🧪 Testing and Evaluation
Sample Sentences: The model is tested on a set of predefined sentences to demonstrate its tagging capabilities.​

Accuracy Evaluation: The model's performance is evaluated on a sample of 100 sentences from the test set, calculating the percentage of correctly predicted tags.​

📈 Performance Insights
Handling Unknown Words: The integration of suffix rules and word features enhances the model's ability to tag words not seen during training.​

Transition and Emission Probabilities: The use of Laplace smoothing ensures that the model can handle unseen tag transitions and word-tag combinations.​

Viterbi Algorithm: Employing the Viterbi algorithm allows for efficient computation of the most probable tag sequence for a given sentence.​




In [None]:
import nltk
nltk.download('punkt_tab')
import numpy as np
import pandas as pd
import math
from collections import defaultdict
from nltk.corpus import treebank, brown
from nltk.tokenize import word_tokenize
from sklearn.model_selection import train_test_split

# Download and prepare data
nltk.download(['treebank', 'brown', 'universal_tagset', 'punkt'])

# Combine Treebank and Brown corpora
tagged_sentences = list(treebank.tagged_sents(tagset='universal')) + \
                   list(brown.tagged_sents(tagset='universal'))

# Split into train and test
train_set, test_set = train_test_split(tagged_sentences, test_size=0.2, random_state=42)

class EnhancedHMMTagger:
    def __init__(self, train_set):
        self.train_set = train_set
        self.train_tagged_words = [pair for sent in train_set for pair in sent]
        self.setup_frequencies()
        self.setup_transition_matrix()
        self.setup_start_probs()
        self.setup_suffix_rules()

    def setup_frequencies(self):
        self.word_freq = defaultdict(int)
        self.tag_freq = defaultdict(int)
        self.word_tag_freq = defaultdict(lambda: defaultdict(int))

        for word, tag in self.train_tagged_words:
            word_lower = word.lower()
            self.word_freq[word_lower] += 1
            self.tag_freq[tag] += 1
            self.word_tag_freq[word_lower][tag] += 1

        self.vocab = set(self.word_freq.keys())
        self.tags = sorted(self.tag_freq.keys())

    def setup_transition_matrix(self):
        self.tag_bigrams = defaultdict(lambda: defaultdict(int))
        for sentence in self.train_set:
            sentence_tags = [tag for _, tag in sentence]
            for i in range(len(sentence_tags)-1):
                current_tag = sentence_tags[i]
                next_tag = sentence_tags[i+1]
                self.tag_bigrams[current_tag][next_tag] += 1

        self.transition_matrix = pd.DataFrame(0.0, index=self.tags, columns=self.tags)
        for t1 in self.tags:
            total = sum(self.tag_bigrams[t1].values())
            for t2 in self.tags:
                self.transition_matrix.loc[t1, t2] = \
                    (self.tag_bigrams[t1].get(t2, 0) + 1) / (total + len(self.tags))

    def setup_start_probs(self):
        self.start_counts = defaultdict(int)
        for sentence in self.train_set:
            if sentence:
                self.start_counts[sentence[0][1]] += 1

        total_starts = sum(self.start_counts.values())
        self.start_probs = {tag: (self.start_counts.get(tag, 0) + 1) / \
                          (total_starts + len(self.tags)) for tag in self.tags}

    def setup_suffix_rules(self):
        # Enhanced suffix rules with weights
        self.suffix_rules = {
            'ing': [('VERB', 0.9), ('NOUN', 0.1)],
            'ed': [('VERB', 0.85), ('ADJ', 0.15)],
            'ly': [('ADV', 0.95)],
            'ment': [('NOUN', 0.98)],
            's': [('NOUN', 0.8), ('VERB', 0.2)],
            'ion': [('NOUN', 0.95)],
            'able': [('ADJ', 0.9)],
            'ive': [('ADJ', 0.9)],
            'est': [('ADJ', 0.95)]
        }
        self.sorted_suffixes = sorted(self.suffix_rules.keys(), key=len, reverse=True)

    def emission_prob(self, word, tag):
        word_lower = word.lower()
        if word_lower in self.word_tag_freq:
            return (self.word_tag_freq[word_lower].get(tag, 0) + 1) / \
                   (self.tag_freq[tag] + len(self.vocab))
        else:
            prob = self.handle_unknown_word(word, tag)
            return prob if prob > 0 else 1e-10

    def handle_unknown_word(self, word, tag):
        for suffix in self.sorted_suffixes:
            if word.lower().endswith(suffix):
                for rule_tag, prob in self.suffix_rules[suffix]:
                    if tag == rule_tag:
                        return prob
                return 0.0

        features = {
            'is_title': word.istitle(),
            'is_upper': word.isupper(),
            'is_digit': word.isdigit(),
            'has_hyphen': '-' in word,
            'is_numeric': any(c.isdigit() for c in word)
        }

        if features['is_digit']:
            return 0.95 if tag == 'NUM' else 0.05/(len(self.tags)-1)
        elif features['is_title'] and len(word) > 3:
            return 0.8 if tag == 'NOUN' else 0.2/(len(self.tags)-1)
        elif features['is_upper']:
            return 0.7 if tag == 'NOUN' else 0.3/(len(self.tags)-1)
        else:
            return 0.5 if tag == 'NOUN' else 0.5/(len(self.tags)-1)

    def viterbi_tag(self, sentence):
        words = [word for word in sentence]
        n = len(words)
        m = len(self.tags)

        viterbi = np.zeros((n, m))
        backpointer = np.zeros((n, m), dtype=int)

        # Initialization
        for j, tag in enumerate(self.tags):
            viterbi[0][j] = math.log(self.start_probs[tag] + 1e-10) + \
                           math.log(self.emission_prob(words[0], tag) + 1e-10)
            backpointer[0][j] = -1

        # Recursion
        for t in range(1, n):
            for j, current_tag in enumerate(self.tags):
                max_prob = -math.inf
                best_prev = 0
                for i, prev_tag in enumerate(self.tags):
                    prob = viterbi[t-1][i] + \
                          math.log(self.transition_matrix.loc[prev_tag, current_tag] + 1e-10)
                    if prob > max_prob:
                        max_prob = prob
                        best_prev = i

                viterbi[t][j] = max_prob + math.log(self.emission_prob(words[t], current_tag) + 1e-10)
                backpointer[t][j] = best_prev

        # Termination
        best_last = np.argmax(viterbi[-1])
        best_path = [best_last]
        for t in range(n-1, 0, -1):
            best_last = backpointer[t][best_last]
            best_path.insert(0, best_last)

        return list(zip(words, [self.tags[idx] for idx in best_path]))

# Initialize and test the final tagger
final_tagger = EnhancedHMMTagger(train_set)

test_sentences = [
    "The quick brown fox jumps over the lazy dog .",
    "Natural language processing is a fascinating field of study .",
    "I would like to order two pizzas with extra cheese .",
    "Machine learning algorithms can improve over time ."
]

print("Final HMM POS Tagger Results:")
for sent in test_sentences:
    words = word_tokenize(sent)
    tagged = final_tagger.viterbi_tag(words)
    print("\nSentence:", sent)
    print("POS Tags:", tagged)

# Evaluation
correct = total = 0
for sent in test_set[:100]:  # Evaluate on a sample
    words, gold_tags = zip(*sent)
    predicted = final_tagger.viterbi_tag(words)
    predicted_tags = [tag for _, tag in predicted]
    correct += sum(p == g for p, g in zip(predicted_tags, gold_tags))
    total += len(gold_tags)
print("\nAccuracy on Test Set (Sample of 100 sentences): {:.2f}%".format(100 * correct / total))


[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt_tab.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


Final HMM POS Tagger Results:

Sentence: The quick brown fox jumps over the lazy dog .
POS Tags: [('The', 'DET'), ('quick', 'ADJ'), ('brown', 'NOUN'), ('fox', 'NOUN'), ('jumps', 'NOUN'), ('over', 'ADP'), ('the', 'DET'), ('lazy', 'ADJ'), ('dog', 'NOUN'), ('.', '.')]

Sentence: Natural language processing is a fascinating field of study .
POS Tags: [('Natural', 'ADJ'), ('language', 'NOUN'), ('processing', 'VERB'), ('is', 'VERB'), ('a', 'DET'), ('fascinating', 'ADJ'), ('field', 'NOUN'), ('of', 'ADP'), ('study', 'NOUN'), ('.', '.')]

Sentence: I would like to order two pizzas with extra cheese .
POS Tags: [('I', 'PRON'), ('would', 'VERB'), ('like', 'VERB'), ('to', 'ADP'), ('order', 'NOUN'), ('two', 'NUM'), ('pizzas', 'NOUN'), ('with', 'ADP'), ('extra', 'ADJ'), ('cheese', 'NOUN'), ('.', '.')]

Sentence: Machine learning algorithms can improve over time .
POS Tags: [('Machine', 'NOUN'), ('learning', 'VERB'), ('algorithms', 'NOUN'), ('can', 'VERB'), ('improve', 'VERB'), ('over', 'ADP'), ('tim

📌 Overview
This project implements a Machine Learning-based Part-of-Speech (POS) tagger using Logistic Regression. It relies on a set of contextual and morphological features extracted from words and their surrounding context. The model is trained on the Treebank and Brown corpora with Universal POS tags, then evaluated on unseen data and new sentences.

🧱 Key Components
1. Data Preparation
Corpus Loading:

Combines tagged sentences from the treebank and brown corpora using the 'universal' POS tagset.

Ensures a diverse, balanced dataset.

Train-Test Split:

Randomly splits the data into 80% training and 20% test using train_test_split().

2. Feature Engineering: word2features()
For each word in a sentence, this function extracts a rich set of features:

Word-level features:

Lowercased word

Whether it's uppercase, title-cased, or a digit

Last 2 and 3 letters (suffixes for morphological cues)

Contextual features:

Lowercased and title-cased versions of the previous and next words

Special markers for beginning (BOS) or end (EOS) of a sentence

These features help the model learn context-sensitive patterns.

3. Training the Classifier
Vectorization:

Converts feature dictionaries into numeric format using DictVectorizer, which performs one-hot encoding.

Model:

Trains a Logistic Regression classifier (sklearn.linear_model.LogisticRegression) to predict tags based on these features.

max_iter=200 ensures convergence for large datasets.

4. Evaluation
Accuracy Check:

Tests on 100 randomly selected sentences from the test set.

Reports the model’s accuracy using accuracy_score().

5. Prediction on New Sentences
predict_tags():

Tokenizes input sentences.

Applies word2features() and uses the trained classifier to predict tags.

Returns word-tag pairs for easy interpretation.

📈 Sample Results
Example predictions are shown for:

"The quick brown fox jumps over the lazy dog ."

"I would like to order two pizzas with extra cheese ."

The results demonstrate the model's ability to generalize to unseen sentences with good accuracy.

✅ Strengths
Context-aware: Uses neighboring words as features.

Morphological info: Leverages suffixes for better handling of unknown words.

Simple yet powerful: Logistic Regression is fast and interpretable.

⚠️ Limitations
Independent tag prediction: Predicts tags word-by-word, without sequence-level optimization like in HMM or CRF.

No post-processing: Errors in early predictions don't influence future decisions.


In [None]:
import nltk
import random
import pandas as pd
from nltk.corpus import treebank, brown
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

nltk.download(['treebank', 'brown', 'universal_tagset'])

# Load data
tagged_sentences = list(treebank.tagged_sents(tagset='universal')) + \
                   list(brown.tagged_sents(tagset='universal'))

train_sents, test_sents = train_test_split(tagged_sentences, test_size=0.2, random_state=42)

# ---------- Feature Engineering ----------
def word2features(sent, i):
    word = sent[i][0]
    features = {
        'word.lower()': word.lower(),
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
        'suffix-3': word[-3:],
        'suffix-2': word[-2:],
    }
    if i > 0:
        prev = sent[i-1][0]
        features.update({
            '-1:word.lower()': prev.lower(),
            '-1:word.istitle()': prev.istitle(),
        })
    else:
        features['BOS'] = True

    if i < len(sent) - 1:
        nxt = sent[i+1][0]
        features.update({
            '+1:word.lower()': nxt.lower(),
            '+1:word.istitle()': nxt.istitle(),
        })
    else:
        features['EOS'] = True
    return features

# ---------- Prepare Training Data ----------
X, y = [], []
for sent in train_sents:
    for i in range(len(sent)):
        feats = word2features(sent, i)
        X.append(feats)
        y.append(sent[i][1])

# Vectorize
vec = DictVectorizer(sparse=True)
X_vectorized = vec.fit_transform(X)

# Train
clf = LogisticRegression(max_iter=200)
clf.fit(X_vectorized, y)

# ---------- Evaluation ----------
def predict_tags(sentence):
    feats = [word2features(list(zip(sentence, [None]*len(sentence))), i) for i in range(len(sentence))]
    feats_vec = vec.transform(feats)
    return list(zip(sentence, clf.predict(feats_vec)))

# Test Accuracy
X_test, y_test = [], []
for sent in test_sents[:100]:  # Sample
    for i in range(len(sent)):
        X_test.append(word2features(sent, i))
        y_test.append(sent[i][1])
X_test_vec = vec.transform(X_test)
y_pred = clf.predict(X_test_vec)

print("Accuracy on test set (100 sentences):", round(accuracy_score(y_test, y_pred) * 100, 2), "%")

# ---------- Try New Sentences ----------
test_sentences = [
    "The quick brown fox jumps over the lazy dog .",
    "I would like to order two pizzas with extra cheese ."
]
print("\nML-Based POS Tagger Results:")
for sent in test_sentences:
    words = nltk.word_tokenize(sent)
    tagged = predict_tags(words)
    print("\nSentence:", sent)
    print("POS Tags:", tagged)


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Accuracy on test set (100 sentences): 97.48 %

ML-Based POS Tagger Results:

Sentence: The quick brown fox jumps over the lazy dog .
POS Tags: [('The', np.str_('DET')), ('quick', np.str_('ADJ')), ('brown', np.str_('NOUN')), ('fox', np.str_('NOUN')), ('jumps', np.str_('NOUN')), ('over', np.str_('ADP')), ('the', np.str_('DET')), ('lazy', np.str_('ADJ')), ('dog', np.str_('NOUN')), ('.', np.str_('.'))]

Sentence: I would like to order two pizzas with extra cheese .
POS Tags: [('I', np.str_('PRON')), ('would', np.str_('VERB')), ('like', np.str_('VERB')), ('to', np.str_('PRT')), ('order', np.str_('VERB')), ('two', np.str_('NUM')), ('pizzas', np.str_('NOUN')), ('with', np.str_('ADP')), ('extra', np.str_('ADJ')), ('cheese', np.str_('NOUN')), ('.', np.str_('.'))]


📌 Overview
This script builds a Part-of-Speech (POS) tagger using a Conditional Random Field (CRF) model via the sklearn-crfsuite library. CRFs are well-suited for sequence labeling tasks like POS tagging because they consider the dependencies between adjacent tags (unlike logistic regression).

🧱 Breakdown of the Code
1. Data Preparation
Dataset: Uses the Penn Treebank corpus via nltk.corpus.treebank.

Splitting: First 3000 sentences for training, the rest for testing.

2. Feature Engineering
Each word is converted into a feature dictionary using word2features(), which includes:

Word-level features:

Lowercase form, suffixes ([-2:], [-3:]), digit check, title case, uppercase

Contextual features:

Same features for the previous and next words

BOS/EOS flags to indicate sentence boundaries

These features are built for all words in each sentence using sent2features().

3. Training the CRF Model
A CRF model is instantiated using:

algorithm='lbfgs' (quasi-Newton optimization)

c1, c2: L1 and L2 regularization to prevent overfitting

all_possible_transitions=True: Helps generalization on unseen transitions

The model is trained using crf.fit(X_train, y_train).

4. Evaluation
Accuracy is calculated using flat_accuracy_score(), which evaluates word-level tag accuracy across all sentences.

5. Tagging New Sentences
Function tag_new_sentence():

Takes a raw sentence, converts it into a dummy tagged format (with fake 'NN' tags),

Extracts features and predicts tags using the trained CRF model,

Returns word-tag pairs.



In [None]:
import nltk
pip install sklearn-crfsuite
import sklearn_crfsuite
from sklearn_crfsuite import metrics
from nltk.corpus import treebank
nltk.download('treebank')

# Prepare data
sentences = treebank.tagged_sents()
train_sents = sentences[:3000]
test_sents = sentences[3000:]

# Feature extractor for each word
def word2features(sent, i):
    word = sent[i][0]
    features = {
        'bias': 1.0,
        'word.lower()': word.lower(),
        'word[-3:]': word[-3:],
        'word[-2:]': word[-2:],
        'word.isupper()': word.isupper(),
        'word.istitle()': word.istitle(),
        'word.isdigit()': word.isdigit(),
    }
    if i > 0:
        word1 = sent[i-1][0]
        features.update({
            '-1:word.lower()': word1.lower(),
            '-1:word.istitle()': word1.istitle(),
            '-1:word.isupper()': word1.isupper(),
        })
    else:
        features['BOS'] = True  # Beginning of sentence

    if i < len(sent)-1:
        word1 = sent[i+1][0]
        features.update({
            '+1:word.lower()': word1.lower(),
            '+1:word.istitle()': word1.istitle(),
            '+1:word.isupper()': word1.isupper(),
        })
    else:
        features['EOS'] = True  # End of sentence

    return features

def sent2features(sent):
    return [word2features(sent, i) for i in range(len(sent))]

def sent2labels(sent):
    return [label for (token, label) in sent]

def sent2tokens(sent):
    return [token for (token, label) in sent]

# Feature transformation
X_train = [sent2features(s) for s in train_sents]
y_train = [sent2labels(s) for s in train_sents]

X_test = [sent2features(s) for s in test_sents]
y_test = [sent2labels(s) for s in test_sents]

# Train CRF model
crf = sklearn_crfsuite.CRF(
    algorithm='lbfgs',
    c1=0.1,
    c2=0.1,
    max_iterations=100,
    all_possible_transitions=True,
)
crf.fit(X_train, y_train)

# Evaluation
y_pred = crf.predict(X_test)
accuracy = metrics.flat_accuracy_score(y_test, y_pred)
print(f"Accuracy on test set: {accuracy * 100:.2f}%")

# POS tag a new sentence
def tag_new_sentence(crf, sentence):
    dummy_sent = [(word, 'NN') for word in sentence.split()]
    features = sent2features(dummy_sent)
    tags = crf.predict([features])[0]
    return list(zip(sentence.split(), tags))

# Example usage
print("Tagged:", tag_new_sentence(crf, "The quick brown fox jumps over the lazy dog"))


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!


Accuracy on test set: 95.67%
Tagged: [('The', 'DT'), ('quick', 'JJ'), ('brown', 'JJ'), ('fox', 'NN'), ('jumps', 'NNS'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]
