## Experiment 9
## Implement HMM for POS tagging.

**Part-of-Speech (POS) Tagging and Hidden Markov Model (HMM) Overview:**

1. **Part-of-Speech (POS) Tagging:**
    - *Definition:* A natural language processing task that involves assigning a specific grammatical category (such as noun, verb, adjective) to each word in a sentence.
    - *Purpose:* Enables linguistic analysis, information retrieval, and aids in various language processing applications.
    - *Challenges:* Ambiguities and context dependencies pose challenges in accurately determining the correct POS for each word.

2. **Hidden Markov Model (HMM):**
    - *Definition:* A statistical model representing a system where the states are hidden but emit observable symbols.
    - *Components:*
        - *States:* Hidden states representing underlying structures.
        - *Observations:* Observable symbols emitted by each state.
        - *Transitions:* Probabilities of moving from one state to another.
        - *Emissions:* Probabilities of emitting specific symbols from each state.
    - *Application in POS Tagging:* HMMs are used to model the sequence of POS tags in a sentence. States represent POS tags, and emissions correspond to observed words.
    - *Training:* Parameters (transitions and emissions probabilities) are learned from annotated training data, enabling the model to predict POS tags for unseen sentences.

3. **POS Tagging with HMM:**
    - *Modeling:* In HMM-based POS tagging, states represent different POS tags, transitions model the likelihood of moving from one tag to another, and emissions capture the probability of observing a word given a POS tag.
    - *Viterbi Algorithm:* Employed to find the most likely sequence of POS tags for a given sequence of words, utilizing the calculated probabilities of transitions and emissions.
    - *Example:* Given a sentence, the HMM calculates the most probable sequence of POS tags, aiding in accurate grammatical categorization.

4. **Limitations:**
    - *Sensitivity to Training Data:* HMM-based POS tagging may be sensitive to the quality and quantity of annotated training data.
    - *Word Ambiguity:* The model may struggle with word ambiguity and context-dependent meanings.

In summary, POS tagging with HMM involves using a statistical model to predict the sequence of grammatical categories for a given sequence of words, leveraging the principles of Hidden Markov Models to capture the underlying structure of language.

## Explanation of the Code

**Hidden Markov Model (HMM) POS Tagging with NLTK Code Explanation:**


1. **Import Libraries:**
   - `import nltk`: Import the Natural Language Toolkit (NLTK).
   - `from nltk.tag import hmm`: Import the HMM module for POS tagging.
   - `from nltk.corpus import treebank`: Import the Penn Treebank corpus.

2. **Download NLTK Data:**
   - `nltk.download('punkt')`: Download NLTK data for tokenization.
   - `nltk.download('treebank')`: Download the Penn Treebank corpus.

3. **Get Tagged Sentences:**
   - `tagged_sentences = treebank.tagged_sents()`: Retrieve tagged sentences from the Penn Treebank corpus.

4. **Split Data:**
   - `train_data = tagged_sentences[:3000]`: Use the first 3000 sentences for training.
   - `test_data = tagged_sentences[3000:]`: Use the remaining sentences for testing.

5. **Train HMM Model:**
   - `trainer = hmm.HiddenMarkovModelTrainer()`: Initialize an HMM trainer.
   - `hmm_model = trainer.train(train_data)`: Train the HMM model using the training data.

6. **Evaluate Model:**
   - `accuracy = hmm_model.evaluate(test_data)`: Evaluate the model on the test data and calculate accuracy.
   - `print(f"Accuracy: {accuracy * 100:.2f}%")`: Print the accuracy of the trained model.

7. **Test Model on Sample Sentence:**
   - `sample_sentence = "This is a sample sentence."`: Define a sample sentence.
   - `sample_tokens = nltk.word_tokenize(sample_sentence)`: Tokenize the sample sentence.
   - `predicted_tags = hmm_model.tag(sample_tokens)`: Use the trained model to predict POS tags for the sample sentence.
   - `print("Predicted POS Tags:", predicted_tags)`: Print the predicted POS tags for the sample sentence.

In [None]:
import nltk
from nltk.tag import hmm
from nltk.corpus import treebank

# Download the NLTK data (you only need to do this once)
nltk.download('punkt')
nltk.download('treebank')

# Get tagged sentences from the Penn Treebank corpus
tagged_sentences = treebank.tagged_sents()

# Split the data into training and testing sets
train_data = tagged_sentences[:3000]
test_data = tagged_sentences[3000:]

# Train the HMM model
trainer = hmm.HiddenMarkovModelTrainer()
hmm_model = trainer.train(train_data)

# Evaluate the model on the test data
accuracy = hmm_model.evaluate(test_data)
print(f"Accuracy: {accuracy * 100:.2f}%")

# Test the model on a sample sentence
sample_sentence = "This is a sample sentence."
sample_tokens = nltk.word_tokenize(sample_sentence)
predicted_tags = hmm_model.tag(sample_tokens)
print("Predicted POS Tags:", predicted_tags)


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Package treebank is already up-to-date!
  Function evaluate() has been deprecated.  Use accuracy(gold)
  instead.
  accuracy = hmm_model.evaluate(test_data)


Accuracy: 36.84%
Predicted POS Tags: [('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'NNP'), ('sentence', 'NNP'), ('.', 'NNP')]


In [None]:
print(tagged_sentences)

[[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ('61', 'CD'), ('years', 'NNS'), ('old', 'JJ'), (',', ','), ('will', 'MD'), ('join', 'VB'), ('the', 'DT'), ('board', 'NN'), ('as', 'IN'), ('a', 'DT'), ('nonexecutive', 'JJ'), ('director', 'NN'), ('Nov.', 'NNP'), ('29', 'CD'), ('.', '.')], [('Mr.', 'NNP'), ('Vinken', 'NNP'), ('is', 'VBZ'), ('chairman', 'NN'), ('of', 'IN'), ('Elsevier', 'NNP'), ('N.V.', 'NNP'), (',', ','), ('the', 'DT'), ('Dutch', 'NNP'), ('publishing', 'VBG'), ('group', 'NN'), ('.', '.')], ...]
