##NLP Lab - 3

NAME: Nileem Kaveramma C C
* 2348441

Q1. Estimating N-gram probability
* a. Generate a lookup table for all the words in a given text for unigram probability

* A unigram is a single word, and the unigram probability measures the likelihood of that word occurring in a text.
* An N-gram is a sequence of N words. The N-gram probability measures the likelihood of a word occurring given the previous
N−1 words. The most common N-grams are bigrams (2-grams) and trigrams (3-grams).
* Unigram Probability: Probability of individual words occurring (e.g., "love" = 0.2).
* Bigram Probability: Probability of a word given the previous word (e.g.,
𝑃
(
"natural"
∣
"love"
)
=
1
P("natural"∣"love")=1).
* Trigram Probability: Probability of a word given the previous two words (e.g.,
𝑃
(
"language"∣

"love natural"
)
=
1
P("language"∣"love natural")=1).

In [None]:
from collections import Counter

# Step 1: Example stock market sentence
sentence = "The stock market is volatile and stock prices can change rapidly."

# Step 2: Tokenize the sentence
# Convert to lowercase and split by spaces
tokens = sentence.lower().split()

# Step 3: Calculate word frequencies
word_frequencies = Counter(tokens)

# Step 4: Calculate total number of words
total_words = len(tokens)

# Step 5: Calculate unigram probabilities and create lookup table
unigram_probabilities = {word: count / total_words for word, count in word_frequencies.items()}

# Display the lookup table
print("Unigram Probabilities Lookup Table:")
for word, prob in unigram_probabilities.items():
    print(f"Word: '{word}', Probability: {prob:.4f}")


Unigram Probabilities Lookup Table:
Word: 'the', Probability: 0.0909
Word: 'stock', Probability: 0.1818
Word: 'market', Probability: 0.0909
Word: 'is', Probability: 0.0909
Word: 'volatile', Probability: 0.0909
Word: 'and', Probability: 0.0909
Word: 'prices', Probability: 0.0909
Word: 'can', Probability: 0.0909
Word: 'change', Probability: 0.0909
Word: 'rapidly.', Probability: 0.0909


* b. Train an n-gram language model on a corpus. Compute the perplexity of a test set given by the user.
* Example of bigram:
P(&lt;s&gt; I want english food &lt;/s&gt;) =   P(I|&lt;s&gt;)

* Perplexity is a measure of how well a language model predicts a sample.

In [None]:
from collections import defaultdict, Counter
import numpy as np

# Step 1: Prepare the corpus
corpus = [
    "<s> I want English food </s>",
    "<s> I love Indian food </s>",
    "<s> She wants Italian food </s>"
]

# Step 2: Tokenize the corpus
tokens = [sentence.split() for sentence in corpus]

# Step 3: Train Bigram Model
bigram_counts = defaultdict(Counter)
unigram_counts = Counter()

for sentence in tokens:
    for i in range(len(sentence) - 1):
        unigram_counts[sentence[i]] += 1
        bigram_counts[sentence[i]][sentence[i + 1]] += 1
    # Include the last word unigram count
    unigram_counts[sentence[-1]] += 1

# Step 4: Calculate Bigram Probabilities
bigram_probabilities = defaultdict(dict)

for w1 in bigram_counts:
    total_count_w1 = float(sum(bigram_counts[w1].values()))
    for w2 in bigram_counts[w1]:
        bigram_probabilities[w1][w2] = bigram_counts[w1][w2] / total_count_w1

# Display Bigram Probabilities
print("Bigram Probabilities:")
for w1 in bigram_probabilities:
    for w2 in bigram_probabilities[w1]:
        print(f"P({w2}|{w1}) = {bigram_probabilities[w1][w2]:.4f}")

# Step 5: Compute Perplexity on a test sentence
def compute_sentence_probability(sentence, bigram_probabilities):
    words = sentence.split()
    probability = 1.0
    for i in range(len(words) - 1):
        w1 = words[i]
        w2 = words[i + 1]
        prob = bigram_probabilities[w1].get(w2, 1e-6)  # Use a small smoothing value for unseen bigrams
        probability *= prob
    return probability

def compute_perplexity(test_sentence, bigram_probabilities):
    words = test_sentence.split()
    sentence_probability = compute_sentence_probability(test_sentence, bigram_probabilities)
    N = len(words)
    perplexity = np.power(1 / sentence_probability, 1 / N)
    return perplexity

# Example test sentence
test_sentence = "<s> I want Italian food </s>"

# Compute and print perplexity
perplexity = compute_perplexity(test_sentence, bigram_probabilities)
print(f"Perplexity of the test sentence: {perplexity:.4f}")


Bigram Probabilities:
P(I|<s>) = 0.6667
P(She|<s>) = 0.3333
P(want|I) = 0.5000
P(love|I) = 0.5000
P(English|want) = 1.0000
P(food|English) = 1.0000
P(</s>|food) = 1.0000
P(Indian|love) = 1.0000
P(food|Indian) = 1.0000
P(wants|She) = 1.0000
P(Italian|wants) = 1.0000
P(food|Italian) = 1.0000
Perplexity of the test sentence: 12.0094


Q2. POS Tagging
* a. Implement any one exercise from chapter 8 of Text book to tag the word of the input text.

In [None]:
import nltk
from nltk.corpus import treebank
from nltk.tag import UnigramTagger
from nltk.tokenize import word_tokenize

# Step 1: Download NLTK resources (if not already downloaded)
nltk.download('treebank')
nltk.download('punkt')

# Step 2: Prepare the corpus data
train_data = treebank.tagged_sents()  # Get tagged sentences from the Treebank corpus

# Step 3: Train a Unigram Tagger
unigram_tagger = UnigramTagger(train_data)

# Step 4: Define a function to tag an input text
def pos_tag_text(text, tagger):
    # Tokenize the input text
    tokens = word_tokenize(text)

    # Use the tagger to assign POS tags
    tagged = tagger.tag(tokens)

    return tagged

# Step 5: Example input text
input_text = "The stock market is extremely volatile today."

# Step 6: POS tag the input text using the trained unigram tagger
tagged_text = pos_tag_text(input_text, unigram_tagger)

# Display the tagged text
print("POS Tagged Text:")
print(tagged_text)


[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


POS Tagged Text:
[('The', 'DT'), ('stock', 'NN'), ('market', 'NN'), ('is', 'VBZ'), ('extremely', 'RB'), ('volatile', 'JJ'), ('today', 'NN'), ('.', '.')]


**INFERENCE:**
* Word: "The" -
POS Tag: DT (Determiner)
* Word: "stock" -
POS Tag: NN (Noun, singular or mass)
* Word: "market" -
POS Tag: NN (Noun, singular or mass)
* Word: "is" -
POS Tag: VBZ (Verb, 3rd person singular present)
* Word: "extremely" -
POS Tag: RB (Adverb)
* Word: "volatile"
POS Tag: JJ (Adjective)
* Word: "today" -
POS Tag: NN (Noun, singular or mass)
* Word: "." -
POS Tag: . (Punctuation mark)

b. Generate HMM tagger using transition and observational probabilities.

* An HMM tagger (Hidden Markov Model tagger) is a type of probabilistic model used in Natural Language Processing (NLP) for Part-of-Speech (POS) tagging. It uses a Hidden Markov Model (HMM) to assign the most likely sequence of POS tags to a given sequence of words in a sentence.

In [None]:
import nltk
from nltk.corpus import treebank
from collections import defaultdict, Counter
import numpy as np

# Step 1: Download NLTK resources (if not already downloaded)
nltk.download('treebank')
nltk.download('punkt')

# Prepare the corpus data
train_data = treebank.tagged_sents()

# Initialize counters for transition and emission probabilities
tag_transitions = defaultdict(Counter)
tag_emissions = defaultdict(Counter)
tag_counts = Counter()

# Populate the counts
for sentence in train_data:
    prev_tag = None
    for word, tag in sentence:
        tag_counts[tag] += 1
        tag_emissions[tag][word] += 1
        if prev_tag is not None:
            tag_transitions[prev_tag][tag] += 1
        prev_tag = tag

# Calculate transition probabilities
tag_transition_probs = defaultdict(dict)
for tag1, tags2 in tag_transitions.items():
    total = float(sum(tags2.values()))
    for tag2, count in tags2.items():
        tag_transition_probs[tag1][tag2] = count / total

# Calculate emission probabilities
tag_emission_probs = defaultdict(dict)
for tag, words in tag_emissions.items():
    total = float(tag_counts[tag])
    for word, count in words.items():
        tag_emission_probs[tag][word] = count / total

# Viterbi Algorithm for HMM Tagging
def viterbi(sentence, tags):
    n = len(sentence)
    dp = [{} for _ in range(n)]
    backpointer = [{} for _ in range(n)]

    # Initialize base cases
    for tag in tags:
        dp[0][tag] = tag_emission_probs[tag].get(sentence[0], 1e-6)
        backpointer[0][tag] = None

    # Dynamic programming to fill the dp table
    for i in range(1, n):
        for tag in tags:
            max_prob, best_tag = max(
                ((dp[i-1][prev_tag] * tag_transition_probs[prev_tag].get(tag, 1e-6) *
                  tag_emission_probs[tag].get(sentence[i], 1e-6), prev_tag)
                 for prev_tag in tags),
                key=lambda x: x[0]
            )
            dp[i][tag] = max_prob
            backpointer[i][tag] = best_tag

    # Backtrack to get the best sequence of tags
    best_sequence = []
    best_last_tag = max(dp[-1], key=dp[-1].get)
    best_sequence.append(best_last_tag)
    for i in range(n-1, 0, -1):
        best_last_tag = backpointer[i][best_last_tag]
        best_sequence.append(best_last_tag)
    best_sequence.reverse()

    return best_sequence

# Define a list of possible tags (based on the training data)
tags = list(tag_counts.keys())

# Example input text
input_text = "The stock market is extremely volatile today."
tokens = nltk.word_tokenize(input_text)

# Tag the input text
tagged_sequence = viterbi(tokens, tags)

# Display the tagged text
print("HMM Tagged Text:")
for word, tag in zip(tokens, tagged_sequence):
    print(f"{word}: {tag}")



[nltk_data] Downloading package treebank to /root/nltk_data...
[nltk_data]   Unzipping corpora/treebank.zip.
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


HMM Tagged Text:
The: DT
stock: NN
market: NN
is: VBZ
extremely: RB
volatile: JJ
today: NN
.: .


Implement NER with NLTK/spaCY/Stanford NER Tagger

Input: John   lives in New   York
Output: B-PER  O     O  B-LOC I-LOC

In [None]:
import spacy

# Load the spaCy model for English
nlp = spacy.load("en_core_web_sm")

# Input sentence
sentence = "John lives in New York"

# Process the sentence with the NLP model
doc = nlp(sentence)

# Initialize an empty list to store the output
output = []

# Loop through the tokens in the sentence
for token in doc:
    if token.ent_iob_ == 'O':
        output.append(token.ent_iob_)
    else:
        output.append(f"{token.ent_iob_}-{token.ent_type_}")

# Print the formatted output
print(" ".join(output))


B-PERSON O O B-GPE I-GPE


**INFERENCE**
* BIO Format: The BIO format is a common tagging scheme for NER that stands for:
* B-PER: "B" denotes the beginning of a named entity of type "PER" (person).
* I-LOC: "I" denotes that the word is inside a named entity of type "LOC" (location).
* O: "O" denotes a token that is outside any named entity.

* Given the example input "John lives in New York", the output in BIO format is:

* "John" → B-PER (Beginning of a person entity)
* "lives" → O (Outside any entity)
* "in" → O (Outside any entity)
* "New" → B-LOC (Beginning of a location entity)
* "York" → I-LOC (Inside the same location entity)
