# Question 1
**Library Used**
: nltk  <br>
A leading library for working with human language data (text) in Python.

**Function Used**

    **word_tokenize()**
-Splits raw text into individual words and punctuation tokens.



In [32]:
import nltk
from nltk.util import ngrams
from nltk import word_tokenize, FreqDist
import math
import pandas as pd

# Load corpus
with open("sample_corpus.txt", "r") as f:
    corpus_text = f.read().lower()
    print("📄 Corpus Text:")
    print(corpus_text)

# Tokenize
tokens = word_tokenize(corpus_text)
total_tokens = len(tokens)

📄 Corpus Text:
i want english food. sam and i like green vegetables. 
i like food and english movies. am i going to eat green vegetables again?



**FreqDist()**
Creates a frequency distribution (like a histogram) for tokens.

Example: counts how often each word appears.

In [33]:
# --- Unigram Table ---
fdist = FreqDist(tokens)
unigram_data = {
    "Word": [],
    "Count": [],
    "Probability": []
}

for word, count in fdist.items():
    unigram_data["Word"].append(word)
    unigram_data["Count"].append(count)
    unigram_data["Probability"].append(round(count / total_tokens, 6))

unigram_df = pd.DataFrame(unigram_data)
unigram_df = unigram_df.sort_values(by="Word").reset_index(drop=True)

print("\n📊 Unigram Table (All Words):")
print(unigram_df)


📊 Unigram Table (All Words):
          Word  Count  Probability
0            .      3     0.107143
1            ?      1     0.035714
2        again      1     0.035714
3           am      1     0.035714
4          and      2     0.071429
5          eat      1     0.035714
6      english      2     0.071429
7         food      2     0.071429
8        going      1     0.035714
9        green      2     0.071429
10           i      4     0.142857
11        like      2     0.071429
12      movies      1     0.035714
13         sam      1     0.035714
14          to      1     0.035714
15  vegetables      2     0.071429
16        want      1     0.035714


**ngrams()**
Generates n-grams (sequences of n tokens) from a list of words.

In this case, you’re using bigrams (2-grams):

Example: ['I', 'like', 'apples'] → [('I', 'like'), ('like', 'apples')]

In [34]:
# --- Bigram Table ---
bigrams = list(ngrams(tokens, 2))
bigram_fdist = FreqDist(bigrams)
prev_word_counts = FreqDist(w1 for (w1, w2) in bigrams)

bigram_data = {
    "Bigram": [],
    "Count": [],
    "Probability": []
}

for bg, count in bigram_fdist.items():
    w1, w2 = bg
    prob = count / prev_word_counts[w1] if prev_word_counts[w1] > 0 else 0
    bigram_data["Bigram"].append(f"{w1} {w2}")
    bigram_data["Count"].append(count)
    bigram_data["Probability"].append(round(prob, 6))

bigram_df = pd.DataFrame(bigram_data)
bigram_df = bigram_df.sort_values(by="Bigram").reset_index(drop=True)

print("\n📊 Bigram Table (All Word Pairs):")
print(bigram_df)


📊 Bigram Table (All Word Pairs):
              Bigram  Count  Probability
0               . am      1     0.333333
1                . i      1     0.333333
2              . sam      1     0.333333
3            again ?      1     1.000000
4               am i      1     1.000000
5        and english      1     0.500000
6              and i      1     0.500000
7          eat green      1     1.000000
8       english food      1     0.500000
9     english movies      1     0.500000
10            food .      1     0.500000
11          food and      1     0.500000
12          going to      1     1.000000
13  green vegetables      2     1.000000
14           i going      1     0.250000
15            i like      2     0.500000
16            i want      1     0.250000
17         like food      1     0.500000
18        like green      1     0.500000
19          movies .      1     1.000000
20           sam and      1     1.000000
21            to eat      1     1.000000
22      vegetables .   

In [36]:
# --- Bigram Probability Helper ---
def bigram_prob(w1, w2):
    return bigram_fdist[(w1, w2)] / prev_word_counts[w1] if prev_word_counts[w1] > 0 else 0

def perplexity(test_bigrams):
        N = len(test_bigrams)
        log_prob_sum = 0
        for bg in test_bigrams:
            p = bigram_prob(bg[0], bg[1])
            if p > 0:
                log_prob_sum += math.log2(p)
            else:
                return float('inf')
        return 2 ** (-log_prob_sum / N)
test_sentence = input("Enter the sentence you want to check the perplexity")
test_tokens = word_tokenize(test_sentence.lower())
test_bigrams = list(ngrams(test_tokens, 2))
print(f"\nPerplexity on test sequence: {perplexity(test_bigrams):.4f}")



Perplexity on test sequence: 2.0000


# Question 2
### Functions Used

**1) pos_tag()**
: Assigns Part-of-Speech (POS) tags to each token using a pretrained model.

Example: ['I', 'like', 'food'] → [('I', 'PRP'), ('like', 'VBP'), ('food', 'NN')]

**2) brown.tagged_sents()**
: Loads tagged sentences from the Brown corpus.

categories='news' restricts to the "news" category.

Used to train your HMM.

**3)ConditionalFreqDist()**
: A frequency distribution conditioned on a key.

Used here to:

Count word frequencies given a tag (for emissions)

Count tag frequencies given a previous tag (for transitions)

**4)LidstoneProbDist()**
: Applies Lidstone smoothing (a generalization of Laplace smoothing).

Prevents zero probabilities in emission/transition models.





In [None]:
from nltk import word_tokenize, pos_tag
from nltk.corpus import brown
from nltk.probability import ConditionalFreqDist, LidstoneProbDist

# a) Simple POS Tagging
sentence = input("Enter the sentence:")
tokens = word_tokenize(sentence)
tags = pos_tag(tokens)
print("\n--- POS Tagging ---")
print(tags)

# b) HMM model training (transitions and emissions)
tagged_sents = brown.tagged_sents(categories='news')[:1000]
cfd_tag_word = ConditionalFreqDist()
cfd_tag_tag = ConditionalFreqDist()

for sent in tagged_sents:
    prev_tag = "<s>"
    for word, tag in sent:
        cfd_tag_word[tag][word.lower()] += 1
        cfd_tag_tag[prev_tag][tag] += 1
        prev_tag = tag

emissions = {
    tag: LidstoneProbDist(cfd_tag_word[tag], 0.1, bins=len(cfd_tag_word[tag]))
    for tag in cfd_tag_word
}

transitions = {
    tag: LidstoneProbDist(cfd_tag_tag[tag], 0.1, bins=len(cfd_tag_tag[tag]))
    for tag in cfd_tag_tag
}

print("\n--- HMM Model Trained ---")
print(f"Sample emission P('food' | 'NN') = {emissions['NN'].prob('food'):.6f}")
print(f"Sample transition P('VB' | '<s>') = {transitions['<s>'].prob('VB'):.6f}")



--- POS Tagging ---
[('i', 'NN'), ('want', 'VBP'), ('to', 'TO')]

--- HMM Model Trained ---
Sample emission P('food' | 'NN') = 0.000031
Sample transition P('VB' | '<s>') = 0.000099


# Question 3
## Module Used
spacy – Industrial-Strength NLP Library
spaCy is a powerful and fast NLP library used for:

i)Tokenization

ii)Named Entity Recognition (NER)

iii)POS tagging

iv)Dependency parsing

## Functions Used
1)**spacy.load("en_core_web_sm")**
: Loads a pre-trained small English pipeline

Includes tokenizer, POS tagger, NER, etc.



2)**nlp(text)**
: Passes the input text through the entire pipeline

Returns a Doc object with tokens, entities, etc.

3)**token.ent_iob_**
IOB tag: used for NER labeling

Values:

'B': Beginning of named entity

'I': Inside a named entity

'O': Outside any named entity

4)**token.ent_type_**
The entity type like:

PERSON, ORG, GPE (geo-political entity), DATE, etc.

In [37]:
import spacy

# Load spaCy model
nlp = spacy.load("en_core_web_sm")

# Example sentence
text = input("Enter text:")
doc = nlp(text)

# Print IOB format output
print("\n--- NER Output (IOB format) ---")
for token in doc:
    iob = token.ent_iob_
    label = token.ent_type_ if token.ent_type_ else "O"
    print(f"{token.text}\t{iob}-{label}")



--- NER Output (IOB format) ---
