<a href="https://colab.research.google.com/github/koushik-ace/NLP/blob/main/Lab8_NGram_Model_Koushik_2403A52258.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Corpus


In [3]:
# Large Corpus with 10 Documents

D1 = """
I am studying BTech in Computer Science at SR University.
My academic journey started with interest in mathematics and physics.
I learned programming using C, C++, and Python.
Data structures and algorithms improved my problem-solving skills.
"""

D2 = """
During my second year, I studied operating systems and database management systems.
I learned SQL queries, indexing, normalization, and transaction management.
Process scheduling and memory allocation were important topics.
"""

D3 = """
Machine learning became my favorite subject.
I studied regression, classification, clustering, and neural networks.
I implemented projects using Scikit-learn and TensorFlow.
Deep learning models improved prediction accuracy.
"""

D4 = """
Cybersecurity is an important domain in computer science.
I studied cryptography, encryption algorithms, and ethical hacking.
Network security and malware analysis were explored.
Digital forensics was introduced in laboratory sessions.
"""

D5 = """
Cloud computing enables scalable software systems.
I learned Amazon Web Services, Microsoft Azure, and Google Cloud Platform.
Docker and Kubernetes improved deployment efficiency.
Virtual machines support distributed infrastructure.
"""

D6 = """
Software engineering focuses on quality development.
I studied SDLC, agile methodology, and DevOps practices.
Git and GitHub improved collaboration.
CI/CD pipelines automated testing and deployment.
"""

D7 = """
Web development is essential for modern applications.
I learned HTML, CSS, JavaScript, and React framework.
Backend development used Flask and Django.
RESTful APIs enabled data communication.
"""

D8 = """
Data science involves data analysis and visualization.
I used Pandas, NumPy, and Matplotlib libraries.
Exploratory data analysis improved decision making.
Statistical modeling enhanced insights.
"""

D9 = """
Artificial intelligence includes natural language processing.
I studied sentiment analysis and chatbot development.
Recommendation systems improved user experience.
Speech recognition was implemented using Python.
"""

D10 = """
My career goal is to become a software engineer.
I aim to work on intelligent systems and distributed computing.
Continuous learning through certifications is important.
Innovation and creativity drive technological growth.
"""


In [4]:
# Combine all documents

combined_text = D1 + D2 + D3 + D4 + D5 + D6 + D7 + D8 + D9 + D10

print("Total Words:", len(combined_text.split()))


Total Words: 280


##Preprocessing

In [5]:
import re
import collections

# Lowercase
text = combined_text.lower()

# Remove punctuation
text = re.sub(r'[^a-z\s]', '', text)

# Tokenization
words = text.split()

print("First 30 Tokens:")
print(words[:30])


First 30 Tokens:
['i', 'am', 'studying', 'btech', 'in', 'computer', 'science', 'at', 'sr', 'university', 'my', 'academic', 'journey', 'started', 'with', 'interest', 'in', 'mathematics', 'and', 'physics', 'i', 'learned', 'programming', 'using', 'c', 'c', 'and', 'python', 'data', 'structures']


##dentify Rare Words & Add UNK

In [23]:
# Identify Rare Words (Frequency = 1)

word_freq = collections.Counter(words)

rare_words = []

for word, count in word_freq.items():
    if count == 1:
        rare_words.append(word)

print("Number of Rare Words:", len(rare_words))


# Replace rare words with UNK

words_unk = []

for word in words:
    if word_freq[word] == 1:
        words_unk.append("UNK")
    else:
        words_unk.append(word)


print("Sample After UNK Replacement:")
print(words_unk[:30])


Number of Rare Words: 152
Sample After UNK Replacement:
['i', 'UNK', 'UNK', 'UNK', 'in', 'computer', 'science', 'UNK', 'UNK', 'UNK', 'my', 'UNK', 'UNK', 'UNK', 'UNK', 'UNK', 'in', 'UNK', 'and', 'UNK', 'i', 'learned', 'UNK', 'using', 'c', 'c', 'and', 'python', 'data', 'UNK']


##Rebuild N-Gram Models with UNK

In [24]:
# Build N-grams with UNK

# Unigrams
unigram_unk = collections.Counter(words_unk)

# Bigrams
bigrams_unk = []

for i in range(len(words_unk)-1):
    bigrams_unk.append((words_unk[i], words_unk[i+1]))

bigram_unk = collections.Counter(bigrams_unk)


# Trigrams
trigrams_unk = []

for i in range(len(words_unk)-2):
    trigrams_unk.append((words_unk[i], words_unk[i+1], words_unk[i+2]))

trigram_unk = collections.Counter(trigrams_unk)


In [25]:
import math


def unigram_probability(word, freq, total, V, k=0):

    return (freq.get(word,0)+k)/(total + k*V)


In [26]:
def sentence_probability(sentence, freq, k=0):

    sent = sentence.lower().split()

    prob = 1

    for word in sent:
        p = unigram_probability(word, freq, len(words), V, k)
        prob *= p

    return prob


##Perplexity Function

In [27]:
def perplexity(sentence, freq, k=0):

    sent = sentence.lower().split()

    log_prob = 0
    n = len(sent)

    for word in sent:

        p = unigram_probability(word, freq, len(words), V, k)

        if p > 0:
            log_prob += math.log(p)

    perp = math.exp(-log_prob/n)

    return perp


##Unigram Counts

In [30]:
# Unigram counts

unigram_counts = collections.Counter(words)

print("Top 20 Unigrams:\n")

for word, count in unigram_counts.most_common(20):
    print(word, ":", count)

# Vocabulary Size
V = len(unigram_counts)
print("\nVocabulary Size =", V)


Top 20 Unigrams:

and : 22
i : 13
improved : 6
my : 5
data : 5
studied : 5
systems : 5
in : 4
learned : 4
is : 4
analysis : 4
development : 4
science : 3
using : 3
important : 3
learning : 3
software : 3
computer : 2
c : 2
python : 2

Vocabulary Size = 185


##Compare Perplexity (Before & After UNK)

In [28]:
test_sentence = "machine learning improves prediction accuracy"

print("Sentence:", test_sentence)

print("\n--- Without UNK ---")
print("Perplexity:", perplexity(test_sentence, unigram_counts))

print("\n--- With UNK ---")
print("Perplexity:", perplexity(test_sentence, unigram_unk))

print("\n--- With Add-One Smoothing ---")
print("Perplexity:", perplexity(test_sentence, unigram_counts, k=1))

print("\n--- With Add-K (0.5) Smoothing ---")
print("Perplexity:", perplexity(test_sentence, unigram_counts, k=0.5))


Sentence: machine learning improves prediction accuracy

--- Without UNK ---
Perplexity: 72.82863565721485

--- With UNK ---
Perplexity: 2.4774640162541157

--- With Add-One Smoothing ---
Perplexity: 232.50000000000009

--- With Add-K (0.5) Smoothing ---
Perplexity: 261.13429503799154


In [37]:
# Deploy Perplexity Calculator

sentence = input("Enter a sentence: ")

print("\nWithout UNK:", perplexity(sentence, unigram_counts))
print("With UNK:", perplexity(sentence, unigram_unk))
print("Add-One:", perplexity(sentence, unigram_counts,1))
print("Add-K:", perplexity(sentence, unigram_counts,0.5))


Enter a sentence: i

Without UNK: 21.53846153846154
With UNK: 21.53846153846154
Add-One: 33.21428571428571
Add-K: 27.592592592592595


##Bigram counts

In [7]:
# Generate Bigrams

bigrams = []

for i in range(len(words)-1):
    bigrams.append((words[i], words[i+1]))

bigram_counts = collections.Counter(bigrams)

print("Top 15 Bigrams:\n")

for bg, count in bigram_counts.most_common(15):
    print(bg[0], bg[1], ":", count)


Top 15 Bigrams:

i studied : 5
i learned : 4
in computer : 2
computer science : 2
systems and : 2
systems i : 2
data analysis : 2
analysis and : 2
i am : 1
am studying : 1
studying btech : 1
btech in : 1
science at : 1
at sr : 1
sr university : 1


##Trigram count

In [8]:
# Generate Trigrams

trigrams = []

for i in range(len(words)-2):
    trigrams.append((words[i], words[i+1], words[i+2]))

trigram_counts = collections.Counter(trigrams)

print("Top 15 Trigrams:\n")

for tg, count in trigram_counts.most_common(15):
    print(tg[0], tg[1], tg[2], ":", count)


Top 15 Trigrams:

in computer science : 2
systems i learned : 2
i am studying : 1
am studying btech : 1
studying btech in : 1
btech in computer : 1
computer science at : 1
science at sr : 1
at sr university : 1
sr university my : 1
university my academic : 1
my academic journey : 1
academic journey started : 1
journey started with : 1
started with interest : 1


##Bigram Prediction

In [9]:
def predict_bigram(sequence):

    seq = sequence.lower().split()

    if not seq:
        return "Enter valid input"

    last = seq[-1]

    candidates = {}

    for (w1,w2),count in bigram_counts.items():
        if w1 == last:
            candidates[w2] = count

    if not candidates:
        return "No prediction available"

    total = unigram_counts.get(last,0)

    best = None
    best_prob = 0

    for w,c in candidates.items():

        prob = c/total
        print("Probability of",w,"=",prob)

        if prob>best_prob:
            best_prob = prob
            best = w

    return best


##Deploy Bigram Model (Without Smoothing)

In [15]:
# Deploy Bigram Model

ip_text = input("Enter text: ")

result = predict_bigram(ip_text)

print("Predicted Next Word:", result)


Enter text: hello
Predicted Next Word: No prediction available


In [10]:
print(predict_bigram("machine"))
print(predict_bigram("software"))
print(predict_bigram("cloud"))


Probability of learning = 1.0
learning
Probability of systems = 0.3333333333333333
Probability of engineering = 0.3333333333333333
Probability of engineer = 0.3333333333333333
systems
Probability of computing = 0.5
Probability of platform = 0.5
computing


##Bigram with Laplace Smoothing

In [11]:
def predict_bigram_laplace(sequence):

    seq = sequence.lower().split()

    if not seq:
        return "Enter valid input"

    last = seq[-1]

    candidates = {}

    for (w1,w2),count in bigram_counts.items():
        if w1 == last:
            candidates[w2] = count

    if not candidates:
        return "No prediction available"

    total = unigram_counts.get(last,0)

    best = None
    best_prob = 0

    for w,c in candidates.items():

        prob = (c+1)/(total+V)
        print("Probability of",w,"=",prob)

        if prob>best_prob:
            best_prob = prob
            best = w

    return best


##Deploy Bigram + Laplace Smoothing

In [31]:
# Deploy Laplace Bigram Model

ip_text = input("Enter text: ")

result = predict_bigram_laplace(ip_text)

print("Predicted Next Word (Laplace):", result)


Enter text: is a
Probability of software = 0.010752688172043012
Predicted Next Word (Laplace): software


##Add-K Smoothing (K = 0.5)

In [12]:
def predict_bigram_k(sequence,k=0.5):

    seq = sequence.lower().split()

    if not seq:
        return "Enter valid input"

    last = seq[-1]

    candidates = {}

    for (w1,w2),count in bigram_counts.items():
        if w1 == last:
            candidates[w2] = count

    if not candidates:
        return "No prediction available"

    total = unigram_counts.get(last,0)

    best = None
    best_prob = 0

    for w,c in candidates.items():

        prob = (c+k)/(total+k*V)
        print("Probability of",w,"=",prob)

        if prob>best_prob:
            best_prob = prob
            best = w

    return best


##Deploy Bigram + Add-K Smoothing

In [22]:
# Deploy Add-K Bigram Model

ip_text = input("Enter text: ")

result = predict_bigram_k(ip_text, 0.5)

print("Predicted Next Word (Add-K):", result)


Enter text: i
Probability of am = 0.014218009478672985
Probability of learned = 0.04265402843601896
Probability of studied = 0.052132701421800945
Probability of implemented = 0.014218009478672985
Probability of used = 0.014218009478672985
Probability of aim = 0.014218009478672985
Predicted Next Word (Add-K): studied


##Trigram Prediction

In [13]:
def predict_trigram(sequence):

    seq = sequence.lower().split()

    if len(seq)<2:
        return "Enter at least two words"

    last2 = tuple(seq[-2:])

    candidates = {}

    for (w1,w2,w3),count in trigram_counts.items():
        if (w1,w2)==last2:
            candidates[w3]=count

    if not candidates:
        return "No prediction available"

    total = bigram_counts.get(last2,0)

    best=None
    best_prob=0

    for w,c in candidates.items():

        prob=c/total
        print("Probability of",w,"=",prob)

        if prob>best_prob:
            best_prob=prob
            best=w

    return best


##Deploy Trigram Model

In [35]:
# Deploy Trigram Model

ip_text = input("Enter two words: ")

result = predict_trigram(ip_text)

print("Predicted Next Word (Trigram):", result)


Enter two words: I learned 
Probability of programming = 0.25
Probability of sql = 0.25
Probability of amazon = 0.25
Probability of html = 0.25
Predicted Next Word (Trigram): programming


In [14]:
print(predict_trigram("machine learning"))
print(predict_trigram("software engineering"))
print(predict_trigram("cloud computing"))


Probability of became = 1.0
became
Probability of focuses = 1.0
focuses
Probability of enables = 1.0
enables


#Answers to Questions
1. Why do unseen words cause zero probability?

* Because they never appear in training data, their frequency is zero.





2. What is UNK token?

* UNK represents unknown or rare words.

3. Which smoothing worked best?

* Add-k smoothing (k = 0.5).

4. Did perplexity improve? Why?

* Yes, because smoothing and UNK remove zero probabilities.