# Naive Bayes Text Prediction: Next Word Generator

### Introduction and Theory

**Naive Bayes** for text prediction is a probabilistic algorithm that uses statistical inference to estimate the likelihood of the next word in a sequence. It assumes that the probability of a word occurring depends on the word immediately preceding it (often called a "bigram" model in this context).

In this project, we will build a model that learns from a Shakespearean text corpus to generate new text word-by-word. We will also implement **Temperature Scaling** to control the "creativity" or randomness of the generated output.

### The Mathematical Model
The model relies on calculating transition probabilities between words and then adjusting them to control diversity.

#### Bayes' Theorem
We use **Bayes' Theorem** to calculate the probability of a specific *Next Word* ($B$) following a *Current Word* ($A$).

$$P(B|A) = \frac{P(A|B) \cdot P(B)}{P(A)}$$

* $P(B|A)$ : The probability of the *Next Word* given the *Current Word* (Transition Probability).
* $P(A|B)$ : The likelihood (How often the *Current Word* appears before the *Next Word*).
* $P(B)$ : The probability of the *Next Word* appearing generally in the text.
* $P(A)$ : The probability of the *Current Word* appearing generally in the text.

#### Temperature Scaling ($T$)
To introduce variety into the text generation, we adjust the probabilities using a **Temperature ($T$)** parameter inside a Softmax function.

$$P'(w) = \frac{e^{\log(P(w))/T}}{\sum_{i} e^{\log(P(w_i))/T}}$$

* $P(w)$ : The original probability of the word.
* $T$ : Temperature parameter.
    * **Low $T$ (< 1.0)**: "Cools" the distribution, making it sharper. The model becomes strict and repetitive, favoring only the most likely words.
    * **High $T$ (> 1.0)**: "Heats" the distribution, making it flatter. The model becomes more creative and random, giving unlikely words a higher chance to be picked.

### Import Libraries

In [35]:
import pandas as pd
import numpy as np
import random

### Loading Data

In [36]:
df = pd.read_csv('shakespeare.txt', sep='\0', header=None, engine='python')
text_data = " ".join(df[0].astype(str).tolist())
print("Successfully loaded shakespeare.txt")

Successfully loaded shakespeare.txt


### Preprocessing Data

In [37]:
# Joining characters
clean_text = "".join([char.lower() for char in text_data if char.isalpha() or char.isspace()])

words = clean_text.split()
print(f"Total words: {len(words)}")

Total words: 25800


### Building Transition Probabilities

In [38]:
# Counting words
word_counts = {}
for w in words:
    if w in word_counts:
        word_counts[w] += 1
    else:
        word_counts[w] = 1

total_words = len(words)

# Counting bigrams
bigram_counts = {}
for i in range(len(words) - 1):
    current_w = words[i]
    next_w = words[i + 1]

    if current_w not in bigram_counts:
        bigram_counts[current_w] = {}

    if next_w in bigram_counts[current_w]:
        bigram_counts[current_w][next_w] += 1
    else:
        bigram_counts[current_w][next_w] = 1

# Building Transition Probabilities
transition_probs = {}

for current_w, next_words_dict in bigram_counts.items():
    transition_probs[current_w] = {}

    for next_w, count in next_words_dict.items():
        # P(B)
        p_next = word_counts[next_w] / total_words

        # P(A|B)
        p_current_given_next = count / word_counts[next_w]

        # P(A)
        p_current = word_counts[current_w] / total_words

        # Bayes' Theorem
        bayes_prob = (p_current_given_next * p_next) / p_current
        transition_probs[current_w][next_w] = bayes_prob

print(f"Model trained on {len(transition_probs)} unique words.")

Model trained on 3756 unique words.


### Text Generation

In [39]:
input_sentence = "the man said that"
num_words = 100
temperature = 0.7
output_list = input_sentence.strip().lower().split()
current_word = output_list[-1]

print("Generating text...")

for _ in range(num_words):
    # Fallback if word is unknown
    if current_word not in transition_probs:
        sorted_words = sorted(word_counts.items(), key=lambda item: item[1], reverse=True)
        next_word = sorted_words[0][0]

    else:
        # Accessing word probabilities
        possibilities = transition_probs[current_word]
        candidates = list(possibilities.keys())
        probs = np.array(list(possibilities.values()))

        # Softmax adjustment
        probs = np.power(probs, 1.0 / temperature)
        probs = probs / np.sum(probs)

        # Weighted selection
        next_word = random.choices(candidates, weights=probs, k=1)[0]

    output_list.append(next_word)
    current_word = next_word

Generating text...


### Displaying Output

In [40]:
generated_text = " ".join(output_list)

final_output = ""
words_in_output = generated_text.split()
for i in range(0, len(words_in_output), 10):
    final_output += " ".join(words_in_output[i:i+10]) + "\n"

print(f"Final Output:\n")
print(final_output)

Final Output:

the man said that title romeo can see that tybalt
murdered doting not sir and in all the frowning night
romeo mercutio lets retire the wormwood on the sun o
here and lady capulet come knock and yet i not
a madmans mercy bid me alack alack alack my lord
what i might live therefore women may be the cords
that i will take it to pleading and all about
and in this of tybalt the winds thy pains farewell
ancient vault to in capulets house of the heavens do
protest which you to make haste friar lawrence these sad
burial feast tybalt mercutio

