Francesca Tuazon (BCS34)

**Problem Statement**: You are tasked with exploring the behavior of n-gram models in text generation. Specifically, you need to:

- *Implement an n-gram Model*: Build a Python program that constructs an n-gram model from a given sample text. The program should handle various values of n to create models ranging from bigrams to higher-order n-grams.

- *Generate Text with Different Inputs*: Use the n-gram model to generate text based on different sample inputs. Compare the generated text for various n-values and observe how the choice of n and input text affects the output.

- *Analyze the Results*: Examine the coherence and relevance of the generated text. Identify any patterns or issues related to different n-gram sizes and sample texts. Assess how well the generated text reflects the structure and style of the input text.

**Tasks**:
- *Create an n-gram Model*: Implement a Python script that builds an n-gram model from a provided text. The script should handle varying sizes of n-grams (e.g., bigrams, trigrams, and 4-grams).
- *Generate and Compare Texts*: Use the model to generate text sequences for different input texts. Experiment with different n-values and analyze how these affect the text output. Example inputs could be:
  - Text 1: "The quick brown fox jumps over the lazy dog."
  - Text 2: "Data science is an interdisciplinary field that uses scientific methods."
  - Generate text sequences with various n-values (e.g., trigram, and 4-gram) for each input text.
- *Evaluate and Discuss*: Compare the quality of the generated text across different n-values. Consider aspects such as coherence, relevance, and adherence to the input text's style.
  - Discuss any observed patterns or anomalies in the generated text and suggest potential improvements to the n-gram model or text generation process.

In [4]:
import random
from collections import defaultdict

# Sample text for n-gram generation
text = {
    "It is only with the heart that one can see rightly; what is essential is invisible to the eye.",
    "It is the time you have wasted for your rose that makes your rose so important.",
    "I am looking for friends. What does that mean -- tame?",
    "If you love a flower that lives on a star, it is sweet to look at the sky at night.",
    "You see, one loves the sunset when one is so sad.",
    "To me, you will be unique in all the world. To you, I shall be unique in all the world.",
    "The land of tears is so mysterious."
}

# Create a bigram model
n = 2
ngrams = defaultdict(list)

# Convert the set of strings to a single string
all_text = " ".join(text)  # Join all sentences in the set into a single string

# Split the string into words
words = all_text.split()

for i in range(len(words) - n + 1):
    gram = tuple(words[i:i+n])
    next_word = words[i+n] if i+n < len(words) else None
    if next_word: # Only add next_word if it's not None
        ngrams[gram].append(next_word)

#Get user input
user_input = input("Enter a word or phrase: ")
input_words = user_input.split()

#Find matching grams
matching_grams = [gram for gram in ngrams if gram[-1] == input_words[-1]]

if matching_grams:
    current_gram = random.choice(matching_grams)
    result = list(current_gram)

    #autocomplete suggestions
    for _ in range(10):
      if current_gram in ngrams and ngrams[current_gram]:
        next_word = random.choice(ngrams[current_gram])
        result.append(next_word)
        current_gram = tuple(result[-n:])
      else:
        break

    print("Autocomplete suggestion: : ", " ".join(result))
else:
  print("No matching grams found")

Enter a word or phrase: it is
Autocomplete suggestion: :  It is the time you have wasted for your rose that makes


In [5]:
import random
import re
from collections import defaultdict

def build_ngram_model(text, n):
    words = re.findall(r'\b\w+\b', text.lower())
    model = defaultdict(list)
    for i in range(len(words) - n + 1):
        gram = tuple(words[i:i + n])
        next_word = words[i + n] if i + n < len(words) else None
        if next_word:
            model[gram[:-1]].append(next_word)
    return model

def generate_text(model, start_words, num_words):
    current_gram = tuple(start_words)
    result = list(start_words)
    for _ in range(num_words - len(start_words)):
        if current_gram in model:
            next_word = random.choice(model[current_gram])
            result.append(next_word)
            current_gram = tuple(result[-len(current_gram):])
        else:
            break
    return " ".join(result)

def main():
    texts = [
        "It is only with the heart that one can see rightly.",
        "It is the time you have wasted for your rose that makes your rose so important.",
        "I am looking for friends. What does that mean -- tame?",
        "If you love a flower that lives on a star, it is sweet to look at the sky at night.",
        "You see, one loves the sunset when one is so sad.",
        "To me, you will be unique in all the world. To you, I shall be unique in all the world.",
        "The land of tears is so mysterious."
    ]

    n_values = [2, 3, 4]

    for text in texts:
        print(f"\nInput text: {text}")
        for n in n_values:
            model = build_ngram_model(text, n)
            start_words = re.findall(r'\b\w+\b', text.lower())[:n-1]  # Use the first n-1 words as start
            if start_words:
              generated_text = generate_text(model, start_words, 20)
              print(f"{n}-gram generated text: {generated_text}")
            else:
              print(f"Not enough words to generate {n}-grams")


if __name__ == "__main__":
    main()


Input text: It is only with the heart that one can see rightly.
2-gram generated text: it only the that can rightly
3-gram generated text: it is with
4-gram generated text: it is only the

Input text: It is the time you have wasted for your rose that makes your rose so important.
2-gram generated text: it the you wasted your that your that your that your so
3-gram generated text: it is time
4-gram generated text: it is the you

Input text: I am looking for friends. What does that mean -- tame?
2-gram generated text: i looking friends does mean
3-gram generated text: i am for
4-gram generated text: i am looking friends

Input text: If you love a flower that lives on a star, it is sweet to look at the sky at night.
2-gram generated text: if love flower lives a it sweet look the at sky night
3-gram generated text: if you a
4-gram generated text: if you love flower

Input text: You see, one loves the sunset when one is so sad.
2-gram generated text: you one the when is sad
3-gram generate

# Discussion
The first code block shows the autocomplete suggestion based on the user's input, while the second code block dissects each text into the n-gram generated text, helping us visualize how the prior block works in returning its results.

The quality of the autocomplete suggestion varies significantly with the value of **n**.  By dissecting the text into n-grams, we gain insight into how the model forms its predictions. The lower the n values, the less coherent the text due to the limitations in context. However, higher n-values can also introduce repetition and unusual phrases.