An **n-gram score** is a metric that measures how well a language model (like the one that predicts text on your phone) predicts a sequence of words. It does this by checking how often a specific n-gram appears in a given body of text. Essentially, the score is a way to evaluate the "predictive power" of the model.

Think of it like this: if a language model sees the phrase "I went to the store and bought..." it will look at all the words that follow "bought" in its training data. If "milk" is a very common word to follow "bought," the model gives it a high probability, or a high "n-gram score." The higher the score for a specific n-gram, the more likely it is to appear in a natural sentence.

In [1]:
from collections import Counter
import pandas as pd

# Creating a sample time series dataset (daily stock prices)
data = {'Date': pd.to_datetime(['2025-01-01', '2025-01-02', '2025-01-03', '2025-01-04', '2025-01-05', '2025-01-06', '2025-01-07', '2025-01-08', '2025-01-09', '2025-01-10']),
        'Price': [100, 102, 101, 103, 105, 104, 106, 105, 104, 106]}
df = pd.DataFrame(data)

# Convert prices to a list
prices = df['Price'].tolist()

def generate_ngrams(data, n):
    """
    Generates n-grams from a list of data points.
    """
    ngrams = []
    # Loop from the first element up to the point where the window fits
    for i in range(len(data) - n + 1):
        # Slice the list to get a contiguous sequence of n items
        ngrams.append(tuple(data[i:i + n]))
    return ngrams

def calculate_ngram_scores(data, n):
    """
    Calculates the frequency (score) of each n-gram.
    """
    # Generate the n-grams
    ngrams = generate_ngrams(data, n)
    # Count the frequency of each n-gram
    ngram_scores = Counter(ngrams)
    return ngram_scores

# Let's generate and score 3-grams
trigram_scores = calculate_ngram_scores(prices, 3)

print("Original Data:", prices)
print("\n3-gram Scores (Frequency):")
for trigram, score in trigram_scores.items():
    print(f"{trigram}: {score}")

Original Data: [100, 102, 101, 103, 105, 104, 106, 105, 104, 106]

3-gram Scores (Frequency):
(100, 102, 101): 1
(102, 101, 103): 1
(101, 103, 105): 1
(103, 105, 104): 1
(105, 104, 106): 2
(104, 106, 105): 1
(106, 105, 104): 1


# Assignment # 1

Change the n value: Modify the calculate_ngram_scores function call to use n=2 (bigrams) and then n=4. What happens to the scores? Why do you think some scores might decrease as n gets larger?

In [2]:
# Your code here

# Assignment # 2
Probability vs. Frequency: The code above uses a simple frequency count. For a more sophisticated n-gram score, you could calculate the conditional probability. For example, for a trigram like (101, 103, 105), the score could be P(105 | 101, 103), which is the probability of 105 appearing after 101 and 103. How would you modify the Python code to calculate this type of score? Hint: You'll need to count the frequency of the full n-gram and the frequency of the (n-1)-gram that precedes it.

In [3]:
# Your code here

# Assignment # 3
Real-World Application: Imagine a dataset of daily average temperatures for your city over a few years. How could you use n-gram scores to find and predict seasonal patterns? For example, what would a high-scoring bigram (T1, T2) tell you about consecutive temperatures?

In [None]:
# Your code here