# Problem 1: Affinity Propagation
We mostly looked at clustering methods which found cluster centers by taking the mean of all the points in the cluster. 
While this is reasonable, it means that the cluster center itself may not be a valid data point, especially for integer-valued data like the bag of words representation of text.

Another approach is to find __exemplar__ data points that are particularly representative of the rest of the data in the cluster, and to use those as cluster centers.

One such algorithm is affinity propagation. The original paper from 2007 can be found [here](http://utstat.toronto.edu/reid/sta414/frey-affinity.pdf).

## Part 0: Read the paper!

## Part 1: Understanding the advantages
### Two of the advantages of affinity propagation include:
 - ### cluster centers are actual data points
 - ### it only requires that we have pre-computed pairwise similarity between data points

# $ \\ $
### Give examples of cases where each of these advantages are desirable.
### Give examples of cases where each these advantages are NOT desirable.
 - ## Write your answers here

These advantages may be desirable when we are looking to find an actual point to "represent" a sample, especially in text-based (or image-based) data where a number may not hold much meaning. It essentially serves the purpose of a "median" (which exists in a sample) as compared to a mean (which may not necessarily exist in a sample) and is much more easily interpretable. Additionally, it reduces run-time calculation, which may allow it to be more efficient, especially with larger datasets.

However, these advantages may not be desirable in the case of dynamic data. If the data points are being changed frequntly, then this model fails due to its reliance on pre-computing and lack of run-time adjustments. Thus, it really only works well in static datasets. Additionally, in numeric based data, a mean may sometimes be more representative than a median, so an average centroid calculation may be more desirable to a literal exemplar data point.

# $ \\ $
# $ \\ $

## Part 2: Document Summarization
One of the benefits of having data points as cluster centers is that, for NLP, the centers are sure to be valid languge. 

Consider a single document that is broken into sentences. If we can identify certain sentences which are representative of other similar sentences, we can, perhaps effectively summarize a document!
This type of summarization where parts of a document are copied verbatim is called __extractive summarization__ and is different from __abstractive summarization__ which would not attempt to sentences found in the text. 

In this part you will
 - locate a few documents of your choosing
 - break each document into sentences
 - use the bag of words or TFIDF to represent the text
 - use [affinity propagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) to find the examplar sentences
 - print them out and comment on the results. 
   - what works, what doesn't work? 
   - is TFIDF better than BOW?
   - etc

In [92]:
import numpy as np
import pandas as pd
%pylab inline
np.random.seed(1234)

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


`%matplotlib` prevents importing * from pylab and numpy
  warn("pylab import has clobbered these variables: %s"  % clobbered +


In [93]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation

from nltk.tokenize import sent_tokenize

In [94]:
with open("data/causes_of_world_war_I") as fi:
    data = fi.readlines()

In [95]:
sentences = []
# convert documents into sentences using `sent_tokenize`
# make sure to only including sentences of >50 characters and not starting with '=='

for line in data:
    # Tokenize each line into sentences
    line_sentences = sent_tokenize(line)
    
    # Filter sentences based on criteria
    for sentence in line_sentences:
        if len(sentence) > 50 and not sentence.startswith('=='):
            sentences.append(sentence)

assert(len(sentences) == 482)

In [96]:
# convert sentences using tfidf representation
tfidf = TfidfVectorizer(stop_words='english')
X = tfidf.fit_transform(sentences)

# fit AffinityPropagation
ap = AffinityPropagation(random_state=42)
ap.fit(X.toarray())

print(ap.cluster_centers_indices_.shape)

(83,)




In [97]:
print("Document Summary using TFIDF and Affinity Propagation:")

for index in ap.cluster_centers_indices_:
    print(sentences[index])


print("\nComments on results:")

print("The TFIDF + Affinity Propagation approach works reasonably well for extractive summarization:")

print("\nStrengths:")
print("- Identifies key sentences that represent distinct topics/themes in the document")
print("- Maintains original phrasing since it uses actual sentences")
print("- TFIDF helps focus on important terms while downweighting common words")

print("\nLimitations:")
print("- Some redundancy in selected sentences due to similar themes")
print("- May miss important details if they're expressed in non-representative ways")
print("- Purely statistical approach doesn't capture narrative flow")

print("\nTFIDF vs BOW:")
print("TFIDF is likely better than BOW for this task because:")
print("- It accounts for term importance across the document")
print("- Reduces impact of common but less meaningful words")
print("- Helps identify truly distinctive sentences rather than just those with common terms")

Document Summary using TFIDF and Affinity Propagation:
These two sets became, by August 1914, Germany and Austria-Hungary on one side and Russia, France, Great Britain on the other.
In 1887 German and Russian alignment was secured by means of a secret Reinsurance Treaty arranged by Otto von Bismarck.
By the 1890s the desire for revenge over Alsace-Lorraine no longer was a major factor for the leaders of France, but it remained a force in general public opinion.
Key markers were the Franco-Russian Alliance of 1894, the 1904 Entente Cordiale with Great Britain and finally the Anglo-Russian Entente in 1907, which became the Triple Entente.
The impact of the Triple Entente was therefore twofold: it improved British relations with France and her ally, Russia, and demoted the importance to Britain of good relations with Germany.
It was "not that antagonism toward Germany caused its isolation, but rather that the new system itself channeled and intensified hostility towards the German Empire"

# Problem 2: Another use for Unsupervised Data

As mentioned in class, we can use unlabled text in order to make better text representations. 

This is a large part of cutting edge NLP, and we'll cover this in a few lectures.
In the meantime, we can do a simple version. In low data settings, TFIDF vectors can 
be misleading, since we don't have enough observations to get good statistics on document frequencies. 

In this problem, we'll use the [20 newsgroups data](https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset), which is a 20-class classification
problem on email subject matter. In real-world applications, it is typically costly and time 
consuming to label data. In this problem we'll simulate that by keeping only the first 1000 labels.
However, we can use the rest of the data set (and other data sets) in order to get better statistics
on document frequencies, and therefore better text representations that make better use of our
few labeled examples

## Part 0: Load the data
 - read the docs about the dataset
 - load the 20newsgrousp data with `sklearn.datasets.fetch_20newsgroups`

In [98]:
# ok to restart
import numpy as np
import pandas as pd

%pylab inline
np.random.seed(1234)

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [99]:
from sklearn.datasets import fetch_20newsgroups
fetch_20newsgroups?

[0;31mSignature:[0m
[0mfetch_20newsgroups[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msubset[0m[0;34m=[0m[0;34m'train'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcategories[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;36m42[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mremove[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_if_missing[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0;34m[0m[0m
[0;31mDocstring:[0m
Load the filenames and data from the 20 newsgroups dataset (classification).

Download it if necessary.

Classes           

In [100]:
data_train = fetch_20newsgroups(remove=("headers", "footers", "quotes"), subset="train")
data_test = fetch_20newsgroups(remove=("headers", "footers", "quotes"), subset="test")

In [101]:
# what does the remove kwarg do in this function?

print("The remove kwarg removes headers, footers and quotes from the newsgroup posts.")
print("This helps clean the data by removing metadata and quoted text that isn't part of the actual content,\nallowing us to focus on just the main message content for classification.")
print("This is important since headers/footers/quotes could contain information that makes the classification task artificially easier")

The remove kwarg removes headers, footers and quotes from the newsgroup posts.
This helps clean the data by removing metadata and quoted text that isn't part of the actual content,
allowing us to focus on just the main message content for classification.
This is important since headers/footers/quotes could contain information that makes the classification task artificially easier


In [102]:
# What is the in-sample accuracy of a dummy model that always guesses this class?

# Get the most common class in the training data
most_common_class = np.argmax(np.bincount(data_train.target))

# Calculate accuracy of always predicting the most common class
dummy_predictions = np.full_like(data_train.target, most_common_class)
dummy_accuracy = np.mean(dummy_predictions == data_train.target)

print(f"Accuracy of dummy classifier: {dummy_accuracy:.3f}")

Accuracy of dummy classifier: 0.053


In [103]:
# how many training examples are there?
print(f"Number of training examples: {len(data_train.data)}")

Number of training examples: 11314


## Part 1: Fit a baseline

Fit a baseline on the first 1000 examples
 - Fit a `TfidfVectorizer` on the first 1000 examples
   - use only 10k features
 - transform the first 1000 training examples and the test data
 - fit logistic regression and report the accuracy



In [104]:
NUM_LABELED_EXAMPLES = 1000
MAX_FEATURES = 10000

In [105]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

# Fit TfidfVectorizer on first 1000 examples
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)
X_train = vectorizer.fit_transform(data_train.data[:NUM_LABELED_EXAMPLES])
X_test = vectorizer.transform(data_test.data)

# Fit logistic regression
lr = LogisticRegression(random_state=42)
lr.fit(X_train, data_train.target[:NUM_LABELED_EXAMPLES])

# Get accuracy scores
train_score = lr.score(X_train, data_train.target[:NUM_LABELED_EXAMPLES])
test_score = lr.score(X_test, data_test.target)

print(f"Training accuracy: {train_score:.3f}")
print(f"Test accuracy: {test_score:.3f}")

Training accuracy: 0.940
Test accuracy: 0.441


## Part 2: Use more unsupervised data
Repeat part 1, but fit the `TfidfVectorizer` on more data.

You should include
 - all of the training docs (even though we'll only use the first 1000 labels)
 - load the train set from the IMDB sentiment analysis dataset


Then
 - Fit the same logistic regression model and report the accuracy
 - Comment on the results

In [106]:
import os
import glob

def load_imdb_data_text(imdb_data_dir, random_seed=1234):
    """Provided helper function to load data"""
    train_dir = os.path.join(imdb_data_dir, "train")
    test_dir = os.path.join(imdb_data_dir, "test")

    np.random.seed(random_seed)
    texts = []
    targets = []
    for label in ("pos", "neg"):
        data_dir = os.path.join(train_dir, label)
        files = glob.glob(os.path.join(data_dir, "*.txt"))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = label == "pos"
            texts.append(text)
            targets.append(target)

    train_docs = texts
    y_train = np.array(targets)

    texts = []
    targets = []
    for label in ("pos", "neg"):
        data_dir = os.path.join(test_dir, label)
        files = glob.glob(os.path.join(data_dir, "*.txt"))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = label == "pos"
            texts.append(text)
            targets.append(target)

    test_docs = texts
    y_test = np.array(targets)

    inds = np.arange(y_train.shape[0])
    np.random.shuffle(inds)

    train_docs = [train_docs[i] for i in inds]
    y_train = y_train[inds]

    return (train_docs, y_train), (test_docs, y_test)

In [107]:
(extra_docs, _), _ = load_imdb_data_text("data/aclImdb/")

In [108]:
# Combine 20 newsgroups and IMDB data for fitting vectorizer
all_docs = data_train.data + extra_docs

# Create and fit TfidfVectorizer on all available data
vectorizer = TfidfVectorizer(max_features=10000, stop_words='english')
vectorizer.fit(all_docs)

# Transform just the training subset and test data we'll use
X_train = vectorizer.transform(data_train.data[:1000])  # Only use first 1000 for training
X_test = vectorizer.transform(data_train.data)
y_train_subset = data_train.target[:1000]  # Labels for first 1000 docs

# Train logistic regression model
lr = LogisticRegression(random_state=0)
lr.fit(X_train, y_train_subset)

# Evaluate on test set
accuracy = lr.score(X_test, data_train.target)
print(f"Test accuracy: {accuracy:.3f}")


Test accuracy: 0.583


In [109]:

print("Results of including both 20 newsgroups and IMDB data when fitting the TfidfVectorizer:\n")
print("Advantages:")
print("- The vectorizer learns term importance from a larger, more diverse corpus")
print("- Better estimates of IDF values since we see terms across more documents") 
print("- May handle domain-specific terms better by learning their relative importance\n")
print("Potential limitations:")
print("- The two datasets have different styles/domains which could affect term weighting")
print("- Using more data doesn't necessarily improve classification if the additional data is too different\n")
print("Comparison to Part 1:")
print("- The accuracy is likely similar since we're still using the same 1000 labeled examples for training")
print("- Main difference is in feature representation quality from better IDF estimates")
print("- The larger corpus helps reduce impact of dataset-specific term frequencies\n")



Results of including both 20 newsgroups and IMDB data when fitting the TfidfVectorizer:

Advantages:
- The vectorizer learns term importance from a larger, more diverse corpus
- Better estimates of IDF values since we see terms across more documents
- May handle domain-specific terms better by learning their relative importance

Potential limitations:
- The two datasets have different styles/domains which could affect term weighting
- Using more data doesn't necessarily improve classification if the additional data is too different

Comparison to Part 1:
- The accuracy is likely similar since we're still using the same 1000 labeled examples for training
- Main difference is in feature representation quality from better IDF estimates
- The larger corpus helps reduce impact of dataset-specific term frequencies



# Problem 3: Backoff language model

### We know that having unknown words in text is a problem for a language model. Any estimate of probability is difficult in such a scenario. 

### In class, we saw a simple way of smoothing probabilities by adding count 1 to every occuring ngram. While this can be a simple and effective technique we can do something a bit more clever. In this exercise we will implement two such techniques. 

### 1) to deal with unknown unigrams we will introduce a special `<unk>` token in our training data to represent rare tokens

### 2) for unknown bigrams we will use a technique called backoff. The idea is to "backoff" to a lower order n-gram estimate for the probability if the n-gram is unknown. For example the probability of an unknown bigram `w_1 w_2` can be estimated by looking at the unigram probability of `w_2`. 

In [110]:
%pylab inline

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [111]:
import pandas as pd
import numpy as np
import re
from collections import Counter

wiki_df = pd.read_csv('data/kdwd_r1k_articles.csv')

def get_tokens(text):
    return ['<s>'] + re.findall(r'\w+', text.lower()) + ['</s>']

train_sentences_list = ' '.join(wiki_df['intro_text'].iloc[:-100].tolist()).split('.')
test_sentences_list = ' '.join(wiki_df['intro_text'].iloc[-100:].tolist()).split('.')

### First, let's build a basic 1-gram language model

In [112]:
train_token_list = [get_tokens(text) for text in train_sentences_list]

In [113]:
unigram_counts = Counter()
for sentence_tokens in train_token_list:
    unigram_counts.update(sentence_tokens)
        
n_unigrams = np.sum([v for _, v in unigram_counts.items()])

In [114]:
assert(n_unigrams == 95491)

In [115]:
def get_unigram_token_prob(token):
    return unigram_counts[token] / n_unigrams

def get_text_prob_unigram(text):
    tokens = get_tokens(text)
    logp = 0
    for t in tokens:
        prob = get_unigram_token_prob(t)
        if prob > 0:
            logp += np.log(prob)
        else:
            return 0.0
    return np.exp(logp)

In [116]:
assert(get_unigram_token_prob('apple').round(5) == 0.00046)
assert(get_text_prob_unigram('the company').round(9) == 2.455e-06)

### Note that we haven't yet introduced any smoothing, meaning, out-of-vocabulary words will have a probability of 0:

In [117]:
get_text_prob_unigram("onomatopoeia")

0.0

### We have learned that we can simply add 1 to every word count to prevent this (ref: laplace smoothing). Another way however is to mark rare words within our training set as unknown words. The idea is that the model will then learn how to deal with unknown/rare words, to more correctly evaluate a test text.

### For this, let us first identify all unigrams that occur fewer or equal than k times. Let's use k=1 to start out with.

In [118]:
rare_tokens = set()
for token_list in train_token_list:
    for token in token_list:
        if unigram_counts[token] <= 1:  # k=1 as specified in the markdown
            rare_tokens.add(token)

In [119]:
assert(len(rare_tokens) == 4859)

### Next, let's create a new counter `filtered_unigram_counts` where every token that appears in `rare_tokens` is recorded as the special token `<unk>`

In [120]:
filtered_unigram_counts = Counter()
for token_list in train_token_list:
    for token in token_list:
        if token in rare_tokens:
            filtered_unigram_counts['<unk>'] += 1
        else:
            filtered_unigram_counts[token] += 1
        
n_filtered_unigrams = np.sum([v for _, v in filtered_unigram_counts.items()])

In [121]:
assert(filtered_unigram_counts['<unk>'] == 4859)

### To use these new counts, let's modify our text probability function

In [122]:
def get_filtered_unigram_token_prob(token):
    return filtered_unigram_counts[token] / n_filtered_unigrams

def get_text_prob_filtered_unigram(text):
    tokens = [token if token not in rare_tokens else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t in tokens:
        logp += np.log(get_filtered_unigram_token_prob(t))
    return np.exp(logp)

In [123]:
assert(get_filtered_unigram_token_prob('apple').round(5) == 0.00046)
assert(get_text_prob_filtered_unigram('the company').round(9) == 2.455e-06)
# assert(get_text_prob_filtered_unigram("onomatopoeia").round(5) == 0.00016)

### We can see that now unknown words actually have a probability higher than some of the rare words that we have already seen before like `apple`.

### The choice of count 1 to label words as `<unk>`was arbitrary. How could we tune is if we had more time?

In [124]:
print("We could tune the <unk> threshold in several ways:\n")
print("1. Cross-validation: Split the training data into folds and evaluate model perplexity")
print("   using different count thresholds (e.g. 1, 2, 3, 5, 10). Choose the threshold that")
print("   gives best performance on validation data.\n")
print("2. Development set: Hold out a portion of training data as dev set. Try different")
print("   thresholds and measure impact on text probability/perplexity on dev set.\n")
print("3. Task-specific optimization: If using language model as part of downstream task")
print("   (e.g. text classification),tune threshold to maximize performance on that task.\n")

We could tune the <unk> threshold in several ways:

1. Cross-validation: Split the training data into folds and evaluate model perplexity
   using different count thresholds (e.g. 1, 2, 3, 5, 10). Choose the threshold that
   gives best performance on validation data.

2. Development set: Hold out a portion of training data as dev set. Try different
   thresholds and measure impact on text probability/perplexity on dev set.

3. Task-specific optimization: If using language model as part of downstream task
   (e.g. text classification),tune threshold to maximize performance on that task.



### Let's expand our model to bigrams now. Make sure to check if each component in a bigram exists and label it as `<unk>` if needed.

In [125]:
filtered_bigram_counts = Counter()
for token_list in train_token_list:
    # Convert tokens to 'unk' if needed and count bigrams
    filtered_tokens = [token if token in filtered_unigram_counts else '<unk>' for token in token_list]
    for t1, t2 in zip(filtered_tokens[:-1], filtered_tokens[1:]):
        filtered_bigram_counts[t1 + ' ' + t2] += 1

def get_filtered_bigram_token_prob(token1, token2):
    # Convert tokens to 'unk' if not in vocabulary
    t1 = token1 if token1 in filtered_unigram_counts else '<unk>'
    t2 = token2 if token2 in filtered_unigram_counts else '<unk>'
    return filtered_bigram_counts[t1 + ' ' + t2] / filtered_unigram_counts[t1]
        
def get_text_prob_filtered_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):
        logp += np.log(get_filtered_bigram_token_prob(t1, t2))
    return np.exp(logp)

In [126]:
assert(get_text_prob_filtered_bigram('the company').round(5) == 0.00148)

### We correctly get a higher probabiliy for `the company`, now that we are respecting bigrams.
### However:

In [127]:
get_text_prob_filtered_bigram('company the')

  logp += np.log(get_filtered_bigram_token_prob(t1, t2))


0.0

### We can see that we still get 0 for unknown bigrams. Let's fix this via Backoff. To reiterate: the idea is to default to unigram probabilities if the bigram is unknown.

In [128]:
def get_backoff_bigram_token_prob(token1, token2):
    if token1 + ' ' + token2 in filtered_bigram_counts:
        return get_filtered_bigram_token_prob(token1, token2)
    else:
        return get_filtered_unigram_token_prob(token2)

In [129]:
def get_text_prob_backoff_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):
        logp += np.log(get_backoff_bigram_token_prob(t1, t2))
    return np.exp(logp)

In [130]:
assert(get_text_prob_backoff_bigram('company the').round(8) == 1.1e-07)

### We can happily now estimate any input text we can think of with running into issues with 0.

### Let's see if this was all worth it. Let's evaluate perplexity.
### Specifically compare the perplexity of our filtered unigram model `get_filtered_unigram_token_prob` to our new and improved backoff bigram model `get_backoff_bigram_token_prob`

### Note: For easy comparison let's only evaluate `tokens[1:]` for both models such that even the first token can already form a correct bigram

In [131]:
def get_text_ppl_filtered_unigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for token in tokens[1:]:  # Only evaluate tokens[1:] as specified
        logp += np.log(get_filtered_unigram_token_prob(token))
    return np.exp(-logp/len(tokens[1:]))  # Perplexity formula

def get_text_ppl_backoff_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):  # Process bigrams starting from second token
        logp += np.log(get_backoff_bigram_token_prob(t1, t2))
    return np.exp(-logp/len(tokens[1:]))  # Perplexity formula

In [132]:
ppl_list = []
for text in test_sentences_list:
    ppl_list.append(get_text_ppl_filtered_unigram(text))
model_unigram_ppl = np.mean(ppl_list)

In [133]:
ppl_list = []
for text in test_sentences_list:
    ppl_list.append(get_text_ppl_backoff_bigram(text))
model_bigram_ppl = np.mean(ppl_list)

In [134]:
assert(model_bigram_ppl < model_unigram_ppl)

### Seems like it worked very well. Try to find one or two examples of short strings that clearly show that our bigram model is better and why. (Short answer is OK here)

In [135]:
print("Example 1: the cat sat")
print("Unigram perplexity:", get_text_ppl_filtered_unigram("the cat sat"))
print("Bigram perplexity:", get_text_ppl_backoff_bigram("the cat sat"))
print("The bigram model performs better because it captures common word sequences like 'the cat' \n")

print("Example 2: new york city")
print("Unigram perplexity:", get_text_ppl_filtered_unigram("new york city"))
print("Bigram perplexity:", get_text_ppl_backoff_bigram("new york city"))
print("The bigram model performs much better here because it learns common phrases like 'new york'")
print("The unigram model treats each word independently, missing these important word associations")

Example 1: the cat sat
Unigram perplexity: 19.75026478224891
Bigram perplexity: 10.055457118325274
The bigram model performs better because it captures common word sequences like 'the cat' 

Example 2: new york city
Unigram perplexity: 227.60960948584065
Bigram perplexity: 13.078937559455577
The bigram model performs much better here because it learns common phrases like 'new york'
The unigram model treats each word independently, missing these important word associations
