# Problem 1: Affinity Propagation

We mostly looked at clustering methods which found cluster centers by taking the mean of all the points in the cluster. 
While this is reasonable, it means that the cluster center itself may not be a valid data point, especially for integer-valued data like the bag of words representation of text.

Another approach is to find __exemplar__ data points that are particularly representative of the rest of the data in the cluster, and to use those as cluster centers.

One such algorithm is affinity propagation. The original paper from 2007 can be found [here](http://utstat.toronto.edu/reid/sta414/frey-affinity.pdf).

## Part 0: Read the paper!

## Part 1: Understanding the advantages
Two of the advantages of affinity propagation include:
 - Cluster centers are actual data points
 - It only requires that we have pre-computed pairwise similarity between data points

1. Give examples of cases where each of these advantages are desirable.

**ANSWER:**

**Actual data points**

In customer segmentation, where each data point represents an actual customer profile, having cluster centers as actual customers allows marketers to directly examine typical customers in each group. This approach provides real examples that can be used for tailored marketing strategies, as each center represents a real, actionable customer profile.

**Only requires pre-computed pairwise similarity**

In social network analysis, pairwise similarity can represent the connections or shared interests between individuals. Affinity Propagation can work with just this pairwise similarity data without needing the entire network structure, making it efficient for analyzing communities within large networks.

Thus, these advantages may be desirable when we are looking to find an actual point to "represent" a sample, especially in text-based (or image-based) data where a number may not hold much meaning. It essentially serves the purpose of a "median" (which exists in a sample) as compared to a mean (which may not necessarily exist in a sample) and is much more easily interpretable. Additionally, it reduces run-time calculation, which may allow it to be more efficient, especially with larger datasets.






2. Give examples of cases where each these advantages are NOT desirable.

**Actual data points**

In a geographic clustering scenario, such as grouping locations across a large area, the most representative location might not coincide with an actual place but rather a calculated central point between locations. Using actual data points as cluster centers may lead to suboptimal results because the chosen centers might not capture the true centrality or ideal representation of a region.


**Only requires pre-computed pairwise similarity**

In image clustering for visual similarity, relying only on pairwise similarity could ignore the broader distribution of visual features. For example, if the images form distinct clusters with varying degrees of similarity across the dataset, pairwise similarity alone may not capture these complex, high-dimensional relationships accurately, reducing clustering effectiveness.

Thus, these advantages may be desirable when we are looking to find an actual point to "represent" a sample, especially in text-based (or image-based) data where a number may not hold much meaning. It essentially serves the purpose of a "median" (which exists in a sample) as compared to a mean (which may not necessarily exist in a sample) and is much more easily interpretable. Additionally, it reduces run-time calculation, which may allow it to be more efficient, especially with larger datasets.



## Part 2: Document Summarization

One of the benefits of having data points as cluster centers is that, for NLP, the centers are sure to be valid language.

Consider a single document that is broken into sentences. If we can identify certain sentences which are representative of other similar sentences, we can, perhaps effectively summarize a document!
This type of summarization where parts of a document are copied verbatim is called __extractive summarization__ and is different from __abstractive summarization__ which would not attempt to sentences found in the text. 

In this part you will
 - locate a few documents of your choosing
 - break each document into sentences
 - use the bag of words or TFIDF to represent the text
 - use [affinity propagation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.AffinityPropagation.html) to find the examplar sentences
 - print them out and comment on the results. 
   - what works, what doesn't work? 
   - is TFIDF better than BOW?
   - etc

In [1]:
import numpy as np
import pandas as pd
%pylab inline
np.random.seed(1234)
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import AffinityPropagation
from nltk.tokenize import sent_tokenize

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [2]:
with open("data/causes_of_world_war_I") as fi:
    data = fi.readlines()

In [3]:
sentences = []
# convert documents into sentences using `sent_tokenize`
# make sure to only including sentences of >50 characters and not starting with '=='

# your code here
for doc in data:
    for sentence in sent_tokenize(doc):
        if len(sentence) > 50 and not sentence.startswith('=='):
            sentences.append(sentence)

assert(len(sentences) == 482)

In [7]:
# convert sentences using tfidf representation
# your code here
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(sentences)

# fit AffinityPropagation
# ap = AffinityPropagation(...)
# your code here
ap = AffinityPropagation()
ap.fit(X)

print(ap.cluster_centers_indices_.shape)

(79,)


In [5]:
# print cluster centers using `ap.cluster_centers_indices_`
for index in ap.cluster_centers_indices_:
    print(sentences[index])

The crisis escalated as the conflict between Austria-Hungary and Serbia was joined by their allies Russia, Germany, France, and ultimately Belgium and Great Britain.
Secondary fault lines exist between those who believe that Germany deliberately planned a European war, those who believe that the war was largely unplanned but still caused principally by Germany and Austria-Hungary taking risks, and those who believe that some or all of the other powers (Russia, France, Serbia, Great Britain) played a more significant role in causing the war than has been traditionally suggested.
In 1887 German and Russian alignment was secured by means of a secret Reinsurance Treaty arranged by Otto von Bismarck.
Key markers were the Franco-Russian Alliance of 1894, the 1904 Entente Cordiale with Great Britain and finally the Anglo-Russian Entente in 1907, which became the Triple Entente.
It was "not that antagonism toward Germany caused its isolation, but rather that the new system itself channeled and

### Comment on results. what works, what doesn't work? is TFIDF better than BOW?

**ANSWER:**
The TFIDF + Affinity Propagation approach works reasonably well for extractive summarization:

Strengths:
- Identifies key sentences that represent distinct topics/themes in the document.
- Maintains original phrasing since it uses actual sentences
- TFIDF helps focus on important terms while downweighting common words

Limitations:
- Redundancy and Omissions: Due to similar themes, redundancy may occur, and important details may be missed, especially if expressed in non-representative ways.
- MLimitations of Statistical Methods: TF-IDF cannot capture narrative flow and contextual relationships, and it does not account for semantic differences.
- Sparse Representation: In the case of a small corpus or the presence of many unique terms, this may lead to underutilization of information.

TFIDF vs BOW:
TFIDF is likely better than BOW for this task because:
- It accounts for term importance across the document
- Reduces impact of common but less meaningful words
- Helps identify truly distinctive sentences rather than just those with common terms
TF-IDF is often better than BOW for tasks that require understanding the importance of terms within a document. The weighting system helps prioritize more relevant terms in the analysis. However, if the goal is to simply gather word counts without concern for their significance, BOW may be sufficient.

# Problem 2: Another use for Unsupervised Data

As mentioned in class, we can use unlabeled text in order to make better text representations. 

This is a large part of cutting edge NLP, and we'll cover this in a few lectures.
In the meantime, we can do a simple version. In low data settings, TFIDF vectors can 
be misleading, since we don't have enough observations to get good statistics on document frequencies. 

In this problem, we'll use the [20 newsgroups data](https://scikit-learn.org/stable/datasets/index.html#newsgroups-dataset), which is a 20-class classification
problem on email subject matter. In real-world applications, it is typically costly and time 
consuming to label data. In this problem we'll simulate that by keeping only the first 1000 labels.
However, we can use the rest of the data set (and other data sets) in order to get better statistics
on document frequencies, and therefore better text representations that make better use of our
few labeled examples

## Part 0: Load the data

 - Read the docs about the dataset
 - Load the 20newsgroups data with `sklearn.datasets.fetch_20newsgroups`

In [8]:
# ok to restart
import numpy as np
import pandas as pd

%pylab inline
np.random.seed(1234)

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


In [9]:
from sklearn.datasets import fetch_20newsgroups
fetch_20newsgroups?

[0;31mSignature:[0m
[0mfetch_20newsgroups[0m[0;34m([0m[0;34m[0m
[0;34m[0m    [0;34m*[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdata_home[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0msubset[0m[0;34m=[0m[0;34m'train'[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mcategories[0m[0;34m=[0m[0;32mNone[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mshuffle[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mrandom_state[0m[0;34m=[0m[0;36m42[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mremove[0m[0;34m=[0m[0;34m([0m[0;34m)[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdownload_if_missing[0m[0;34m=[0m[0;32mTrue[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mreturn_X_y[0m[0;34m=[0m[0;32mFalse[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mn_retries[0m[0;34m=[0m[0;36m3[0m[0;34m,[0m[0;34m[0m
[0;34m[0m    [0mdelay[0m[0;34m=[0m[0;36m1.0[0m[0;34m,[0m[0;34m[0m
[0;34m[0m[0;34m)[0m[0;34m[0m[0

In [10]:
data_train = fetch_20newsgroups(remove=("headers", "footers", "quotes"), subset="train")
data_test = fetch_20newsgroups(remove=("headers", "footers", "quotes"), subset="test")

What does the remove kwarg do in this function?

**ANSWER** 

The `remove` keyword argument in the `fetch_20newsgroups` function specifies which parts of the newsgroup posts should be removed. By setting `remove=("headers", "footers", "quotes")`, the function will remove the headers, footers, and quoted text from the newsgroup posts.

This helps clean the data by removing metadata and quoted text that isn't part of the actual content, allowing us to focus on just the main message content for classification. This is important since headers/footers/quotes could contain information that makes the classification task artificially easier.

What is the in-sample accuracy of a dummy model that always guesses this class?

In [11]:
from sklearn.dummy import DummyClassifier
from sklearn.metrics import accuracy_score

dummy_clf = DummyClassifier(strategy="most_frequent")
dummy_clf.fit(data_train.data, data_train.target)
dummy_predictions = dummy_clf.predict(data_train.data)
in_sample_accuracy = accuracy_score(data_train.target, dummy_predictions)

print(f"In-sample accuracy of the dummy model: {in_sample_accuracy}")

In-sample accuracy of the dummy model: 0.0530316422131872


How many training examples are there?

In [12]:
num_training_examples = len(data_train.data)
print(f"Number of training examples: {num_training_examples}")

Number of training examples: 11314


## Part 1: Fit a baseline


Fit a baseline on the first 1000 examples
 - Fit a `TfidfVectorizer` on the first 1000 examples
   - Use only 10k features
 - Transform the first 1000 training examples and the test data
 - Fit logistic regression and report the accuracy

In [13]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

NUM_LABELED_EXAMPLES = 1000
MAX_FEATURES = 10000

# your code here
vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)
X_train = vectorizer.fit_transform(data_train.data[:NUM_LABELED_EXAMPLES])
X_test = vectorizer.transform(data_test.data)

lr = LogisticRegression()
lr.fit(X_train, data_train.target[:NUM_LABELED_EXAMPLES])

# lr.score(...)
accuracy = lr.score(X_test, data_test.target)
print(f"Accuracy of logistic regression model: {accuracy}")

Accuracy of logistic regression model: 0.44118428040361124


## Part 2: Use more unsupervised data

Repeat part 1, but fit the `TfidfVectorizer` on more data.

You should include
 - All of the training docs (even though we'll only use the first 1000 labels)
 - Load the train set from the IMDB sentiment analysis dataset

Then
 - Fit the same logistic regression model and report the accuracy
 - Comment on the results

In [14]:
import os
import glob

def load_imdb_data_text(imdb_data_dir, random_seed=1234):
    """Provided helper function to load data"""
    train_dir = os.path.join(imdb_data_dir, "train")
    test_dir = os.path.join(imdb_data_dir, "test")

    np.random.seed(random_seed)
    texts = []
    targets = []
    for label in ("pos", "neg"):
        data_dir = os.path.join(train_dir, label)
        files = glob.glob(os.path.join(data_dir, "*.txt"))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = label == "pos"
            texts.append(text)
            targets.append(target)

    train_docs = texts
    y_train = np.array(targets)

    texts = []
    targets = []
    for label in ("pos", "neg"):
        data_dir = os.path.join(test_dir, label)
        files = glob.glob(os.path.join(data_dir, "*.txt"))
        for filename in files:
            with open(filename) as fi:
                text = fi.read()
            target = label == "pos"
            texts.append(text)
            targets.append(target)

    test_docs = texts
    y_test = np.array(targets)

    inds = np.arange(y_train.shape[0])
    np.random.shuffle(inds)

    train_docs = [train_docs[i] for i in inds]
    y_train = y_train[inds]

    return (train_docs, y_train), (test_docs, y_test)

In [15]:
(extra_docs, _), _ = load_imdb_data_text("data/aclImdb/")

In [17]:
# your code here
all_train_data = data_train.data + extra_docs

vectorizer = TfidfVectorizer(max_features=MAX_FEATURES)
X_train = vectorizer.fit_transform(all_train_data[:NUM_LABELED_EXAMPLES])
X_test = vectorizer.transform(data_test.data)
y_train_subset = data_train.target[:NUM_LABELED_EXAMPLES]

lr = LogisticRegression()
lr.fit(X_train, y_train_subset)

# lr.score(...)
accuracy = lr.score(X_test, data_test.target)
print(f"Accuracy of logistic regression model with more unsupervised data: {accuracy}")

Accuracy of logistic regression model with more unsupervised data: 0.44118428040361124


**Comment on the results:**

Results of including both 20 newsgroups and IMDB data when fitting the TfidfVectorizer:

Advantages:
- The vectorizer learns term importance from a larger, more diverse corpus
- Better estimates of IDF values since we see terms across more documents
- May handle domain-specific terms better by learning their relative importance

Potential limitations:
- The two datasets have different styles/domains which could affect term weighting
- Using more data doesn't necessarily improve classification if the additional data is too different

Comparison to Part 1:
- The accuracy is likely similar since we're still using the same 1000 labeled examples for training
- Main difference is in feature representation quality from better IDF estimates
- The larger corpus helps reduce impact of dataset-specific term frequencies


# Problem 3: Backoff language model

We know that having unknown words in text is a problem for a language model. Any estimate of probability is difficult in such a scenario. 

In class, we saw a simple way of smoothing probabilities by adding count 1 to every occuring ngram. While this can be a simple and effective technique we can do something a bit more clever. In this exercise we will implement two such techniques. 

1. To deal with unknown unigrams we will introduce a special `<unk>` token in our training data to represent rare tokens

2. For unknown bigrams we will use a technique called backoff. The idea is to "backoff" to a lower order n-gram estimate for the probability if the n-gram is unknown. For example the probability of an unknown bigram `w_1 w_2` can be estimated by looking at the unigram probability of `w_2`. 

In [18]:
import pandas as pd
import numpy as np
import re
from collections import Counter
%pylab inline

wiki_df = pd.read_csv('data/kdwd_r1k_articles.csv')

def get_tokens(text):
    return ['<s>'] + re.findall(r'\w+', text.lower()) + ['</s>']

train_sentences_list = ' '.join(wiki_df['intro_text'].iloc[:-100].tolist()).split('.')
test_sentences_list = ' '.join(wiki_df['intro_text'].iloc[-100:].tolist()).split('.')

%pylab is deprecated, use %matplotlib inline and import the required libraries.
Populating the interactive namespace from numpy and matplotlib


## Part 0: First, let's build a basic 1-gram language model

In [19]:
train_token_list = [get_tokens(text) for text in train_sentences_list]

In [20]:
unigram_counts = Counter()
# your code here
for tokens in train_token_list:
    unigram_counts.update(tokens)
        
n_unigrams = np.sum([v for _, v in unigram_counts.items()])

In [21]:
assert(n_unigrams == 95491)

In [22]:
def get_unigram_token_prob(token):
    return unigram_counts[token] / n_unigrams

def get_text_prob_unigram(text):
    tokens = get_tokens(text)
    logp = 0
    for t in tokens:
        # code here
        logp += np.log(get_unigram_token_prob(t))
    return np.exp(logp)

In [23]:
assert(get_unigram_token_prob('apple').round(5) == 0.00046)
assert(get_text_prob_unigram('the company').round(9) == 2.455e-06)

Note that we haven't yet introduced any smoothing, meaning, out-of-vocabulary words will have a probability of 0:

In [24]:
get_text_prob_unigram("onomatopoeia")

  logp += np.log(get_unigram_token_prob(t))


0.0

## Part 1: Identify rare unigrams

We have learned that we can simply add 1 to every word count to prevent this (ref: laplace smoothing). Another way however is to mark rare words within our training set as unknown words. The idea is that the model will then learn how to deal with unknown/rare words, to more correctly evaluate a test text.

For this, let us first identify all unigrams that occur fewer or equal than k times. Let's use k=1 to start out with.

In [25]:
rare_tokens = set()
# your loop code here
for token, count in unigram_counts.items():
    if count <= 1:
        rare_tokens.add(token)

print(len(rare_tokens))

4859


In [26]:
assert(len(rare_tokens) == 4859)

## Part 2: Create `<unk>` replacement

Next, let's create a new counter `filtered_unigram_counts` where every token that appears in `rare_tokens` is recorded as the special token `<unk>`

In [27]:
filtered_unigram_counts = Counter({'<unk>': 0})
for token_list in train_token_list:
    for token in token_list:
        if token in rare_tokens:
            filtered_unigram_counts['<unk>'] += 1
        else:
            filtered_unigram_counts[token] += 1
        
n_filtered_unigrams = np.sum([v for _, v in filtered_unigram_counts.items()])

In [28]:
assert(filtered_unigram_counts['<unk>'] == 4859)

## Part 3: Use new counts

To use these new counts, let's modify our text probability function

In [29]:
def get_filtered_unigram_token_prob(token):
    return filtered_unigram_counts[token] / n_filtered_unigrams

def get_text_prob_filtered_unigram(text):
    # get tokens and convert to <unk> if needed
    tokens = ['<unk>' if t not in filtered_unigram_counts else t for t in get_tokens(text)]
    logp = 0
    for t in tokens:
        logp += np.log(get_filtered_unigram_token_prob(t))
    return np.exp(logp)

In [30]:
assert(get_filtered_unigram_token_prob('apple').round(5) == 0.00046)
assert(get_text_prob_filtered_unigram('the company').round(9) == 2.455e-06)
assert(get_text_prob_filtered_unigram("onomatopoeia").round(5) == 0.00016)

We can see that now unknown words actually have a probability higher than some of the rare words that we have already seen before like `apple`.

The choice of count 1 to label words as `<unk>`was arbitrary. How could we tune this if we had more time?

**ANSWER:** 
We could tune the <unk> threshold in several ways:

1. Cross-validation: Split the training data into folds and evaluate model perplexity using different count thresholds (e.g. 1, 2, 3, 5, 10). Choose the threshold that gives best performance on validation data.

2. Development set: Hold out a portion of training data as dev set. Try different thresholds and measure impact on text probability/perplexity on dev set.

3. Task-specific optimization: If using language model as part of downstream task(e.g. text classification),tune threshold to maximize performance on that task.

4. Identify Rare Unigrams: Set a threshold (e.g., words occurring fewer than 5 times) to identify rare unigrams, which will be replaced with <unk>.

## Part 4: Expand to bigrams

Let's expand our model to bigrams now. Make sure to check if each component in a bigram exists and label it as `<unk>` if needed.

In [31]:
filtered_bigram_counts = Counter()
for token_list in train_token_list:
    # your loop and 'unk' conversion here
    token_list = ['<unk>' if token in rare_tokens else token for token in token_list]
    for t1, t2 in zip(token_list[:-1], token_list[1:]):
        filtered_bigram_counts[t1 + ' ' + t2] += 1

def get_filtered_bigram_token_prob(token1, token2):
    return filtered_bigram_counts[token1 + ' ' + token2] / filtered_unigram_counts[token1]
        
def get_text_prob_filtered_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):
        logp += np.log(get_filtered_bigram_token_prob(t1, t2))
    return np.exp(logp)

In [32]:
assert(get_text_prob_filtered_bigram('the company').round(5) == 0.00148)

We correctly get a higher probabiliy for `the company`, now that we are respecting bigrams.
However:

In [33]:
get_text_prob_filtered_bigram('company the')

  logp += np.log(get_filtered_bigram_token_prob(t1, t2))


0.0

We can see that we still get 0 for unknown bigrams. Let's fix this via Backoff. To reiterate: the idea is to default to unigram probabilities if the bigram is unknown.

In [34]:
def get_backoff_bigram_token_prob(token1, token2):
    # check if bigram exists and if not return unigram token2 prob
    if token1 + ' ' + token2 in filtered_bigram_counts:
        return get_filtered_bigram_token_prob(token1, token2)
    else:
        return get_filtered_unigram_token_prob(token2)

In [35]:
def get_text_prob_backoff_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):
        logp += np.log(get_backoff_bigram_token_prob(t1, t2))
    return np.exp(logp)

In [36]:
assert(get_text_prob_backoff_bigram('company the').round(8) == 1.1e-07)

We can happily now estimate any input text we can think of with running into issues with 0.

## Part 5: Evaluate perplexity

Let's see if this was all worth it. Let's evaluate perplexity.
Specifically compare the perplexity of our filtered unigram model `get_filtered_unigram_token_prob` to our new and improved backoff bigram model `get_backoff_bigram_token_prob`

Note: For easy comparison let's only evaluate `tokens[1:]` for both models such that even the first token can already form a correct bigram

In [37]:
def get_text_ppl_filtered_unigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    # your code
    logp = 0
    for t in tokens[1:]:
        logp += np.log(get_filtered_unigram_token_prob(t))
    return np.exp(-logp / len(tokens[1:]))

def get_text_ppl_backoff_bigram(text):
    tokens = [token if token in filtered_unigram_counts else '<unk>' for token in get_tokens(text)]
    # your code
    logp = 0
    for t1, t2 in zip(tokens[:-1], tokens[1:]):
        logp += np.log(get_backoff_bigram_token_prob(t1, t2))
    return np.exp(-logp / len(tokens[1:]))

In [38]:
ppl_list = []
for text in test_sentences_list:
    ppl_list.append(get_text_ppl_filtered_unigram(text))
model_unigram_ppl = np.mean(ppl_list)

In [39]:
ppl_list = []
for text in test_sentences_list:
    ppl_list.append(get_text_ppl_backoff_bigram(text))
model_bigram_ppl = np.mean(ppl_list)

In [40]:
assert(model_bigram_ppl < model_unigram_ppl)

Seems like it worked very well. Try to find one or two examples of short strings that clearly show that our bigram model is better and why. (Short answer is OK here)

In [42]:
print("Example 1: the cat sat")
print("Unigram perplexity:", get_text_ppl_filtered_unigram("the cat sat"))
print("Bigram perplexity:", get_text_ppl_backoff_bigram("the cat sat"))

print("Example 2: new york city")
print("Unigram perplexity:", get_text_ppl_filtered_unigram("new york city"))
print("Bigram perplexity:", get_text_ppl_backoff_bigram("new york city"))

Example 1: the cat sat
Unigram perplexity: 19.75026478224891
Bigram perplexity: 10.055457118325274
Example 2: new york city
Unigram perplexity: 227.60960948584065
Bigram perplexity: 13.078937559455577


**ANSWER** 

Example 1:The bigram model performs better because it captures common word sequences like 'the cat' .

Example 2:The bigram model performs much better here because it learns common phrases like 'new york',
The unigram model treats each word independently, missing these important word associations.