<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">
Linguistics 531<br>
Fall 2024<br>
Jackson

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class. (For grading, it needs to be able to run on my machine, and the way to do that is to limit yourself to the modules we've discussed and that are loaded into the Notebook.)
1. Don't use editors *other* than Jupyter Notebook to work on and submit your assignment, since they will mangle the autograding features: Google Colab, or even just editing the `.ipynb` file as a plain text file. Diagnosing and fixing that kind of problem takes a lot of my time, and that means less of my time to offer constructive feedback to you and to other students.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates, or the work of someone on Stack Overflow, or the work of a generative AI model. Using someone else's code and simply changing variable or object names is *not* doing your own work.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error. Code that runs with errors will get no credit from the autograder.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

my name: Kathleen Costa

people I talked to about the assignment: N/A

# Homework #8

**This is due Tuesday, December 10, 2024 at noon (Arizona time).**

This assignment continues with the `NewB` corpus (downloadable [here](https://github.com/JerryWei03/NewB)).

imports:

In [1]:
import re
import numpy as np
from math import isclose
import pandas as pd

# Even though your document vectors will be sorted lists,
#  using a Counter() will be an efficient way to get there.
from collections import Counter

**As before, this section is for autograding:**

What I need on my machine to properly grade this:

In [2]:
# Path on my own machine, needed for GRADING
newbfile = '/home/ejackson1/Downloads/linguistics/NewB/train_orig.txt'

# ie, DON'T CHANGE THIS CELL, CHANGE THE ONE BELOW!
#  If you change *this* cell, the autograding is likely to break.

*In the editable cell below, enter the path on your own machine,* then uncomment that line so the notebook works on your machine.

**BEFORE YOU SUBMIT to D2L, remember to comment out *your* path again.**

In [3]:
# YOUR path
newbfile = 'train_orig.txt'

**1.** Fill in the `get49()` function below to read in the data and extract sentences in class #4 and class #9. (3 points)

This should be very similar to what you've done on past assignments, just with a new pair of classes. This function will also include some of the kinds of operations that we used to create our document vectors (which are a form of document-term index).

You should use the `tokenize()` function **within `get49()`** to process the text as we have on previous homeworks: do not stem or remove stop words; keep only numbers, upper and lower case ASCII letters, and the percent sign. You can likely copy your function from assignment 7. Although there shouldn't be any upper case letters in this data set, it's always a good idea to check and normalize your data before assuming what's there.

Convert the sentences to raw term frequency counts for each sentence. (In past weeks and in this week's class notebook, some of the things we've done use *tf-idf* counts, but for this homework, we're keeping things simple and just using raw frequency counts. If you're curious to see how using *tf-idf* counts would affect this week's data, you're welcome to try that in another notebook. It's not required, though it might provide useful points for your discussion in question 8.)

Your function should return two lists, one for each class. See the docstring and cells below for clarity. Each item in these lists corresponds to a document (ie, a sentence in this data). For each document, you should have an **alphabetically sorted** list of two-item tuples where the first item is a normalized word and the second is the frequency count for that word.

In [4]:
def get49(filename):
    '''
    calculates word count vectors for all documents in
    classes #4 and #9
    
    args:
        filename: location of train_orig.txt
        
    returns:
        two lists, one each for category 4 and 9;
            within each category's list, there should be a list (one per
            document) of tuples (<normalized word>, <count>), sorted
            alphabetically by the normalized word
    '''
    # YOUR CODE HERE
    category4_docs = []
    category9_docs = []
    
    with open(filename, 'r') as f:
        for line in f:
            try:
                category, text = line.strip().split('\t')
            except ValueError:
                continue
            
            if category not in ['4', '9']:
                continue
            
            word_counts = Counter(tokenize(text))
            sorted_counts = sorted(word_counts.items())
            
            if category == '4':
                category4_docs.append(sorted_counts)
            else:
                category9_docs.append(sorted_counts)
    
    return category4_docs, category9_docs

def tokenize(s):
    '''
    converts anything that is not a letter (upper or lower
    case ASCII), number, or percent sign to space, and tokenizes
    on white space.
    
    args:
        s: a sentence as a string
    returns:
        a list of normalized word tokens as strings
    '''
    # YOUR CODE HERE
    cleaned = re.sub(r"[^a-zA-Z0-9% ]", " ", s) 
    cleaned = re.sub(r" +", " ", cleaned) 
    return cleaned.strip().split()

In [5]:
ws4,ws9 = get49(newbfile)

# test 1a, 1pt
assert len(ws4) == len(ws9) == 23071

In [6]:
sample = "  More than  28%  of sentences in this   data contain�the president's name."

# test 1b, 1pt
# This is just a separate test of your tokenize() function
assert tokenize(sample) == ['More',  'than', '28%', 'of', 'sentences', 'in', 'this',
 'data', 'contain', 'the', 'president', 's', 'name']

In [7]:
# test 1c, 1pt
# Be sure you've applied your tokenize() function to each document/sentence
#  before creating these document vectors.
assert ws9[21321] == [('all', 1),('avoid', 1),('crossing', 1),
 ('donald', 1),('i', 1),('in', 1),('m', 1),('other', 1),
 ('s', 1),('saying', 1),('t', 1),('the', 1),('to', 1),
 ('trump', 1),('words', 1)]

In [8]:
ws4[10]

[('attacks', 1),
 ('criticize', 1),
 ('if', 1),
 ('president', 1),
 ('the', 1),
 ('them', 1),
 ('they', 1),
 ('trump', 1)]

**2.** Separate the last 300 documents for each class into separate test sets and use the remainder for training. (3 points)

This task should be familiar from previous assignments.

In [9]:
#train4 = all but last 300 items for training
# YOUR CODE HERE
cat4_docs, cat9_docs = get49('train_orig.txt')
train4 = cat4_docs[:-300]

#test4 = last 300 items for test
# YOUR CODE HERE
test4 = cat4_docs[-300:]

#train9 = all but last 300 items for training
# YOUR CODE HERE
train9 = cat9_docs[:-300]

#test9 = last 300 items for test
#YOOUR CODE HERE
test9 = cat9_docs[-300:]

In [10]:
# test 2a, 1 pt
assert ws9[-300] == test9[0] and ws4[-300] == test4[0]

In [11]:
# test 2b, 1 pt
assert len(train4) == len(train9) == 22771 and len(test4) == len(test9) == 300

In [12]:
# test 2c, 1 pt
assert train4[5342] == [('added', 1), ('after', 1), ('an', 1), ('and', 1), ('as', 1),
 ('bolton', 1), ('but', 1), ('consolidate', 1), ('cuba', 1), ('lumped', 1), ('maduro', 1),
 ('missed', 1), ('nation', 1), ('nicaragua', 1), ('nicol', 1), ('of', 1), ('on', 1),
 ('opportunity', 1), ('piccone', 1), ('president', 1), ('pressure', 1), ('s', 1),
 ('that', 2), ('the', 1), ('to', 1), ('troika', 1), ('trump', 1), ('tyranny', 1),
 ('venezuelan', 1), ('with', 1)]

**3.** Write a function to calculate the centroid of a list of document vectors that are in the data structure we're using. (4 points)

Check the docstring as well as the cells below to see how this should be applied. Remember, we want to calculate the centroid for our class 4 sentences and for our class 9 sentences. As we saw in the lecture, though, our vocabulary counts constitute sparse lexical vectors. 

*(Hint: You may find that a dictionary or Counter is a helpful data structure for this calculation. The bit of testing that I've done happens to show that dictionaries are much faster than Counters here, but it does depend on the details of how you implement it!)*

In [13]:
def sparseCentroid(vectors):
    '''
    takes a list of sparse document vectors and calculates
    the centroid
    
    args:
        vectors: a list of lists of sorted (word,count) tuples
    returns:
        centroid as a sorted list of (word,avg) tuples
    '''
    # YOUR CODE HERE
    if not vectors:
        return []
        
    combined_counts = Counter()
    for vector in vectors:
        combined_counts.update(dict(vector))

    num_vectors = len(vectors)
    centroid = [(word, count / num_vectors) for word, count in combined_counts.items()]

    return sorted(centroid)

In [14]:
centroid4 = sparseCentroid(train4)
centroid9 = sparseCentroid(train9)

# test 3a, 1 pt
# This depends on the terms that are in class 4, right?
assert len(centroid4) == 23576

In [15]:
# test 3b, 1 pt
# Likewise, this depends on the terms in class 9.
assert len(centroid9) == 18088

In [16]:
c4dict = dict(centroid4)
c9dict = dict(centroid9)

# test 3c, 1 pt
assert isclose(c4dict['hat'],0.0006587,abs_tol=0.0000001)

In [17]:
# Sanity check:
# I get 1.3059154187343551
# Remember, these are raw weights, not tf-idf scores, so this is the average number of
#  times the word 'the' occurs in documents in class 4. If your numbers (including this
#  one) are off, be sure you're just using raw weights. We would otherwise expect a
#  common word like 'the' to have a very low tf-idf score, right?
c4dict['the']

1.3059154187343551

In [18]:
# test 3d, 1 pt
assert isclose(c9dict['hat'],0.0003513,abs_tol=0.0000001)

Here's an updated version of cosine similarity we developed for these in class.

In [19]:
def cosimfreq(d1,d2):
    '''
    Calculate cosine similarity for two sparse document vectors,
    using dicts for calculations and Numpy's built-in `.norm`.
    
    args:
        d1, d2: lists of (word, freq) tuples
    returns:
        a float from 0 to 1 expressing the similarity of these vectors
        (values closer to 1 are more similar)
    '''
    # concise, AND efficient-ish!
    d1,d2 = dict(d1), dict(d2)
    dot_product = sum(d1[word] * d2[word] for word in d1.keys() & d2.keys())
    magnitudes = np.linalg.norm([*d1.values()]) * np.linalg.norm([*d2.values()])
    if magnitudes == 0: return 0
    return dot_product/magnitudes

**4.** Now we do Rocchio classification for the two test sets. Write a function as described in the docstring below, using centroids as output by `sparseCentroid()` (question 3&mdash;see the `assert` statements below) and using `cosimfreq()` from above. (2 points)

In [20]:
def testSet(centroid1,centroid2,vectors):
    '''
    takes two centroids and a list of document vectors
    and returns how many vectors are closer to the second
    centroid (using cosimfreq).
    
    Everything is represented as lists of (word,freq) tuples.
    
    args:
        centroid1: centroid of class 1
        centroid2: centroid of class 2
        vectors: list of documents
    returns:
        integer number of how many of the vectors are
        closer to the second centroid
    '''
    # YOUR CODE HERE
    closer_to_second = 0
    v1 = 0
    v2 = 0

    for vector in vectors:
        sim1 = cosimfreq(centroid1, vector)
        sim2 = cosimfreq(centroid2, vector)
        
        if sim2 > sim1:
            closer_to_second += 1

    return closer_to_second

#This code lead the kernel to die. I've tried several different ways, including from the nb10vectorspace assignment. I've tried to alter
#the other code to see if that might be the issue, but I think the kernel simply can't handle large amounts of data. 

In [None]:
#this may take 2-3 minutes
result4 = testSet(centroid9,centroid4,test4)

# test 4a, 1 pt
# Out of the 300 item test set, I get that 202 are evaluated correctly.
assert isclose(result4,202,abs_tol=1)

In [None]:
#this may take 2-3 minutes
result9 = testSet(centroid4,centroid9,test9)

# test 4a, 1 pt
# Out of the 300 item test set, I get that 183 are evaluated correctly.
assert isclose(result9,183,abs_tol=1)

Let's summarize the performance of the Rocchio method.

In [None]:
testSize = 300
print("Rocchio method summary:")
print("Test set 4, {} out of {} correct, {:.0%}".format(result4, testSize, result4/testSize))
print("Test set 9, {} out of {} correct, {:.0%}".format(result9, testSize, result9/testSize))

**5.** Now we'll work up to a full kNN classification. This involves a *lot* of calculations, so let's reduce the data a bit. For each class, set aside 1000 training items, 100 development items, and 100 test items, as described below. (3 points)

In [None]:
#newtrain4 = first 1000
# YOUR CODE HERE
newtrain4 = cat4_docs[:1000] 

#newdev4 = next 100
# YOUR CODE HERE
newdev4 = cat4_docs[1000:1100] 

#newtest4 = next 100
# YOUR CODE HERE
newtest4 = cat4_docs[1100:1200]

#newtrain9 = first 1000
# YOUR CODE HERE
newtrain9 = cat9_docs[:1000]

#newdev9 = next 100
# YOUR CODE HERE
newdev9 = cat9_docs[1000:1100]

#newtest9 = next 100
# YOUR CODE HERE
newtest9 = cat9_docs[1100:1200]

In [None]:
# test 5a, 1 pt
assert (len(newtrain4) == len(newtrain9) == 1000) and \
    (len(newtest4) == len(newtest9) == len(newdev4) == len(newdev9) == 100)

In [None]:
# test 5b, 1 pt
assert newdev4[0] == ws4[1000]

In [None]:
# test 5c, 1 pt
assert newtest9[0] == ws9[1100]

**6.** Now we write a function that takes: two lists of training documents, represented as above; a class label for each of these lists of training documents; and a single document for development testing, that we will compare to these lists of documents. This function returns a single ordered list of cosine similarity scores reflecting the test documents that are closest to the single development document. (4 points)

To be clear, the two lists of documents will be our full *training* sets (each with 1000 documents), and the single document will be from our *development* set. We'll call this function from the code that we write in question 7, which will loop through the documents in our development set, in order to find the best value for $k$ for each development document. It may be helpful to look at question 7 before you complete the code here, to see how this function will be used there.

In the original lists, each item is a document vector (an ordered list of tuples). In the returned list, each item is a tuple of cosine similarity score (of one training document and the development document) and the class label of that training document. (We'll keep track of the class label of the *development document* in the function we write in Question 7.)

In [None]:
def makeList(trainlist1,label1,trainlist2,label2,devdoc):
    '''
    ranks the elements of two lists in a single
    new list in terms of cosine similarity score
    with respect to a single document.
    
    args:
        trainlist1:  document vectors of one class for training
        label1:      the label of that class
        trainlist2:  document vectors of a second class for training
        label2:      label of the second class
        devdoc:      a single document vector from a development list
    returns:
        a ranked list of tuples, ranked from most
        similar to least similar: (score,class label)
    '''
    # YOUR CODE HERE
    scores = []
    for doc in trainlist1:
        sim = cosimfreq(doc, devdoc)
        scores.append((sim, label1))
    for doc in trainlist2:
        sim = cosimfreq(doc, devdoc)
        scores.append((sim, label2))
    scores.sort(reverse=True)
    
    return scores

In [None]:
res = makeList(newtrain4,4,newtrain9,9,newdev4[0])

# test 6a, 1 pt
# 1000 training documents from each class means that we should have a list
#  of 2000 similarities returned.
assert len(res) == 2000

In [None]:
# test 6b, 1 pt
assert type(res[0]) == tuple

In [None]:
# test 6c, 1 pt
assert all([(label == 4 or label == 9) for similarity, label in res])

In [None]:
# test 6d, 1 pt
assert res[505][0] > res[506][0]

**7.** Now we use the training sets and development sets from question 5 with the function you created in question 6 to determine the best value for $k$ for all the items in the development sets, and then choose the best $k$ overall. (2 points)

The following function takes two training sets and a single development set as arguments and returns a dictionary of $k$ values containing a count for how many times the *second* category is chosen by the kNN method, where $k$ varies over odd numbers from $1$ to $50$.

*(Hint: It may be helpful to express the possible values of $k$ as a `range()`.)*

This function should step through the documents in the development set, calling the `makeList()` function that you wrote in question 6 for each one.

*(Another hint: You'll be stepping through the possible values of $k$, and stepping through the documents in the development set. What is the most efficient way to combine these loops? Think about how you can avoid re-calculating the same things.)*

In [None]:
def findK(trainlist1,label1,trainlist2,label2,devset):
    '''
    returns kNN counts for different choices of k
    
    args:
        trainlist1:  one list of documents for training
        label1:      label of the first training list
        trainlist2:  another list of documents for training
        label2:      label of the other training list
        devset:      one list of development documents
    returns:
        a dictionary from values of k to counts
        for how many times the second class was
        chosen.
    '''
    # YOUR CODE HERE
    k_counts = {}
    
    k_values = range(1, 50, 2)
    
    for dev_doc in devset:
        ranked_list = makeList(trainlist1, label1, trainlist2, label2, dev_doc)
        
        for k in k_values:
            top_k = ranked_list[:k]
            label2_count = sum(1 for _, label in top_k if label == label2)
            k_counts[k] = k_counts.get(k, 0) + (label2_count > k/2)
    
    return k_counts

In [None]:
res944 = findK(newtrain9,9,newtrain4,4,newdev4)
res499 = findK(newtrain4,4,newtrain9,9,newdev9)

# test 7a, 1 pt
# Odd numbers from 1 to 50--that's 25 possible values for k
assert len(res944) == len(res499)== 25

We can add those values together to see which value of $k$ scores the highest:

In [None]:
results = pd.DataFrame([res944,res499]).transpose()
results['sum'] = results.sum(axis=1)
results

In [None]:
# Let's choose the best k: the label of the row with the highest times it made
#  the right prediction for the development set.
bestk = results['sum'].idxmax()

# test 7b, 1 pt
assert bestk == 19

Now run kNN on the test sets with the best value of $k$:

In [None]:
newTestSize=100

testres944 = findK(newtrain9,9,newtrain4,4,newtest4)
testres499 = findK(newtrain4,4,newtrain9,9,newtest9)
good4 = testres944[bestk]
good9 = testres499[bestk]

print("kNN method summary:")
print("Test set 4, {} out of {} correct, {:.0%}".format(good4, newTestSize, good4/newTestSize))
print("Test set 9, {} out of {} correct, {:.0%}".format(good9, newTestSize, good9/newTestSize))

**8.** Summarize the results of your Rocchio and kNN classification systems and discuss. Can you say anything about why you get the results you do? The Review & Mastery exercises for this week may suggest some relevant things to consider. (2 points)

Even though my Rocchio classification system didn't work as expected, I can see that 202 and 183 of the 300 test items performed well with this classification system. Centroids are used to represent each of the classes and define the boundaries of each of these classes. In kNN classification systems, classifications are made using local document neighborhoods. In other words, the proximity of the documents is used to make predictions about the documents. kNN also looks at the individual similarities of a document rather than looking at the overall class pattern. In this assignment, the best value of k = 19. Since there is such a high number associated with the Rocchio classification (202 and 183 out of 300), it is implied that the classes have distinct centroid patterns. These results make sense due to the classification nature of each of these classification systems, since kNN looks at the local similarities while Rocchio looks at the class characteristics as a whole through the scope of centroids. 