<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">
Linguistics 531<br>
Fall 2024<br>
Jackson

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class. (For grading, it needs to be able to run on my machine, and the way to do that is to limit yourself to the modules we've discussed and that are loaded into the Notebook.)
1. Don't use editors *other* than Jupyter Notebook to work on and submit your assignment, since they will mangle the autograding features: Google Colab, or even just editing the `.ipynb` file as a plain text file. Diagnosing and fixing that kind of problem takes a lot of my time, and that means less of my time to offer constructive feedback to you and to other students.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates, or the work of someone on Stack Overflow, or the work of a generative AI model. Using someone else's code and simply changing variable or object names is *not* doing your own work.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error. Code that runs with errors will get no credit from the autograder.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

my name: *\<FILL IN HERE\>*

people I talked to about the assignment: *\<FILL IN HERE\>*

# Homework #4

**This is due Tuesday, November 12, 2024 at noon (Arizona time).**

This assignment continues with the `NewB` corpus (downloadable [here](https://github.com/JerryWei03/NewB)).

imports:

In [1]:
import re
from math import isclose

# Used in the cosimfreq() implementation from class
import numpy as np

# HINT: This might make your life much easier
from collections import Counter

**As before, this section is for autograding:**

Again, for grading, I need to be working with the right file that we load our corpus from. On my machine, that file has this path:

In [2]:
# Path on my own machine, needed for GRADING
newbfile = '/home/ejackson1/Downloads/linguistics/NewB/train_orig.txt'

# ie, DON'T CHANGE THIS CELL, CHANGE THE ONE BELOW!
#  If you change *this* cell, the autograding is likely to break.

For **you** to work on your own code, you need to point this notebook to the path for this file on your own machine. *You should enter the path on your own machine in the editable code cell below,* then uncomment that line so the notebook works on your machine. This means that the second code cell will take precedence in assigning the value of the path to the corpus, and you can write your code to open that file without problems.

**BEFORE YOU SUBMIT to D2L, remember to comment out *your* path again.** This means that when I run the code on my own machine, it'll have the path that ***I*** need, and it'll grade your notebook properly.

In [3]:
# YOUR path
newbfile = 'train_orig.txt'

**1.** Build a frequency-based *document* index from the `train_orig.txt` file. (6 points total)

You may adapt any of the code from class for this or use your own. You may also use or adapt the code from your previous assignments. Do *not* stem or remove stop words, but *do* use the processing that we've done before: leave only upper and lower case ASCII letters, and numbers. We'll continue the practice from last week of putting our text normalization into a separate function, `text_prep()`, so that we can easily do the same processing for our collection and for our queries.

Your `makeFreqIdx()` function should return a list where the positional index of each item (ie, what you might get from `enumerate`-ing the document index) corresponds to the document ID. The value of each item in the list is a tuple composed of: i) the publication source code for that document, as an integer; ii) the text of that document, normalized and tokenized, as a list of strings; and iii) a set that contains tuples of terms and counts—a word (a normalized string) and a count (an integer) of how many times it occurs in this document.

You should see that this is really quite similar to what you wrote last week. The only difference comes in the set that is the third element of each document in our collection. Last week, that set was composed of just terms. This week, that set is composed of tuples of terms and counts. Now that we're counting terms, do you think a `Counter()` object would help?

In [4]:
def makeFreqIdx(filename):
    '''create a frequency-based document index from
    the newB source file
    
    args:
        filename: name/location of train_orig.txt
    returns:
        a frequency-based document index represented
            as a list of tuples:
                publication code, as an integer
                text of document (ie, one sentence), as a list of strings
                set of count tuples: (word, count)
    '''
    # YOUR CODE HERE
    doc_index = []
    
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                source_code = int(parts[0])
                text = parts[1]
                tokens = text_prep(text)
                freq_counts = Counter(tokens)
                freq_set = {(word, count) for word, count in freq_counts.items()}
                doc_index.append((source_code, tokens, freq_set))
    
    return doc_index
def text_prep(input):
    '''performs text normalization and tokenization on an input string
    
    Our process: anything that is not a letter (upper or lower case
        ASCII letters), digit (0-9), or the percent sign (%) is converted
        to space, and then terms are split on whitespace.
    
    args:
        input: a string of unprocessed text
    returns:
        output: a list of strings of normalized tokens
    '''
    # NOTE: This time, we're not returning both a normalized string AND
    #       a tokenized list of strings--just the tokenized list of strings.
    # YOUR CODE HERE
    normalized = re.sub(r'[^a-zA-Z0-9]', ' ', input)
    return normalized.split()

Here's a bit of a test for your `text_prep()` function

In [5]:
# Some sample input
test1 = 'this is a_weird string  \n'
test2 = 'this string is 98% oDd.'
test3 = 'now we have some WĔIřD•characters'
tests = [test1, test2, test3]

#pay attention to how this should be tokenized according to the instructions
target1 = ['this', 'is', 'a', 'weird', 'string']
target2 = ['this', 'string', 'is', '98%', 'oDd']
target3 = ['now', 'we', 'have', 'some', 'W', 'I', 'D', 'characters']
targets = [target1, target2, target3]

for test, target in zip(tests, targets):
    yourout = text_prep(test)
    result = "Matches!" if target == yourout else "Doesn't match!"
    print("Target: ",target)
    print(" Yours: ",yourout)
    print("Result: ", result, "\n")

Target:  ['this', 'is', 'a', 'weird', 'string']
 Yours:  ['this', 'is', 'a', 'weird', 'string']
Result:  Matches! 

Target:  ['this', 'string', 'is', '98%', 'oDd']
 Yours:  ['this', 'string', 'is', '98', 'oDd']
Result:  Doesn't match! 

Target:  ['now', 'we', 'have', 'some', 'W', 'I', 'D', 'characters']
 Yours:  ['now', 'we', 'have', 'some', 'W', 'I', 'D', 'characters']
Result:  Matches! 



In [6]:
# Here's an example of working with a Counter()
count2 = Counter(target2)
count2

Counter({'this': 1, 'string': 1, 'is': 1, '98%': 1, 'oDd': 1})

In [7]:
# Here's a way to get the Counter() to give you a set of (key, value) tuples
#  (the set() function is there just to get the types right)
set(count2.items())

{('98%', 1), ('is', 1), ('oDd', 1), ('string', 1), ('this', 1)}

In [8]:
# Here's how to get something from a set of tuples back into a Counter()
mytuples = {('98%', 1), ('is', 1), ('oDd', 1), ('string', 1), ('this', 1)}

# The trick is to convert this set of tuples into a dictionary, and from
#   there easily into a Counter()
newcounter = Counter(dict(mytuples))

#Is this the same as what we started with?
newcounter == count2

True

One of the nice things about `Counter()` objects is that they don't complain if you ask for a key that they don't have; they simply return a value of zero (ie, they didn't count that key when they were created). This can make for some Python code that is MUCH easier to read! Consider this example of a dot product function on `Counter()` objects:

In [9]:
def cdot(counter1, counter2):
    '''calculate the dot product over two counters'''
    return sum(counter1[word]*counter2[word] for word in counter1)
    # Think about why I only have to iterate over one of the counters for this calculation
    # Could you also write a function to find the "vector magnitude" of a Counter()?
    # With those two functions, could you also rewrite cosim() in terms of Counter()s?

count1, count2 = Counter(target1), Counter(target2)
cdot(count1, count2)

3

*Tests for Q1*

In [10]:
#this will take a few seconds
docs = makeFreqIdx(newbfile)

# test 1a, 1 pt
assert type(docs) == list

In [11]:
# test 1b, 1 pt
assert len(docs) == 253781

In [12]:
# test 1c, 1 pt
assert type(docs[0]) == tuple

In [13]:
# test 1d, 1 pt
assert type(docs[500][0]) == int

In [14]:
# test 1e, 1 pt
assert docs[10000][1] == ['the', 'snack', 'bar', 'across', 'the', 'mall',
                          'from', 'the', 'trump', 'on', 'the', 'ocean',
                          'site', 'was', 'badly', 'flooded', 'and', 'damaged',
                          'even', 'though', 'it', 'has', 'no', 'basement']

In [15]:
# test 1f, 1 pt
assert docs[38032][2] == {('a', 2), ('agreed', 1), ('and', 2), ('as', 2),
 ('asked', 1), ('be', 2), ('but', 1), ('by', 1), ('did', 1), ('even', 1),
 ('far', 1), ('fear', 1), ('fed', 3), ('from', 1), ('government', 1), ('im', 3),
 ('in', 1), ('is', 1), ('known', 1), ('lives', 1), ('lumber', 1), ('name', 1),
 ('named', 1), ('not', 2), ('of', 1), ('others', 1), ('partisan', 1), ('quoted', 1),
 ('rancor', 1), ('reflecting', 1), ('regulation', 1), ('said', 1), ('square', 1),
 ('stronghold', 1), ('taxes', 1), ('the', 1), ('then', 1), ('to', 2), ('trump', 2),
 ('up', 3), ('what', 1), ('who', 2), ('wholesaler', 1), ('with', 3)}

**2**. Revise our Euclidean distance function to operate over these new index structures, now that we're using frequency counts. (4 points total)

Note, this means that we need to adjust that function since we're not just looking at term incidence, where all weights effectively equal 1. Be careful; this can be tricky! There are different ways to approach it, and some are faster than others.

*(Hint: What are the different cases of values in the two vectors that you have to calculate for this? How much of the code from the in-class notebook—both from last week's Euclidean distance function that worked on unweighted vectors, and this week's cosine similarity function that did work on weighted vectors—can you re-use? Here's one place where using a Counter() may give you some helpful default behavior. Pay attention to the TYPES that this function expects--you can't just pass it a Counter(), but you might be able to convert a set of tuples into a Counter() pretty easily, rigiht?)*

In [16]:
def eucdistfreq(d1,d2):
    '''calculate Euclidean distance for frequency-based
    document representations as produced by makeFreqIdx()
    
    args:
        d1,d2: document vectors represented as sets of
           tuples of the form (word,count)
    returns:
        Euclidean distance between the two document vectors
        (return a scalar of type float to pass the following assert test)
    '''
    # YOUR CODE HERE
    v1 = Counter(dict(d1))
    v2 = Counter(dict(d2))
    
    all_terms = set(v1.keys()) | set(v2.keys())
    squared_diff_sum = sum((v1[term] - v2[term]) ** 2 for term in all_terms)
    return float(np.sqrt(squared_diff_sum))

In [17]:
# Let's get two documents to use
d0 = docs[0][2]
d8 = docs[8][2]

# The order of these vectors shouldn't matter
res1 = eucdistfreq(d0,d8)
res2 = eucdistfreq(d8,d0)

# test 2a, 1pt
assert type(res1) == float

In [18]:
# test 2b, 1pt
# The order that the vectors are fed to eucdistfreq shouldn't change the result
assert res1 == res2

In [19]:
# test 2c, 1pt
# I get a value 5.477225575051661
assert isclose(res1,5.4772,abs_tol=0.0001)

In [20]:
d2 = docs[2][2]
res3 = eucdistfreq(d8,d2)

# test 2d, 1pt
# I get a value 7.0710678118654755
assert isclose(res3,7.0711,abs_tol=0.0001)

We'll also need the `cosimfreq()` function from class:

In [21]:
#cosine similarity wrt/frequencies for DTI (from class)
def cosimfreq(d1,d2):
    num = sum([e1[1]*e2[1] for e1 in d1
               for e2 in d2 if e1[0] == e2[0]])
    d1len = np.sqrt(sum([e[1]**2 for e in d1]))
    d2len = np.sqrt(sum([e[1]**2 for e in d2]))
    denom = d1len * d2len
    if denom == 0: return 0
    return float(num)/float(denom)

**3.** Revise your search function from last time so that it works with weighted document vectors and query vectors. (5 points)

**It should process the query just like you processed your documents,** and then return the top 10 document indices that best match a query using either Euclidean distance or cosine similarity. (Remember that sorting should depend on the distance metric that is used.)

The function should have this argument structure:

```python
search(query,index,cosine=True)
```

The default is cosine similarity, but if you specify a third argument as `False`, the function uses Euclidean distance. (You may, of course, use and adapt code from class.)

*Hint: This week, we've changed our index, and we've changed our distance functions. What part of your search function from last week needs to change, to be compatible with these differences?*

In [22]:
def search(q,idx,cosine=True):
    '''searches for the 10 best matches for a
    string query using either cosine similarity
    or euclidean distance
    
    args:
        q: query (possibly multi-word) as a string
        idx: frequency-based index
        cosine: boolean for cosine similarity or
            euclidean distance
    returns:
        list 10 best matching tuples: (score,docID)
    '''
    # YOUR CODE HERE
    q_tokens = text_prep(q)
    q_freq = Counter(q_tokens)
    q_vector = {(term, count) for term, count in q_freq.items()}
    
    scores = []
    for i, doc in enumerate(idx):
        doc_freqs = doc[2] 
        
        if cosine:
            score = cosimfreq(q_vector, doc_freqs)
            scores.append((score, i))
        else:
            score = eucdistfreq(q_vector, doc_freqs)
            scores.append((score, i))
    
    scores.sort(reverse=cosine)
    
    return scores[:10]

In [23]:
r1 = search('the fire wall',docs)

# test 3a, 1pt
assert type(r1) == list

In [24]:
# test 3b, 1pt
assert len(r1) == 10

In [25]:
r2 = search('the fire wall',docs,False)

# test 3c, 1pt
assert r1[0][1] == 23855 and r2[0][1] == 22610

In [26]:
# Let's display these
print('cosimfreq top 10 results:\teucdistfreq top 10 results:')
[print('{}\t{}'.format(r1,r2)) for r1, r2 in zip(r1, r2)]

cosimfreq top 10 results:	eucdistfreq top 10 results:
(0.5773502691896258, 23855)	(1.7320508075688772, 22610)
(0.5560384374855327, 101886)	(1.7320508075688772, 23855)
(0.545544725589981, 20432)	(2.0, 2014)
(0.5345224838248488, 217696)	(2.0, 3034)
(0.5333333333333333, 132093)	(2.0, 3950)
(0.5270462766947299, 113628)	(2.0, 4009)
(0.5222329678670935, 182303)	(2.0, 4280)
(0.5222329678670935, 103169)	(2.0, 4350)
(0.5222329678670935, 74900)	(2.0, 4524)
(0.5163977794943222, 247960)	(2.0, 4830)


[None, None, None, None, None, None, None, None, None, None]

In [27]:
# test 3d, 1pt
# I get 0.5773502691896258
assert isclose(r1[0][0],.5774,abs_tol=0.0001)

In [28]:
# test 3e, 1pt
# I get 1.7320508075688772
assert isclose(r2[0][0],1.7321,abs_tol=0.0001)

**4.** This system should have problems related to function words (since we are not removing them). Explain what the problem is. (2 pts)

*For full points, be sure your answer makes it clear you understand how the system we've been building **in this assignment** may return results that do not actually meet a user's information need well, as a consequence of function words in the collection and in the query.*

*Hint: This problem was discussed in the course videos, so if you're not sure how to approach this, you might benefit from reviewing the videos. For what you write here, you should be able to express this problem in your own words.*

Function words such as "the," "is," "and," etc., are high frequency words, meaning they are commonly used among all the documents. Not removing these function words from the documents can lead to the function to treat function words with the same importance as content words, despite the fact they may not be relevant to the user's query. This confusion would reduce the precision of the model because it will see high frequency words as more important, which may not always be what the query may be requeesting. The best way to do this is to remove the function words form the documents. The removal of these words would lead the funciton to focus on words with more semantic weight.

**5.** *Demonstrate* the problem that you described in **Question 4** with actual searches on the index and search function you created in questions 1 and 3. (2 pts)

In [29]:
#YOUR CODE HERE
def makeFreqIdx2(filename):
    doc_index = []
    
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                source_code = int(parts[0])
                text = parts[1]
                tokens = text_prep2(text)  
                freq_counts = Counter(tokens)
                freq_set = {(word, count) for word, count in freq_counts.items()}
                doc_index.append((source_code, tokens, freq_set))
    
    return doc_index

def text_prep2(input_text):
    function_words = {"the", "on", "at", "and", "in", "to", "of", "a", "for", "with", 
                      "is", "that", "by", "from", "it", "has", "no", "was"}
    
    normalized = re.sub(r'[^a-zA-Z0-9]', ' ', input_text)
    tokens = normalized.lower().split()
    
    filtered_tokens = [word for word in tokens if word not in function_words]
    
    return filtered_tokens

def search2(q, idx, cosine=True):
    q_tokens = text_prep2(q)  
    q_freq = Counter(q_tokens)
    q_vector = {(term, count) for term, count in q_freq.items()}
    
    scores = []
    for i, doc in enumerate(idx):
        doc_freqs = doc[2]  
        
        if cosine:
            score = cosimfreq(q_vector, doc_freqs)
            scores.append((score, i))
        else:
            score = eucdistfreq(q_vector, doc_freqs)
            scores.append((score, i))
    
    scores.sort(reverse=cosine)
    
    return scores[:10]

doc_text = 'the snack bar across the mall from the trump on the ocean site was badly flooded and damaged even though it has no basement'
processed_tokens = text_prep2(doc_text)
print("Processed tokens (with function words removed):", processed_tokens)

assert processed_tokens == ['snack', 'bar', 'across', 'mall', 'trump', 'ocean', 'site', 
                            'badly', 'flooded', 'damaged', 'even', 'though', 'basement'], \
       f"Tokens after processing: {processed_tokens}"

Processed tokens (with function words removed): ['snack', 'bar', 'across', 'mall', 'trump', 'ocean', 'site', 'badly', 'flooded', 'damaged', 'even', 'though', 'basement']
