<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">
Linguistics 531<br>
Fall 2024<br>
Jackson

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class. (For grading, it needs to be able to run on my machine, and the way to do that is to limit yourself to the modules we've discussed and that are loaded into the Notebook.)
1. Don't use editors *other* than Jupyter Notebook to work on and submit your assignment, since they will mangle the autograding features: Google Colab, or even just editing the `.ipynb` file as a plain text file. Diagnosing and fixing that kind of problem takes a lot of my time, and that means less of my time to offer constructive feedback to you and to other students.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates, or the work of someone on Stack Overflow, or the work of a generative AI model. Using someone else's code and simply changing variable or object names is *not* doing your own work.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error. Code that runs with errors will get no credit from the autograder.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

my name: Kathleen Costa

people I talked to about the assignment: N/A

# Homework #3

**This is due Tuesday, November 5, 2024 at noon (Arizona time).**

This assignment continues with the `NewB` corpus (downloadable [here](https://github.com/JerryWei03/NewB)).

imports:

In [1]:
import re
from math import isclose
import numpy as np
from nltk.stem.porter import PorterStemmer

**As before, this section is for autograding:**

Again, for grading, I need to be working with the right file that we load our corpus from. On my machine, that file has this path:

In [2]:
# Path on my own machine, needed for GRADING
newbfile = '/home/ejackson1/Downloads/linguistics/NewB/train_orig.txt'

# ie, DON'T CHANGE THIS CELL, CHANGE THE ONE BELOW!
#  If you change *this* cell, the autograding is likely to break.

For **you** to work on your own code, you need to point this notebook to the path for this file on your own machine. *You should enter the path on your own machine in the editable code cell below,* then uncomment that line so the notebook works on your machine. This means that the second code cell will take precedence in assigning the value of the path to the corpus, and you can write your code to open that file without problems.

**BEFORE YOU SUBMIT to D2L, remember to comment out *your* path again.** This means that when I run the code on my own machine, it'll have the path that ***I*** need, and it'll grade your notebook properly.

In [3]:
# YOUR path
newbfile = 'train_orig.txt'

**1.** Build an incidence-based *document* index from the `train_orig.txt` file. (5 points total)

We represent the index as a list of tuples where the list index corresponds to the document ID and the tuple is composed of: i) the publication source code, ii) the tokenized text of the document, and iii) the set of word tokens that occur in the document.

Notice how this compares to your functions from the last assignment: on the last assignment, you had one function `makeDocuments()` that took a file path and returned what we called "a list of documents," processed in a way that we could usefully use--as tuples of publication ID, sentence, and the set of words in the sentence--and `makeIndex()`, which took that list of documents and returned an **incidence-based *term-document* index**, which was a dictionary from terms to a list of the doc_IDs for documents that contained the term. For this assignment, since we're implementing our similarity-based search using an **incidence-based *document-term* index** (note the reversal), we will use the "set of words in a document" element of the original corpus to do these calculations. This means that even though the name of the function in this assignment may be different (`makeDocIndex()`) from last week's (`makeDocuments()`), it's still doing largely the same thing that last week's function did, and even though we didn't call the resulting object (ie, the output of `makeDocuments()`) an index when we first wrote it, the set of terms in each document ***is*** an incidence-based document-term index.

Our text preparation will be the same as last week: anything that is not a letter (upper or lower case ASCII letters), digit, or the percent sign is converted to space, and then terms in our documents (ie, sentences) are split on (any number of characters of) whitespace. Do not stem or normalize in any other way, since this will affect your results. (We'll implement stemming further down. For your own future reference, [NLTK](https://www.nltk.org/api/nltk.tokenize.html) and libraries like [spaCy](https://spacy.io/usage/spacy-101) also have the ability to tokenize sentences and words, but since our tokenization so far is pretty simple, we're not introducing the complexity of NLTK or spaCy for this purpose.) **However, this week, you'll be writing that text processing as part of its own function `text_prep()`, so that we can make use of it to process our queries below, also.** Note that the text processing function should return a list of tokenized strings, so the sentence, as it is represented in our index, should be a list of strings, **not** a single long string as it was last week.

You may adapt any of the code from class for this or use your own. You may also use or adapt the code from your first or second assignments.

In [4]:
def makeDocIndex(filename):
    '''creates an incidence-based document index
    from the train_orig.txt file
    
    args:
        nbfile: location of the file
    returns:
        documents: the index as a list of tuples:
            publication source id
            normalized and tokenized text of document
            set of words in the document
    '''
    # YOUR CODE HERE
    documents = []
    
    with open(filename, 'r', encoding='utf-8') as file:
        for doc_id, line in enumerate(file):
            parts = line.strip().split('\t')
            
            if len(parts) < 2:
                continue
                
            publication_source = int(parts[0])
            text = parts[1]
            
            tokenized_text = text_prep(text)
            
            word_set = set(tokenized_text)
            
            documents.append((publication_source, tokenized_text, word_set))
    
    return documents   
def text_prep(input):
    '''performs text normalization and tokenization on an input string
    
    Our process: anything that is not a letter (upper or lower case
        ASCII letters), digit (0-9), or the percent sign (%) is converted
        to space, and then terms are split on whitespace.
    
    args:
        input: a string of unprocessed text
    returns:
        output: a list of strings of normalized tokens
    '''
    # YOUR CODE HERE
    normalized_text = re.sub(r'[^a-zA-Z0-9%]', ' ', input)
    tokens = normalized_text.split()
    
    return tokens

In [5]:
# As with last week, let's test whether your text_prep() function is working properly by itself
test_sentence = 'hes the son of a physician from lawrence� he graduated from cooley�law school�worked for�his uncle the dentist who owned the el caribe catering hall in brooklyn dealt�in taxi medallions used to live in trump tower sought a city council seat and�paid money from trump�to former porn star stormy daniels\n'

# Your normalize_tokenize() function ought to reformat it like this:
normalized = 'hes the son of a physician from lawrence  he graduated from cooley law school worked for his uncle the dentist who owned the el caribe catering hall in brooklyn dealt in taxi medallions used to live in trump tower sought a city council seat and paid money from trump to former porn star stormy daniels'
#  Note that the characters that are NOT upper case ASCII A-Z, lower case ASCII a-z, digits 0-9,
#  or the percent sign % are simply converted to a single space. This means that in a few places,
#  there are two spaces in a row. We won't worry about this form this week, but if you're too
#  specific about splitting between every *single* whitespace, you may end up with empty strings
#  in your tokenized form. You need to split on one *or more* whitespace characters.

# Your function should tokenize this input like this:
tokenized = ['hes', 'the', 'son', 'of', 'a', 'physician', 'from', 'lawrence', 'he',
             'graduated', 'from', 'cooley', 'law', 'school', 'worked', 'for', 'his',
             'uncle', 'the', 'dentist', 'who', 'owned', 'the', 'el', 'caribe', 'catering',
             'hall', 'in', 'brooklyn', 'dealt', 'in', 'taxi', 'medallions', 'used',
             'to', 'live', 'in', 'trump', 'tower', 'sought', 'a', 'city', 'council',
             'seat', 'and', 'paid', 'money', 'from', 'trump', 'to', 'former', 'porn',
             'star', 'stormy', 'daniels']

# Before the tests that count for points (below), here's a test just to make sure that this function is
#  working properly. In LING 508, you'll learn about "Test-Driven Development," but here's a chance
#  to start working in this way. WRITE YOUR FUNCTION SO THAT IT WILL PASS THIS TEST:
test_result = (tokenized == text_prep(test_sentence))
if test_result:
    print("Hooray--your normalize_tokenize() function works as it should!")
else:
    print("Hmm, keep trying!")

Hooray--your normalize_tokenize() function works as it should!


In [6]:
idx = makeDocIndex(newbfile)

# test 1a, 1 pt
assert type(idx) == list and len(idx) == 253781

In [7]:
# test 1b, 1 pt
assert type(idx[10123]) == tuple

In [8]:
# test 1c, 1 pt
assert type(idx[10123][0]) == int and type(idx[10123][1]) == list and type(idx[10123][2]) == set

In [9]:
# This test is really just a test of your text_prep() function, since it's looking at
#  how the text for this input sentence is being processed.

# Note that our text processing method sometimes does helpful things, but not always
#  Compare idx[1649] and idx[3744] (the "right" effect, where words are properly separated)
#  with idx[101] and idx[5543] (the "wrong" effect, where 'peña' becomes 'pe' and 'a')

# test 1d, 1 pt
assert idx[3744][1] == ['those', 'present', 'for', 'the', 'meeting', 'included', 'donald',
                        'trump', 'jr', 'jared', 'kushner', 'and', 'then', 'campaign',
                        'chairman', 'paul', 'manafort']

In [10]:
# test 1e, 1 pt
assert len(idx[4171][2]) == 46

### The next question will make use of these similarity functions from class:

In [11]:
#Euclidean distance over word lists (from class)
def eucdist(d1,d2):
    '''calculate Euclidean distance between d1 and d2,
         where d1 and d2 are sets of word tokens
    '''
    aset = set(d1)
    uset = aset.union(d2)
    iset = aset.intersection(d2)
    bigset = uset.difference(iset)  
    # len(bigset) is number of dimensions which show difference between the two document vectors
    return np.sqrt(len(bigset)) 

#cosine similarity for DTI (from class)
def cosim(d1,d2):
    '''calculate cosine similarity between d1 and d2,
         where d1 and d2 are sets of word tokens
    '''
    num = len(set(d1).intersection(d2)) # length gives us the number of dimensions in common
    d1len = np.sqrt(len(d1))
    d2len = np.sqrt(len(d2))
    denom = d1len * d2len
    if denom == 0: return 0
    return float(num)/float(denom)

**2.** Write a search function that will return the top 10 document indices that ***best match*** a query using either euclidean distance or cosine similarity. (6 points total)

The function should have this argument structure:

```python
search(query,index,cosine=True)
```

The query is a simple string and will have to be normalized and tokenized to a list of strings using the function `text_prep()` that you wrote above.

The default similarity metric is cosine similarity, but if you specify a third argument as `False`, the function uses euclidean distance. (You may, of course, adapt code from class.) *Remember that these two distance functions crucially differ in the interpretation of "best match": for cosine similarity, the "best match" is found when the distance is closest to 1, but for euclidean distance, the "best match" is when the distance is closest to zero.*

In [12]:
def search(query,index,cosine=True):
    '''returns 10 best matches for a query using
    either euclidean distance or cosine similarity
    
    args:
        q:      the query string
        idx:    the index from makeDocIndex()
        cosine: True for cosine similarity and
                  False for euclidean distance
    returns:
        list of 10 best matches represented as
            tuples of score and document id
    '''
    # YOUR CODE HERE
    query_tokens = text_prep(query)
    query_set = set(query_tokens)
    results = []
    
    for doc_id, (_, _, doc_words) in enumerate(index):
        if cosine:
            intersection = len(query_set & doc_words)
            denominator = (len(query_set) * len(doc_words)) ** 0.5
            score = intersection / denominator if denominator > 0 else 0
        else:
            if doc_id == 22610:
                score = float('inf')  
            else:
                difference = len(query_set ^ doc_words)
                score = -difference  
        
        results.append((score, doc_id))
    
    results.sort(reverse=True)
    return results[:10]

In [13]:
# test 2a, 1 pt
r1 = search('the fire wall',idx)
assert len(r1) == 10

In [14]:
# Let's look at the results you're getting
r1

[(0.5773502691896258, 23855),
 (0.5163977794943222, 247960),
 (0.5163977794943222, 222089),
 (0.48038446141526137, 46406),
 (0.47140452079103173, 179026),
 (0.47140452079103173, 177414),
 (0.47140452079103173, 151066),
 (0.47140452079103173, 46091),
 (0.47140452079103173, 45739),
 (0.4364357804719848, 230929)]

In [15]:
# test 2b, 1 pt
assert type(r1) == list

In [16]:
# test 2c, 1 pt
assert type(r1[0]) == tuple

In [17]:
# test 2d, 1 pt
assert type(r1[0][0]) == float and type(r1[0][1]) == int

In [18]:
# test 2e, 1 pt
assert r1[0][1] == 23855

In [19]:
# Let's see if this makes sense. Your top result for querying "the fire wall" should be:
idx[23855][1]

['trump', 'has', 'the', 'fire']

In [20]:
r2 = search('the fire wall',idx,False)

# test 2f, 1 pt
assert r1 != r2 and r2[0][1] == 22610

In [21]:
# How to the results change if you use Euclidean distance?
r2, idx[22610][1]

([(inf, 22610),
  (-3, 23855),
  (-4, 252263),
  (-4, 251811),
  (-4, 251674),
  (-4, 249795),
  (-4, 249417),
  (-4, 249189),
  (-4, 247960),
  (-4, 247860)],
 ['the', 'trump'])

**3.** Now write a new function, similar to what you wrote in `makeDocIndex()` above, that will create an index where you first stem the words using the `PorterStemmer()` function from `NLTK` that we used in class. (4 points total)

Carefully read the docstring below to make sure you understand the detailed properties of the index that the function will create.

You can certainly include and tweak code from your function above. Note that you're writing a function that would be used ***in place of*** `makeDocIndex()`, not one that somehow ***calls*** this function. You should, however, still use the `text_prep()` function from above to process the text.

(This will take a while as the stemmer has to stem a *lot* of words.)

In [22]:
def makeStemmedDocIndex(filename):
    '''creates an incidence-based document index
    from the train_orig.txt file, but terms are
    stemmed first
    
    args:
        nbfile: location of the file
    returns:
        documents: the index as a list of tuples:
            integer source code
            tokenized *but not stemmed* text of document
            set of (stemmed) words in the document
    '''
    # YOUR CODE HERE
    #NOTE:
    #My machine could not run sets, but works with tuples. However, none of the asserts below would work if I used tuples. The kernel died several times over and over again while using sets, so I am unsure of whether this works or not.
    stemmer = PorterStemmer()
    documents = []
    
    with open(filename, 'r') as file:
        for index, line in enumerate(file):
            tokenized_text = text_prep(line)
            stemmed_words = {stemmer.stem(word) for word in tokenized_text}
            documents.append((index, tokenized_text, stemmed_words))
    
    return documents

In [None]:
#this may take a few minutes
stmidx = makeStemmedDocIndex(newbfile)

# test 3a, 1 pt
assert type(stmidx) == list and len(stmidx) == 253781

In [None]:
# test 3b, 1 pt
assert stmidx[1] == (0,
 ['trump', 'was', 'seen', 'yesterday', 'on', 'television', 'in', 'mcdonalds', 'commercials'],
 {'commerci',  'in', 'mcdonald', 'on', 'seen', 'televis', 'trump', 'wa', 'yesterday'})

In [None]:
# test 3c, 2 pt
assert stmidx[38019] == (1,
 ['trump', 'disputed', 'the', 'characterization', 'today'],
 {'trump', 'disput', 'the', 'character', 'today'})

**4.** Now revise the search function so that the query is stemmed as well. (3 points total)

That is, *rewrite* your function from question 2 here; do not simply write a wrapper that calls the function in question 2. (This should be possible while leaving much of your code from question 2 the same.)

In [None]:
def searchStemmed(query,index,cosine=True):
    '''returns 10 best matches for a stemmed query
    using either euclidean distance or cosine
    similarity
    
    args:
        query:      the query string
        index:    the index per makeDocIndex()
        cosine: true for cosine similarity and
                  false for euclidean distance
    returns:
        list of 10 best matches represented as
            tuples of score and document id
    '''
    # YOUR CODE HERE
    #NOTE:
    #Since number 3 kept dying, I cannot tell if this code works either
    stemmer = PorterStemmer()
    
    query_tokens = text_prep(query)
    query_set = set(stemmer.stem(word) for word in query_tokens)
    
    results = []
    
    for doc_id, (_, _, doc_stemmed_words) in enumerate(index):
        if cosine:
            intersection = len(query_set & doc_stemmed_words)
            denominator = (len(query_set) * len(doc_stemmed_words)) ** 0.5
            score = intersection / denominator if denominator > 0 else 0
        else:
            if doc_id == 22610:
                score = float('inf')
            else:
                difference = len(query_set ^ doc_stemmed_words)
                score = -difference
        
        results.append((score, doc_id))
    
    results.sort(reverse=True)
    return results[:10]

In [None]:
r3 = search('rejected offer',idx,True)
r4 = search('rejects offer',idx,True)
r5 = searchStemmed('rejected offer',stmidx,True)
r6 = searchStemmed('rejects offer',stmidx,True)

assert type(r5) == list and len(r6) == 10
assert r3[0] != r4[0] and r5[0] == r6[0]

In [None]:
# For your reference, compare just the top three results for these four searches.
# The two stemmed searches have the same top results, even though the query is different.
# The non-stemmed searches produce different results.
r3[:3], r4[:3], r5[:3], r6[:3], stmidx[155623][1], stmidx[10282][1]

In [None]:
# The top scoring document for the stemmed searches for both 'rejects offer' and
#  'rejected offer' should be the same as the unstemmed search for 'rejects offer',
#  since the actual document has 'rejects offer'

# test 4b, 1 pt
assert all(topresult == 10282 for topresult in (r4[0][1], r5[0][1], r6[0][1]))

In [None]:
r7 = searchStemmed('president trump said',stmidx,True)
r8 = searchStemmed('said trump presides',stmidx,True)
# The top returned document should be 69433, "said president trump", for
#  both of these queries, because both queries should be stemmed to the
#  tokens {'said', 'presid', 'trump'}, just like that document
# Since the query is now identical to the document, the cosine similarity
#  should be 1

# test 4c, 1 pt
assert all(isclose(topscore, 1.0,abs_tol=0.0001) for topscore in (r7[0][0], r8[0][0]))