<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">
Linguistics 531<br>
Fall 2024<br>
Jackson

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class. (For grading, it needs to be able to run on my machine, and the way to do that is to limit yourself to the modules we've discussed and that are loaded into the Notebook.)
1. Don't use editors *other* than Jupyter Notebook to work on and submit your assignment, since they will mangle the autograding features: Google Colab, or even just editing the `.ipynb` file as a plain text file. Diagnosing and fixing that kind of problem takes a lot of my time, and that means less of my time to offer constructive feedback to you and to other students.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates, or the work of someone on Stack Overflow, or the work of a generative AI model. Using someone else's code and simply changing variable or object names is *not* doing your own work.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error. Code that runs with errors will get no credit from the autograder.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

my name: Kathleen Costa

people I talked to about the assignment: N/A

# Homework #2

**This is due Tuesday, October 29, 2024 at noon (Arizona time).**

This assignment continues with the `NewB` corpus (downloadable [here](https://github.com/JerryWei03/NewB)).

We first build an incidence-based term index from the `train_orig.txt` file. Remember that we are treating **each sentence as a separate document when building our index**. We are not treating each publication *source*, as indicated by the integer on each line, as a document; instead, that source number will be metadata. We'll make use of that later, when we look at classification of documents.

You may adapt any of the code from class for this or use your own, but bear in mind the that aspects of the code from class notebooks are tailored to the *different* file and data structure of the Blogger corpus. You may also use or adapt code from your first assignment. Do *not* stem or remove stop words.

Imports and important constants:

In [1]:
# You may use the RE module, if you choose, to perform the text normalization that is required here.
# There may be ways to accomplish what you need to do using the Python standard library, so you're
#  not required to use RE--but it's available if you choose.
import re

# You may also use a Counter() for this and future assignments, though you are not required to do so.
#  To learn more about Counter() objects, see https://realpython.com/python-counter/
from collections import Counter

**As with HW 1, make my autograding life easier (and your own notebook more likely to be graded correctly):**

In order for me to develop this assignment, and in order for me to **grade** your submission for this assignment, I need to be working with the right file that we load our corpus from. On my machine, that file has this path:

In [2]:
# Path on my own machine, needed for GRADING
newbfile = '/home/ejackson1/Downloads/linguistics/NewB/train_orig.txt'

# ie, DON'T CHANGE THIS CELL, CHANGE THE ONE BELOW!
#  If you change *this* cell, the autograding is likely to break.

However, for **you** to work on your own code, you need to point this notebook to the path for this file on your own machine. *You should enter the path on your own machine in the editable code cell below,* then uncomment that line so the notebook works on your machine. This means that the second code cell will take precedence in assigning the value of the path to the corpus, and you can write your code to open that file without problems.

**HOWEVER, BEFORE YOU SUBMIT to D2L** comment out **your** path again. This means that when I run the code on my own machine, it'll have the path that ***I*** need, and it'll grade your notebook properly.

In [3]:
# YOUR path
newbfile = '/hlt2/train_orig.txt'

**1.** Flesh out the two functions below. One function extracts each sentence from the document and parses off the integer that represents the publication source. For each sentence, we create an "incidence list" which is the set of words in that document (as described in the docstring below). (6 points total)

The data in the NewB corpus has supposedly already been normalized; things like apostrophes and punctuation have mostly been removed, and all text should have been made lower case. Percent signs have been left on numbers that represent percent. However, it looks like there may be some issues with the normalization method, since some characters besides just lower-case ASCII remain in the document. These include left and right curly braces { }, some back-ticks \`, and even stranger things. So, in order for us all to be working on the same data and getting the same answers, you're going to also create a normalization and tokenization function that will institute a standard method of preprocessing our text:

Follow the docstring for the second function below, and pay special attention to the data types. This function should replace all characters in the sentence that are not letters (upper or lower case ASCII letters), digits, or the percent sign % by a space. We'll then define "words" (which will be the *terms* for our index) to be any remaining alphanumeric characters which are found between spaces. For this assignment, do *not* normalize in any other way or stem.

In [4]:
def makeDocuments(filename):
    '''reads in the source file and returns a structured list
         of documents, with text normalization as described in
         the normalize_tokenize() function below.
    
    args:
        filename: location of train_orig.txt
    returns:
        documents: a list of triples:
            document source ID (as an integer)
            document text (normalized but not split, as a string)
            *set* of words in this document (normalized, as a set)

       Document IDs within our collection are assumed to be the
         integer index at which that document occurs in this list,
         which should also match the line number in the original
         document.
'''
    # YOUR CODE HERE   
    documents = []
    
    with open(filename, 'r') as file:
        for line_num, line in enumerate(file):
            line = line.strip()
            if line:
                match = re.match(r'^(\d+)\s*(.*)', line)
                if match:
                    source_id = int(match.group(1))  
                    sentence = match.group(2)  
                
                    normalized_text, words_set = normalize_tokenize(sentence)
                   
                    documents.append((source_id, normalized_text, words_set))
    
    return documents
def normalize_tokenize(sentence):
    '''Takes as input a line of text (assumed to be a multi-word sentence) and
    returns that sentence, normalized and tokenized into words.
    
    Conventions:
    -- upper and lower case ASCII letters (a-z and A-Z, with no diacritics) are kept
    -- digits (0-9) are kept
    -- percent sign (%) is kept
    All other characters are converted to whitespace, and words (terms) are
         then split on whitespace.
    
    args:
        sentence: a sentence from the corpus, as a string
    returns:
        normalized: a normalized version of the sentence, as described in the
            conventions above
        tokenized: the sequence of normalized words from the sentence, split
            on whitespace, as a list
    '''
    # YOUR CODE HERE
    normalized = re.sub(r'[^a-zA-Z0-9%]', ' ', sentence)
    normalized = normalized.strip()
    tokenized = set(normalized.split())
    
    return normalized, tokenized

### Before working on `makeDocuments()`, make sure your tokenization function works!

Your `makeDocuments()` function won't be able to pass its tests if your `normalize_tokenize()` function isn't doing what it needs to do. So, let's test that tokenization function first!

We'll be testing (for points!) the document at index 4171. In the original file, that line looks like this:

In [5]:
test_sentence = 'hes the son of a physician from lawrence� he graduated from cooley�law school�worked for�his uncle the dentist who owned the el caribe catering hall in brooklyn dealt�in taxi medallions used to live in trump tower sought a city council seat and�paid money from trump�to former porn star stormy daniels'

In the original file, this would have ended with a newline, but I'm assuming that you'll strip newlines in the part of your code that reads in the file, just like you did in HW1. So, your `normalize_tokenize()` function will be operating on it *after* the newline has been removed.

I'm not sure where these non-ASCII characters came from, since they don't appear to be in the original article, which seems to be this one:

https://web.archive.org/web/20220521092314/https://www.newsday.com/long-island/donald-trump-michael-cohen-w71594

Your `normalize_tokenize()` function ought to reformat this document like this:

In [6]:
normalized = 'hes the son of a physician from lawrence  he graduated from cooley law school worked for his uncle the dentist who owned the el caribe catering hall in brooklyn dealt in taxi medallions used to live in trump tower sought a city council seat and paid money from trump to former porn star stormy daniels'

Note that the characters that are NOT upper case ASCII A-Z, lower case ASCII a-z, digits 0-9, or the percent sign % are simply converted to a single space. In a few places, this results in a sequence of two or more spaces. We don't want to "find" a zero-length string in between two spaces, so be careful how you tokenize this with Python's `str.split()`!

Your function should tokenize it like this:

In [7]:
tokenized = ['hes', 'the', 'son', 'of', 'a', 'physician', 'from', 'lawrence', 'he',
             'graduated', 'from', 'cooley', 'law', 'school', 'worked', 'for', 'his',
             'uncle', 'the', 'dentist', 'who', 'owned', 'the', 'el', 'caribe', 'catering',
             'hall', 'in', 'brooklyn', 'dealt', 'in', 'taxi', 'medallions', 'used',
             'to', 'live', 'in', 'trump', 'tower', 'sought', 'a', 'city', 'council',
             'seat', 'and', 'paid', 'money', 'from', 'trump', 'to', 'former', 'porn',
             'star', 'stormy', 'daniels']

Before the tests that count for points (below), here's a test just to make sure that this function is working properly. In LING 508, you'll learn about "Test-Driven Development," but here's a chance to start working in this way. WRITE YOUR FUNCTION SO THAT IT WILL PASS THIS TEST:

In [8]:
if (normalized, tokenized == normalize_tokenize(test_sentence)):
    print("Hooray--your normalize_tokenize() function works as it should!")
else:
    print("Hmm, keep trying!")

Hooray--your normalize_tokenize() function works as it should!


Now you can move on to the `makeDocuments()` function, and make sure it's working properly, as well&mdash;this time for points!

In [9]:
# test 1a, 1 pt
docs = makeDocuments(newbfile)
assert len(docs) == 253781

In [10]:
# test 1b, 1 pt
assert type(docs[10]) == tuple

In [11]:
# test 1c, 1 pt
assert len(docs[10]) == 3

In [12]:
# test 1d, 1 pt
assert docs[4171][0] == 0

In [13]:
# test 1e, 1 pt
#   This is really a test of your normalize_tokenize() function, and
#   how you've integrated it into the makeDocuments() function.
assert docs[4171][1] == 'hes the son of a physician from lawrence  he graduated from cooley law school worked for his uncle the dentist who owned the el caribe catering hall in brooklyn dealt in taxi medallions used to live in trump tower sought a city council seat and paid money from trump to former porn star stormy daniels'

In [14]:
# test 1f, 1 pt
#   This is ALSO testing the output of your normalize_tokenize() function!
assert len(docs[4171][2]) == 46

**2.** The following function takes the output of `makeDocuments()` and creates an incidence-based index. We represent the index as a dictionary that maps from words to a *sorted* list of integer document IDs. The integer document IDs are the indices at which that document occurs in the structured list of documents that is returned by `makeDocuments()`. (4 points total)

(Note that there is the potential for confusion here, since the word _index_ is being used two ways to refer to two different things. First, the incidence-based _index_ that we're building refers to a dictionary that maps from terms to a list of the document IDs that they occur in. Second, the integer document IDs are an _index_, that is, an integer that allows us to specify an item in our document list, like a subscript or a counter.)

In [15]:
def makeIndex(documents):
    '''maps from a documents list to an
    incidence index represented as a dictionary
    from words to sorted lists of document IDs
    
    args:
        documents: a documents list as produced
            by makeDocuments()
    returns:
        index: a dictionary from words to lists
            of document IDs (ie, indices in the
            documents list)
    '''
    # YOUR CODE HERE
    index = {}
    for doc_id, (_, _, word_set) in enumerate(documents):
        for word in word_set:
            index.setdefault(word, []).append(doc_id)
    
    for word_list in index.values():
        word_list.sort()
    
    return index

In [16]:
docs = makeDocuments(newbfile)
idx = makeIndex(docs)

# test 2a, 1 pt
assert type(idx) == dict

In [17]:
# test 2b, 1 pt
assert len(idx) == 61193

In [18]:
# test 2c, 1 pt
assert idx['champagnes'] == [223755]

In [19]:
# test 2d, 1 pt
assert idx['happiness'] == [16495,66139,84943,
                            85998,91589,93472,
                            120070,133078,193349]

**3.** Now write a function that will return the set of all document IDs that contain some set of words. (5 points total)

The function should take two arguments:

- an index (ie, the output of `makeIndex`)
- a list of strings (ie, the search query)

In [20]:
def search(idx,ws):
    '''returns the set of documents that contain
    some set of words
    
    args:
        idx: an incidence index, as created by
            makeIndex()
        ws: a list of words
    returns:
        docs: a set of document indices
    '''
    # YOUR CODE HERE
    if not ws:
        return set()
        
    results = set(idx.get(ws[0], set()))
    
    for term in ws[1:]:
        if term not in idx:
            return set()
        results &= set(idx[term])
    
    return results

In [21]:
docs = makeDocuments(newbfile)
idx = makeIndex(docs)

# test 3a, 1 pt
assert type(search(idx,['airplane'])) == set

In [22]:
# test 3b, 1 pt
assert len(search(idx,['omelet'])) == 1

In [23]:
# test 3c, 1 pt
assert search(idx,['senate','reject']) == {68347, 144901, 177620, 181422, 181564}

In [24]:
# test 3d, 1 pt
assert search(idx,['wow','congress','airplane']) == set()

In [25]:
# test 3e, 1 pt
assert search(idx,['fire','wall']) == {32718, 46406, 49273, 67060, 154764, 178112, 201064}

**4.** Tweak the function so that it will complement any of the words. (5 points total)

For example, if we call the function like this:

```python
dsearch(idx,['hat','*chair','coat'])
```

That will return all document IDs (indices) that contain *hat* and *coat*, but not *chair*.

In [26]:
def dsearch(idx,ws):
    '''returns the set of documents that contain
    or do *not* contain some set of words
    
    args:
        idx: a term index as created by makeIndex()
        ws: a list of words any of which may be
            marked with a prefixed '*', which
            indicates complement/negation
    '''
    # YOUR CODE HERE
    all_docs = set()
    for docs in idx.values():
        all_docs.update(docs)
    
    results = all_docs.copy()
    
    for term in ws:
        if term.startswith('*'):
            word = term[1:]
            if word in idx:
                results &= (all_docs - set(idx[word]))
        else:
            if term in idx:
                results &= set(idx[term])
            else:
                return set()
    
    return results

In [27]:
docs = makeDocuments(newbfile)
idx = makeIndex(docs)

# test 4a, 1 pt
assert type(dsearch(idx,['airplane'])) == set

In [28]:
# test 4b, 1 pt
assert len(dsearch(idx,['hats','*other'])) == 85

In [29]:
# test 4c, 1 pt
assert len(dsearch(idx,['hats'])) + \
len(dsearch(idx,['*hats'])) == \
len(docs)

In [30]:
# test 4d, 1 pt
assert dsearch(idx,['win','send','*not']) == {57751, 251520}

In [31]:
# test 4e, 1 pt
assert dsearch(idx,['win','*not']) == dsearch(idx,['*not','win'])

**5.** How would you extend your function to include disjunction? In other words, think about what it would take to adapt this function so that you can have an OR connective as well as an AND connective. What aspects of this function will need to change? (3 points, manually graded)

This has two parts. First, what would the syntax of the keywords need to be, that you would be passing as arguments to this function? How does the introduction of this new Boolean operation complicate the interpretation of a query?

Second, how would you actually implement disjunction in the Python code? What is the built-in function that would be needed to represent it? *(Hint: Think about which operation we used to implement a Boolean AND. What's the operation that would correspond to Boolean OR?)*

*For full points, your answer should make it clear that you understand how to represent the Boolean connectors in the search string in terms of set operations over the index. Your answer should also make it clear how the simple syntax that we've been using so far would become more complicated once we introduce the possibility of **disjunction** of search terms in addition to **conjunction** of search terms.*

To extend the dsearchfunction to include disjunction, I can use an OR and AND connective. For example, if I wanted to look for the words "champagnes" AND "airplane" OR "happiness" AND "hats," I can change the syntax in the function to something like this: dsearch(idx, ['champagnes', 'airplane' | 'happiness', 'hats']). This would search for "chapagnes" AND either "airplane" OR (|) "happiness" AND includes the word "hats." Introducing the OR connective with make the function more complex because the function will now expand the search to include documents that include either word, aside from just the AND connective which would use an intersecting logic. Having both of those functions in the code would make it more complex as it would have to identify multiple variations across several documents.