<img src='https://hammondm.github.io/hltlogo1.png' style="float:right">
Linguistics 531<br>
Fall 2024<br>
Jackson

## Things to remember about any homework assignment:

1. For this assignment, you will edit this jupyter notebook and turn it in. Do not turn in pdf files or separate `.py` files.
1. Late work is not accepted.
1. Given the way I grade, you should try to answer *every* question, even if you don't like your answer or have to guess.
1. You may *not* use `python` modules that we have not already used in class. (For grading, it needs to be able to run on my machine, and the way to do that is to limit yourself to the modules we've discussed and that are loaded into the Notebook.)
1. Don't use editors *other* than Jupyter Notebook to work on and submit your assignment, since they will mangle the autograding features: Google Colab, or even just editing the `.ipynb` file as a plain text file. Diagnosing and fixing that kind of problem takes a lot of my time, and that means less of my time to offer constructive feedback to you and to other students.
1. You may certainly talk to your classmates about the assignment, but everybody must turn in *their own* work. It is not acceptable to turn in work that is essentially the same as the work of classmates, or the work of someone on Stack Overflow, or the work of a generative AI model. Using someone else's code and simply changing variable or object names is *not* doing your own work.
1. All code must run. It doesn't have to be perfect, it may not do all that you want it to do, but it must run without error. Code that runs with errors will get no credit from the autograder.
1. Code must run in reasonable time. Assume that if it takes more than *5 minutes* to run (on your machine), that's too long.
1. Make sure to select `restart, run all cells` from the `kernel` menu when you're done and before you turn this in!

my name: Kathleen Costa

people I talked to about the assignment: N/A

# Homework #6

**This is due Tuesday, November 26, 2024 at noon (Arizona time).**

This assignment continues with the `NewB` corpus (downloadable [here](https://github.com/JerryWei03/NewB)), but we've moving on to methods of classifying these documents rather than searching over them. We'll be implementing the classification by mean post length approach that we started in the lectures. Remember what we saw about how classification based *only* on the feature of length worked for the Blogger data. How well will this method work for classifying the NewB data?

imports:

In [1]:
import re
import numpy as np
import scipy.stats as stats
import pandas as pd
from math import isclose
from collections import Counter # I need this for a test. You might not need it, but if you find a use, you can use it

**As before, this section is for autograding:**

What I need on my machine to properly grade this:

In [2]:
# Path on my own machine, needed for GRADING
newbfile = '/home/ejackson1/Downloads/linguistics/NewB/train_orig.txt'

# ie, DON'T CHANGE THIS CELL, CHANGE THE ONE BELOW!
#  If you change *this* cell, the autograding is likely to break.

*In the editable cell below, enter the path on your own machine,* then uncomment that line so the notebook works on your machine.

**BEFORE YOU SUBMIT to D2L, remember to comment out *your* path again.**

In [3]:
# YOUR path
newbfile = 'train_orig.txt'

**1.** Read in the entire contents of the `train_orig.txt` file into a list of tuples of the form `(<publication ID>, <sentence>)`. (2 points)

This is similar to, but a bit simpler than, the functions you've been writing to create document indices, since now we only need to keep track of the publication source of each sentence, to be used for classification. Don't worry about normalizing and tokenizing the sentence yet; we'll get to that down below. For now, just make sure you've removed the newline character (`\n`) that is at the end of each line.

In [4]:
def getSentences(filename):
    '''read in newB data and return a list of publication IDs
    and sentences
    
    args:
        filename: location of train_orig.txt
    returns:
        list of tuples: (publication ID (as an integer),
                         sentence (as a single string, with \n removed))
    '''
    # YOUR CODE HERE
    sentences = []
    
    with open(filename, 'r', encoding='utf-8') as f:
        for line in f:
            parts = line.strip().split('\t')
            if len(parts) == 2:
                pub_id = int(parts[0])
                sentence = parts[1]
                sentences.append((pub_id, sentence))
    
    return sentences

In [5]:
ss = getSentences(newbfile)
countIDs = Counter([pubID for pubID, sentence in ss])

# test 1a, 1 pt
assert all([count == 23071 for count in countIDs.values()])

In [6]:
# Let's just see what those counts are:
#
# This should be
# dict_items([(0, 23071), (1, 23071), (2, 23071), (3, 23071), (4, 23071), (5, 23071), (6, 23071), (7, 23071), (8, 23071), (9, 23071), (10, 23071)])
countIDs.items()

dict_items([(0, 23071), (1, 23071), (2, 23071), (3, 23071), (4, 23071), (5, 23071), (6, 23071), (7, 23071), (8, 23071), (9, 23071), (10, 23071)])

In [7]:
# test 1b, 1 pt
assert all([id in range(11) for id in countIDs]) and len(countIDs) == 11

**2.** From the data structure `ss` that contains the whole collection, separate just categories #3 and #7 into separate lists of sentences (without the publication ID code). (2 points)

Note the variable names that you should use from both the comments and from the `assert` statements.

*Hint: At this point in the notebook, you've got the data structure with all your sentences in a list named `ss`. The lightest solution would approach this problem with a list comprehension, though you could also write a function (maybe even a lambda function!) that takes in the collection and a pubID and returns just the sentences from a single pubID. Note that slicing `ss` by itself won't produce an output with the right structure, since you're asked not just to return a list of `(pubID, sentence)` tuples, but a list of just the sentences&mdash;but slicing might be an important part of a lambda-function approach or a list comprehension approach.*

In [8]:
#define cat3 = a list of all sentences in category 3
# YOUR CODE HERE
cat3 = [sentence for pub_id, sentence in ss if pub_id == 3]

#define cat7 = a list of all sentences in category 7
# YOUR CODE HERE
cat7 = [sentence for pub_id, sentence in ss if pub_id == 7]

In [9]:
# test 2a, 1 pt
assert len(cat3) == len(cat7) == 23071

In [10]:
# test 2b, 1 pt
assert cat3[19] == 'an average of recent polls puts clinton ahead of trump 47% to 42%'

**3.** Divide each of these single-source document collections into training and test sets. (2 points)

Make the test sets the first 1000 sentences each; use the rest for training. Note the variable names from the comments and `assert` statements. There might be multiple ways to do this, but aim for an approach that clearly indicates to someone reading your code what your overall goal is.

In [11]:
#define test3 = first 1000 sentences in category 3
# YOUR CODE HERE
test3 = cat3[:1000]

#define test7 = first 1000 sentences in category 7
# YOUR CODE HERE
test7 = cat7[:1000]

#define train3 = remaining sentences in category 3
# YOUR CODE HERE
train3 = cat3[1000:]

#define train7 = remaining sentences in category 7
# YOUR CODE HERE
train7 = cat7[1000:]

In [12]:
# test 3a, 1 pt
assert len(test3) == len(test7) == 1000 and len(train3) == len(train7) == 22071

In [13]:
# test 3b, 1 pt
assert train3[25] == 'trump has said much more than he has done and hes thrown many more punches than hes landed'

**4.** Now write a function to normalize and tokenize your sentences. (2 points)

Convert anything besides upper or lower case letters, numbers, and the percent sign to space, and then split the string on white space. After doing this for so many other homeworks, this should be very familiar to you now; you can probably reuse your `text_prep()` function from past assignments with little to no modification (other than the name).

In [14]:
def tokenize(s):
    '''
    change everything other than upper & lower case
    ASCII letters, numbers, and the percent sign to
    white space, and tokenize based on white space
    '''
    # YOUR CODE HERE
    cleaned = re.sub(r'[^a-zA-Z0-9%]', ' ', s)
    return [token for token in cleaned.split() if token]

In [15]:
# test 4a, 1 pt
assert tokenize(train3[378]) == ['trump', 'leads',
    'among', 'men', '47%', 'to', '36%', 'while',
    'clinton', 'has', 'a', 'smaller', '41%', '34%',
    'edge', 'among', 'women']

In [16]:
# test 4b, 1 pt
assert tokenize(test7[434]) == ['full','text','donald',
    'trump','says','that', 'when','he','looks','at',                            
    'himself','in', 'the','mirror','he','sees','a','man',
    'half','his','age']

**5.** Now calculate four things *for each of the two training sets*: the mean and standard deviation for the length (in words) of the tokenized sentences, and the mean and standard deviation for the length (in characters) of the words in the tokenized sentences. Your code here should call your `tokenize()` function from question 4. (8 points)

You can do this with a single list comprehension for each value to be calculated (though you're not *forced* to use a list comprehension; you just need to get the right answer). Since we've imported NumPy, you may find some of its [statistical functions](https://numpy.org/doc/stable/reference/routines.statistics.html) to be convenient. These were also used in the class notebook, which you can review for guidance. Again, please note the variable names from the comments and the following `assert` statements.

Are you having trouble passing the `assert` statements? **Be sure you're preparing the data with a working tokenization function from question 4.** If that's working properly, then double-check that you're getting the proper number of words per sentence, and the proper number of letters per word.

Also, for the word-level statistics, note that you [**cannot**](https://stats.stackexchange.com/questions/133138/will-the-mean-of-a-set-of-means-always-be-the-same-as-the-mean-obtained-from-the#:~:text=No%2C%20the%20averages%20of%20the,are%20the%20same%20sample%20size.) find the statistics at a sentence level (like a mean for the length of words in characters) and then average the per-sentence values across all sentences; that is not guaranteed to give you the same result as the overall mean word length. You must create a suitable data set (ie, over all words), and then calculate the statistics for all items in that data set. Note that your `train3`, `train7`, `test3`, `test7` data structures are effectively lists of lists of words; you might find [this discussion](https://stackoverflow.com/questions/952914/how-do-i-make-a-flat-list-out-of-a-list-of-lists) helpful for de-nesting a list of lists.

In [17]:
#mean length of sentences in words for train3: t3smean
# YOUR CODE HERE
t3smean = np.mean([len(tokenize(sent)) for sent in train3])

#mean length of sentences in words for train7: t7smean
# YOUR CODE HERE
t7smean = np.mean([len(tokenize(sent)) for sent in train7])

#standard deviation for sentence lengths for train3: t3ssd
# YOUR CODE HERE
t3ssd = np.std([len(tokenize(sent)) for sent in train3])

#standard deviation for sentence lengths for train7: t7ssd
# YOUR CODE HERE
t7ssd = np.std([len(tokenize(sent)) for sent in train7])

#mean word length in characters for train3: t3wmean
# YOUR CODE HERE
t3wmean = np.mean([len(word) for sent in train3 for word in tokenize(sent)])

#mean word length in characters for train7: t7wmean
# YOUR CODE HERE
t7wmean = np.mean([len(word) for sent in train7 for word in tokenize(sent)])

#standard deviation of word length for train3: t3wsd
# YOUR CODE HERE
t3wsd = np.std([len(word) for sent in train3 for word in tokenize(sent)])

#standard deviation of word length for train7: t7wsd
# YOUR CODE HERE
t7wsd = np.std([len(word) for sent in train7 for word in tokenize(sent)])

In [18]:
# test 5a, 1 pt
# I get t3smean == 24.724162928730006
assert isclose(t3smean,24.7242,abs_tol=0.0001)

In [19]:
# test 5b, 1 pt
# I get t7smean == 22.896334556658058
assert isclose(t7smean,22.8963,abs_tol=0.0001)

In [20]:
# test 5c, 1 pt
# I get t3ssd == 11.951055365116888
assert isclose(t3ssd,11.9511,abs_tol=0.0001)

In [21]:
# test 5d, 1 pt
# I get t7ssd == 14.060112912440676
assert isclose(t7ssd,14.0601,abs_tol=0.0001)

In [22]:
# test 5e, 1 pt
# I get t3wmean == 4.9649835894936105
assert isclose(t3wmean,4.9650,abs_tol=0.0001)

In [23]:
# test 5f, 1 pt
# I get t7wmean == 4.800249334612987
assert isclose(t7wmean,4.8002,abs_tol=0.0001)

In [24]:
# test 5g, 1 pt
# I get t3wsd == 2.6140707416546554
assert isclose(t3wsd,2.6141,abs_tol=0.0001)

In [25]:
# test 5h, 1 pt
# I get t7wsd == 2.440287717233418
assert isclose(t7wsd,2.4403,abs_tol=0.0001)

We can look at these in a table. (Notebooks do a great job at formatting tables from a Pandas DataFrame.)

In [26]:
pd.DataFrame(
    np.array([
        [t3smean,t3ssd,t3wmean,t3wsd],
        [t7smean,t7ssd,t7wmean,t7wsd]
    ]).T,
    columns=['group 3','group 7'],
    index=[
        'sentence mean',
        'sentence sd',
        'word mean',
        'word sd'
    ]
)

Unnamed: 0,group 3,group 7
sentence mean,24.724163,22.896335
sentence sd,11.951055,14.060113
word mean,4.964984,4.800249
word sd,2.614071,2.440288


If these numbers are hard for you to digest, you can also display them as histograms by adapting the code from the class notebook. (You'll need to do this in a different notebook, not this one. I couldn't include code to generate the histograms here because the code that would generate the histograms is basically the same code you need to write for question 5.)

**6.** Give the code to build a classification model **using *just* mean sentence length for these categories**, run the comparison, and report the results, as described in the docstring for this function. (2 points)

Don't forget to apply your tokenization function from question 4 to the test sentences.

*A note on this function and on the interpretation of its output: Here, as you can see from the docstring, you're not writing a function that would return a classification for a single test item. Instead, this function is intended to take in the mean lengths for two different classes of documents, along with a list of test documents, and returns **an integer** which specifies how many sentences from the test set were classified as part of the **second** class. So, with the proper ordering of the mean sentence lengths that you feed in as `m1` and `m2`, this number reflects how well your model worked: as long as the "right" mean for a given test set is the one you pass in as `m2`, a larger number that is returned (up to the size of the test set) indicates that your model's predictions are better. As a specific example, if (1) your test set has 10 sentences, and (2) that test set is taken from class 8, and (3) you give this function the mean length for the training data from class 8 in the position of `m2` (and some other mean in `m1`), then returning `10` would be a "perfect" score, which would mean this function classified 100% of the test set as the proper class; if the function returned `5`, then only 50% of the test set sentences were properly classified as class 8.*

(Yes, in question 5 you calculated word lengths, also, but we're not using those now.)

In [27]:
def classByMeanLength(m1,m2,testdocs):
    '''
    Implements a classification model that classifies sentences in a test set
    into one of two classes, given the mean sentence length for each class
    
    Rather than return a classification for a single sentence, this function
    returns the number of items in the test set that are closer to the second
    mean than to the first.
    
    Items in the test set should be tokenized using tokenize().
    
    args:
        m1: mean number of words per sentence for class 1
        m2: mean number of words per sentence for class 2
        testdocs: a list of strings (each string is considered a document)
    
    returns:
        the number of test items determined to be in class 2
        (that is, whose length is closer to the mean for class 2)
    '''
    # YOUR CODE HERE
    test_lengths = [len(tokenize(doc)) for doc in testdocs]
    class2_count = sum(1 for length in test_lengths 
                      if abs(length - m2) < abs(length - m1))
    
    return class2_count

In [28]:
# test 6a, 1 pt
# I get that 512 out of 1000 items in test set 3 are closer to the
#  mean given as m2 (which is the training set 3 mean)
assert isclose(classByMeanLength(t7smean,t3smean,test3),512,abs_tol=1)

This is 51.2% accuracy, so at slightly better than chance, this isn't a great method.

In [29]:
# test 6b, 1 pt
# I get that 588 out of 1000 items in test set 7 are closer to the
#  mean given as m2 (which is the training set 7 mean)
assert isclose(classByMeanLength(t3smean,t7smean,test7),588,abs_tol=1)

Again, this is 58.8% accuracy, so not a great method&mdash;but our point here wasn't to show that modeling by sentence length was the best method to use. Here, we just want to see the basics of working with a training set and a test set for our data, and see how we can evaluate how well our classification model is working.

Eventually, we'll make a similar model, but instead of classifying a test document as part of some class using "scalar difference from the mean document length of the class", we'll represent our training and test documents as vectors, just like we did for searching, and we'll classify a test document as part of a class using "distance from the mean vector of the class", putting our Euclidean distance and cosine similarity functions back into use.

**7.** Now write code to create another sentence classification model based on sentence length (as in question 6), but where the two classes are modeled as normal distributions. (2 points)

Again, don't forget to apply your tokenization function from question 4 to the test sentences.

The interpretation of the output of this function will be the same as for question 6: if you feed it 10 test sentences which were taken from the class whose mean and standard deviation are given as `m2` and `sd2`, then returning `10` would be a perfect score, meaning that all ten test sentences were judged to be part of the second distribution, not the first.

Recall that the lectures showed how to use the `norm` function in the SciPy Stats module, as well as its `.pdf()` (*probability density function*) method. Review the class notebook for examples of how to do this. You may also find it helpful to read the [SciPy Stats documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html), or *some* of the information in [this tutorial](https://www.tutorialspoint.com/scipy/scipy_stats.htm).

In [30]:
def classByNormalDistribution(m1,sd1,m2,sd2,testdocs):
    '''
    Implements a classification model that classifies sentences from a test set
    into one of two classes, given a normal probability distribution for each
    class
    
    Rather than return a classification for a single sentence, this function
    returns the number of items in the test set for which the probability is
    greater that they're in the second distribution than in the first.
    
    Items in the test set should be tokenized using tokenize().
       
    args:
        m1: mean number of words per sentence for class 1
        sd1: standard deviation of words per sentence for class 1
        m2: mean number of words per sentence for class 2
        sd2: standard deviation of words per sentence for class 2
        testdocs: a list of sentences to evaluate as more likely to be part
           of class 1 or class 2
    returns:
        the number of items that are more likely to be in class 2
    '''
    # YOUR CODE HERE
    test_lengths = [len(tokenize(doc)) for doc in testdocs]
    class2_count = sum(1 for length in test_lengths 
                      if (np.exp(-((length - m2)**2)/(2*sd2**2))/sd2) > 
                         (np.exp(-((length - m1)**2)/(2*sd1**2))/sd1))
    
    return class2_count

In [31]:
# test 7a, 1 pt
# I get that 699 out of 1000 items in test set 3 are closer to the
#  distribution given as m2 & sd2 (which is the training set 3 mean & sd)
assert isclose(
    classByNormalDistribution(t7smean,t7ssd,t3smean,t3ssd,test3),
    699,
    abs_tol=1
)

In [32]:
# test 7b, 1 pt
# I get that 395 out of 1000 items in test set 7 are closer to the
#  distribution given as m2 & sd2 (which is the training set 7 mean & sd)
assert isclose(
    classByNormalDistribution(t3smean,t3ssd,t7smean,t7ssd,test7),
    395,
    abs_tol=1
)

**8.** Summarize the performance of the two systems and explain why you think you get the results you do. Is the feature of **mean sentence length** a useful one in this data set for classifying these two classes? (2 points)

(*Hint*: How large were your test sets? What would an ideal result be? How do your actual results compare to this ideal? What kinds of issues were discussed in the lecture videos and the in-class notebook for the Blogger corpus? The table after question 5 may be helpful to you in evaluating why you got the results in question 6 and 7 that you did.)

The test sets were 1000 sentences from each category, and the ideal results would be that each category is label as their appropriate category (e.g., sentences in category 3 will be classified as category 3, and the same goes for category 7). Looking at the means of each category, it is evident that there is a measurable difference between these two categories. This shows that the sentence length can be useful, but is not always the perfect feature for classification. Overall, the system worked well in classifying, but did not catch individual nuances in the text which may change the classifications.