In [4]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2023-spring/lab1-2.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")

     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 164.0/164.0 kB 4.2 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 110.5/110.5 kB 9.1 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 147.5/147.5 kB 10.9 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 232.6/232.6 kB 18.6 MB/s eta 0:00:00
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 40.1 MB/s eta 0:00:00


In [5]:
!pip install -q -r requirements.txt

In [6]:
# Initialize Otter
import otter
grader = otter.Notebook()

# Course 236299
## Lab 1-2 — Text classification and evaluation methodology

After this lab, you should be able to

* Understand the distinction between training and test corpora, and why both are needed;
* Understand the role of gold labels;
* Implement a majority class baseline as a benchmark to compare other methods;
* Implement nearest neighbor classification, and understand the role of distance metrics in its operation;
* Compare multiple methods for acccuracy.

New bits of Python used for the first time in the _solution set_ for this lab, and which you may therefore find useful, include

* [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter)
* [`collections.Counter.most_common`](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)
* [`torch.float`](https://pytorch.org/docs/stable/tensors.html)
* [`torch.Tensor.type`](https://pytorch.org/docs/stable/generated/torch.Tensor.type.html?highlight=torch%20tensor%20type#torch.Tensor.type)

# Preparation – Loading packages and data

In [7]:
# Please do not change these imports because some hidden tests might depend on them.
# You can add a cell below if you need to import anything else.
import collections
import copy
import json
import math
from pprint import pprint
import torch
import wget

# The Federalist Papers

<img src="https://github.com/nlp-course/data/raw/master/Federalist/federalist.jpg" width=150 align=right />

The _Federalist_ papers is a collection of 85 essays written pseudonymously by Alexander Hamilton, John Jay, and James Madison following the Constitutional Convention of 1787, promoting the ratification of the nascent United States Constitution.

The authorship of many of the individual papers has been well established and acknowledged by the various authors, but a number of the papers have been contentious, with both Madison and Hamilton as possible authors. Determining the authorship of these disputed papers is a classic text classification problem, and one that has received great attention. The seminal work on the problem is that of [Mosteller and Wallace](http://www.historyofinformation.com/detail.php?entryid=4799), who applied then-novel statistical methods to the problem. In this lab, we'll use the _Federalist_ data to experiment with some of the ideas about distance metrics and classification methods that you've read about. (It's also an excuse to make some points about proper testing methodology.)

Mosteller and Wallace used the frequencies of various words in the papers as the raw data for determining authorship. We've provided access to a heavily pre-digested version of this data. (If you're interested, you can find the raw data – all 85 papers – and the notebook used to generate the pre-digested data in the [course `data` github repository](https://github.com/nlp-236299/data/tree/master/Federalist).)

Start by evaluating the cells below to load the data and view a sample.

In [8]:
# Retrieve the Federalist data
os.makedirs('data', exist_ok=True)
wget.download('https://github.com/nlp-236299/data/raw/master/Federalist/federalist_data.json', out='data/')
# Read the json data into a data structure
with open('data/federalist_data.json', 'r') as fin:
    dataset = json.load(fin)
# Convert counts to tensors of floats
for example in dataset:
    example['counts'] = torch.tensor(example['counts']).type(torch.float)

In [9]:
# View a sample of the data
print(f"Number of papers in the dataset: {len(dataset)}")
print("Some examples:")
pprint(dataset[:3])

Number of papers in the dataset: 85
Some examples:
[{'authors': 'Hamilton',
  'counts': tensor([9., 6., 2., 0.]),
  'number': '1',
  'title': 'General Introduction'},
 {'authors': 'Jay',
  'counts': tensor([8., 1., 0., 0.]),
  'number': '2',
  'title': 'Concerning Dangers from Foreign Force and Influence'},
 {'authors': 'Jay',
  'counts': tensor([6., 0., 1., 0.]),
  'number': '3',
  'title': 'The Same Subject Continued: Concerning Dangers from Foreign Force '
           'and Influence'}]


You'll see above that the dataset is a list of *examples*, one for each paper, each a dictionary providing the paper number, its title and author(s), and the raw counts for a few important words in the papers. From the last lab, you'll recognize the `counts` field as a bag-of-words representation of the document. The `counts` field is the document representation that we will be wanting to classify, and the `authors` field contains the pertinent class label for each example. 

For your reference, here are the words that were used to derive the counts:

In [10]:
keywords = ['on', 'upon', 'there', 'whilst']

Thus in the first example paper, *Federalist 1*, there were 9 tokens of "on", 6 of "upon", 2 of "there", and none of "whilst". 

The `authors` field takes on various values. Here's a table of the frequency of each of the values. (This will come in handy later.)

In [11]:
# Generate a table of the number of papers by each author label
cnt = collections.Counter(map(lambda ex: ex['authors'],
                              dataset))
for author, count in cnt.items():
    print(f"{count:3d} ({100.0*count/len(dataset):.3f}%) {author}")

 51 (60.000%) Hamilton
  5 (5.882%) Jay
 15 (17.647%) Madison
  3 (3.529%) Hamilton and Madison
 11 (12.941%) Hamilton or Madison


As you can see, some of the papers are of known authorship by one of Madison or Hamilton. We can use these as training data.

In [12]:
# Extract the papers by either of Madison and Hamilton
training = list(filter(lambda ex: ex['authors'] in ['Madison', 'Hamilton'],
                       dataset))

In [13]:
# View a sample of the training data
print(f"Number of papers in the dataset: {len(training)}")
print("Some examples:")
pprint(training[:3])

Number of papers in the dataset: 66
Some examples:
[{'authors': 'Hamilton',
  'counts': tensor([9., 6., 2., 0.]),
  'number': '1',
  'title': 'General Introduction'},
 {'authors': 'Hamilton',
  'counts': tensor([2., 4., 7., 0.]),
  'number': '6',
  'title': 'Concerning Dangers from Dissensions Between the States'},
 {'authors': 'Hamilton',
  'counts': tensor([13., 11.,  9.,  0.]),
  'number': '7',
  'title': 'The Same Subject Continued: Concerning Dangers from Dissensions '
           'Between the States'}]


Others of the papers are of ambiguous authorship. They are shown as having `'Hamilton or Madison'` as author. These will be the elements that we want to test our models on.

In [15]:
# Extract the papers of unknown authorship
testing = list(filter(lambda ex: ex['authors'] == 'Hamilton or Madison',
                      dataset))

In [16]:
# View a sample of the data
print(f"Number of papers in the dataset: {len(testing)}")
print("Some sample elements:")
pprint(testing[:3])

Number of papers in the dataset: 11
Some sample elements:
[{'authors': 'Hamilton or Madison',
  'counts': tensor([16.,  0.,  2.,  1.]),
  'number': '49',
  'title': 'Method of Guarding Against the Encroachments of Any One Department '
           'of Government by Appealing to the People Through a Convention'},
 {'authors': 'Hamilton or Madison',
  'counts': tensor([11.,  1.,  0.,  0.]),
  'number': '50',
  'title': 'Periodic Appeals to the People Considered'},
 {'authors': 'Hamilton or Madison',
  'counts': tensor([21.,  0.,  2.,  2.]),
  'number': '51',
  'title': 'The Structure of the Government Must Furnish the Proper Checks and '
           'Balances Between the Different Departments'}]


# Models for text classification

We can think of a _model_ for a text classification problem as a function taking a test example and returning a class label for the test example. Generating the model will rely on a corpus of training data.

With a model in hand, we can evaluate its _accuracy_ on a test corpus by computing the proportion of test examples that the model correctly classifies, that is, the model assigns to a test example the author that the test example specifies. Define a function `accuracy` that takes a test corpus (like `testing`) and a model (which is a function, remember), and returns the accuracy of the model on that corpus. 

<!--
BEGIN QUESTION
name: accuracy
-->

In [17]:
#TODO -- Define the `accuracy` function.
def accuracy(test_corpus, model):
    """Computes the accuracy of a model on a corpus.
    Arguments:
      `test_corpus`: a list of test examples, such as `testing`
      `model`: a function whose input is an example from the corpus (such as 
              `testing[0]`, and whose output is the predicted author
    Returns:
      accuracy, a float number.
    """
    true_cnt = 0
    for elem in test_corpus:
      if elem['authors'] == model(elem):
        true_cnt += 1
    return true_cnt / len(test_corpus)

## Majority class classification

An especially simple classification model labels each test example with whichever label happens to occur most frequently in the training data. It completely ignores the test example that it classifies!

By examination of the table provided above, what is the majority class label for the training dataset?

<!--
BEGIN QUESTION
name: maj_class_label
-->

In [18]:
#TODO -- Set this variable to the majority class label for the training set.
maj_class_label = 'Hamilton'

In [19]:
grader.check("maj_class_label")

Rather than determining the majority class by inspection, it's better to have a function to compute it for us. Define a function `majority_class_label` that returns the majority class label for a training set.
<!--
BEGIN QUESTION
name: majority_class_label
-->

In [43]:
#TODO -- Define the `majority_class_label` function.
def majority_class_label(training):
    """Find the majority class label for a training set.
    Arguments:
      `training`: a list of training examples, such as `training`
    Returns:
      the majority class label, a string.
    """
    cnt_dict = {}
    Y = [t['authors'] for t in training]
    for label in Y:
      if label in cnt_dict.keys():
        cnt_dict[label] += 1
      else:
        cnt_dict[label] = 1
    
    m = max(cnt_dict.values())
    for label in cnt_dict.keys():
      if cnt_dict[label] == m:
        return label
      

In [44]:
grader.check("majority_class_label")

What proportions of the *training* examples do you think would be classified correctly by the majority class model?
<!--
BEGIN QUESTION
name: maj_class_accuracy_guess
-->

In [24]:
#TODO -- Define this variable to be what you think the 
#        accuracy of the majority class model would be
#        on the training data.
maj_class_accuracy_guess = 0.7727272727272727

In [25]:
grader.check("maj_class_accuracy_guess")

Now define a function `majority_class` that takes a single argument (a test example) and returns the particular class label that is most frequent in the training data `training` (regardless of what the test example is).
<!--
BEGIN QUESTION
name: majority_class
-->

In [26]:
#TODO - Define the `majority_class` model.
def majority_class(example):
    """Defines a majority class model.
    Arguments:
      `example`: an example, such as `testing[0]`
    Returns:
      the majority class in the *training* set, a string.
    """
    return 'Hamilton'

In [27]:
grader.check("majority_class")

Now we can see how well this majority class model works by trying it out on some examples. Use the `accuracy` function to determine the model's accuracy when applied to the task of labeling the _training_ data.
<!--
BEGIN QUESTION
name: accuracy_maj_class_train
-->

In [28]:
#TODO -- Define `maj_class_on_train` to be the accuracy of the majority 
#        class model on the training data.
accuracy_maj_class_train = accuracy(training, majority_class)

In [29]:
grader.check("accuracy_maj_class_train")

In [30]:
print(f"Accuracy of the majority class model on training data: "
      f"{accuracy_maj_class_train:.3f}")

Accuracy of the majority class model on training data: 0.773


Was your guess from above right?

## Nearest neighbor classification

Recall that nearest neighbor classification classifies a test example with the label of the nearest training example. To calculate nearest neighbors, we need a distance metric between the representations of the documents. We will use two such metrics, familiar from the previous lab, for Euclidean distance and cosine distance.

> Note: In order to allow full use of `torch` operations, these functions assume that the vectors are provided as tensors of type `float`. (That's why we tensorified the `counts` data as we loaded the dataset at the top  of this notebook.) When you call them, you'll want to make sure of this. They also return singleton tensors, not floats.

Just like in lab1-1, define a function `euclidean_distance` to compute the Euclidean distance between two vectors, and function `cosine_distance` to compute the cosine similarity between two vectors.
<!--
BEGIN QUESTION
name: euclidean_distance
-->

In [45]:
#TODO
def euclidean_distance(v1, v2):
    return torch.linalg.norm(v1 - v2)

In [46]:
grader.check("euclidean_distance")

In [33]:
def safe_acos(x):
    """Returns the arc cosine of `x`. Unlike `math.acos`, it 
       does not raise an exception for values of `x` out of range, 
       but rather clips `x` at -1..1, thereby avoiding math domain
       errors in the case of numerical errors."""
    return math.acos(math.copysign(min(1.0, abs(x)), x))

In [62]:
#TODO
def cosine_distance(v1, v2):
    """Returns the cosine distance between ttwo vectors"""
    return safe_acos(torch.dot(v1, v2) / (torch.linalg.norm(v1) * torch.linalg.norm(v2))) / math.pi

In [59]:
cosine_distance(torch.tensor([0., 0., 1.]), torch.tensor([0., 1., 1.]))

0.2500000054476416

In [60]:
grader.check("cosine_distance")

Here's an example of the use of these distance metrics:

In [61]:
t1 = torch.tensor([1., 2.])
t2 = torch.tensor([3., 4.])

print("Testing on two different tensors\n"
      f"Euclidean: {euclidean_distance(t1, t2)}\n"
      f"Cosine   : {cosine_distance(t1, t2)}\n\n"
      "Testing on two identical tensors\n"
      f"Euclidean: {euclidean_distance(t1, t1)}\n"
      f"Cosine   : {cosine_distance(t1, t1)}")

Testing on two different tensors
Euclidean: 2.8284270763397217
Cosine   : 0.05724914679911019

Testing on two identical tensors
Euclidean: 0.0
Cosine   : 0.0


### Generating nearest neighbor models

To specify a nearest neighbor model, we need both a training corpus (like `training`) and a distance metric (like `euclidean_distance` or `cosine_distance` defined just above). 

Define a function called `define_nearest_neighbor` that takes a training corpus and a distance metric and returns a model -- that is, a function that classifies a single test example. The model should return the class _label_ of that training example whose _counts vector_ is closest to that of the test example according to the metric.

<!--
BEGIN QUESTION
name: define_nearest_neighbor
-->

In [63]:
#TODO -- Define this function that generates nearest neighbor models.
def define_nearest_neighbor(corpus, metric):
    """Generates a nearest neighbor model from a training corpus and a
    distance metric.
    Arguments:
      `corpus`: a training corpus, such as `training`
      `metric`: a metric function which takes two tensors as input and 
                returns their distance, such as `euclidean_distance`
    Returns:
      a model, which is a function that takes in a test example (such as 
      `testing[0]`) and returns the author of the nearest example in the 
      training set, where distances are measured on the counts vector 
      using `metric`.
    """
    def nn_model(example):
      dist = math.inf
      current_label = ''
      for elem in corpus:
        tmp_dist = metric(elem['counts'], example['counts'])
        if tmp_dist < dist:
          dist = tmp_dist
          current_label = elem["authors"]
      return current_label
    return nn_model


We can use the `define_nearest_neighbor` function to define two new models for nearest neighbor classification, one using Euclidean distance and one using cosine distance.

In [64]:
nearest_neighbor_euclidean_model = \
    define_nearest_neighbor(training, euclidean_distance)

nearest_neighbor_cosine_model = \
    define_nearest_neighbor(training, cosine_distance)

### Testing the nearest neighbor models on the training data

How accurate are these models when used to label the training data (as we did for the majority class model above)? Use the `accuracy` function above to calculate the accuracy of `nearest_neighbor_euclidean_model` in labeling the _training_ data (not the test data), and similarly for `nearest_neighbor_cosine_model`.
<!--
BEGIN QUESTION
name: accuracy_train
-->

In [65]:
#TODO - Define the variable to be the calculated accuracy.
accuracy_nn_euclidean_train = accuracy(training, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_train = accuracy(training, nearest_neighbor_cosine_model)

In [66]:
grader.check("accuracy_train")

In [67]:
print(f"Accuracy of the nearest neighbor euclidean model tested on training data: "
      f"{accuracy_nn_euclidean_train:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on training data: "
      f"{accuracy_nn_cosine_train:.3f}")

Accuracy of the nearest neighbor euclidean model tested on training data: 1.000
Accuracy of the nearest neighbor cosine model tested on training data: 1.000


<!-- BEGIN QUESTION -->

**Question:** Does the performance of these classifiers on the training data seem to you to be representative of how good a classifier each is? Why or why not?
<!--
BEGIN QUESTION
name: open_response_1
manual: true
-->

_no, 1-nearest-neighbor basically leads to overfit since each sample is closest to itself and therefore on the training set we'll get perfect fit where on unseen data it'll not perform well because there is no generalization doe to overfit._

<!-- END QUESTION -->

#### Testing the nearest neighbor models on the testing data

To get a better sense of how the nearest neighbor models perform, let's try them out on the testing data that we have. (Recall that the testing data in `testing` were the ambiguously-authored Federalist papers, where the `authors` field was `'Hamilton or Madison'`.)

We start by looking in detail at the predictions generated by the two nearest neighbor models. Print out a table that lists, for each `testing` example, the paper number and the authors predicted under the nearest neighbor Euclidean model and the nearest neighbor cosine model. It might look something like
```
49 Madison  Madison 
50 Hamilton Madison 
51 Madison  Madison
...
```

<!--
BEGIN QUESTION
name: print_table
-->

In [69]:
#TODO - Print out the requested table.
for elem in testing:
  print(f"{elem['number']} {nearest_neighbor_euclidean_model(elem)} {nearest_neighbor_cosine_model(elem)}")

49 Madison Madison
50 Hamilton Madison
51 Madison Madison
52 Madison Madison
53 Madison Madison
54 Madison Madison
55 Madison Hamilton
56 Madison Madison
57 Madison Madison
62 Madison Madison
63 Madison Madison


<!-- BEGIN QUESTION -->

What do you notice about the two models?
<!--
BEGIN QUESTION
name: open_response_2
manual: true
-->

_the two models predicted mostly 'Madison' on the test set, even tho the majority of training set is 'Hamilton'._

<!-- END QUESTION -->

### Testing the nearest neighbor models on the testing data

Now use the `accuracy` function to calculate the accuracy of the two nearest neighbor models as you did above, but this time calculating accuracy on the *testing* corpus rather than the training corpus. (Expect to find a surprising result. Read ahead for an explanation if you're confused.)
<!--
BEGIN QUESTION
name: accuracy_test
-->

In [70]:
#TODO -- Define the variables to be, respectively, the calculated accuracy of the nearest 
#        neighbor Euclidean model and cosine model on the testing data.
accuracy_nn_euclidean_test = accuracy(testing, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_test = accuracy(testing, nearest_neighbor_cosine_model)

In [71]:
grader.check("accuracy_test")

In [72]:
print(f"Accuracy of the nearest neighbor euclidean model tested on testing data: "
      f"{accuracy_nn_euclidean_test:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on testing data: "
      f"{accuracy_nn_cosine_test:.3f}")

Accuracy of the nearest neighbor euclidean model tested on testing data: 0.000
Accuracy of the nearest neighbor cosine model tested on testing data: 0.000


<!-- BEGIN QUESTION -->

**Question:** Does the performance of these classifiers on the testing data seem to you to be representative of how good a classifier each is? Why or why not?
<!--
BEGIN QUESTION
name: open_response_3
manual: true
-->

_no, because on the test set there is labels such as 'Hamilton or Madison' that the training set doesn't contain, and therefore wasn't trained on._

<!-- END QUESTION -->

### The importance of gold labels

In order to evaluate the accuracy of the nearest neighbor model – and any model – we need to have the correct labels for the testing corpus, the so-called _gold_ labels. What shall we use for gold labels? Mosteller and Wallace's much more extensive analysis concluded that all of the papers of ambiguous origin were penned by Madison, so we'll use that. We should use a version of the `testing` corpus with the gold labels. 

Write some code to generate a version of the testing corpus with the gold labels.

> Hint: In defining `testing_gold`, you'll want to be careful not to change `testing`. Otherwise, some unit tests that use `testing` may fail. The `copy.deepcopy` function may be useful.

<!--
BEGIN QUESTION
name: get_gold
-->

In [76]:
#TODO - Write code that defines `testing_gold`, which is the same
# as `testing` except that it has the correct gold labels.
# Note: be careful to not change `testing`.

testing_gold = copy.deepcopy(testing)

for elem in testing_gold:
  elem['authors'] = 'Madison'

In [77]:
grader.check("get_gold")

Now we can rerun the accuracy calculations.

In [78]:
accuracy_nn_euclidean_test_with_gold = accuracy(testing_gold, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_test_with_gold = accuracy(testing_gold, nearest_neighbor_cosine_model)

In [79]:
print(f"Accuracy of the nearest neighbor euclidean model tested on testing data: "
      f"{accuracy_nn_euclidean_test_with_gold:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on testing data: "
      f"{accuracy_nn_cosine_test_with_gold:.3f}")

Accuracy of the nearest neighbor euclidean model tested on testing data: 0.909
Accuracy of the nearest neighbor cosine model tested on testing data: 0.909


Do these results make more sense?

<!-- BEGIN QUESTION -->

# Lab debrief

**Question:** We're interested in any thoughts you have about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_labs were too long, we didn't took a break at class and still had to complete around half the work at home, *but*, other than that - anything was great <(^.^)>._

<!-- END QUESTION -->



# End of lab 1-2

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [80]:
grader.check_all()