In [83]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2020/lab1-2.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [84]:
# Initialize Otter
import otter
grader = otter.Notebook()

# Course 236299
## Lab 1-2 — Text classification and evaluation methodology

After this lab, you should be able to

* Understand the distinction between training and test corpora, and why both are needed;
* Understand the role of gold labels;
* Implement a majority class baseline as a benchmark to compare other methods;
* Implement nearest neighbor classification, and understand the role of distance metrics in its operation;
* Compare multiple methods for acccuracy.

New bits of Python used for the first time in the _solution set_ for this lab, and which you may therefore find useful, include

* [`collections.Counter`](https://docs.python.org/3/library/collections.html#collections.Counter)
* [`collections.Counter.most_common`](https://docs.python.org/3/library/collections.html#collections.Counter.most_common)

## Preparation – Loading packages and data

In [85]:
# Please do not change these imports because some hidden tests might depend on them.
# You can add a cell below if you need to import anything else.
import collections
import json
import math
import numpy as np
from pprint import pprint
import re

## The Federalist Papers

<img src="https://github.com/nlp-236299/data/raw/master/Federalist/federalist.jpg" width=150 align=right />

The _Federalist_ papers is a collection of 85 essays written pseudonymously by Alexander Hamilton, John Jay, and James Madison following the Constitutional Convention of 1787, promoting the ratification of the nascent United States Constitution.

The authorship of many of the individual papers has been well established and acknowledged by the various authors, but a number of the papers have been contentious, with both Madison and Hamilton as possible authors. Determining the authorship of these disputed papers is a classic text classification problem, and one that has received great attention. The seminal work on the problem is that of [Mosteller and Wallace](http://www.historyofinformation.com/detail.php?entryid=4799), who applied then-novel statistical methods to the problem. In this lab, we'll use the _Federalist_ data to experiment with some of the ideas about distance metrics and classification methods that you've read about.

Mosteller and Wallace used the frequencies of various words in the papers as the raw data for determining authorship. We've provided access to a heavily pre-digested version of this data. (If you're interested, you can find the raw data – all 85 papers – and the notebook used to generate the pre-digested data in the [course `data` github repository](https://github.com/nlp-236299/data).)

Start by evaluating the cells below to load the data and view a sample.

In [86]:
# Read the Federalist data from the json file
shell('wget -nv -N -P data https://github.com/nlp-236299/data/raw/master/Federalist/federalist_data.json')




In [87]:
with open('data/federalist_data.json', 'r') as fin:
    dataset = json.load(fin)

In [88]:
# View a sample of the data
print(f"Number of papers in the dataset: {len(dataset)}")
print("Some examples:")
pprint(dataset[:3])

Number of papers in the dataset: 85
Some examples:
[{'authors': 'Hamilton',
  'counts': [9, 6, 2, 0],
  'number': '1',
  'title': 'General Introduction'},
 {'authors': 'Jay',
  'counts': [8, 1, 0, 0],
  'number': '2',
  'title': 'Concerning Dangers from Foreign Force and Influence'},
 {'authors': 'Jay',
  'counts': [6, 0, 1, 0],
  'number': '3',
  'title': 'The Same Subject Continued: Concerning Dangers from Foreign Force '
           'and Influence'}]


You'll see above that the dataset is a list of _examples_, one for each paper, each a dictionary providing the paper number, its title and author(s), and the raw counts for a few important words in the papers. (From the last lab, you'll recognize the `counts` field as a bag-of-words representation of the document.) The `counts` field is the document representation that we will be wanting to classify, and the `authors` field contains the pertinent class label for each example. 

For your reference, here are the words that were used to derive the counts:

In [89]:
keywords = ['on', 'upon', 'there', 'whilst']

Thus in the first example paper, _Federalist 1_, there were 9 tokens of "on", 6 of "upon", 2 of "there", and none of "whilst". 

The `authors` field takes on various values. Here's a table of the frequency of each of the values. (This will come in handy later.)

In [90]:
# Generate a table of the number of papers by each author label
cnt = collections.Counter(map(lambda ex: ex['authors'],
                              dataset))
for author, count in cnt.items():
    print(f"{count:3d} ({count/len(dataset):.3f}%) {author}")

 51 (0.600%) Hamilton
  5 (0.059%) Jay
 15 (0.176%) Madison
  3 (0.035%) Hamilton and Madison
 11 (0.129%) Hamilton or Madison


As you can see, some of the papers are of known authorship by Madison or Hamilton. We can use these as training data.

In [91]:
# Extract the papers by either of Madison and Hamilton
training = list(filter(lambda ex: ex['authors'] in ['Madison', 'Hamilton'],
                       dataset))

In [92]:
# View a sample of the training data
print(f"Number of papers in the dataset: {len(training)}")
print("Some examples:")
pprint(training[:3])

Number of papers in the dataset: 66
Some examples:
[{'authors': 'Hamilton',
  'counts': [9, 6, 2, 0],
  'number': '1',
  'title': 'General Introduction'},
 {'authors': 'Hamilton',
  'counts': [2, 4, 7, 0],
  'number': '6',
  'title': 'Concerning Dangers from Dissensions Between the States'},
 {'authors': 'Hamilton',
  'counts': [13, 11, 9, 0],
  'number': '7',
  'title': 'The Same Subject Continued: Concerning Dangers from Dissensions '
           'Between the States'}]


Others of the papers are of ambiguous authorship. They are shown as having `'Hamilton or Madison'` as author. These will be the elements that we want to test our models on.

In [93]:
# Extract the papers of unknown authorship
testing = list(filter(lambda ex: ex['authors'] == 'Hamilton or Madison',
                      dataset))

In [94]:
# View a sample of the data
print(f"Number of papers in the dataset: {len(testing)}")
print("Some sample elements:")
pprint(testing[:3])

Number of papers in the dataset: 11
Some sample elements:
[{'authors': 'Hamilton or Madison',
  'counts': [16, 0, 2, 1],
  'number': '49',
  'title': 'Method of Guarding Against the Encroachments of Any One Department '
           'of Government by Appealing to the People Through a Convention'},
 {'authors': 'Hamilton or Madison',
  'counts': [11, 1, 0, 0],
  'number': '50',
  'title': 'Periodic Appeals to the People Considered'},
 {'authors': 'Hamilton or Madison',
  'counts': [21, 0, 2, 2],
  'number': '51',
  'title': 'The Structure of the Government Must Furnish the Proper Checks and '
           'Balances Between the Different Departments'}]


## Models for text classification

We can think of a _model_ for a text classification problem as a function taking a test example and returning a class label for the test example. Generating the model will rely on a corpus of training data.

With a model in hand, we can evaluate its _accuracy_ on a test corpus by computing the proportion of test examples that the model correctly classifies. Define a higher-order function `accuracy` that takes a test corpus (like `testing`) and a model (which is a function, remember), and returns the accuracy of the model on that corpus. 

<!--
BEGIN QUESTION
name: accuracy
-->

In [95]:
# TODO
def accuracy(test_corpus, model):
    correct = len([ex for ex in test_corpus if model(ex) == ex['authors']])
    return correct / len(test_corpus)
    

### Majority class classification

An especially simple classification model labels each test example with whichever label happens to occur most frequently in the training data. It completely ignores the test example that it classifies!

By examination of the table provided above, what is the majority class label for this dataset?

<!--
BEGIN QUESTION
name: maj_class_label
-->

In [96]:
maj_class_label = 'Hamilton' if cnt['Hamilton'] > cnt['Madison'] else 'Madison'


In [97]:
grader.check("maj_class_label")

Define a function `majority_class_label` that returns the majority class label for a training set.
<!--
BEGIN QUESTION
name: majority_class_label
-->

In [98]:
#TODO
def majority_class_label(training):
  labels = [ex['authors'] for ex in training]
  label =  max(labels, key = labels.count)
  return label

In [99]:
grader.check("majority_class_label")

What proportions of the **training** examples would be classified correctly by the majority class model?
<!--
BEGIN QUESTION
name: maj_class_accuracy_guess
-->

In [100]:
#TODO
maj_class_accuracy_guess = cnt['Hamilton'] / (cnt['Hamilton'] + cnt['Madison'])

In [101]:
grader.check("maj_class_accuracy_guess")

Now define a function `majority_class` that takes a single argument, a test example, and returns the class label that is most frequent in the training data `training` (regardless of what the test example is).
<!--
BEGIN QUESTION
name: majority_class
-->

In [102]:
#TODO - define the `majority_class` model
def majority_class(example):
    return majority_class_label(training)

In [103]:
grader.check("majority_class")

Now we can see how well this majority class model works by trying it out on some examples. Use the `accuracy` function to determine the model's accuracy when applied to the task of labeling the _training_ data?
<!--
BEGIN QUESTION
name: accuracy_maj_class_train
-->

In [104]:
#TODO - define `maj_class_on_train` to be the accuracy of the majority class model on the training data
accuracy_maj_class_train = accuracy(training, majority_class)

In [105]:
grader.check("accuracy_maj_class_train")

In [106]:
print(f"Accuracy of the majority class model on training data: {accuracy_maj_class_train:.3f}")

Accuracy of the majority class model on training data: 0.773


Was your estimation from above right?

### Nearest neighbor classification

Recall that nearest neighbor classification classifies a test example with the label of the nearest training example. To calculate nearest neighbors, we need a distance metric between the representations of the documents. Below we've provided two such metrics, familiar from the previous lab, for Euclidean distance and cosine distance.

In [107]:
def euclidean_distance(v1, v2):
    '''Returns the Euclidean distance between two vectors''' 
    return np.linalg.norm(np.array(v1) - np.array(v2))

In [108]:
def safe_acos(x):
    '''Returns the arc cosine of `x`. Unlike `math.acos`, it 
       does not raise an exception for values of `x` out of range, 
       but rather clips `x` at -1..1, thereby avoiding math domain
       errors in the case of numerical errors.'''
    return math.acos(math.copysign(min(1.0, abs(x)), x))
        
def cosine_distance(v1, v2):
    '''Returns the cosine distance between two vectors'''
    return (safe_acos(np.dot(v1, v2) 
                      / (np.linalg.norm(v1, 2) * np.linalg.norm(v2, 2)))
            / math.pi)

#### Generating nearest neighbor models

To specify a nearest neighbor model, we need both a training corpus (like `training`) and a distance metric (like `euclidean_distance` or `cosine_distance` defined just above). 

Define a function called `define_nearest_neighbor` that takes a training corpus and a metric and returns a model - that is, a **function** that classifies a single test example. The model should return the class label of that training example whose counts vector is closest to that of the test example according to the metric.

<!--
BEGIN QUESTION
name: define_nearest_neighbor
-->

In [109]:
    def define_nearest_neighbor(corpus, metric):
        model = lambda x: min([(paper, metric(x['counts'], paper['counts']))
                            for paper in corpus], key=lambda t: t[1]
                            )[0]['authors']
        return model

We can use the `define_nearest_neighbor` function to define two new models for nearest neighbor classification, one using Euclidean distance and one using cosine distance.

In [110]:
nearest_neighbor_euclidean_model = \
    define_nearest_neighbor(training, euclidean_distance)

nearest_neighbor_cosine_model = \
    define_nearest_neighbor(training, cosine_distance)

#### Testing the nearest neighbor models on the training data

How accurate are these models when used to label the training data (as we did for the majority class model above)? Use the `accuracy` function above to calculate the accuracy of `nearest_neighbor_euclidean_model` in labeling the _training_ data (not the test data), and similarly for `nearest_neighbor_cosine_model`.
<!--
BEGIN QUESTION
name: accuracy_train
-->

In [111]:
#TODO - define the variable to be the calculated accuracy 
accuracy_nn_euclidean_train = accuracy(training, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_train = accuracy(training, nearest_neighbor_cosine_model)

In [112]:
grader.check("accuracy_train")

In [113]:
print(f"Accuracy of the nearest neighbor euclidean model tested on training data: "
      f"{accuracy_nn_euclidean_train:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on training data: "
      f"{accuracy_nn_cosine_train:.3f}")

Accuracy of the nearest neighbor euclidean model tested on training data: 1.000
Accuracy of the nearest neighbor cosine model tested on training data: 1.000


<!-- BEGIN QUESTION -->

**Question:** Does the performance of these classifiers on the training data seem to you to be representative of how good a classifier each is? Why or why not?
<!--
BEGIN QUESTION
name: open_response_1
manual: true
-->

No, because in both metrics we use the training dataset, all of samples we've allready learned, i.e memorized, thus we find the exact sample, which is the most close to itself (0.0 distance) and predict exactly it's label. To measure the performance, it is desirable to use examples which weren't seen  by the model.



<!-- END QUESTION -->



#### Testing the nearest neighbor models on the testing data

To get a better sense of how the nearest neighbor models perform, let's try them out on the testing data that we have. (Recall that the testing data in `testing` were the ambiguously-authored Federalist papers, where the `authors` field was `'Hamilton or Madison'`.)

We start by looking in detail at the predictions generated by the two nearest neighbor models. Print out a table that lists, for each `testing` example, the paper number and the authors predicted under the nearest neighbor Euclidean model and the nearest neighbor cosine model.
<!--
BEGIN QUESTION
name: print_table
-->

In [114]:
# TODO
table_euclidean = [(ex['number'], nearest_neighbor_euclidean_model(ex)) for ex in testing]
print('Euclidean:')
print(table_euclidean)

table_cosine = [(ex['number'], nearest_neighbor_cosine_model(ex)) for ex in testing]

print('Cosine:')
print(table_cosine)

Euclidean:
[('49', 'Madison'), ('50', 'Hamilton'), ('51', 'Madison'), ('52', 'Madison'), ('53', 'Madison'), ('54', 'Madison'), ('55', 'Madison'), ('56', 'Madison'), ('57', 'Madison'), ('62', 'Madison'), ('63', 'Madison')]
Cosine:
[('49', 'Madison'), ('50', 'Madison'), ('51', 'Madison'), ('52', 'Madison'), ('53', 'Madison'), ('54', 'Madison'), ('55', 'Hamilton'), ('56', 'Madison'), ('57', 'Madison'), ('62', 'Madison'), ('63', 'Madison')]


<!-- BEGIN QUESTION -->

What do you notice about the two models?
<!--
BEGIN QUESTION
name: open_response_2
manual: true
-->

Both model use nearest neighbour algorithm and  differ only by the metric to calculate the distance between neighbours of the given example. Both agree on most of the test examples predictions, in which 'Madison' is predicted to be the author, but differ in one test example prediction, what is generally a coincidence (they could differ on any number of predictions).

<!-- END QUESTION -->

#### Accuracy on the testing corpus

Now use the `accuracy` function to calculate the accuracy of the two nearest neighbor models as you did above, but this time calculating accuracy on the testing corpus rather than the training corpus. (Expect to find a surprising result. Read ahead for an explanation if you're confused.)
<!--
BEGIN QUESTION
name: accuracy_test
-->

In [115]:
#TODO - define the variables to be the calculated accuracy of the nearest 
# neighbor Euclidean model and the cosine model on the testing data

accuracy_nn_euclidean_test = accuracy(testing, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_test = accuracy(testing, nearest_neighbor_cosine_model)


In [116]:
grader.check("accuracy_test")

In [117]:
print(f"Accuracy of the nearest neighbor euclidean model tested on training data: "
      f"{accuracy_nn_euclidean_test:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on training data: "
      f"{accuracy_nn_cosine_test:.3f}")

Accuracy of the nearest neighbor euclidean model tested on training data: 0.000
Accuracy of the nearest neighbor cosine model tested on training data: 0.000


<!-- BEGIN QUESTION -->

**Question:** Does the performance of these classifiers on the testing data seem to you to be representative of how good a classifier each is? Why or why not?
<!--
BEGIN QUESTION
name: open_response_3
manual: true
-->

It isn't representative, because the groundtruth labels are unknown. In our case the test set composed of papers, authors of which addressed as "Hamilton or Madison", and our prediction by each model is "Hamilton" or "Madison", thus the accuracy function, which eventually calculates the number of true result comparisons of the predicted authors and the groundtruth, just get 0 because niether "Madison" nor "Hamilton" equals to "Hamilton or Madison", so the accuracy is 0. For overcoming this issue, it is desiarble to have some groundtruth labels for the testing (as in generally in classification problems), for example those can be in out case the gold labels (labels that were produced by humans, as described in the reading material).

<!-- END QUESTION -->

#### The importance of gold labels

In order to evaluate the accuracy of the nearest neighbor model – and any model – we need to have the true labels for the testing corpus, the so-called _gold_ labels. What shall we use for gold labels? Mosteller and Wallace's much more extensive analysis concluded that all of the papers of ambiguous origin were penned by Madison, so we'll use that. We should modify the `testing` corpus to inject the gold labels. 

Write some code to update the testing corpus with the gold labels.
<!--
BEGIN QUESTION
name: get_gold
-->

In [118]:
#TODO - write code to update the testing corpus with the gold labels
for ex in testing:
    if ex['authors'] == 'Hamilton or Madison':
        ex['authors'] = 'Madison'



In [119]:
grader.check("get_gold")

Now we can rerun the accuracy calculations.

In [120]:
accuracy_nn_euclidean_test_with_gold = accuracy(testing, nearest_neighbor_euclidean_model)
accuracy_nn_cosine_test_with_gold = accuracy(testing, nearest_neighbor_cosine_model)

In [121]:
print(f"Accuracy of the nearest neighbor euclidean model tested on testing data: "
      f"{accuracy_nn_euclidean_test_with_gold:.3f}")
print(f"Accuracy of the nearest neighbor cosine model tested on testing data: "
      f"{accuracy_nn_cosine_test_with_gold:.3f}")

Accuracy of the nearest neighbor euclidean model tested on testing data: 0.909
Accuracy of the nearest neighbor cosine model tested on testing data: 0.909


Do these results make more sense?

<!-- BEGIN QUESTION -->

## Lab debrief – for consensus submission only

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of lab 1-2

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [122]:
grader.check_all()