# Authorship Analysis

The _Federalist Papers_ were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of Hamilton, Jay, or Madison. The known authorships can be found in `/data/federalist/authorship.csv`.

For 15 papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use textual data analysis to assign authorships to these disputed papers.

In [1]:
import pandas as pd
import numpy as np

In [2]:
scores = []

## Question 1

In Information Retrieval, we use TF-IDF (instead of raw term frequencies) to give rare words more weight. If you're looking for "red stockings", then a document with the rare word "stockings" is much more likely to be relevant than a document with the common word "red".

However, for analyzing an author's style, common words like "the" and "on" are actually much more useful than rare words like "hostilities". Rare words depend on context, and different authors writing about the same context will use the same words. For example, both Dr. Seuss and Charles Dickens used the rare words "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens uses the word "upon" over 100 times, while Dr. Seuss does not use "upon" at all.

First, determine the most common 50 words in the _Federalist Papers_ corpus. You should use the `word_counts.csv` that you generated in the Zipf's Law part of this lab. Then, create a data frame of term frequencies for those 50 words in each of the 85 _Federalist Papers_. Your data frame should have 85 rows and 50 columns. The index of the data frame should be the number of the paper (1-85).

In [3]:
word_counts = pd.read_csv("word_counts.csv", header=None, names=["Word", "Count"])
word_counts = word_counts.sort_values(by='Count', ascending=False).reset_index().drop('index',axis=1)

In [11]:
words50 = word_counts.head(50)["Word"].tolist()

In [14]:
allPgs = []
for page in range(1,86):
    vocab = {}
    with open("/data/federalist/%d.txt"%page) as f:
        text = f.read()
        words = text.split()
        fixed = []
        for word in words:
            fixed.append(word.lower().rstrip(';.?!,:)"').lstrip('"(').replace('-','').replace('"',''))
        for word in fixed:
            if word not in vocab and word in words50:
                vocab[word] = 1
            elif word in vocab and word in words50:
                vocab[word] += 1
    allPgs.append(vocab)

In [18]:
top50 = pd.DataFrame(allPgs)
top50.index = np.arange(1, len(top50) + 1)
top50

Unnamed: 0,a,all,an,and,any,are,as,at,be,been,...,them,they,this,those,to,we,which,will,with,would
1,25,9,11,40,6.0,12,10,8,34,3.0,...,2.0,6,14,9.0,72,8.0,18,25,6,2
2,29,4,1,83,1.0,6,16,10,15,8.0,...,4.0,22,14,2.0,53,5.0,11,2,13,5
3,13,4,3,60,5.0,8,24,1,31,2.0,...,8.0,5,6,6.0,56,,11,24,10,2
4,16,4,3,90,5.0,11,20,2,26,2.0,...,12.0,17,1,4.0,51,10.0,10,15,12,17
5,9,4,4,72,3.0,3,3,4,31,,...,11.0,11,6,9.0,45,5.0,10,6,11,37
6,52,6,12,81,,18,19,6,18,13.0,...,3.0,13,11,7.0,61,5.0,24,6,15,6
7,48,12,15,51,7.0,7,19,11,47,8.0,...,12.0,7,22,8.0,82,16.0,24,1,12,51
8,45,9,13,54,2.0,14,16,11,35,16.0,...,11.0,13,19,6.0,80,5.0,26,11,13,27
9,47,8,13,45,4.0,14,19,10,26,15.0,...,5.0,20,15,6.0,71,6.0,25,7,14,8
10,78,4,14,121,4.0,27,20,8,61,9.0,...,8.0,11,11,4.0,100,9.0,39,30,11,6


### Grader's Comments

- 
- 

[This question is worth 10 points.]

In [7]:
# This cell should only be modified only by a grader.
scores.append(0)

## Question 2

Implement a function `calc_cosine_similarity` that calculates the cosine similarity between a vector (representing the counts of the 50 most common words in a document) and each document in the _Federalist Papers_ corpus. Some code to check your implementation has been provided for you below.

In [8]:
# use df from question 1 and put on this function
# subset results to et ones without author - for other question
# compare vector to every row

def calc_cosine_similarity(vector):
    """
    Args:
      - vector: a Pandas series with 50 elements, representing
          the term frequencies of the 50 most common words in 
          a document.
          
    Returns:
      A Pandas series, indexed by the number of the Federalist paper,
      containing the cosine similarities between each paper and the vector.
    """
    raise NotImplementedError

In [9]:
# This should return a vector with the cosine similarity of each
# Federalist Paper with Paper No. 49 (one of the disputed papers).
calc_cosine_similarity(data.loc[49])

NameError: name 'data' is not defined

In [None]:
# This test should pass without an error. (Think about why!)
assert(calc_cosine_similarity(data.loc[49]).loc[49] == 1.0)

### Grader's Comments

- 
- 

[This question is worth 15 points.]

In [None]:
# This cell should only be modified only by a grader.
scores.append(0)

## Question 3

We will consider two ways of using the cosine similarities to assign an authorship to the disputed papers.

1. We can take the author of the Federalist Paper (among those with known authors) with the highest cosine similarity to the disputed paper.
2. We can look at the authors of the 3 papers (among those with known authors) with the 3 highest cosine similarities to the disputed papers and use "majority vote" to determine the winner. (For example, if the 3 most similar papers were written by Hamilton, Madison, and Hamilton, then we would conclude that the paper was written by Hamilton.) In the case of a tie, we pick one arbitrarily.

Remember that the known authorships can be found in `/data/federalist/authorship.csv`.

To determine which of the two methods is better, we first apply the methods to the papers with known authorship. For each of the 70 papers with known authors, apply the two methods above and create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) -- showing how often you predicted Hamilton, Jay, or Madison, and how often it actually was Hamilton, Jay, or Madison. You can create the confusion matrix using either Pandas or Matplotlib, as long as it is labeled.

**WARNING:** When applying the above methods to the papers with known authorships, you should exclude the document with the highest cosine similarity. (Do you see why?)

In [None]:
# YOUR CODE HERE

_DISCUSS WHICH OF THE TWO METHODS WAS BETTER AND WHY._

### Grader's Comments

- 
- 

[This question is worth 20 points.]

In [None]:
# This cell should only be modified only by a grader.
scores.append(0)

## Question 4

Using the method that you deemed better in Question 3, assign an authorship to the 15 disputed _Federalist Papers_.

### Grader's Comments

- 
- 

[This question is worth 5 points.]

In [None]:
# This cell should only be modified only by a grader.
scores.append(0)