# Authorship Analysis

The _Federalist Papers_ were originally published under the pseudonym "Publius". Although the identity of the authors was a closely guarded secret at the time, most of the papers have since been conclusively attributed to one of Hamilton, Jay, or Madison. The known authorships can be found in `/data/federalist/authorship.csv`.

For 15 papers, however, the authorships remain disputed. (These papers can be identified from the `authorship.csv` file because the "Author" field is blank.) In this analysis, you will use textual data analysis to assign authorships to these disputed papers.

In [1]:
import pandas as pd
import numpy as np
from random import randint

In [2]:
scores = []

## Question 1

In Information Retrieval, we use TF-IDF (instead of raw term frequencies) to give rare words more weight. If you're looking for "red stockings", then a document with the rare word "stockings" is much more likely to be relevant than a document with the common word "red".

However, for analyzing an author's style, common words like "the" and "on" are actually much more useful than rare words like "hostilities". Rare words depend on context, and different authors writing about the same context will use the same words. For example, both Dr. Seuss and Charles Dickens used the rare words "chimney" and "stockings" in _How the Grinch Stole Christmas_ and _A Christmas Carol_, respectively. But they used common words very differently: Dickens uses the word "upon" over 100 times, while Dr. Seuss does not use "upon" at all.

First, determine the most common 50 words in the _Federalist Papers_ corpus. You should use the `word_counts.csv` that you generated in the Zipf's Law part of this lab. Then, create a data frame of term frequencies for those 50 words in each of the 85 _Federalist Papers_. Your data frame should have 85 rows and 50 columns. The index of the data frame should be the number of the paper (1-85).

In [3]:
word_counts = pd.read_csv("word_counts.csv", header=None, names=["Word", "Count"])
word_counts = word_counts.sort_values(by="Count", ascending = False).reset_index().reset_index()



In [4]:
words50 = word_counts.head(50)["Word"].tolist()
words50

['the',
 'of',
 'to',
 'and',
 'in',
 'a',
 'be',
 'that',
 'it',
 'is',
 'which',
 'by',
 'as',
 'this',
 'would',
 'have',
 'will',
 'for',
 'or',
 'not',
 'their',
 'with',
 'from',
 'are',
 'on',
 'they',
 'an',
 'states',
 'government',
 'may',
 'been',
 'state',
 'all',
 'but',
 'its',
 'other',
 'power',
 'people',
 'has',
 'more',
 'at',
 'if',
 'them',
 'than',
 'one',
 'any',
 'no',
 'those',
 'we',
 'can']

In [5]:
allPgs = []
for page in range(1,86):
    vocab = {}
    with open("/data/federalist/%d.txt"%page) as f:
        text = f.read()
        words = text.split()
        fixed = []
        for word in words:
            fixed.append(word.lower().rstrip(';.?!,:)"').lstrip('"(').replace('-','').replace('"',''))
        for word in fixed:
            if (word not in vocab) and (word in words50):
                vocab[word] = 1
            elif (word in vocab) and (word in words50):
                vocab[word] += 1
    allPgs.append(vocab)
    
allPgs

[{'a': 25,
  'all': 9,
  'an': 11,
  'and': 40,
  'any': 6,
  'are': 12,
  'as': 10,
  'at': 8,
  'be': 34,
  'been': 3,
  'but': 2,
  'by': 14,
  'can': 3,
  'for': 12,
  'from': 11,
  'government': 9,
  'has': 6,
  'have': 10,
  'if': 4,
  'in': 27,
  'is': 13,
  'it': 20,
  'its': 10,
  'may': 11,
  'more': 7,
  'no': 3,
  'not': 14,
  'of': 106,
  'on': 9,
  'one': 4,
  'or': 6,
  'other': 3,
  'people': 6,
  'power': 2,
  'state': 6,
  'states': 2,
  'than': 11,
  'that': 28,
  'the': 132,
  'their': 14,
  'them': 2,
  'they': 6,
  'this': 14,
  'those': 9,
  'to': 72,
  'we': 8,
  'which': 18,
  'will': 25,
  'with': 6,
  'would': 2},
 {'a': 29,
  'all': 4,
  'an': 1,
  'and': 83,
  'any': 1,
  'are': 6,
  'as': 16,
  'at': 10,
  'be': 15,
  'been': 8,
  'but': 8,
  'by': 10,
  'for': 13,
  'from': 4,
  'government': 9,
  'has': 6,
  'have': 17,
  'if': 3,
  'in': 34,
  'is': 16,
  'it': 38,
  'its': 5,
  'may': 4,
  'more': 5,
  'no': 1,
  'not': 10,
  'of': 83,
  'on': 8,
  'on

In [6]:
len(allPgs)

85

In [7]:
top50 = pd.DataFrame(allPgs)
top50.index = np.arange(1, len(top50) + 1)
top50

Unnamed: 0,a,all,an,and,any,are,as,at,be,been,...,them,they,this,those,to,we,which,will,with,would
1,25,9,11,40,6.0,12,10,8,34,3.0,...,2.0,6,14,9.0,72,8.0,18,25,6,2
2,29,4,1,83,1.0,6,16,10,15,8.0,...,4.0,22,14,2.0,53,5.0,11,2,13,5
3,13,4,3,60,5.0,8,24,1,31,2.0,...,8.0,5,6,6.0,56,,11,24,10,2
4,16,4,3,90,5.0,11,20,2,26,2.0,...,12.0,17,1,4.0,51,10.0,10,15,12,17
5,9,4,4,72,3.0,3,3,4,31,,...,11.0,11,6,9.0,45,5.0,10,6,11,37
6,52,6,12,81,,18,19,6,18,13.0,...,3.0,13,11,7.0,61,5.0,24,6,15,6
7,48,12,15,51,7.0,7,19,11,47,8.0,...,12.0,7,22,8.0,82,16.0,24,1,12,51
8,45,9,13,54,2.0,14,16,11,35,16.0,...,11.0,13,19,6.0,80,5.0,26,11,13,27
9,47,8,13,45,4.0,14,19,10,26,15.0,...,5.0,20,15,6.0,71,6.0,25,7,14,8
10,78,4,14,121,4.0,27,20,8,61,9.0,...,8.0,11,11,4.0,100,9.0,39,30,11,6


### Grader's Comments

- 
- 

[This question is worth 10 points.]

In [8]:
# This cell should only be modified only by a grader.
# paper given is compared against all other papers
scores.append(0)

## Question 2

Implement a function `calc_cosine_similarity` that calculates the cosine similarity between a vector (representing the counts of the 50 most common words in a document) and each document in the _Federalist Papers_ corpus. Some code to check your implementation has been provided for you below.

In [9]:
top50.iloc[1].head(5)

a      29.0
all     4.0
an      1.0
and    83.0
any     1.0
Name: 2, dtype: float64

In [10]:
# use df from questin 1 and put on this function
# subset results to get ones without author - for other question
# compare vector to every row

# top50.iloc[1]
def calc_cosine_similarity(vector):
    idf = np.log(len(top50)/((top50)>0).sum())
    tf_idf = top50*idf
    query_vector = vector*idf
    dot_products = (tf_idf*query_vector).sum(axis=1)
    length1 = np.sqrt((tf_idf**2).sum(axis=1))
    length2 = np.sqrt((query_vector**2).sum())
    cos_sim = dot_products/(length1*length2)
    return cos_sim

In [19]:
# This should return a vector with the cosine similarity of each
# Federalist Paper with Paper No. 49 (one of the disputed papers).
calc_cosine_similarity(top50.loc[1]).nlargest(3)

1     1.000000
2     0.835979
3     0.782240
4     0.785857
5     0.866028
6     0.814072
7     0.600997
8     0.790892
9     0.713261
10    0.915364
11    0.859037
12    0.745709
13    0.739847
14    0.837228
15    0.819067
16    0.811168
17    0.831711
18    0.806310
19    0.666848
20    0.690224
21    0.791626
22    0.826072
23    0.672404
24    0.754902
25    0.852453
26    0.807772
27    0.738337
28    0.665122
29    0.641748
30    0.729425
        ...   
56    0.786391
57    0.863653
58    0.741779
59    0.744361
60    0.915302
61    0.767328
62    0.677962
63    0.736188
64    0.771537
65    0.808865
66    0.828350
67    0.700048
68    0.871360
69    0.689032
70    0.907207
71    0.849740
72    0.804264
73    0.786818
74    0.647510
75    0.760247
76    0.825332
77    0.665681
78    0.689477
79    0.725164
80    0.596428
81    0.747166
82    0.662256
83    0.790878
84    0.761035
85    0.747552
dtype: float64

In [12]:
# This test should pass without an error. (Think about why!)
assert(calc_cosine_similarity(top50.loc[49]).loc[49] == 1.0)
# because it's comparing its cosine similarity to itself!

### Grader's Comments

- 
- 

[This question is worth 15 points.]

In [13]:
# This cell should only be modified only by a grader.
scores.append(0)

## Question 3

We will consider two ways of using the cosine similarities to assign an authorship to the disputed papers.

1. We can take the author of the Federalist Paper (among those with known authors) with the highest cosine similarity to the disputed paper.
2. We can look at the authors of the 3 papers (among those with known authors) with the 3 highest cosine similarities to the disputed papers and use "majority vote" to determine the winner. (For example, if the 3 most similar papers were written by Hamilton, Madison, and Hamilton, then we would conclude that the paper was written by Hamilton.) In the case of a tie, we pick one arbitrarily.

Remember that the known authorships can be found in `/data/federalist/authorship.csv`.

To determine which of the two methods is better, we first apply the methods to the papers with known authorship. For each of the 70 papers with known authors, apply the two methods above and create a [confusion matrix](https://en.wikipedia.org/wiki/Confusion_matrix) -- showing how often you predicted Hamilton, Jay, or Madison, and how often it actually was Hamilton, Jay, or Madison. You can create the confusion matrix using either Pandas or Matplotlib, as long as it is labeled.

**WARNING:** When applying the above methods to the papers with known authorships, you should exclude the document with the highest cosine similarity. (Do you see why?)

<h3>First Method</h3>

In [14]:
auth = pd.read_csv("/data/federalist/authorship.csv")
known = auth[auth["Author"].notnull()]
disp = auth[auth["Author"].isnull()]
disp_lst = auth[auth["Author"].isnull()].Paper.tolist()
known_lst = auth[auth["Author"].notnull()].Paper.tolist()


def related_author_1(lst):
    pred_auths = []
    for num in lst:
        g_row = calc_cosine_similarity(top50.loc[num]).sort_values(ascending=False)[1:2].index.values[0]
        pred_auth = auth[auth["Paper"]== g_row].Author.values[0]
        pred_auths.append(pred_auth)
    y_actu = pd.Series(known.Author.values, name='Actual')
    y_pred = pd.Series(pred_auths, name='Predicted')
    df_confusion = pd.crosstab(y_actu, y_pred)
    return df_confusion

related_author_1(known_lst)

Predicted,Hamilton,Jay,Madison
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hamilton,38,0,7
Jay,1,0,2
Madison,4,1,6


<h3>Second Method</h3>

In [15]:
def related_author_2(lst):
    pred_auths = []
    for num in lst:
        g_row = calc_cosine_similarity(top50.loc[num]).sort_values(ascending=False)[1:4].index.values
        box = []
        box.append(auth[auth["Paper"]== g_row[0]].Author.values[0])
        box.append(auth[auth["Paper"]== g_row[1]].Author.values[0])
        box.append(auth[auth["Paper"]== g_row[2]].Author.values[0])
        if box[0] == (box[1] or box[2]):
            pred_auths.append(box[0])
        elif box[1] == (box[0] or box[2]):
            pred_auths.append(box[1])
        elif box[2] == (box[0] or box[1]):
            pred_auths.append(box[2])
        else:
            pred_auths.append(box[randint(0,2)])
    y_actu = pd.Series(known.Author.values, name='Actual')
    y_pred = pd.Series(pred_auths, name='Predicted')
    df_confusion = pd.crosstab(y_actu, y_pred)
    return df_confusion

related_author_2(known_lst)

Predicted,Hamilton,Jay,Madison
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hamilton,40,0,7
Jay,3,0,1
Madison,5,1,5


_DISCUSS WHICH OF THE TWO METHODS WAS BETTER AND WHY._

Although our models contain small flaws, it seems like the second model is more accurate for it takes into account more cos similarities than the first before it makes a choice.

### Grader's Comments

- 
- 

[This question is worth 20 points.]

In [16]:
# This cell should only be modified only by a grader.
scores.append(0)

## Question 4

Using the method that you deemed better in Question 3, assign an authorship to the 15 disputed _Federalist Papers_.

In [17]:
related_author_2(disp_lst)

Predicted,Hamilton,Jay,Madison
Actual,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Hamilton,3,1,2
Jay,3,0,1
Madison,1,0,0


### Grader's Comments

- 
- 

[This question is worth 5 points.]

In [18]:
# This cell should only be modified only by a grader.
scores.append(0)