# Supervised Text Classification

### with the disputed Federalist papers


<img src="http://teachingamericanhistory.org/wp-content/themes/tah-main/images/imported/ratification/federalist10.jpg">

### What is supervised text classification?

Email Subject | True Label
--|--
Your Mystery Tuesday Deal has arrived | Spam
End-of-Season Clearance :: 3,500+ Items on Sale! | Spam
Metis pre-class Info/Meetup sessions | **Not Spam**
The Best Content Marketers in the World | ?


A model is [ text $\rightarrow$ label ]
---- 

## Process

1. Find the data and parse into sets of [document, label]
2. Train a model on the labeled documents, preferably with cross validation. Select a model that performs best
3. Run the model on the documents that don't have labels yet

### Motivating Problem

<img src="http://tenthamendmentcenter.com/wp-content/uploads/2012/08/federalist-papers.jpg">

85 Papers written in Alexander Hamilton, James Madison, and John Jay in 1787 

Published by prominent New York Newspapers under a pseudonym

12 Papers known as "the disupted federalist papers"

Solved by Wallace and Mosteller in 1959-1963, using the IBM 7090:


<img src="https://pix-media.priceonomics-media.com/blog/1252/IBM_7090_computer.jpg">

## Process

1. **Find the data and parse into sets of [document, label]**
2. Train a model on the labeled documents, preferably with cross validation. Select a model that performs best
3. Run the model on the documents that don't have labels yet

A quick google search reveals that we can find the federalist papers from the gutenberg project

We note that the URL is http://www.gutenberg.org/cache/epub/18/pg18.txt

Everything is plain text, so we need to clean this to be sets of (text, author) 

e.g.

"Among the numerous advantages promised by a well constructed Union.." $\rightarrow$ "Hamilton"

In [1]:
# packages for grabbing web data
import re
import urllib
import sys

# Read in raw data
location_of_federalist_papers = urllib.urlopen("http://www.gutenberg.org/cache/epub/18/pg18.txt")
fed_text_raw_string = location_of_federalist_papers.read()
location_of_federalist_papers.close()

# Save for future use
%store fed_text_raw_string

Stored 'fed_text_raw_string' (str)


In [2]:
# Let's see what it looks like:
fed_text_raw_string[:500]

'\xef\xbb\xbfThe Project Gutenberg EBook of The Federalist Papers, by \r\nAlexander Hamilton and John Jay and James Madison\r\n\r\nThis eBook is for the use of anyone anywhere at no cost and with\r\nalmost no restrictions whatsoever.  You may copy it, give it away or\r\nre-use it under the terms of the Project Gutenberg License included\r\nwith this eBook or online at www.gutenberg.net\r\n\r\n\r\nTitle: The Federalist Papers\r\n\r\nAuthor: Alexander Hamilton\r\n        John Jay\r\n        James Madison\r\n\r\nPosting Date: December 12'

In [3]:
# Trial and error regex with https://regex101.com/r/fKSaBx/1
regex_string = '(?xs)(FEDERALIST[. ]+No\.\s[0-9].*?)(?=(?:FEDERALIST)|(?:End\ of\ the\ Project\ Gutenberg))'
pattern = re.compile(regex_string)

In [4]:
from nltk import tokenize 
# tokenize takes in the regex to break apart
list_o_papers = tokenize.regexp_tokenize(fed_text_raw_string, pattern)

In [5]:
# How many papers?
len(list_o_papers)

86

In [6]:
list_o_papers[4][:500]

'FEDERALIST No. 5\r\n\r\n\r\n\r\nThe Same Subject Continued\r\n\r\n(Concerning Dangers From Foreign Force and Influence)\r\n\r\nFor the Independent Journal.\r\n\r\n\r\n\r\nJAY\r\n\r\n\r\n\r\nTo the People of the State of New York:\r\n\r\nQUEEN ANNE, in her letter of the 1st July, 1706, to the Scotch\r\nParliament, makes some observations on the importance of the UNION\r\nthen forming between England and Scotland, which merit our attention.\r\nI shall present the public with one or two extracts from it: "An\r\nentire and perfect union will '

In [7]:
list_o_papers_no_returns = [re.sub("\r|\n", " ", essay) for essay in list_o_papers]
# More like turn all whitespace into a single space
list_o_papers_no_spaces = [re.sub("\s+", " ", essay) for essay in list_o_papers]
list_o_papers_cleaned = list_o_papers_no_spaces

In [8]:
# make list of authors
author_search = "(HAMILTON|JAY|MADISON)(\s(AND|OR)\s(MADISON))?"
fed_author_list = [re.search(author_search,essay).group() for essay in list_o_papers_cleaned]

In [9]:
# Print out the first 10 to make sure it worked
fed_author_list[:10]

['HAMILTON',
 'JAY',
 'JAY',
 'JAY',
 'JAY',
 'HAMILTON',
 'HAMILTON',
 'HAMILTON',
 'HAMILTON',
 'MADISON']

In [10]:
#unique authors: 
set(fed_author_list)

{'HAMILTON', 'HAMILTON AND MADISON', 'HAMILTON OR MADISON', 'JAY', 'MADISON'}

In [11]:
# keep just the text of the essays
pattern = re.compile(r'''
             (To\ the\ People\ of\ the\ State\ of\ New\ York)
             .*?                              # anything non-greedy
          $                                # end of string
             ''', re.VERBOSE)
for i in range(len(list_o_papers_cleaned)):
    text_search = re.search(pattern, list_o_papers_cleaned[i])
    list_o_papers_cleaned[i] = text_search.group()


# lowercase everything
list_o_papers_cleaned = [essay.lower() for essay in list_o_papers_cleaned]

# remove most punctuation 
list_o_papers_cleanedNoPunct = [re.sub("[.?!:;,()`'*]|(--)|\[|\]", "", essay) for essay in list_o_papers_cleaned]

In [12]:
list_o_papers_cleanedNoPunct[5][:500]

'to the people of the state of new york the three last numbers of this paper have been dedicated to an enumeration of the dangers to which we should be exposed in a state of disunion from the arms and arts of foreign nations i shall now proceed to delineate dangers of a different and perhaps still more alarming kindthose which will in all probability flow from dissensions between the states themselves and from domestic factions and convulsions these have been already in some instances slightly an'

## Seperate into Training and Test set

In [13]:
train_documents = [list_o_papers_cleanedNoPunct[i]  for i in xrange(len(fed_author_list)) if fed_author_list[i] in ['HAMILTON', 'MADISON']]

train_labels = [author for author in fed_author_list if author in ['HAMILTON', 'MADISON']]

test_documents = [list_o_papers_cleanedNoPunct[i]  for i in xrange(len(fed_author_list))  if fed_author_list[i] == 'HAMILTON OR MADISON']

test_labels = [author for author in fed_author_list if author == 'HAMILTON OR MADISON']

In [24]:
len(train_documents)

67

In [26]:
import numpy as np
np.mean([len(i) for i in train_documents])

13652.731343283582

## Process

1. Find the data and parse into sets of [document, label]
2. **Train a model on the labeled documents, preferably with cross validation. Select a model that performs best**
3. Run the model on the documents that don't have labels yet

In [14]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
from sklearn.linear_model import SGDClassifier

text_clf = Pipeline([('vect', CountVectorizer()),
                     ('tfidf', TfidfTransformer()),
                     ('clf', SGDClassifier(loss='hinge', penalty='l2',
                                           alpha=1e-3, n_iter=5, random_state=42)),
])


In [15]:
text_clf.fit(train_documents, train_labels)

Pipeline(steps=[('vect', CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        st...     penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False))])

In [16]:
predicted = text_clf.predict(train_documents)

In [17]:
predicted[:10]

array(['HAMILTON', 'HAMILTON', 'HAMILTON', 'HAMILTON', 'HAMILTON',
       'MADISON', 'HAMILTON', 'HAMILTON', 'HAMILTON', 'MADISON'], 
      dtype='|S8')

In [18]:
import numpy
numpy.mean(predicted == train_labels) 

0.86567164179104472

In [19]:
from sklearn.model_selection import GridSearchCV
parameters = {'vect__ngram_range': [(1, 1), (1, 2)],
              'tfidf__use_idf': (True, False),
              'clf__alpha': (1e-2, 1e-3),
}

In [20]:
gs_clf = GridSearchCV(text_clf, parameters, n_jobs=-1)

In [21]:
gs_clf = gs_clf.fit(train_documents, train_labels)
predictions = gs_clf.predict(train_documents)

In [22]:
numpy.mean(predictions == train_labels)

0.9850746268656716

## Process

1. Find the data and parse into sets of [document, label]
2. Train a model on the labeled documents, preferably with cross validation. Select a model that performs best
3. **Run the model on the documents that don't have labels yet**

In [23]:
gs_clf.predict(test_documents)

array(['MADISON', 'MADISON', 'MADISON', 'MADISON', 'MADISON', 'MADISON',
       'MADISON', 'MADISON', 'MADISON', 'MADISON', 'MADISON'], 
      dtype='|S8')

## From the 1963 Paper:

<img src="federalist_conclusion.png">

### Appendix

References:

Original Paper: Mosteller and Wallace, 1965 https://www.stat.cmu.edu/Exams/mosteller.pdf 

The story of the discovery: https://priceonomics.com/how-statistics-solved-a-175-year-old-mystery-about/

(uses Naive Bayes)

Recent approach (SVM): http://pages.cs.wisc.edu/~gfung/federalist.pdf

Overview of Naive Bayes classifier: https://web.stanford.edu/~jurafsky/slp3/6.pdf

Overview of text processing: https://web.stanford.edu/class/cs124/lec/naivebayes.pdf

#### Scikit Learn Docs for Text Learning:

http://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html

http://scikit-learn.org/stable/auto_examples/text/document_classification_20newsgroups.html
