# Summary
This notebook explores methods for comparing two different textual datasets to identify the terms that are distinct to each one:

* Difference of proportions (described in [Monroe et al. 2009, Fighting Words](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf) section 3.2.2
* Mann-Whitney rank-sums test (described in [Kilgarriff 2001, Comparing Corpora](https://www.sketchengine.eu/wp-content/uploads/comparing_corpora_2001.pdf), section 2.3)

In [12]:
import sys, operator
from collections import Counter, defaultdict
from scipy.stats import mannwhitneyu

In [7]:
import numpy as np

In [2]:
# the convote data is already tokenized so just split on whitespace
repub_tokens=open("../data/repub.convote.txt", encoding="utf-8").read().split(" ")
dem_tokens=open("../data/dem.convote.txt", encoding="utf-8").read().split(" ")

Q1: First, calculate the simple "difference of proportions" measure from Monroe et al.'s "Fighting Words", section 3.2.2.  What are the top ten terms in this measurement that are most republican and most democrat?

In [6]:
print('republic tokens: ', len(repub_tokens))
print('democrat tokens: ', len(dem_tokens))

republic tokens:  416714
democrat tokens:  483171


In [44]:
import string
import nltk

## Clean text 

+ remove punct,
+ remove stop words, 
+ stem words

In [49]:
puncts = string.punctuation
stop_words = nltk.corpus.stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [50]:
repub_tokens = [word for word in repub_tokens if (word not in puncts) and (word not in stop_words)]
dem_tokens = [word for word in dem_tokens if (word not in puncts) and (word not in stop_words) ]

In [53]:
ps = nltk.PorterStemmer()
repub_tokens_stemmed = [ps.stem(word) for word in repub_tokens]
dem_tokens_stemmed = [ps.stem(word) for word in dem_tokens]

In [14]:
def cal_counts(tokens):
    counts = defaultdict(int)
    for word in tokens:
        counts[word] += 1
    return counts

## Chi-square test

In [36]:
def chi_square(one_counts, two_counts):

    one_sum, two_sum = sum(one_counts.values()), sum(two_counts.values())
    vocab = list(np.union1d(list(one_counts.keys()), list(two_counts.keys())))
    
    N=one_sum + two_sum
    vals={}
    
    for word in vocab:
        O11=one_counts[word]
        O12=two_counts[word]
        O21=one_sum-one_counts[word]
        O22=two_sum-two_counts[word]
        
        # We'll use the simpler form given in Manning and Schuetze (1999) 
        # for 2x2 contingency tables: 
        # https://nlp.stanford.edu/fsnlp/promo/colloc.pdf, equation 5.7
        
        vals[word]=(N*(O11*O22 - O12*O21)**2)/((O11 + O12)*(O11+O21)*(O12+O22)*(O21+O22))
        
    sorted_chi = {k:v for (k, v) in sorted(vals.items(), key=operator.itemgetter(1), reverse=True)}
    return sorted_chi

In [39]:
def difference_of_proportions(one_tokens, two_tokens, one_cls_name, two_cls_name):
#     one_counts = dict(Counter(one_tokens))
#     two_counts = dict(Counter(two_tokens))
    one_counts, two_counts = cal_counts(one_tokens), cal_counts(two_tokens)
    sorted_chi = chi_square(one_counts, two_counts)
    
    one_sum, two_sum = sum(one_counts.values()), sum(two_counts.values())
    one=[]
    two=[]
    for k in sorted_chi:
        if one_counts[k]/one_sum > two_counts[k]/two_sum:
            one.append(k)
        else:
            two.append(k)
    
    print (one_cls_name, ':')
    for k in one[:20]:
        print("%s\t%s" % (k, sorted_chi[k]))

    print ("\n\n{}:\n".format(two_cls_name))
    for k in two[:20]:
        print("%s\t%s" % (k, sorted_chi[k]))

In [51]:
difference_of_proportions(dem_tokens, repub_tokens, 'Democrat', 'Republic')

Democrat :
cuts	354.76394914480903
republican	315.9149660695275
billion	159.20533317280237
cut	157.1287194169713
republicans	135.55603537338268
--	124.12411856122874
cbc	120.72444361187453
majority	117.69870422458176
administration	109.04797567950855
debt	103.7928198661121
budget	103.11718636395187
iraq	100.08594190293657
professor	92.85743126761008
theresa	76.17157546435062
social	71.55265790848058
health	70.27991663441924
fails	67.9300074948242
gun	64.97365346386344
university	62.674555431170695
n't	61.11643010254004


Republic:

gang	135.28114375995779
economy	112.76583946480014
chairman	100.647013372887
growth	100.64442459346924
small	92.22274660863798
gentleman	88.93870285590893
businesses	83.0519443942429
jurisdiction	82.3714935212654
gangs	79.87221745586834
shall	78.66749657131065
may	71.68530395687134
identification	70.70678819490136
driver	70.019411279093
terrorists	69.97493533648
terri	69.31482330636425
important	68.38123341198757
jobs	67.0485231904696
committee	63.4154279939

In [54]:
difference_of_proportions(dem_tokens_stemmed, repub_tokens_stemmed, 'Democrat', 'Republic')

Democrat :
cut	542.0320077518245
republican	450.46285526417677
billion	175.59634023664373
--	124.12411856122874
cbc	120.72444361187453
budget	107.9492917233775
debt	107.4225725285551
fail	101.2131313048445
iraq	100.08594190293657
administr	93.54095075691816
major	88.61140898053648
worker	81.95803623310654
theresa	76.17157546435062
deficit	75.7730502383019
professor	73.9392811615788
gun	71.04419319448873
health	70.63352853794076
opposit	67.75518061801439
social	67.52269487511651
altern	64.98553647027717


Republic:

gang	214.72196763004575
busi	138.72507579817227
economi	112.25884490634984
jurisdict	110.54764809170459
chairman	100.647013372887
growth	100.64442459346924
small	92.22274660863798
gentleman	88.93870285590893
terrorist	82.58421978219984
shall	78.66749657131065
licens	72.87412934485505
identif	71.75229849477107
may	71.68530395687134
committe	69.9531549528938
import	69.57009824196267
terri	68.97625501070368
grow	59.18992687464742
look	57.58220448064527
thank	51.969161258649436


Simply analyzing the difference in relative frequencies has a number of downsides: 1.) As Monroe et al (2009) points out (and we can see here as well), it tends to emphasize high-frequency words (be sure you understand why).  2.) We're not measuring whether a difference is statistically meaningful or just due to chance; the $\chi^2$ test is one method (described in Kilgarriff 2001 and in the context of collocations in Manning and Schuetze [here](https://nlp.stanford.edu/fsnlp/promo/colloc.pdf)) that addresses the desideratum of finding statistically significant terms, but it too has another downside: 3.) Simply counting up the total number of mentions of a term doesn't account for the "burstiness" of language -- if we see the word "Dracula" in a text, we're probably going to see it again in that same text.  The occurrence of words are not independent random events; they are tightly coupled with each other. If we're trying to understanding the robust differences between two corpora, we might prefer to prioritize words that show up more frequently *everywhere* in corpus A (but not in corpus B) over those that show up only very frequently within narrow slice of A (such as one text in a genre, one chapter in a book, or one speaker when measuring the differences between policital parties).

Q2 (check-plus): One measure that does account for this burstiness is the adaptation by corpus linguistics of the non-parametric Mann-Whitney rank-sum test. The specific adaptation of this test for text is described in Kilgarriff 2001, section 2.3.  Implement this test using a fixed chunk size of 500 and the [scikit-learn mannwhitneyu function](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mannwhitneyu.html); what are the top ten terms in this measurement that are most republican and most democrat? 

In [None]:
def mann_whitney_analysis(one_tokens, two_tokens):
    # your code here

In [None]:
mann_whitney_analysis(dem_tokens, repub_tokens)