# Homework 4: Applying Dunning's log-likelihood to 19c poetry

I've put my "solutions" to in-class exercises on the Moodle, except for Exercise 2, which has become our homework assignment. 

## Problem 1.

Find 25 words that are overrepresented in poetry reviewed by elite 19c magazines, as compared to other works of poetry that didn't get reviewed in those venues. Also list 25 words that are overrepresented in poetry that didn't get reviewed.

To do this, you'll need to copy over some of the functions from our Week 4 exercises, and also copy over the code from our in-class Exercise #1, editing it so that it divides the corpus.

Here's some code to get us started. I load some modules we're likely to need, and then load the ```poefic``` corpus.

Then I filter the ```poefic``` DataFrame to have only poetry. I'm doing this for two reasons. The first is that I'm a little concerned that the size of the data is posing a problem on some computers. The other, more immediate, reason is that this dataset only has an even distribution of the "reception" variable in poetry. (Almost all the fiction I gave you was reviewed in elite venues.)

In [1]:
import os, csv, math
import pandas as pd
import numpy as np

from collections import Counter

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)

# FILTERING BY ROW TO GET ONLY THE POETRY
poefic = poefic[poefic['genre'] == 'poetry']
# equivalent to
    # poefic = poefic.loc[poefic['genre] == poetry, : ]
poefic.index = range(poefic.shape[0])
poefic.head(6)

Current working directory: /Users/rmorriss/Documents/datahum/code



Unnamed: 0,date,author,title,genre,reception,text
0,1835,"Browning, Robert,",Paracelsus,poetry,remove,Paracelsus. We 154 PARACELSUS [BOOK III Not ea...
1,1833,"Browning, Robert,",Pauline,poetry,remove,"all, I sought How best life’s end might be att..."
2,1855,"Arnold, Matthew,",Poems,poetry,elite,"grace, and Wisdom be too proud To halve a lodg..."
3,1867,"Arnold, Matthew,",New poems,poetry,elite,from the West was then in shade. Ah ! now 'tis...
4,1861,"Mangum, A. W.",The holy shield,poetry,vulgar,happy hgme which he had exchange d for the ten...
5,1889,"Hopkins, Gerard Manley",Poems of Gerard Manley Hopkins,poetry,addcanon,"Randal; How far from then forethought of, all ..."


**A small digression about the code above**

It's worth dwelling for a moment on the statement that does filtering by row. Notice that if you index a pandas DataFrame with a single string, like ```poefic['genre']```, you get a column. But if you generate a series of Boolean values, and use *that* to index the DataFrame, like so,

```poefic[poefic['genre'] == 'poetry']```

You'll be selecting *rows* where the series has a value ```True.```

If it's not clear what I mean by "generating a series of Boolean values," look at the result of the cell below. (You can delete the cell below when you're working on the homework; this is all a digression.) Notice also, in the code above, that you can also use the ```.loc``` method to specify rows and columns at the same time if you want to. In this case I haven't specified a column for ```.loc``` to select, the ``` : ``` after the comma is a way of saying "all the columns."

# This explanation makes very good sense!  Thanks Ted

In [4]:
elite = poefic['reception'] == 'elite'
elite[:5]

0    False
1    False
2     True
3     True
4    False
Name: reception, dtype: bool

In [5]:
# CODE FOR PROBLEM 1

# You'll need to copy over the functions you need: things like "tokenize" will 
# certainly be necessary.

# I recommend removing stopwords, but test, and see what happens if you don't.

# The column 'reception' has several possible values, including 'elite' (was
# reviewed in elite journals), and 'vulgar' (which doesn't mean the poetry was
# obscene, but is just a wry way of saying it didn't turn up in our sample of 
# reviews). You want to contrast these two groups. Leave out other rows, where
# 'reception' has a value like 'remove.'

# After you've run code to produce the top 25 and bottom 25 words, sorted by 
# signed Dunnings, write a few sentences of commentary below.

In [6]:
def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def addcounters(counter2add, countersum):
    ''' Adds all the counts in counter2add to countersum.
    Because Counters(like dictionaries) are mutable, it
    doesn't need to return anything.
    '''
    
    for key, value in counter2add.items():
        countersum[key] += value

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        addcounters(counts, vocab)
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

# a set of common words is often useful
stopwords = {'a', 'an', 'are', 'and', 'but', 'or', 'that', 'this', 'so', 
             'all', 'at', 'if', 'in', 'i', 'is', 'was', 'by', 'of', 'to', 
             'the', 'be', 'you', 'were'}


        

# First things first: create a vocab!
Like in the Trump exercise, we begin by creating a vocabulary of the poems. First we put the text of the poems into a variable called `poems_text`.  Next we pass `poems_text` to the `create_vocab` function, and put the result in a variable called `poem_vocab`.  This gives us a pandas SERIES, from which we can pull out the stop words.  The result is a SERIES which we can treat as a list and print the first 20 words.  

In [11]:
poems_text = poefic['text']
poem_vocab = create_vocab(poems_text, 5000)
poem_vocab = list(set(poem_vocab)- stopwords)
poem_vocab[:10]

['',
 'wanton',
 'forever',
 '24',
 'regal',
 'kindling',
 'unfurled',
 'flies',
 'pleasures',
 'sake']

# Next step: divide the poems by reception
Great! I will say that this is a weird list of words; not exactly what I'd expect to be the top 10words in the vocab of this batch of poems.  In any case, now we need to classify our poems into their groups. In the case of the Trump exercise, the variable that we used to divide the tweets was the origin of the tweets: iphone vs. android.  In this case, we will divide the poems by the values in the reception column.  One group, `elite`, will be those reviewed in elite journals.  The other group will be `vulgar`, and it is the group not reviewed in those journals.

The method is to make a counter for `elite` and one for `vulgar`.  Then figure out the number of rows in our data frame `poefic` and use this to write a for loop (using the range function) that will cycle through the dataframe and pick out the poetry text in each category, then cycle through and add its tokens to the counters that we initiated.  In the end, we'll have two counters corresponding to the two groups.

I will paste in the functions here first, and then move to assemble the loops and code.

In [15]:
def logodds(countsA, countsB, word):
    ''' Straightforward.
    '''
    
    odds = (countsA[word] + 1) / (countsB[word] + 1)
    
    # Why do we add 1 on both sides? Two reasons. The hacky one is 
    # that otherwise we'll get a division-by-zero error whenever
    # word isn't present in countsB. The more principled reason
    # is that this technique (called Laplacian smoothing) tends
    # to reduce the dramatic disproportion likely to be found in
    # very rare words.
    
    return math.log(odds)

def signed_dunnings(countsA, totalA, countsB, totalB, word):
    ''' Less straightforward. This function calculates a signed (+1 / -1)
    version of Dunning's log likelihood. Intuitively, this is a number 
    that gets larger as the frequency of the word in our two corpora
    diverges from its EXPECTED frequency -- i.e., the frequency it would
    have if it were equally distributed over both. But it also tends to get
    larger as the raw frequency of the word increases.
    
    Note that this function requires two additional arguments:
    the total number of words in A and B. We could calculate that inside
    the function, but it's faster to calculate it just once, outside the function.
    
    Also note: the strict definition of Dunnings has no 'sign': it gets bigger
    whether a word is overrepresented in A or B. I've edited that so that Dunnings
    is positive if overrepresented in A, and negative if overrepresented in B.
    '''
    if word not in countsA and word not in countsB:
        return 0
    
    # the raw frequencies of this word in our two corpora
    # still doing a little Laplacian smoothing here
    a = countsA[word] + 0.1
    b = countsB[word] + 0.1
    
    # now let's calculate the expected number of times this
    # word would occur in both if the frequency were constant
    # across both
    overallfreq = (a + b) / (totalA + totalB)
    expectedA = totalA * overallfreq
    expectedB = totalB * overallfreq
    
    # and now the Dunning's formula
    dunning = 2 * ((a * math.log(a / expectedA)) + (b * math.log(b / expectedB)))
    
    if a < expectedA:
        return -dunning
    else:   
        return dunning

# a set of common words is often useful
stopwords = {'a', 'an', 'are', 'and', 'but', 'or', 'that', 'this', 'so', 
             'all', 'at', 'if', 'in', 'i', 'is', 'was', 'by', 'of', 'to', 
             'the', 'be', 'you', 'were'}

# finally, one more function: given a list of tuples like
testlist = [(10, 'ten'), (2000, 'two thousand'), (0, 'zero'), (-1, 'neg one'), (8, 'eight')]
# we're going to want to sort them and print the top n and bottom n

def headandtail(tuplelist, n):
    tuplelist.sort(reverse = True)
    print("TOP VALUES:")
    for i in range(n):
        print(tuplelist[i][1], tuplelist[i][0])
    
    print()
    print("BOTTOM VALUES:")
    lastindex = len(tuplelist) - 1
    for i in range(lastindex, lastindex - n, -1):
        print(tuplelist[i][1], tuplelist[i][0])
        
headandtail(testlist, 2)
    

TOP VALUES:
two thousand 2000
ten 10

BOTTOM VALUES:
neg one -1
zero 0


In [20]:
elite = Counter()
vulgar = Counter()

# numrows = len(poefic['text'])
numrows = poefic.shape[0]
print(numrows)

# Now write the for loop:

poefic['text'][0]
for i in range(numrows):
    counts = tokenize(poefic['text'][i])
    if poefic['reception'][i] == 'elite':
        addcounters(counts, elite)
    elif poefic['reception'][i] == 'vulgar':
        addcounters(counts, vulgar)

# Now sum up the values in the counters, which are necessary for the division in the formula.

vulgar_sum = sum(vulgar.values())
elite_sum = sum(elite.values())

# Now write the loop that iterates over the words in `poem_vocab` and passes them into the functions ```logodds``` 
# and then the ```signed_dunning```

tuplelist = []

for word in poem_vocab:
    g = logodds(elite, vulgar, word)
#     g = signed_dunnings(elite, elite_sum, vulgar, vulgar_sum, word)
    tuplelist.append((g, word))

headandtail(tuplelist, 20)

668
TOP VALUES:
isis 3.5263605246161616
osiris 3.367295829986474
typhon 3.091042453358316
eleanore 3.044522437723423
lilac 2.9444389791664403
gareth 2.9444389791664403
dauber 2.9444389791664403
julian 2.772588722239781
hech 2.70805020110221
handsomest 2.70805020110221
wonted 2.6390573296152584
pazon 2.6390573296152584
muza 2.6390573296152584
budget 2.6390573296152584
jane 2.3978952727983707
tragedy 2.2512917986064953
jules 2.1972245773362196
willie 2.0794415416798357
ida 2.0794415416798357
saladin 2.0149030205422647

BOTTOM VALUES:
rosamond -3.044522437723423
laura -2.995732273553991
herodias -2.9444389791664407
emma -2.8622008809294686
mornia's -2.833213344056216
apache -2.772588722239781
aulus -2.70805020110221
diuk -2.70805020110221
santaclaus -2.70805020110221
lyteria -2.639057329615259
philomel -2.4849066497880004
driver -2.3978952727983707
journal -2.3025850929940455
fo -2.1400661634962708
ting -2.1400661634962708
ar -2.0794415416798357
cain -2.0794415416798357
pounds -2.07944154

#### Brief commentary on results.

This isn't a class on 19th-century poetry, so I don't expect you to fully
interpret the results. (As Clarice was rightly suggesting in class, it's
necessary to actually read a few documents before we're in a position to
interpret quantitative patterns.) But you might be able to speculate or
form tentative hypotheses based on a selection of distinctive words.

### Some Interesting Results
I'm so relieved to get the code working after struggling to write that loop last night (I was hung up for 45 minutes because the indexing wasn't working on poefic!-- thanks for the tip).  Now that it works, I'm looking at the results of the signed dunning.  The way I set it up, the words with positive values are the most disproportionately common (or overrepresented) words in the class `elite`, which is to say that they are the most overrepreesnted  words in the poems that were reviewed in elite literary journals.  The words in the second list, the ones with the negative scores, are those words most disproportionately common in poetry not reviewed in elite literary journals.  I'm not sure quite how to characterize the difference between these lists; it seems fair to say that the "elite" poetry has a kind of "ROMANTIC" flavor to it-- Egyptian myth, grey, feminine pronouns and names.  The vulgar poetry list contains religious/ Christian motifs, first person plural pronouns.  Not sure what to say.  

Interestingly the logodds function turns up some of the same patterns, but not all. Whatever the differences, though, the same dilemma presents with the results of the logodds: it's hard to interpret what the separation here represents.  