# Hypothesis testing, songs vs. poems

Song lyrics look a lot like poems. They both have a number of short lines, 
they often rhyme, and they are written with careful attention to sound,
emotion, and aesthetics. Some songs started as poems, like the Star Spangled
Banner. So what distinguishes them?

[Be aware: Song lyrics have not been filtered for content. I expect the 
classroom to remain a respectful place. You are adults and I have no 
doubt that you can deal with this!]

**On paper** Before looking at any data, what do you think will be the differences between 
song lyrics and short poems? Will we be able to distinguish them reliably?

In this script we'll be examining a method for finding distinctive words between
two groups of texts: Dunning's g-test. This method tests if the difference 
between two proportions (e.g. word frequencies) are significant.

In [None]:
from collections import Counter
import math, re, random
import numpy

word_pattern = re.compile("\w[\w\-\']*\w|\w")

songs = []
songs_counter = Counter()
poems = []
poems_counter = Counter()

with open("../data/songs_poems/songs.txt", encoding="utf-8") as songs_reader:
    for line in songs_reader:
        tokens = word_pattern.findall(line)
        songs.append(tokens)
        songs_counter.update(tokens)
    
with open("../data/songs_poems/poems.txt", encoding="utf-8") as poems_reader:
    for line in poems_reader:
        tokens = word_pattern.findall(line)
        poems.append(tokens)
        poems_counter.update(tokens)

word_counts = songs_counter + poems_counter

## All the distinct word types in descending order by frequency
vocabulary = [x[0] for x in word_counts.most_common()]
vocabulary = vocabulary[0:1000]
vocab_size = len(vocabulary)

### Part 1: Song vs Poem Classification

Now let's try classifying documents as poems or songs using a Naive Bayes classifier.

1. What accuracy do you expect if we do a leave-one-out evaluation?

[Response here]

2. Run the `calculate_accuracy()` function. Record the accuracy. Are you surprised?

[Response here]

3. By default I'm classifying based on the 1000 most frequent words. Try three other frequency ranges. For example, to use the 50th to 100th most frequent words modify the approriate line to:

    vocabulary = vocabulary[50:100]

How do these affect accuracy? Give examples of the words in each range.

[Response here]

3. Use the `predict()` function to construct a list of (score, poem_tokens) tuples. Sort this list, and find the most song-ey poem and the most poem-ey poem. Do the same for songs.

[Describe extreme poems here]

[Describe extreme songs here]


In [None]:
def predict(doc, smoothing=0.01):
    ## loop through each word and add the log ratio
    ## for that word. poem-ish words will have a negative score,
    ##  song-ish words will have a positive score.
    score = 0.0
    
    songs_length = sum(songs_counter.values())
    poems_length = sum(poems_counter.values())
    
    for token in doc:
        if token in vocabulary:
            p_song = (songs_counter[token] + smoothing) / (songs_length + smoothing * vocab_size)
            p_poem = (poems_counter[token] + smoothing) / (poems_length + smoothing * vocab_size)
            score += math.log(p_song / p_poem)
            
    return score

def calculate_accuracy():
    correct = 0.0
    total = 0.0
    
    for doc in songs:
        songs_counter.subtract(doc)
        score = predict(doc)
        if score >= 0.0:
            correct += 1
        total += 1
        songs_counter.update(doc)
    
    for doc in poems:
        poems_counter.subtract(doc)
        score = predict(doc)
        if score < 0.0:
            correct += 1
        total += 1
        poems_counter.update(doc)
    
    return correct / total


def print_nicely(scores):
    for word_info in scores:
        print("{}\t{}\t{}\t{}".format(word_info[0], word_info[1], word_info[2], word_info[3]))

### Part 2: Word comparisons

Dunning's g test is similar to the *Fightin' Words* plots and Burrows' $z$-scores we were looking at before. It takes into account the number of observations we have (ie word counts).

For this script we'll be comparing songs and poems. 
We'll read the documents and then perform Dunning's 
g-test for each term in the overall vocabulary.

1. Use Markdown syntax to create a contingency table for these values:
    Group 1: 10 out of 50, Group 2: 5 out of 50.

[Answer here]

2. Using the `dunning_score` function, calculate Dunning
   g-scores for the following proportions:
   
   (a) 100/120, 30/55
   
   (b) 100/120, 10/12
   
   (c) 45/100, 105/200
   
   (d) 0/5, 10/25.
   
   How do differences in proportions and differences in counts affect the 
   resulting g-scores?

[Answer here]

3. What does the magnitude of a g-score indicate? (Try adding a zero to every number)

[Answer here]


4. Dunning's G score tells us whether the difference in counts between two groups is significant, but it doesn't tell us which group uses that word more. Modify the code in `score_differences()` to return a *negative* G score if a wordis more probable in poems than in songs.

[Change in code]

5. Use this command to generate positive/negative Dunning scores:

    word_scores = score_differences(songs_counter, poems_counter)

Copy the output here. Which words are most indicative of songs, and which of
poems? What does that result tell you about poetry and songs? Comment both on
"stopwords" and on more content-bearing words.

[Response here]

In [None]:
numpy.random.binomial(1000, 0.05, size=2)

In [None]:
dunning_score(500, 10000, 470, 10000)

In [None]:
### Evaluate the "surprise factor" of two proportions that are expressed as counts.
###  ie x1 "heads" out of n1 flips.
def dunning_score(x1, n1, x2, n2):
    p1 = float(x1) / n1
    p2 = float(x2) / n2
    p = float(x1 + x2) / (n1 + n2)
    
    return -2 * ( x1 * math.log(p / p1) + (n1 - x1) * math.log((1 - p)/(1 - p1)) + 
                  x2 * math.log(p / p2) + (n2 - x2) * math.log((1 - p)/(1 - p2)) )

def score_differences(a_counter, b_counter):
    a_length = sum(a_counter.values())
    b_length = sum(b_counter.values())
    
    shared_vocabulary = a_counter.keys() & b_counter.keys()
    
    scored_words = []
    
    for w in shared_vocabulary:
        a_n = a_counter[w]
        b_n = b_counter[w]
        
        g_score = dunning_score(a_n, b_n, a_length, b_length)
        
        ## The score is always positive, so add in a sign 
        ##  to indicate which proportion is larger.
        ## If the word is more frequent in poems, negate it.
        ## If the word is more frequent in songs, do nothing.
        ### [ADD CODE HERE]
        
        ## Create a tuple containing information about each word
        scored_words.append( (round(g_score, 3), a_n, b_n, w) )
        scored_words.sort(reverse = True)
    
    return scored_words
