# Week 5 homework.

In class, we built a classifier that detected Trump's authorship of tweets from his account. Repeat that work for the poefic dataset. Build a classifier that distinguishes poetry from fiction. 

Almost all the code you need is in the notebook we used in class. Copy functions from that notebook and paste them here, altering them as necessary so that they use the metadata available in the poefic frame. Include a function that does five-fold crossvalidation. Play around with different settings of p (the number of features included in the model) to see how high you can get the accuracy.

Then, at the end of the notebook, write a short paragraph of commentary. How much accuracy do you get? Why do you think that accuracy for this classification task is higher or lower than it was on the Trump tweet data? (You might want to inspect the data itself, using Excel or a text editor.)

In [10]:
import os, csv, math, random
import pandas as pd
import numpy as np

from collections import Counter

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)
poefic.head()

Current working directory: /Users/rmorriss/Documents/datahum/code



Unnamed: 0,date,author,title,genre,reception,text
0,1908,"Robins, Elizabeth,",The convert,fiction,elite,"looked like decent artisans, but more who bore..."
1,1871,"Lytton, Edward Bulwer Lytton,",The coming race,fiction,elite,"called the "" Easy Time "" (with which what I ma..."
2,1872,"Butler, Samuel,","Erewhon, or, Over the range",fiction,elite,the curtain ; on this I let it drop and retrea...
3,1900,"Barrie, J. M.",Tommy and Grizel,fiction,elite,"at you !"" he said. ""Dear eyes, "" said she. ""Th..."
4,1873,"Ritchie, Anne Thackeray,",Old Kensington,fiction,elite,"furious; I have not dared tell her, poor creat..."


# Simplify the dataset, and divide it into categories, and "folds"

In [11]:
def poefic_test(a_data_frame, rowidx):
    if 'fiction' in a_data_frame['genre'][rowidx]:
        return 'fiction'
    elif 'poetry' in a_data_frame['genre'][rowidx]:
        return 'poetry'
    else:
        return 'other'
    
lit_text = poefic['text']

genre = []
fold = []
for idx in poefic.index:
    genre.append(poefic_test(poefic, idx))
    fold.append(random.sample(list(range(5)), 1)[0])
genre = pd.Series(genre, index = poefic.index)
fold = pd.Series(fold, index = poefic.index)

tdf = pd.concat([lit_text, genre, fold], axis = 1)
tdf.columns = ['text', 'genre', 'fold']

# limit the dataframe to columns with either poetry or fiction;
# exclude 'other'
tdf = tdf[(tdf['genre'] == 'poetry') | (tdf['genre'] == 'fiction')]
tdf.head()


Unnamed: 0,text,genre,fold
0,"looked like decent artisans, but more who bore...",fiction,2
1,"called the "" Easy Time "" (with which what I ma...",fiction,0
2,the curtain ; on this I let it drop and retrea...,fiction,0
3,"at you !"" he said. ""Dear eyes, "" said she. ""Th...",fiction,2
4,"furious; I have not dared tell her, poor creat...",fiction,4


## Divide the dataframe into training set and test set.

In [9]:
testset = tdf[tdf['fold'] == 4]
trainingset = tdf[tdf['fold'] != 4]
print('Training set includes ' + str(trainingset.shape[0]))
print('Test set includes ' + str(testset.shape[0]))

Training set includes 807
Test set includes 220


## Define basic text wrangling functions

In [13]:
def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        vocab = vocab + counts
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

## Define the functions for the naive Bayes test

In [14]:
def categorize(df, rowidx):
    if df.loc[rowidx, 'genre'] == 'fiction':
        return 'positive'
    elif df.loc[rowidx, 'genre'] == 'poetry':
        return 'negative'
    else:
        print('error: neither fiction nor poetry')
        return 'other'

def get_priors(df):
    genre_counts = df.groupby('genre').count()['text']
    print(genre_counts)
    positive_odds = genre_counts['fiction'] / genre_counts['poetry']
    negative_odds = genre_counts['poetry'] / genre_counts['fiction']
    return math.log(positive_odds), math.log(negative_odds)

def train_nb_model(df, p):
    vocab = create_vocab(df['text'], p)
    vocabset = set(vocab)
    # we make a set because membership-checking is faster
    # in sets; but we also hold onto the list, which is ordered
    
    positive_prior, negative_prior = get_priors(df)
    
    positive_counts = Counter()
    negative_counts = Counter()
    
    for i in df.index:
        snippet = df['text'][i]
        snippet_counts = tokenize(snippet)
        category = categorize(df, i)
        if category == 'negative':
            negative_counts = negative_counts + snippet_counts
        elif category == 'positive':
            positive_counts = positive_counts + snippet_counts
    
    # Now let's organize these Counters into a DataFrame
    
    negative = pd.Series(1, index = vocab)
    positive = pd.Series(1, index = vocab)
    # notice initializing to 1 -- Laplacian smoothing
    
    for word, count in positive_counts.items():
        if word in vocabset:
            positive[word] += count
    
    for word, count in negative_counts.items():
        if word in vocabset:
            negative[word] += count
    
    all_prob = (negative + positive) / (np.sum(negative) + np.sum(positive))
    
    negative_prob = negative / np.sum(negative)
    positive_prob = positive / np.sum(positive)
    
    # note that when we sum up the negative and positive
    # columns, we are also summing up all the Laplacian 1's
    # we initially added to them
    
    model = pd.concat([negative, positive, all_prob, 
                       negative_prob, positive_prob], axis = 1) 
        
    model.columns = ['neg', 'pos', 'all_prob', 'neg_prob', 'pos_prob']
    
    # The next step is unnecessary, and will not be found in
    # most published versions of naive Bayes. I'm providing it
    # because it may help you understand the logic of the
    # algorithm.
    
    model['neg_norm'] = negative_prob / all_prob
    model['pos_norm'] = positive_prob / all_prob
    
    
    model['log_neg'] = [math.log(x) for x in model['neg_norm']]
    model['log_pos'] = [math.log(x) for x in model['pos_norm']]
    return vocab, positive_prior, negative_prior, model

vocab, positive_prior, negative_prior, model = train_nb_model(trainingset, 1500)
model.head() 
        

genre
fiction    287
poetry     520
Name: text, dtype: int64


Unnamed: 0,neg,pos,all_prob,neg_prob,pos_prob,neg_norm,pos_norm,log_neg,log_pos
the,36621,17403,0.071332,0.075599,0.06376,1.059815,0.893843,0.058094,-0.112225
and,23464,10367,0.04467,0.048438,0.037982,1.084361,0.85028,0.080991,-0.162189
,21344,12402,0.044558,0.044062,0.045438,0.988872,1.019749,-0.01119,0.019557
of,14865,8945,0.031438,0.030687,0.032772,0.976095,1.042425,-0.024195,0.04155
to,11491,9085,0.027168,0.023722,0.033285,0.873139,1.225146,-0.13566,0.20306


In [15]:
print(positive_prior, negative_prior)

-0.5943465958158518 0.5943465958158519


In [16]:
pd.options.mode.chained_assignment = None

def apply_model(vocab, positive_prior, negative_prior, model, testset):
    right = 0
    wrong = 0
    vocabset = set(vocab)
    odds_pos = []
    odds_neg = []

    for i in testset.index:
        odds_positive = positive_prior
        odds_negative = negative_prior
        snippet = testset['text'][i]
        snippet_counts = tokenize(snippet)
        for word, count in snippet_counts.items():
            if word not in vocabset:
                continue
            odds_positive += model.loc[word, 'log_pos']
            odds_negative += model.loc[word, 'log_neg']
            
        if odds_positive > odds_negative:
            prediction = 'positive'
        else:
            prediction = 'negative'
        
        odds_pos.append(odds_positive)
        odds_neg.append(odds_negative)

        reality = categorize(testset, i)
        if reality != 'positive' and reality != 'negative':
            continue
        elif prediction == reality:
            right += 1
        else:
            wrong += 1

    print("Got " + str(right) + " rows right, and " + str(wrong) + " wrong.")
    accuracy = (right / (wrong + right)) * 100
    print("Accuracy was {0:.2f}%".format(accuracy))
    
    resultset = testset.copy()
    resultset['odds_positive'] = odds_pos
    resultset['odds_negative'] = odds_neg
    resultset = resultset.sort_values(by = 'odds_positive')
    
    return resultset, accuracy

newtestset, accuracy = apply_model(vocab, positive_prior, 
                         negative_prior, model, testset)

Got 208 rows right, and 12 wrong.
Accuracy was 94.55%


In [18]:
newtestset.head(20)

Unnamed: 0,text,genre,fold,odds_positive,odds_negative
973,"leaves unfold, The clouds at evening glow in g...",poetry,4,-241.573742,54.420104
551,scarce two moons measured round; Thou hadst no...,poetry,4,-237.371775,48.926737
647,"treading now ; I love to linger near, and feel...",poetry,4,-236.805949,52.963928
529,a desert dark and dreary With fragant flowers ...,poetry,4,-221.945342,47.5925
469,"first light of morn j , There his antitype, Ch...",poetry,4,-221.936463,47.798253
918,"read to them. It is the noon. How still, how c...",poetry,4,-218.749743,46.211941
604,"Those whom she meeteth mourning, for her heart...",poetry,4,-213.966473,40.43456
655,"her shrine, And passion burnt her incense. Vis...",poetry,4,-211.866119,44.084293
989,! withstand the subtle foe. Arrest its progres...,poetry,4,-211.339189,37.097206
578,"was ours, Shall unto thee remain? Thy wealth, ...",poetry,4,-210.114811,41.213183


## Some Reflections
What we see here is a VERY accurate model.  Although I passed it quite a large "p" value, giving it the top 1500 words in the vocabulary to test.  It took a very long time on my relatively competent and high powered computer.  I hesitate to go up from 1500 for fear of a crash in jupyter notebooks.  But the nearly 95% accuracy is pretty great.  One thing I am not sure I understand yet is whether the fact that we tested it on an "unseen" part of the data set means that we're out of the woods when it comes to the risk of "overfitting." My sense is that's true.

One way to explore that question of course is to perform the cross-validation on the rest of the dataset, training it on successive 4/5's of the data, and then confirming on the 1/5 that we hold out.  So let's try the fivefold cross-validation.


In [19]:
def fivefold_crossvalidate(tdf, p):
    accuracies = []
    for i in range(5):
        testset = tdf[tdf['fold'] == i]
        trainingset = tdf[tdf['fold'] != i]
        vocab, positive_prior, negative_prior, model = train_nb_model(trainingset, p)
        newtestset, accuracy = apply_model(vocab, positive_prior, negative_prior, model, testset)
        accuracies.append(accuracy)
        avg_acc = sum(accuracies)/len(accuracies)
        print("The average accuracy was " + str(avg_acc))

fivefold_crossvalidate(tdf, 2000)


genre
fiction    285
poetry     501
Name: text, dtype: int64
Got 229 rows right, and 12 wrong.
Accuracy was 95.02%
The average accuracy was 95.0207468879668
genre
fiction    290
poetry     540
Name: text, dtype: int64
Got 189 rows right, and 8 wrong.
Accuracy was 95.94%
The average accuracy was 95.47991659119151
genre
fiction    304
poetry     544
Name: text, dtype: int64
Got 171 rows right, and 8 wrong.
Accuracy was 95.53%
The average accuracy was 95.49685314645542
genre
fiction    287
poetry     533
Name: text, dtype: int64
Got 194 rows right, and 13 wrong.
Accuracy was 93.72%
The average accuracy was 95.05259155066283
genre
fiction    270
poetry     554
Name: text, dtype: int64
Got 196 rows right, and 7 wrong.
Accuracy was 96.55%
The average accuracy was 95.35241806811646


## Pretty Impressive!
The model works very well to distinguish the poetry from the fiction. At p = 2000, of course, the algorithm takes a long time to run. That said, I'm impressed that the naive bayes can distinguish so easily and clearly. As for why that's true, my intuition tells me that it has to do with the fact that these snippets are so much longer than the tweets, with so many more tokens. Given that the probabilities for any one token appearing in the one dataset but not the other are therefore relatively higher, therefore as the algorithm multiplies out these probabilities to update the prior (i.e. to calculate the posterior probability), the posteriors get further and further distinguished. I am still working on the intuition of why this works. I have read the [article](https://arbital.com/p/bayes_rule_odds/?l=1x8&pathId=11662) at arbital.com and I'm reading different textbooks too to get this in my head. I'm close but not quite there. I hope we can discuss a little more at the beginning of class on Tuesday-- that would be helpful. 

