# Week 4: The geometry of meaning

We're going to explore some basic forms of text analysis, using David Robinson's dataset of tweets made from the account of Donald J. Trump, as well as a dataset of nineteenth-century poetry and fiction, which is divided by date, by genre, and also by reception (whether or not the volume got reviewed in an 'elite' journal).

To begin, let's import some modules we're going to need later, and also read in the Trump data.

In [5]:
import os, csv, math
import pandas as pd
import numpy as np

from collections import Counter
from sklearn.decomposition import TruncatedSVD
from sklearn.preprocessing import Normalizer

cwd = os.getcwd()
print('Current working directory: ' + cwd + '\n')
      
relativepath = os.path.join('..', 'data', 'weekfour', 'trump.csv')
trump = pd.read_csv(relativepath)

Current working directory: /Users/rmorriss/Documents/datahum/code



## Different ways of identifying "distinctive" words

In this section we'll explore Dunning's log-likelihood, and also think about the strengths and weaknesses of "distinctive" words as evidence.

First let's glance at the Trump dataset.

In [6]:
trump.head()

Unnamed: 0.2,Unnamed: 0,Unnamed: 0.1,text,favorited,favoriteCount,replyToSN,created,truncated,replyToSID,id,replyToUID,statusSource,screenName,retweetCount,isRetweet,retweeted,longitude,latitude
0,0,1,My economic policy speech will be carried live...,False,9214,,2016-08-08 15:20:44,False,,762669882571980801,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,3107,False,False,,
1,1,2,"Join me in Fayetteville, North Carolina tomorr...",False,6981,,2016-08-08 13:28:20,False,,762641595439190016,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,2390,False,False,,
2,2,3,"#ICYMI: ""Will Media Apologize to Trump?"" https...",False,15724,,2016-08-08 00:05:54,False,,762439658911338496,,"<a href=""http://twitter.com/download/iphone"" r...",realDonaldTrump,6691,False,False,,
3,3,4,"Michael Morell, the lightweight former Acting ...",False,19837,,2016-08-07 23:09:08,False,,762425371874557952,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,6402,False,False,,
4,4,5,The media is going crazy. They totally distort...,False,34051,,2016-08-07 21:31:46,False,,762400869858115588,,"<a href=""http://twitter.com/download/android"" ...",realDonaldTrump,11717,False,False,,


#### Basic functions

For a lot of the work we do today, we're going to want to construct dictionaries that hold the frequencies of words in different categories: poetry or fiction, Trump-iphone or Trump-android. To do this we'll need to break text into words, count the words in each text, and then add up the counts by category.

Let's define some functions that do this. (You can find more polished versions of these functions in the ```nltk``` module.)


In [7]:
def tokenize(astring):
    ''' Breaks a string into words, and counts them.
    Designed so it strips punctuation and lowercases everything,
    but doesn't separate hashtags and at-signs.
    '''
    wordcounts = Counter()
    # create a counter to hold the counts
    
    tokens = astring.split()
    for t in tokens:
        word = t.strip(',.!?:;-—()<>[]/"\'').lower()
        wordcounts[word] += 1
        
    return wordcounts

def addcounters(counter2add, countersum):
    ''' Adds all the counts in counter2add to countersum.
    Because Counters(like dictionaries) are mutable, it
    doesn't need to return anything.
    '''
    
    for key, value in counter2add.items():
        countersum[key] += value

def create_vocab(seq_of_strings, n):
    ''' Given a sequence of text snippets, this function
    returns the n most common words. We'll use this to
    create a limited 'vocabulary'.
    '''
    vocab = Counter()
    for astring in seq_of_strings:
        counts = tokenize(astring)
        addcounters(counts, vocab)
    topn = [x[0] for x in vocab.most_common(n)]
    return topn

# Let's test the vocabulary function.
vocab = create_vocab(trump['text'], 4000)
vocab[0:10]
        

['the', 'to', 'and', 'a', 'in', 'is', 'i', 'you', 'of', 'will']

#### A few more basic functions

Once we have a vocabulary, we're going to want to divide our texts into categories, create Counters summing the word frequencies in those categories, and then compare the two Counters to find words that are overrepresented in one category relative to the other.

There are several ways we could define "overrepresented." We'll use Robinson's simple log-odds measure, as well as Dunning's log-likelihood.

In [8]:
def logodds(countsA, countsB, word):
    ''' Straightforward.
    '''
    
    odds = (countsA[word] + 1) / (countsB[word] + 1)
    
    # Why do we add 1 on both sides? Two reasons. The hacky one is 
    # that otherwise we'll get a division-by-zero error whenever
    # word isn't present in countsB. The more principled reason
    # is that this technique (called Laplacian smoothing) tends
    # to reduce the dramatic disproportion likely to be found in
    # very rare words.
    
    return math.log(odds)

def signed_dunnings(countsA, totalA, countsB, totalB, word):
    ''' Less straightforward. This function calculates a signed (+1 / -1)
    version of Dunning's log likelihood. Intuitively, this is a number 
    that gets larger as the frequency of the word in our two corpora
    diverges from its EXPECTED frequency -- i.e., the frequency it would
    have if it were equally distributed over both. But it also tends to get
    larger as the raw frequency of the word increases.
    
    Note that this function requires two additional arguments:
    the total number of words in A and B. We could calculate that inside
    the function, but it's faster to calculate it just once, outside the function.
    
    Also note: the strict definition of Dunnings has no 'sign': it gets bigger
    whether a word is overrepresented in A or B. I've edited that so that Dunnings
    is positive if overrepresented in A, and negative if overrepresented in B.
    '''
    if word not in countsA and word not in countsB:
        return 0
    
    # the raw frequencies of this word in our two corpora
    # still doing a little Laplacian smoothing here
    a = countsA[word] + 0.1
    b = countsB[word] + 0.1
    
    # now let's calculate the expected number of times this
    # word would occur in both if the frequency were constant
    # across both
    overallfreq = (a + b) / (totalA + totalB)
    expectedA = totalA * overallfreq
    expectedB = totalB * overallfreq
    
    # and now the Dunning's formula
    dunning = 2 * ((a * math.log(a / expectedA)) + (b * math.log(b / expectedB)))
    
    if a < expectedA:
        return -dunning
    else:   
        return dunning

# a set of common words is often useful
stopwords = {'a', 'an', 'are', 'and', 'but', 'or', 'that', 'this', 'so', 
             'all', 'at', 'if', 'in', 'i', 'is', 'was', 'by', 'of', 'to', 
             'the', 'be', 'you', 'were'}

# finally, one more function: given a list of tuples like
testlist = [(10, 'ten'), (2000, 'two thousand'), (0, 'zero'), (-1, 'neg one'), (8, 'eight')]
# we're going to want to sort them and print the top n and bottom n

def headandtail(tuplelist, n):
    tuplelist.sort(reverse = True)
    print("TOP VALUES:")
    for i in range(n):
        print(tuplelist[i][1], tuplelist[i][0])
    
    print()
    print("BOTTOM VALUES:")
    lastindex = len(tuplelist) - 1
    for i in range(lastindex, lastindex - n, -1):
        print(tuplelist[i][1], tuplelist[i][0])
        
headandtail(testlist, 2)
    

TOP VALUES:
two thousand 2000
ten 10

BOTTOM VALUES:
neg one -1
zero 0


## Exercise 1: Is Dunning's a better measure than logodds for Trump's tweets?

Let's put all these functions together to answer that question.

I've sketched the outline of a program below in "pseudocode," which
describes what needs to be done. Translate that into real Python code, using
the functions defined above. First use Robinson's logodds function and try to
replicate his results. See what happens if you do (or don't) remove stopwords
and tweets that begin with a quote.
                                                   
Then edit your code to use Dunning's log likelihood. Does that seem to be a better (more revealing) measure of overrepresentation? How would we decide?

In [9]:
trump['text'][:5]

0    My economic policy speech will be carried live...
1    Join me in Fayetteville, North Carolina tomorr...
2    #ICYMI: "Will Media Apologize to Trump?" https...
3    Michael Morell, the lightweight former Acting ...
4    The media is going crazy. They totally distort...
Name: text, dtype: object

In [1]:
# Code for Exercise 1

# Start by creating a vocabulary for words in the Trump tweets.
# Put it in a variable called 'vocab'.

trump_text = trump['text']
vocab = create_vocab(trump_text, 5000)
vocab[:10]

# Remember the function create_vocab takes two arguments:
# (seq_of_strings, n)
# We can afford to include all the words, so set n for 5000.


# An optional step: removing stopwords
vocab = list(set(vocab) - stopwords)

# Create counters for the android and iphone corpora.

android = Counter()
iphone = Counter()

# Figure out how many rows are in the Trump DataFrame
# and put that number in a variable like 'numrows.'
# Then iterate through the 'text' column of the data frame.

numrows = 1512

# for each text cell, get a Counter with words counts for that cell
# then add those counts either to iphone or android, like so:

for i in range(numrows):
    counts = tokenize(trump['text'][i])
    if 'iphone' in trump['statusSource'][i]:
        addcounters(counts, iphone)
    elif 'android' in trump['statusSource'][i]:
        addcounters(counts, android)

        
# print(type(android))
# When you get around to running Dunning's, you'll need to
# create variables that hold the total count of *all words*
# in iphone and android.
total_iphone = sum(iphone.values())
total_android = sum(android.values())

# Create an empty list to hold pairs of (overrepresentation_measure, word)
# Then iterate through your vocabulary. For each word, measure 
# overrepresentation using either logodds or signed_dunnings.
# Create a tuple, (overrepresentation_measure, word)
# and append it to the empty list you created.

new_list = []
for word in vocab:
#     g = logodds(iphone, android, word)
    g = signed_dunnings(iphone, total_iphone, android, total_android, word)
    new_list.append((g, word))
    

# Finally use the headandtail function to display the top 25 and bottom 25
# words in your tuplelist.
headandtail(new_list, 20)

NameError: name 'trump' is not defined

## Exercise 2: Apply the same methods to a more literary dataset.

I've also provided a dataset of roughly 1026 snippets from nineteenth-century poetry and fiction. The code below should read it in. Run that, then copy and paste the code you worked up for Trump, and edit it so it provides the most distinctive words for poetry (versus fiction).

If we have time, it may also be worth distinguishing poetry reviewed in elite journals from poetry that wasn't.


In [50]:
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)
poefic.head()

Unnamed: 0,date,author,title,genre,reception,text
0,1908,"Robins, Elizabeth,",The convert,fiction,elite,"looked like decent artisans, but more who bore..."
1,1871,"Lytton, Edward Bulwer Lytton,",The coming race,fiction,elite,"called the "" Easy Time "" (with which what I ma..."
2,1872,"Butler, Samuel,","Erewhon, or, Over the range",fiction,elite,the curtain ; on this I let it drop and retrea...
3,1900,"Barrie, J. M.",Tommy and Grizel,fiction,elite,"at you !"" he said. ""Dear eyes, "" said she. ""Th..."
4,1873,"Ritchie, Anne Thackeray,",Old Kensington,fiction,elite,"furious; I have not dared tell her, poor creat..."


In [10]:
# Code for Exercise 2

# The main thing you will need to change is the code that
# identifies rows as belonging to one of two corpora.


## Using corpora to create a "meaning space."

Contrasting two corpora can be revealing, but sometimes we want to think about the relations between individual texts or words. To do that, we often represent them as vectors in a multi-dimensional space.

The simplest way to do this is to create a DataFrame where rows are documents and columns are word — a document-term matrix. Here's a function that does that. It requires a pre-defined vocabulary (list of words) as well as a list (or numpy vector) of texts.


In [32]:
def doc_term_matrix(vocab, textvector):
    ''' Transform the textvector into a document-term matrix
    with one column for each word in vocab.
    '''
    
    n = len(textvector)
    vocabset = set(vocab)
    # making a set so we can check membership quickly;
    # it's much faster in a set than in a list
    
    termdictionary = dict()
    for word in vocab:
        termdictionary[word] = np.zeros(n)
    for i, text in enumerate(textvector):
        counts = tokenize(text)
        for word, count in counts.items():
            if word in vocabset:
                termdictionary[word][i] += count
    
    dtmatrix = pd.DataFrame(termdictionary, columns = vocab)
    return dtmatrix

# A nice arcane trick to perform on a document-term matrix
# is to squash it into a smaller number of dimensions. This
# often reveals relationships between words that don't
# necessarily, literally occur together. The technique is called
# Latent Semantic Analysis.

def lsa_matrix(dtmatrix, vocab, number_of_dimensions):
    lsa = TruncatedSVD(number_of_dimensions, algorithm = 'arpack')
    dtm_lsa = lsa.fit_transform(dtmatrix)
    dtm_lsa = Normalizer(copy=False).fit_transform(dtm_lsa)
    lsamatrix = pd.DataFrame(lsa.components_, columns = vocab)
    
    return lsamatrix

def cosine_similarity(vector1, vector2):
    dot_product = np.sum(vector1 * vector2)
    # we assume these are numpy vectors and can be
    # multiplied elementwise
    
    magnitude = math.sqrt(sum([val**2 for val in vector1])) * math.sqrt(sum([val**2 for val in vector2]))
    if not magnitude:
        return 0
    else:
        return dot_product/magnitude

## Exercise 3: Finding words that are close in "meaning space."

Following Widdows, we'll measure semantic similarity as the cosine similarity between vectors defined by a word's distribution across documents.

Let's try this both in the space defined by Trump tweets and in the space defined by 19c literature.

In [59]:
# Code for exercise 3

# Let's start by getting a vocabulary,
# and a doc-term matrix, as well as a
# squashed (lsa) version of that matrix.
relativepath = os.path.join('..', 'data', 'weekfour', 'poefic.csv')
poefic = pd.read_csv(relativepath)
# vocab = create_vocab(trump['text'], 5000)
vocab = create_vocab(poefic['text'], 5000)
dtm = doc_term_matrix(vocab, poefic['text'])
lsa = lsa_matrix(dtm, vocab, 25)

# Now write a function called find_matches
# that prints the 10 closest, and
# 10 weakest, matches for a given word in a given matrix
# Your function should take a matrix and a word as
# arguments.
def find_matches(amatrix, word):
    columnA = amatrix[word]
    vocab = amatrix.columns.values
    tuplelist = []    
    for w in vocab: 
        columnB = amatrix[w]
        cosinesim = cosine_similarity(columnA, columnB)
        tuplelist.append((cosinesim, w))

    
    headandtail(tuplelist, 10)


# It can get a vocabulary from the columns
# vocab = amatrix.columns.values

# Then use the word to get a vector associated with
# that column.

# Create a list to hold tuples, as we did before.

# Iterate through all columns, and for
# each column, check the cosine similarity between
# that column and our word. In each case, add
# (cosine similarity, word) to the tuple list.

# Then use headandtail(tuplelist, n) to display the
# top and bottom closest matches.


# Once you've defined this function (and run this cell),
# the cell below should allow you to select words
# and find matches. You can start by looking for
# matches in the doc-term matrix, but then
# branch out to the lsa matrix. See if that's
# better or worse.

# Then apply the same technique to the poefic DataFrame.


In [60]:
user_word = input('word? ')
find_matches(dtm, user_word) 



word? love
TOP VALUES:
love 1.0
heart 0.6338019727563365
my 0.5790713199496025
pb 0.5781646004562426
and 0.5737058509180477
when 0.5558831899652108
soul 0.5553625563278963
for 0.5536234172042541
all 0.5490887747443086
sweet 0.5488605574551916

BOTTOM VALUES:
isis 0.0
jago 0.0
lothair 0.0
malcolm 0.0
steelman 0.0
tankney 0.0
typhon 0.0
vance 0.0
، 0.0
ا 0.0
