# Hickenlooper and Gardner E-Mail Fundraising Analysis

How do John Hickenlooper and Cory Gardner message themselves to potential donors? 

That's the question I wanted to find out. Based on [political science research](https://www.tandfonline.com/eprint/g6tVMXwGRKZZQbNmYusB/full), political candidates tend to message themselves in more partisan ways and ask for money more often in e-mails than they do in, say, TV ads. 

Additionally, based on reading the e-mails, you can spot some stylistic differences between Gardner's e-mails and Hickenlooper's e-mails. So I wanted to see how we can use quantitative methods to document how the two candidates are messaging themselves in political fundraising e-mails.

## Set-up

First, I'm going to load in the data I prepared using the code in the README.

In [1]:
import datetime

import altair as alt
import pandas as pd
import pytz
import sys; sys.path.append("..")

COLORADO = "America/Denver"
fundraising = pd.read_csv("../data/clean/fundraising_emails.csv", index_col=0)
fundraising["date"] = pd.to_datetime(fundraising.date)
fundraising.head(2)

Unnamed: 0,candidate,valid_message,slug,subject,from,date,text,html,clean_text
0,gardner,True,gardner-206,Re-upping this in your inbox,info@coryforco.com,2019-11-20 08:15:59-07:00,Re-upping this in your inbox in case you happe...,<!doctype html>\n<html>\n<head>\n\t<style type...,Re-upping this in your inbox. Re-upping this i...
1,gardner,True,gardner-254,Your Quarterly Status Update,info@coryforco.com,2019-09-28 08:15:40-06:00,Will you step up now to defend the Senate next...,"<!DOCTYPE html PUBLIC ""-//W3C//DTD XHTML 1.0 S...",Your Quarterly Status Update. Will you step up...


### Basic Statistics

Now, I'm going to compute some basic statistics about the data. I wound up using `describe` in `pandas` and two function calls to do this.

In [2]:
fundraising.groupby("candidate").describe(datetime_is_numeric=True).T

Unnamed: 0,candidate,gardner,hickenlooper
valid_message,count,309,375
valid_message,unique,1,1
valid_message,top,True,True
valid_message,freq,309,375
slug,count,309,375
slug,unique,309,375
slug,top,gardner-202,hickenlooper-464
slug,freq,1,1
subject,count,309,375
subject,unique,302,361


In [3]:
fundraising.date.min()

datetime.datetime(2019, 8, 22, 7, 54, 5, tzinfo=tzoffset(None, -21600))

In [4]:
fundraising.date.max()

datetime.datetime(2020, 7, 16, 9, 51, 14, tzinfo=tzoffset(None, -21600))

## N-Grams

From here, I loaded in indexes showing the number of times different words or phrases appeared in Hickenlooper's and Gardner's emails. This code does two things. First of all, it "tokenizes" the emails, converting each document from a string to a list of strings roughly representing words. More specifically, it tokenizes the documents using a series of regular expressions and using WordNet with a Part of Speech tagger. That converts certain words like "dogs" to their root form ("dog" in this case), but it does so based on the part of speech of the word. For some types of words like prices that computers can't really make sense of, especially using the methods I'm using, I additionally converted the words into standard tags.

Then, I built indexes for single words, or *unigrams*; for two-word phrases, or *bigrams*; for three-word phrases, or *trigrams*; and for four-word phrases, or 4-grams. This allows me to find the raw frequencies of multi-word phrases throughout the document set, as well as to find the percentage of documents containing the phrases.

The function I'm using, `get_n_gram_counts`, returns a frequency counter for the combined candidates, for Hickenlooper, and for Gardner, both at the level of each individual document and at the level of each set of documents.

In [5]:
from email_analysis import tokenize
from nltk.probability import FreqDist

combined_clean_text = fundraising.clean_text
hickenlooper = fundraising[fundraising.candidate == "hickenlooper"]
hickenlooper_clean_text = hickenlooper.clean_text
gardner = fundraising[fundraising.candidate == "gardner"]
gardner_clean_text = gardner.clean_text

def get_n_gram_counts(
    baseline_text,
    cand1,
    cand2,
    n=1
):
    """Returns 3 pandas Series and 3 nltk FreqDists from a set of 3 distinct iterators of text."""
    n_gram = lambda x: tokenize.get_ngrams(x, n=n)
    all_tokens = (
        baseline_text.apply(tokenize.tokenize_email)
        .apply(list)
        .apply(n_gram)
        .apply(FreqDist)
    )
    cand1_toks = (
        cand1.apply(tokenize.tokenize_email)
        .apply(list)
        .apply(n_gram)
        .apply(FreqDist)
    )
    cand2_toks = (
        cand2.apply(tokenize.tokenize_email)
        .apply(list)
        .apply(n_gram)
        .apply(FreqDist)
    )
    
    return all_tokens, all_tokens.sum(), cand1_toks, cand1_toks.sum(), cand2_toks, cand2_toks.sum()

unigram_data = get_n_gram_counts(
    combined_clean_text,
    gardner_clean_text,
    hickenlooper_clean_text,
    n=1
)
bigram_data = get_n_gram_counts(
    combined_clean_text,
    gardner_clean_text,
    hickenlooper_clean_text,
    n=2
)
trigram_data = get_n_gram_counts(
    combined_clean_text,
    gardner_clean_text,
    hickenlooper_clean_text,
    n=3
)
four_gram_data = get_n_gram_counts(
    combined_clean_text,
    gardner_clean_text,
    hickenlooper_clean_text,
    n=4
)

## Exploration

At this point, I began exploring the data. In order to do this, I calculated the `log odds ratio` between language Hickenlooper used and language Gardner used. The *odds* of a candidate using a particular word is simply the probability of finding that word at random within the candidate's emails and dividing by the probability of not finding that word at random. In other words, $Odds_{w, candidate} = \frac{f_{w, candidate}}{1 - f_{w, candidate}}$, where $f_{w, candidate}$ is the frequency of that word within the candidate's emails.

From there, the log-odds ratio is $\log{Odds_{w, Gardner}} - \log{Odds_{w, Hickenlooper}}$. (Technically, I added a slight weight to all of the frequencies to avoid divide-by-zero errors.)

I decided to do this because of a [2008 paper demonstrating the metric](http://languagelog.ldc.upenn.edu/myl/Monroe.pdf). As that paper notes, there are flaws to this metric. But for exploratory analysis, it has the benefit of being a symmetric measure and being relatively good at finding words that meaningfully distinguish the language of two candidates, as opposed to words like "and" or "the" that are common among both candidates.

For each of the unigram, bigram, trigram, and four-gram datasets, I'll show the 20 words that had the highest and lowest log-odds ratios. When I initially conducted this analysis, I also used a interactive frequency graphic in altair that allowed me to hover over individual words in the dataset. The graphic was inspired from the Monroe paper above and served as a useful tool for exploring and finding words that Hickenlooper or Gardner used at much higher rates than the other.

Unfortunately, `altair` does not render well on GitHub. However, I'll show a static example of what those looked like after creating a function for the log-odds ratio:

In [6]:
import math
from email_analysis import display

def nonzero_freq(freq_dist, token, combined_dist):
    """Takes a frequency distribution and a token and returns 
    (num_occurrences + 1) / (unique_tokens + total_tokens)
    """
    return (freq_dist[token] + 1) / (freq_dist.N() + combined_dist.B())

def log_odds_ratio(
    token,
    all_doc,
    all_combined,
    gardner_doc,
    gardner_combined,
    hick_doc,
    hick_combined
):
    freq_gardner = nonzero_freq(gardner_combined, token, all_combined)
    freq_hick = nonzero_freq(hick_combined, token, all_combined)
    gardner_odds = (freq_gardner / (1 - freq_gardner))
    hick_odds = (freq_hick / (1 - freq_hick))
    return  math.log(gardner_odds) - math.log(hick_odds)

chart = display.display_frequency_graph(log_odds_ratio, *unigram_data)
chart.save("frequency-chart.png")

![A scatterplot comparing the frequencies of words to their log-odds ratios](frequency-chart.png)

The graphic maps the overall frequencies of the words to their log-odds ratios. In the interactive version, I was able to hover over each point and see the corresponding word. This allowed me to identify words that weren't captured in the lists of top 20 ratios below. The graphic also allowed me to find out the overall frequencies of the words I was looking at, which helped me ensure that the words I was selecting were not so rare that their high log-odds ratios could've come purely from random chance.

There was one other intuition I developed with the graphic through exploration: the band of points at the top and bottom of the graphic that are the farthest separated from the rest of the cluster seem to exclusively come from words or phrases that only appear in one candidate's emails. As a result, the graphic also helped me keep a good mix of words that both Gardner and Hickenlooper used in their emails and words that were exclusive to one candidate's emails.

### One-Word Phrases

In [7]:
display.display_differences(log_odds_ratio, *unigram_data, num_display=20)

                             Low Scores    	                              High Scores
  1. hick                                     -3.91434	x                                        6.77604
  2. tear                                     -3.91434	sandra                                   5.76816
  3. trial                                    -3.82729	radical                                  5.73136
  4. corporate                                -3.80502	left                                     5.02563
  5. fundraiser                               -3.75665	match                                    5.02127
  6. smith                                    -3.70662	socialist                                4.80250
  7. witness                                  -3.70662	o                                        4.74574
  8. violence                                 -3.68064	conservative                             4.65679
  9. pitch                                    -3.58427	schumer                    

### Two-Word Phrases

In [8]:
display.display_differences(log_odds_ratio, *bigram_data, num_display=20)

                             Low Scores    	                              High Scores
  1. right wing                               -4.53427	_<digit>_ x                              6.62144
  2. mcconnell '                              -4.44113	x match                                  6.56530
  3. or more                                  -4.38517	_<links>_ donate                         6.10470
  4. , folk                                   -4.17742	_<prices>_ _<links>_                     5.98739
  5. <s> folk                                 -4.14514	team gardner                             5.93481
  6. hello —                                  -4.12861	_<links>_ _<digit>_                      5.58104
  7. m .                                      -4.09469	match donate                             5.41868
  8. this campaign                            -4.08641	sandra ,                                 5.32409
  9. like this                                -4.07729	match _<prices>_           

In [9]:
display.display_differences(log_odds_ratio, *trigram_data, num_display=20)

                             Low Scores    	                              High Scores
  1. _<prices>_ or more                       -5.05091	_<digit>_ x match                        6.57181
  2. of _<prices>_ or                         -4.44552	_<links>_ donate _<prices>_              6.06726
  3. mcconnell ' s                            -4.43339	_<links>_ _<digit>_ x                    5.58090
  4. flip this senate                         -4.38336	x match donate                           5.41297
  5. <s> donate _<links>_                     -4.18546	donate _<prices>_ _<links>_              5.34078
  6. </s> <s> folk                            -4.13744	x match _<prices>_                       5.07442
  7. win this election                        -4.10409	_<prices>_ _<links>_ _<digit>_           5.04344
  8. have the resource                        -4.08699	match donate _<prices>_                  4.90890
  9. m . e                                    -4.06960	_<prices>_ x _<digit>_     

In [10]:
display.display_differences(log_odds_ratio, *four_gram_data, num_display=20)

                             Low Scores    	                              High Scores
  1. of _<prices>_ or more                    -4.42988	_<links>_ _<digit>_ x match              5.58416
  2. flip this senate seat                    -4.36694	_<digit>_ x match donate                 5.41626
  3. </s> <s> donate _<links>_                -4.18196	_<digit>_ x match _<prices>_             5.07776
  4. _<prices>_ or more to                    -4.10060	_<prices>_ _<links>_ _<digit>_ x         5.04678
  5. . </s> <s> folk                          -4.08350	x match donate _<prices>_                4.89776
  6. m . e .                                  -4.04840	_<links>_ donate _<prices>_ now          4.83756
  7. mitch mcconnell ' s                      -3.93504	donate _<prices>_ now _<links>_          4.79801
  8. . </s> <s> hello                         -3.87315	donate _<prices>_ x _<digit>_            4.56258
  9. donation of _<prices>_ or                -3.85164	now _<links>_ donate _<pric

## Findings

In TV ads Cory Gardner and, to a lesser extent, John Hickenlooper, have cast themselves as bipartisan candidates. Casting himself as someone who works across the aisle, Gardner's first TV ad showed Gov. Jared Polis citing how consistently the two are in contact during the coronavirus pandemic.

"Senator Cory Gardner, who I talk with multiple times every day, has done everything I've asked to help in our response," the ad shows Polis saying.

But in line with the political science research, you can see in the log-odds ratios that the candidates have been using partisan language in their emails. 

### People

One place where this particularly comes across is in the political figures the two candidates mention. McConnell and Donald Trump are both disproportionately referenced by Hickenlooper, while Gardner disproportionately mentions Schumer. You can see precise statistics below:

In [11]:
def display_stats(word):
    ngram = len(word)
    phrase = " ".join(word)
    if ngram == 1:
        dataset = unigram_data
    elif ngram == 2:
        dataset = bigram_data
    elif ngram == 3:
        dataset = trigram_data
    elif ngram == 4:
        dataset = four_gram_data
    else:
        raise ValueError("The word is too long/short")
    gardner_pct = dataset[2].apply(lambda x: word in x).mean()
    hick_pct = dataset[4].apply(lambda x: word in x).mean()
    print(f"Gardner mentioned {phrase} in {gardner_pct * 100.} percent of his e-mails")
    print(f"Gardner mentioned {phrase} in {dataset[2].apply(lambda x: word in x).sum()} of his {len(dataset[2])} emails")
    print(f"Hickenlooper mentioned {phrase} in {hick_pct * 100.} percent of his e-mails")
    print(f"Hickenlooper mentioned {phrase} in {dataset[4].apply(lambda x: word in x).sum()} of his {len(dataset[4])} emails")
    if word in dataset[3] and word in dataset[5]:
        greater_cand = "Gardner" if gardner_pct > hick_pct else "Hickenlooper"
        lesser_cand = "Hickenlooper" if gardner_pct > hick_pct else "Gardner"
        min_num = min(gardner_pct, hick_pct)
        max_num = max(gardner_pct, hick_pct)
        odds_ratio = (max_num / (1. - max_num)) / (min_num / (1. - min_num))
        print(f"The odds of {greater_cand} using the word {phrase} were {odds_ratio} times higher than {lesser_cand}")

#### Chuck Schumer

In [12]:
display_stats(("schumer",))

Gardner mentioned schumer in 36.56957928802589 percent of his e-mails
Gardner mentioned schumer in 113 of his 309 emails
Hickenlooper mentioned schumer in 0.26666666666666666 percent of his e-mails
Hickenlooper mentioned schumer in 1 of his 375 emails
The odds of Gardner using the word schumer were 215.6224489795918 times higher than Hickenlooper


In [13]:
display_stats(("chuck", "schumer",))

Gardner mentioned chuck schumer in 19.41747572815534 percent of his e-mails
Gardner mentioned chuck schumer in 60 of his 309 emails
Hickenlooper mentioned chuck schumer in 0.26666666666666666 percent of his e-mails
Hickenlooper mentioned chuck schumer in 1 of his 375 emails
The odds of Gardner using the word chuck schumer were 90.12048192771084 times higher than Hickenlooper


#### Mitch McConnell

In [14]:
display_stats(("mitch", "mcconnell",))

Gardner mentioned mitch mcconnell in 2.26537216828479 percent of his e-mails
Gardner mentioned mitch mcconnell in 7 of his 309 emails
Hickenlooper mentioned mitch mcconnell in 47.199999999999996 percent of his e-mails
Hickenlooper mentioned mitch mcconnell in 177 of his 375 emails
The odds of Hickenlooper using the word mitch mcconnell were 38.56709956709956 times higher than Gardner


In [15]:
display_stats(("mcconnell",))

Gardner mentioned mcconnell in 2.26537216828479 percent of his e-mails
Gardner mentioned mcconnell in 7 of his 309 emails
Hickenlooper mentioned mcconnell in 55.733333333333334 percent of his e-mails
Hickenlooper mentioned mcconnell in 209 of his 375 emails
The odds of Hickenlooper using the word mcconnell were 54.318416523235804 times higher than Gardner


#### Donald Trump

In [16]:
display_stats(("donald", "trump",))

Gardner mentioned donald trump in 0.6472491909385114 percent of his e-mails
Gardner mentioned donald trump in 2 of his 309 emails
Hickenlooper mentioned donald trump in 28.000000000000004 percent of his e-mails
Hickenlooper mentioned donald trump in 105 of his 375 emails
The odds of Hickenlooper using the word donald trump were 59.69444444444445 times higher than Gardner


### Partisan Signals

You can similarly see a number of partisan symbols in the data, with Gardner making references to the "far-left" and to "socialists" and Hickenlooper referring to "special interests" and to "billionaires" — a theme a political scientist told me is popular in political emails among Democrats in the Senate as a whole.

#### Socialist

In [17]:
display_stats(("socialist",))

Gardner mentioned socialist in 26.537216828478964 percent of his e-mails
Gardner mentioned socialist in 82 of his 309 emails
Hickenlooper mentioned socialist in 0.0 percent of his e-mails
Hickenlooper mentioned socialist in 0 of his 375 emails


#### Radical

In [18]:
display_stats(("radical",))

Gardner mentioned radical in 55.663430420711975 percent of his e-mails
Gardner mentioned radical in 172 of his 309 emails
Hickenlooper mentioned radical in 0.0 percent of his e-mails
Hickenlooper mentioned radical in 0 of his 375 emails


#### Far-left

In [19]:
display_stats(("far", "left",))

Gardner mentioned far left in 33.980582524271846 percent of his e-mails
Gardner mentioned far left in 105 of his 309 emails
Hickenlooper mentioned far left in 0.0 percent of his e-mails
Hickenlooper mentioned far left in 0 of his 375 emails


#### Special Interest

In [20]:
display_stats(("special", "interest",))

Gardner mentioned special interest in 12.944983818770226 percent of his e-mails
Gardner mentioned special interest in 40 of his 309 emails
Hickenlooper mentioned special interest in 17.599999999999998 percent of his e-mails
Hickenlooper mentioned special interest in 66 of his 375 emails
The odds of Hickenlooper using the word special interest were 1.4364077669902913 times higher than Gardner


#### Billionaire

In [21]:
display_stats(("billionaire",))

Gardner mentioned billionaire in 1.6181229773462782 percent of his e-mails
Gardner mentioned billionaire in 5 of his 309 emails
Hickenlooper mentioned billionaire in 11.200000000000001 percent of his e-mails
Hickenlooper mentioned billionaire in 42 of his 375 emails
The odds of Hickenlooper using the word billionaire were 7.668468468468468 times higher than Gardner


### References to Money

You can also see that both candidates made frequent references to money. And you can see that Cory Gardner made frequent references to matching — something that did not appear to come up in Hickenlooper's emails at all. Campaign finance experts [have questioned how these campaigns can work](https://www.opensecrets.org/news/2019/08/political-contributions-campaigns-say-theyll-match/).

I've analyzed both of the references to money based on regular expression. I matched references to prices (e.g. $100.00) using this reasonable regular expression from `CommonRegex`:

`'[$]\s?[+-]?[0-9]{1,3}(?:(?:,?[0-9]{3}))*(?:\.[0-9]{1,2})?'`

For matching digits, I used a simple `[0-9]+` regular expression. I tagged items by digits last, after prices, phone numbers and a few other things.

#### Prices

Both candidates made frequent references to money in their emails, something that a political scientist, Taewoo Kang, told me doesn't happen on TV at all:

In [22]:
display_stats(("_<prices>_",))

Gardner mentioned _<prices>_ in 70.22653721682848 percent of his e-mails
Gardner mentioned _<prices>_ in 217 of his 309 emails
Hickenlooper mentioned _<prices>_ in 62.4 percent of his e-mails
Hickenlooper mentioned _<prices>_ in 234 of his 375 emails
The odds of Gardner using the word _<prices>_ were 1.4212653288740247 times higher than Hickenlooper


#### Matching Campaigns

And, as mentioned before, Gardner uses matching campaigns regularly:

In [23]:
display_stats(("_<digit>_", "x", "match"))

Gardner mentioned _<digit>_ x match in 38.51132686084142 percent of his e-mails
Gardner mentioned _<digit>_ x match in 119 of his 309 emails
Hickenlooper mentioned _<digit>_ x match in 0.0 percent of his e-mails
Hickenlooper mentioned _<digit>_ x match in 0 of his 375 emails


#### Vulnerable Republican Senator

Another point that I found interesting on the money front was how the candidates discussed their opponents. If you read through examples of these phrases, you could see Hickenlooper often bringing McConnell up as a way of referencing the moneyed interests in the Republican Party. And Gardner often warned of the far-left as he tried to get people to give him money. As Tyler Sandberg, a GOP consultant, told me, "In fundraising, it's very much, 'The sky is falling, give now, save the world.'"

On this front, Hickenlooper mentioned Gardner's political vulnerability quite often as pointed to the importance of winning the race. He similarly made frequent references to "flipping this Senate seat blue" or variants of that phrase:

In [24]:
display_stats(("vulnerable", "republican", "senator"))

Gardner mentioned vulnerable republican senator in 0.0 percent of his e-mails
Gardner mentioned vulnerable republican senator in 0 of his 309 emails
Hickenlooper mentioned vulnerable republican senator in 8.0 percent of his e-mails
Hickenlooper mentioned vulnerable republican senator in 30 of his 375 emails


### Impeachment

The final interesting point — again, related to the partisan language of campaign emails — lies in how often the candidates talked about impeachment. In order to assess this, I looked at how often the candidates used any one of a set of impeachment-related words. This, of course, is not conclusive of their language, but tracks well with what I seemed to be seeing when I read the emails.

In [25]:
def impeachment(dataset, any_words):
    phrase = ', '.join(["".join(w) for w in any_words])
    in_impeachment = lambda x: any([word in x for word in any_words])
    gardner_pct = dataset[2].apply(in_impeachment).mean() * 100.
    hick_pct = dataset[4].apply(in_impeachment).mean() * 100.
    print(f"Gardner mentioned one of {phrase} in {gardner_pct} percent of his e-mails")
    print(f"Hickenlooper mentioned one of {phrase} in {hick_pct} percent of his e-mails")
    freq_gardner = sum([dataset[3].freq(word) for word in any_words])
    freq_hick = sum([dataset[5].freq(word) for word in any_words])
    if freq_gardner != 0 and freq_hick != 0:
        if freq_hick > freq_gardner:
            print(f"Hickenlooper mentioned {phrase} {freq_hick / freq_gardner} times more frequently than Gardner")
        else:
            print(f"Gardner mentioned {phrase} {freq_gardner / freq_hick} times more frequently than Hickenlooper")

impeachment_words = [
    ("trial",), ("impeachment",),
    ("acquit",), ("witness",),
    ("impeach",), ("ukraine",)
]        
impeachment(unigram_data, impeachment_words)

Gardner mentioned one of trial, impeachment, acquit, witness, impeach, ukraine in 0.6472491909385114 percent of his e-mails
Hickenlooper mentioned one of trial, impeachment, acquit, witness, impeach, ukraine in 9.866666666666667 percent of his e-mails
Hickenlooper mentioned trial, impeachment, acquit, witness, impeach, ukraine 81.00634190847614 times more frequently than Gardner


You can clearly see a stark contrast here. But you really see the whole contrast if you narrow your search to January and February, when the hearings were going on:

In [26]:
all_janfeb = fundraising[
    (fundraising.date >= datetime.datetime(2020, 1, 1, tzinfo=pytz.timezone(COLORADO))) &
    (fundraising.date < datetime.datetime(2020, 3, 1, tzinfo=pytz.timezone(COLORADO)))
]
hick_janfeb = all_janfeb[all_janfeb.candidate == "hickenlooper"]
gardner_janfeb = all_janfeb[all_janfeb.candidate == "gardner"]
jf_unigrams = get_n_gram_counts(
    all_janfeb.clean_text,
    gardner_janfeb.clean_text,
    hick_janfeb.clean_text,
    n=1
)
impeachment(jf_unigrams, impeachment_words)

Gardner mentioned one of trial, impeachment, acquit, witness, impeach, ukraine in 0.0 percent of his e-mails
Hickenlooper mentioned one of trial, impeachment, acquit, witness, impeach, ukraine in 40.32258064516129 percent of his e-mails


## Conclusion

The data shows that the candidates are messaging themselves in particularly partisan ways in emails, something that reflects how different the audiences of political emails are from the audiences of, say, television ads. Particularly in a race where Sen. Gardner has made such an emphasis of his bipartisan record, this seems notable.