Singing the Bayesian Beginner Blues
Song Lyrics Frequency and Empirical Bayes Estimation
library(knitr) knitr::opts_chunk$set(cache = TRUE, warning = FALSE, message = FALSE, dpi = 180) options(width=80, dplyr.width = 150) library(ggplot2) theme_set(theme_light())
Earlier this week, I published a post about song lyrics and how different U.S. states are mentioned at different rates, and at different rates relative to their populations. That was a very fun post to work on, but you can tell from that paragraph near the end that I am a little bothered by the uncertainty involved in calculating the rates by just dividing two numbers. David Robinson suggested on Twitter that I might try using empirical Bayes methods to estimate the rates. I am a newcomer to Bayesian methods but this idea makes a lot of sense in this context, so let's see what we can do!
The analysis here borrows very heavily from two of Dave's posts from last year. To start out, what are the values that we are dealing with? (I have hidden the code that downloads/calculates these values, but you can see the entire R Markdown file here.)
library(acs) library(dplyr) library(reshape2) library(readr) library(tidytext) ## population data stategeo <- geo.make(state = "*") popfetch <- acs.fetch(geography = stategeo, endyear = 2014, span = 5, table.number = "B01003", col.names = "pretty") pop_df <- tbl_df(melt(estimate(popfetch))) %>% mutate(name = Var1, state_name = tolower(Var1), pop2014 = value) %>% select(name, state_name, pop2014) %>% filter(state_name != "puerto rico") ## song lyrics song_lyrics <- read_csv("./billboard_lyrics_1964-2015.csv") tidy_lyrics <- bind_rows(song_lyrics %>% unnest_tokens(state_name, Lyrics), song_lyrics %>% unnest_tokens(state_name, Lyrics, token = "ngrams", n = 2)) ## join them together tidy_lyrics <- inner_join(tidy_lyrics, pop_df) %>% distinct(Rank, Song, Artist, Year, state_name, .keep_all = TRUE) ## count the states up state_counts <- tidy_lyrics %>% group_by(state_name) %>% summarise(n = n()) ## remove the Maine confusion songs state_counts$n[state_counts$state_name == "maine"] <- 1 state_counts <- pop_df %>% left_join(state_counts) %>% filter(!is.na(n)) %>% select(-state_name) %>% rename(state_name = name) %>% mutate(rate = n/pop2014)
state_counts %>% arrange(desc(rate)) %>% top_n(10)
We have, for each state here, the population in the state, the number of times it was mentioned in a song, and the rate of mentions per population (just the previous two numbers divided by each other). The reason I was uncomfortable here is that some states were mentioned so few times (like Hawaii and Montana!) and it is surely true that the rates calculated here have very different uncertainty intervals from state to state.
This is where Bayesian methods come in. We can use Bayesian methods to a) estimate the rate itself and b) estimate credible intervals. For a really wonderful explanation of how Bayesian models work, I will point you to Rasmus Bååth's post about whether his wife is pregnant or not. He posted that last year not too long after Dave posted some of his baseball/Bayes posts, at which time I started to think, "Maybe this Bayes stuff does actually make sense." To use a Bayesian method, we need to choose a prior probability distribution for what we want to estimate; this is what we believe about the quantity before evidence is taken into account. What makes empirical Bayes "empirical" is that the prior probability distribution is taken from the data itself; we will plot the actual distribution of rates (mentions per population) and use that distribution as our prior.
ggplot(state_counts, aes(rate)) + geom_histogram(binwidth = 2e-7, alpha = 0.8, fill = "midnightblue") + labs(x = "rate of mentions per population") + theme_minimal(base_family = "RobotoCondensed-Regular")
Hmmmmm, that's not so great, is it? There are only 50 states and not all of them were mentioned in this lyrics data set at all. But we will merrily push forward and calculate a prior probability distribution from this. Let's fit a beta distribution to this to use as our prior. I'm going to use the method of moments to fit a beta distribution.
x <- state_counts$n / state_counts$pop2014 mu <- mean(x) sigma2 <- var(x) alpha0 <- ((1 - mu) / sigma2 - 1 / mu) * mu^2 beta0 <- alpha0 * (1 / mu - 1)
Now let's plot this.
ggplot(state_counts) + geom_histogram(aes(rate, y = ..density..), binwidth = 2e-7, alpha = 0.8, fill = "midnightblue") + stat_function(fun = function(x) dbeta(x, alpha0, beta0), color = "red", size = 1) + labs(x = "rate of mentions per population") + theme_minimal(base_family = "RobotoCondensed-Regular")
That's not... too awful, I hope. Remember, what this is supposed to be is our belief about the distribution of rates before evidence from individual states is taken into account. I would buy this in a general sense; there a few states that are mentioned many times and many states that are mentioned a few times.
And that's it! We have a prior.
Calculating the Empirical Bayes Estimate
Now that we have a prior probability distribution, we use Bayes' theorem to calculate a posterior estimate for each state's rate. It's not very difficult math.
state_counts <- state_counts %>% mutate(rate_estimate = 1e6*(n + alpha0) / (pop2014 + alpha0 + beta0), rate = 1e6*rate) state_counts
How do the two values compare?
library(ggrepel) ggplot(state_counts, aes(rate, rate_estimate, color = n)) + geom_abline(intercept = 0, slope = 1, color = "gray70", linetype = 2) + geom_text_repel(aes(label = state_name), color = "black", box.padding = unit(0.5, 'lines'), family = "RobotoCondensed-Regular") + geom_point(size = 4) + scale_color_gradient(low = "midnightblue", high = "pink", name="Number\nof songs") + labs(title = "States in Song Lyrics with Empirical Bayes", subtitle = "States like Montana and Hawaii (high rates, few mentions) are shifted the most", x = "Measured rate of mentions per million population", y = "Empirical Bayes estimate of rate per million population") + theme_minimal(base_family = "RobotoCondensed-Regular") + theme(plot.title=element_text(family="Roboto-Bold"))
Notice that the states that were mentioned the highest number of times are closest to the line, i.e., the empirical Bayes method here did not change the value that much. It is only for states that were mentioned in a few songs that the two values are quite different. Notice that the high rates were shifted to lower values and the low rates were shifted to (slightly) higher values.
What Is the Posterior Distribution for Each State?
We calculated an empirical Bayes estimate for each rate above, but we can actually calculate the full posterior probability distribution for each state. What are $$\alpha$$ and $$\beta$$ for each state?
state_counts <- state_counts %>% mutate(alpha1 = n + alpha0, beta1 = pop2014 - n + beta0) state_counts
Let's plot a few of these to see what they look like.
library(broom) counts_beta <- state_counts %>% arrange(desc(rate_estimate)) %>% top_n(5, rate_estimate) %>% inflate(x = seq(1e-7, 5e-6, 2e-8)) %>% ungroup() %>% mutate(density = dbeta(x, alpha1, beta1)) ggplot(counts_beta, aes(x, density, color = state_name)) + geom_line(size = 1.2, alpha = 0.8) + stat_function(fun = function(x) dbeta(x, alpha0, beta0), lty = 2, color = "black") + labs(x = "Rate of mentions per population", y = "Density", title = "Prior and Posterior Distributions", subtitle = "The posterior distribution for a few example states are shown\nThe prior distribution is shown as a dashed line") + theme_minimal(base_family = "RobotoCondensed-Regular") + theme(plot.title=element_text(family="Roboto-Bold")) + theme(legend.title=element_blank())
Notice that New York, which was mentioned in many songs, has a narrow posterior probability distribution; we have more precise knowledge about the rate for New York. Hawaii, by contrast, has a broad probability distribution; we have less precise knowledge about the rate for Hawaii, because Hawaii was only mentioned in a few songs!
We can use these posterior probability distributions to calculate credible intervals, an estimate of how uncertain each of these measurements is, analogous to a confidence interval. (BUT SO DIFFERENT, everyone tells me. ON A PHILOSOPHICAL LEVEL.)
state_counts <- state_counts %>% mutate(low = 1e6*qbeta(.025, alpha1, beta1), high = 1e6*qbeta(.975, alpha1, beta1))
These are 95% credible intervals. Let's check them out!
library(tidyr) state_counts %>% arrange(desc(rate_estimate)) %>% mutate(state_name = factor(state_name, levels = rev(unique(state_name)))) %>% select(state_name, 'Measured rate' = rate, 'Empirical Bayes estimate' = rate_estimate, low, high) %>% gather(type, rate, `Measured rate`, `Empirical Bayes estimate`) %>% ggplot(aes(rate, state_name, color = type)) + geom_errorbarh(aes(xmin = low, xmax = high), color = "gray50") + geom_point(size = 3) + xlim(0, NA) + labs(x = "Rate of mentions per million population", y = NULL, title = "Measured Rates, Empirical Bayesian Estimates, and Credible Intervals", subtitle = "The 95% credible intervals are shown for these states") + theme_minimal(base_family = "RobotoCondensed-Regular") + theme(plot.title=element_text(family="Roboto-Bold")) + theme(legend.title=element_blank())
The part of this that was the most satisfying was the credible intervals, how big/small they are, and how the plain vanilla rates and empirical Bayes rate estimates are distributed in the credible intervals. It definitely gave me the data intuition warm fuzzies and made a lot of sense. This method is not that hard to implement or understand and was a gratifyingly productive first foray into Bayesian methods. The R Markdown file used to make this blog post is available here. I am very happy to hear feedback or questions!