Puttin'a Prior on It: Stan Beta Estmation of Song Lyrics

MDH 10/6/2016

Introduction
Get Data
Join, mutate, and munge the data
Fun facts...
Parameter estimation
Hamiltonian Monte Carlo (HMC)
Adding City Counts
Model Comparison
Create R scipt

Introduction

Estimating a Beta Distribution with Stan HMC

Repository for data, analysis (markdown), and R code

This is a repo to hold the data and code for my blog post based on Julia Silge's Singing the Bayesian Beginner Blues. The post by Silge is a really fun and interesting analysis of the rate at which song lyrics from Billboard Hot-100 songs (1958 to present) mention U.S. States by name. In her second post on the subject, Silge used a beta distribution to model this rate. After reading that post, I was inspired to learn more about her method and to follow up on this model with my interests in Bayesian modeling with Stan, a probabilistic programming language. I hoped to repeat Silge's findings while learning how to code her model in Stan. This was also a fun opportunity to work on more dplyr data munging techniques. The result was a good learning experience and perhaps a few additional insights into the distribution of state name mentions.

If you are interested in this, please see my blog post and Julia's posts. Note: the code and analytical process here is based on Silge's workflow, but alterations and additions were made to focus on different aspects. Any errors, sloppiness, or misunderstandings are my own. Also, I am sure there are other ways to model these data and while I'd be very happy to hear about them, this post is intended to explore this particular method. Please contact me if you see errors and just to say hi. [@md_harris](https://twitter.com/Md_Harris)

Differences from Silge's Analysis

Is zero inflated to include states not mentioned in lyrics
Incorporates mentions of cities with >= 100k population aggregated to their state and compares to analysis without city counts
Utilizes Hamiltonian Monte Carlo via rstan to estimate parameters, propagated uncertainty, and predict new values.

Why Go Bayes?

Aside from being an interesting data set due to cultural relevance, it is also a really great example of for why one may want to use a Bayesian approach. As described in Silge's post, the simple calculation of a mention rate is unsatisfying because it does not consider the magnitude of each states population. We would prefer a method that incorporates this information and can make a new estimate based on the number of mentions and population each state that therefore regularizes the estimates based on all of the information available. This include prior information and empirical data.

In my humble opinion this is a really key point; the reason that Silge utilized the beta distribution to model the data (as opposed to simply describe) is because she wanted to show the uncertainty for each estimate based on a states mentions and population. The problem this solves is a state that has a very high rate, but few mentions because it has a low population. This is in opposition to a state that is mentioned many times, but has a lower rate because of a large population. This is the classic "batting average" from illuminated by David Robinson here.

What the Bayesian approach does is incorporate prior knowledge (e.g. priors) and the uncertainty that comes with balancing values based on only a few data points against values based on many data points; which are likely more reliable. By this, the expectation for a states rate of mention is regularized or drawn towards a central global expectation. States with fewer data points are regularized more than more stable states with lots of data. Finally, the Credible Interval captures the uncertainty that propagates through this system. Regularization and uncertainty estimation are key reasons why someone may want to use Bayesian methods.

Let's get started!

Packages

library("ggplot2")
library("grid")
library("dplyr")
library("tools")
library("tidyverse")
library("acs")
library("reshape2")
library("readr")
library("tidytext")
library("scales")
library("ggplot2")
library("ggrepel")
library("broom")
library("fitdistrplus")
library("rstan")
library("knitr")
library("rmarkdown")

Get Data

Use acs package to get state population data from 2014
1. requires an API from the Census
Grab song lyrics from Kaylin Walker's repo
Load ./data/cities_over_100k_pop.csv, names of cities with >= 100k population
1. UNCOMMENT that line download.file(lyrics_url, save_lyrics_loc) to run
2. you may have to adjust the path
Load state name abbreviation for simplifying the plots later

#api.key.install("YOUR KEY HERE!")
## population data
stategeo <- geo.make(state = "*")
popfetch <- acs.fetch(geography = stategeo, 
                      endyear = 2014,
                      span = 5, 
                      table.number = "B01003",
                      col.names = "pretty")
## song lyrics
lyrics_url <- "https://raw.githubusercontent.com/walkerkq/musiclyrics/master/billboard_lyrics_1964-2015.csv"
save_lyrics_loc <- "~/Documents/R_Local/Put_a_prior_on_it-Blog_post/data/billboard_lyrics_1964-2015.csv"
# download.file(lyrics_url, save_lyrics_loc)
song_lyrics <- read_csv(save_lyrics_loc)

Parsed with column specification:
cols(
  Rank = col_integer(),
  Song = col_character(),
  Artist = col_character(),
  Year = col_integer(),
  Lyrics = col_character(),
  Source = col_integer()
)

cities_dat_loc <- "~/Documents/R_Local/Put_a_prior_on_it-Blog_post/data/cities_over_100k_pop.csv"
state_abbrev_loc <- "~/Documents/R_Local/Put_a_prior_on_it-Blog_post/data/state_abbrev.csv"
state_abbrev <- read_csv(state_abbrev_loc)

Parsed with column specification:
cols(
  name = col_character(),
  abbrev = col_character()
)

Join, mutate, and munge the data

Prepare data for extracting lyrics

The tidytext package uses unnest_tokens to do the heavy lifting here. Thanks David Robinson and Julia Silge for this package!

# extract desired info from acs data
pop_df <- tbl_df(melt(estimate(popfetch))) %>%
  mutate(name = as.character(Var1),
         state_name = tolower(Var1),
         pop2014 = value) %>%
  dplyr::select(name, state_name, pop2014) %>%
  filter(state_name != "puerto rico") %>%
  left_join(state_abbrev)

Joining, by = "name"

# clean in city names
cities <- read_csv(cities_dat_loc) %>%
  mutate(city = gsub("\x96", "-", city),
         city_name = tolower(city),
         state_name = tolower(state))

Parsed with column specification:
cols(
  rank = col_integer(),
  city = col_character(),
  state = col_character(),
  pop_2014 = col_integer()
)

# extract and tidy lyrics from songs data
tidy_lyrics <- bind_rows(song_lyrics %>% 
               unnest_tokens(lyric, Lyrics),
               song_lyrics %>% 
               unnest_tokens(lyric, Lyrics, 
               token = "ngrams", n = 2))

Join lyrics to geography

Here the lyrics are joined to the state names to find the songs that contain those names. An inner_join will return only the songs and states that match. Here I use a right_join to retain all 50 state names even if they are not mentioned by any lyrics. The zeros data set is then filtered for a data set that only has matching songs. Also, I use distinct(Song, Artist, lyric, ...) to aggregate the data set to only a single row for each song even if it mentions a song many times. This choice if distinct criteria also allows for songs that mentioned many different states to remain, but collapses an edge case where a song was on the charts in different years.

# join and retain songs whether or not they are in the lyrics
tidy_lyrics_state_zeros <- right_join(tidy_lyrics, pop_df,
                                by = c("lyric" = "state_name")) %>%
  distinct(Song, Artist, lyric, .keep_all = TRUE) %>% 
  mutate(cnt = ifelse(is.na(Source), 0, 1)) %>%
  filter(lyric != "district of columbia") %>%
  dplyr::rename(state_name = lyric)

tidy_lyrics_state <- filter(tidy_lyrics_state_zeros, cnt > 0)

## count the states up
zero_rate <- 0.0000001 # beta hates zeros

# group, summarise, and calculate rate per 100k population
state_counts_zeros <- tidy_lyrics_state_zeros %>% 
  group_by(state_name) %>% 
  dplyr::summarise(n = sum(cnt))  %>% # sum(cnt)
  left_join(pop_df, by = c("state_name" = "state_name")) %>%
  mutate(rate = (n / (pop2014 / 100000)) + zero_rate) %>%
  arrange(desc(n))

# create another data set with 
state_counts <- filter(state_counts_zeros, rate > zero_rate)
print(dplyr::select(state_counts, state_name, n ,pop2014, rate))

# A tibble: 33 × 4
    state_name     n  pop2014       rate
         <chr> <dbl>    <dbl>      <dbl>
1     new york    61 19594330 0.31131465
2   california    34 38066920 0.08931649
3      georgia    22  9907756 0.22204836
4    tennessee    14  6451365 0.21700844
5        texas    13 26092033 0.04982374
6      alabama    12  4817678 0.24908275
7  mississippi    10  2984345 0.33508200
8     kentucky     7  4383272 0.15969815
9       hawaii     6  1392704 0.43081670
10    illinois     6 12868747 0.04662469
# ... with 23 more rows

Do the same as above, but for cities

Note the use of inner_join here because I am not interested in cities that are not mentioned.

### Cities
## join cities together - inner_join b/c I don't care about cities with zero mentions (right_join otherwise)
tidy_lyrics_city <- inner_join(tidy_lyrics, cities,
                               by = c("lyric" = "city_name")) %>%
  distinct(Song, Artist, lyric, .keep_all = TRUE) %>%
  filter(!city %in% c("Surprise", "Hollywood", 
                      "District of Columbia", "Jackson",
                    "Independence")) %>%
  mutate(cnt = ifelse(is.na(Source), 0, 1)) %>%
  dplyr::rename(city_name = lyric)

# count cities mentions. No need for a rate; not of use now
city_counts <- tidy_lyrics_city %>% 
  group_by(city_name) %>% 
  dplyr::summarise(n = sum(cnt)) %>%
  arrange(desc(n))
print(city_counts)

# A tibble: 84 × 2
     city_name     n
         <chr> <dbl>
1     new york    61
2        miami    20
3  new orleans    19
4      chicago    17
5      atlanta    14
6      memphis    14
7      houston    11
8       orange     9
9      detroit     7
10      dallas     6
# ... with 74 more rows

# count of states that host the cities mentioned
city_state_counts <- tidy_lyrics_city %>% 
  group_by(state_name) %>% 
  dplyr::summarise(n = sum(cnt)) %>%
  arrange(desc(n))
print(city_state_counts)

# A tibble: 33 × 2
   state_name     n
        <chr> <dbl>
1    new york    66
2  california    38
3       texas    33
4   tennessee    22
5     florida    21
6   louisiana    21
7     georgia    18
8    illinois    17
9    michigan     9
10   nebraska     6
# ... with 23 more rows

Join city counts to state counts and compute new rate

state_city_counts_zeros <- left_join(state_counts_zeros,
                                     city_state_counts,
                                     by = "state_name") %>%
  dplyr::rename(n_state = n.x, n_city = n.y) %>%
  mutate(n_city = ifelse(is.na(n_city), 0, n_city),
         n_city_state = n_state + n_city,
         city_state_rate = (n_city_state / (pop2014 / 100000)) + zero_rate)

# same as above, but no states with zero mentioned by city or state
state_city_counts <- filter(state_city_counts_zeros, n_city_state > zero_rate)

Fun facts...

Some interesting things about these data

1. States mentioned by their cities, but not the state itself

# Boston = most mentioned city without its state
all_the_cities <- filter(state_city_counts, !state_name %in% state_counts$state_name) %>%
  dplyr::select(name) %>%
  mutate_if(is.factor, as.character) %>%
  left_join(tidy_lyrics_city, by = c("name" = "state")) %>%
  dplyr::select(name, Song, Artist, city)

kable(all_the_cities)

name	Song	Artist	city
Massachusetts	dirty water	the standells	Boston
Massachusetts	please come to boston	dave loggins	Boston
Massachusetts	hey nineteen	steely dan	Boston
Massachusetts	the heart of rock roll	huey lewis and the news	Boston
Massachusetts	dazzey duks	duice	Boston
Missouri	georgy girl	the seekers	Springfield
Missouri	living in america	james brown	Kansas City
New Mexico	bring em out	ti	Albuquerque
North Carolina	hush hush sweet charlotte	patti page	Charlotte
North Carolina	georgy girl	the seekers	Charlotte
North Carolina	raise the roof	luke featuring no good but so good	Fayetteville
North Carolina	wagon wheel	darius rucker	Raleigh
Rhode Island	meant to live	switchfoot	Providence
South Carolina	the beat goes on	sonny cher	Charleston
South Carolina	no no song	ringo starr	Columbia
South Carolina	forgot about dre	dr dre featuring eminem	Charleston

2. What song mentions the most unique state names?

n_states_mentioned <- tidy_lyrics_state %>%
  group_by(Artist, Song) %>%
  dplyr::summarise(n = n()) %>%
  arrange(desc(n)) %>%
  ungroup() %>%
  top_n(5)

Selecting by n

kable(n_states_mentioned)

Artist	Song	n
red hot chili peppers	dani california	7
duice	dazzey duks	5
jason aldean	fly over states	4
nelly	country grammar hot shit	4
arlo guthrie	city of new orleans	3
crazy elephant	gimme gimme good lovin	3
ludacris featuring bobby valentino	pimpin all over the world	3

# Top song is...
filter(tidy_lyrics_state, Song == as.character(n_states_mentioned[1,1])) %>%
  dplyr::select(Song, Artist, Year, state_name)

# A tibble: 0 × 4
# ... with 4 variables: Song <chr>, Artist <chr>, Year <int>,
#   state_name <chr>

3. What song mentions a single state the most number of times?

most_repeated_in_song <- right_join(tidy_lyrics, pop_df,
                                by = c("lyric" = "state_name")) %>%
  group_by(Song, Artist, lyric) %>%   
  dplyr::summarise(n = n()) %>%
  arrange(desc(n)) %>%
  ungroup() %>%
  filter(row_number() <= 10)

kable(most_repeated_in_song)

Song	Artist	lyric	n
empire state of mind	jayz and alicia keys	new york	34
new york groove	ace frehley	new york	20
arthurs theme best that you can do	christopher cross	new york	16
arizona	mark lindsay	arizona	13
all summer long	kid rock	alabama	12
mississippi girl	faith hill	mississippi	10
california girls	david lee roth	california	9
california girls	the beach boys	california	9
midnight train to georgia	gladys knight the pips	georgia	9
california dreamin	the mamas the papas	california	8

Parameter estimation

Estimate the parameters of the beta distribution in 3 ways:

Maximum-likelihood Estimation (MLE) with fitdistr package
Limited memory BFGS with the rstan package
Hamiltonian Monte Carlo (HMC) with Stan and rstan package

MLE with `fitdistr`

Here I use fitdist to compare the parameter estimates and Log likelihood of three different distributions, beta, exponential, and log-normal. This method is very fast, but not as accurate as the full Bayesian estimate. Here the beta has the best fit. You could explore different distributions that make sense for the data, but since Silge used beta, I am also.

## beta boot for comparison
beta_fit <- fitdist(state_counts_zeros$rate,"beta") # best logLik
summary(beta_fit)

Fitting of the distribution ' beta ' by maximum likelihood 
Parameters : 
        estimate Std. Error
shape1 0.1617485 0.02453031
shape2 2.2298802 0.71328680
Loglikelihood:  212.6346   AIC:  -421.2692   BIC:  -417.4452 
Correlation matrix:
        shape1  shape2
shape1 1.00000 0.38538
shape2 0.38538 1.00000

exp_fit <- fitdist(state_counts_zeros$rate,"exp")
summary(exp_fit)

Fitting of the distribution ' exp ' by maximum likelihood 
Parameters : 
     estimate Std. Error
rate  12.9374   1.829625
Loglikelihood:  78.00613   AIC:  -154.0123   BIC:  -152.1002

lnorm_fit <- fitdist(state_counts_zeros$rate,"lnorm")
summary(lnorm_fit)

Fitting of the distribution ' lnorm ' by maximum likelihood 
Parameters : 
         estimate Std. Error
meanlog -7.170250  0.9150467
sdlog    6.470357  0.6470357
Loglikelihood:  194.204   AIC:  -384.408   BIC:  -380.584 
Correlation matrix:
              meanlog         sdlog
meanlog  1.000000e+00 -4.206895e-09
sdlog   -4.206895e-09  1.000000e+00

Optimizing in `rstan`

This is the first introduction to a Stan model contained in the character string opt_chr1. Typically the Stan model is saved in a separate file and called, but I kept it all in-house for this analysis.

opt_chr1 <- "
data {
  int<lower=0> N;
  real x[N];
}
parameters {
  real<lower = 0> alpha0;
  real<lower = 0> beta0;
}
model {
  alpha0 ~ normal(0, 1);
  beta0 ~ normal(0, 10);
  //target += beta_lpdf(x | alpha0, beta0); // same as below
  x ~ beta(alpha0, beta0);
}
"
# initialize parameter values (based on knowledge or fitdist results)
init_list <- list(alpha0 = 0.1, beta0 = 1)
# compile model (~ 10 to 15 seconds)
opt_mod1 <- stan_model(model_code = opt_chr1)
# optimize data given model
opt1 <- optimizing(object = opt_mod1, as_vector = FALSE,
                   data = list(x = state_counts_zeros$rate,
                               N = length(state_counts_zeros$rate)),
                   hessian = TRUE,
                   draws = 2500)

Parameters estimates and log likelihood

# view results
opt1$par

$alpha0
[1] 0.1614814

$beta0
[1] 2.217158

opt1$value #compare to LogLikelihood of summary(beta_fit)

[1] 212.5968

plot distribution of parameters to see dispersal and correlation

ggplot(data.frame(opt1$theta_tilde), aes(x = alpha0, y = beta0)) +
  geom_point(color = "skyblue3", alpha = 0.35) +
  geom_density2d(aes(colour =..level..)) + 
  scale_colour_gradient(low="gray80",high="firebrick") + 
  scale_x_continuous(breaks = seq(0,0.3,0.025)) +
  scale_y_continuous(breaks = seq(0,7,0.5)) +
  theme_bw() +
    labs(x = "alpha0",
       y = "beta0",
       title = "Distribution of Alpha and Beta Shape Parameters",
       subtitle = "2500 samples from MLE optimized beta model posterior") +
 theme(
    panel.border = element_rect(colour = "gray90"),
    axis.text.x = element_text(size = 8, family = "Trebuchet MS"),
    axis.text.y = element_text(size = 8, family = "Trebuchet MS"),
    axis.title = element_text(size = 10, family = "Trebuchet MS"),
    plot.caption = element_text(size = 7, hjust=0, margin=margin(t=5), 
                                family = "Trebuchet MS"),
    plot.title=element_text(family="TrebuchetMS-Bold"),
    legend.position = "none",
    panel.grid.minor = element_blank()
    )

Hamiltonian Monte Carlo (HMC)

Estimation and prediction

As the third method of estimation, we use a full Stan model and include a generated quantities block to make predictions of new rates for each state. In this block, new parameters for each state are drawn and calculated into expected mention rates based on the observed number of mentions and population. Estimating this within the model allows for the full integration over uncertainty. Another approach is to take the alpha and beta parameters distributions, sample those outside the model and calculate the state estimates.

model_string1_pred <- "
data {
  int<lower=1> N;
  vector[N] x;
  int<lower=1> M;
  vector[M] new_success;
  vector[M] new_attempts;
}
parameters {
  real<lower=0> alpha0;
  real<lower=0> beta0;
}
model {
  alpha0 ~ normal(0, 1);
  beta0 ~ normal(0, 10);
  x ~ beta(alpha0, beta0);
} generated quantities {
  vector[M] x_tilde; 
  for (n in 1:M)
    x_tilde[n] = beta_rng((new_success[n] + alpha0),
                (new_attempts[n] - new_success[n]  + beta0));
}
"

Run the Stan model

new_success = state_counts_zeros$n
new_attempts = (state_counts_zeros$pop2014)/100000
model_dat1_pred <- list(x = state_counts_zeros$rate, 
                   N = length(state_counts_zeros$rate),
                   new_success = new_success,
                   new_attempts = new_attempts,
                   M = length(new_success))
fit1_pred <- stan(model_code = model_string1_pred, 
                  data = model_dat1_pred,
                  iter = 10000, chains = 4,  warmup=2500)

Extract results

Show the 95% Credible Interval for alpha, beta, each state, and the log posterior lp__

fit1_pred_summary <- data.frame(summary(fit1_pred)[["summary"]]) %>%
  rownames_to_column() %>%
  mutate(Parameter = c("alpha0", "beta0",
                  as.character(state_counts_zeros$name), "lp__")) %>%
  dplyr::select(Parameter, mean, sd, X2.5., X97.5., n_eff, Rhat) %>%
  dplyr::rename(Mean = mean,
                SD = sd,
                `2.5%` = X2.5.,
                `97.5%` = X97.5.)
kable(fit1_pred_summary, digits = 3)

Parameter	Mean	SD	2.5%	97.5%	n_eff	Rhat
alpha0	0.167	0.024	0.124	0.217	21096.54	1
beta0	2.490	0.718	1.303	4.110	22051.02	1
New York	0.308	0.033	0.245	0.373	29857.22	1
California	0.089	0.014	0.063	0.119	29855.18	1
Georgia	0.218	0.040	0.144	0.302	29343.73	1
Tennessee	0.211	0.049	0.123	0.315	29874.16	1
Texas	0.050	0.013	0.027	0.079	30000.00	1
Alabama	0.240	0.059	0.134	0.365	30000.00	1
Mississippi	0.313	0.080	0.167	0.478	28211.54	1
Kentucky	0.154	0.052	0.066	0.270	29731.69	1
Hawaii	0.372	0.117	0.162	0.610	29625.26	1
Illinois	0.047	0.018	0.018	0.089	30000.00	1
Indiana	0.091	0.035	0.035	0.169	28416.92	1
Virginia	0.073	0.028	0.028	0.137	30000.00	1
Louisiana	0.106	0.044	0.037	0.207	30000.00	1
Arizona	0.061	0.029	0.018	0.127	29321.09	1
Florida	0.021	0.010	0.006	0.045	28980.35	1
Ohio	0.035	0.017	0.010	0.075	29935.66	1
Oklahoma	0.102	0.047	0.031	0.210	29909.97	1
Colorado	0.058	0.031	0.013	0.134	29557.50	1
Maine	0.199	0.097	0.050	0.418	30000.00	1
Nebraska	0.150	0.075	0.035	0.322	29598.46	1
Washington	0.044	0.024	0.010	0.104	30000.00	1
Kansas	0.068	0.044	0.010	0.176	29175.99	1
Montana	0.172	0.103	0.026	0.415	29853.08	1
Nevada	0.072	0.046	0.010	0.184	29922.60	1
New Jersey	0.024	0.016	0.003	0.063	30000.00	1
Pennsylvania	0.017	0.011	0.002	0.044	30000.00	1
West Virginia	0.103	0.064	0.015	0.256	30000.00	1
Arkansas	0.036	0.033	0.001	0.124	28833.05	1
Idaho	0.063	0.055	0.003	0.206	29375.92	1
Maryland	0.019	0.017	0.001	0.063	28763.70	1
Michigan	0.012	0.011	0.000	0.040	30000.00	1
Minnesota	0.021	0.019	0.001	0.071	29458.81	1
North Dakota	0.121	0.101	0.006	0.379	29224.08	1
Alaska	0.017	0.039	0.000	0.135	30000.00	1
Connecticut	0.004	0.011	0.000	0.035	29449.01	1
Delaware	0.014	0.033	0.000	0.113	28917.46	1
Iowa	0.005	0.012	0.000	0.041	29729.80	1
Massachusetts	0.002	0.006	0.000	0.020	29606.32	1
Missouri	0.003	0.007	0.000	0.022	29339.44	1
New Hampshire	0.010	0.024	0.000	0.082	30000.00	1
New Mexico	0.007	0.017	0.000	0.056	29956.60	1
North Carolina	0.002	0.004	0.000	0.014	30000.00	1
Oregon	0.004	0.010	0.000	0.032	30000.00	1
Rhode Island	0.012	0.030	0.000	0.101	27519.67	1
South Carolina	0.003	0.008	0.000	0.027	29300.92	1
South Dakota	0.015	0.035	0.000	0.120	30000.00	1
Utah	0.005	0.013	0.000	0.044	30000.00	1
Vermont	0.019	0.043	0.000	0.149	29881.66	1
Wisconsin	0.003	0.007	0.000	0.022	29677.02	1
Wyoming	0.020	0.046	0.000	0.160	28480.49	1
lp__	210.722	0.954	208.133	211.627	14563.81	1

Realtionship of Bayesian to observed estimate

state_estimates <- rstan::extract(fit1_pred, pars = "x_tilde") %>%
  data.frame() %>%
  rename_(.dots=setNames(names(.),state_counts_zeros$state_name)) %>%
  gather() %>%
  dplyr::rename(state_name = key) %>%
  group_by(state_name) %>%
  dplyr::summarise(q025 = quantile(value, probs = 0.025),
                   q5 = quantile(value, probs = 0.5),
                   q975 = quantile(value, probs = 0.975),
                   mean = mean(value)) %>%
  left_join(.,state_counts_zeros)

The plot design here is based entirely on Silge's visualization

### could melt and add q025,q5,q975 by color/shape
### could also predict across range of rates and show areas

ggplot(state_estimates, aes(rate, mean)) +
  geom_abline(intercept = 0, slope = 1, color = "gray70", linetype = 2) +
  geom_point(size = 4, aes(color = n)) +
  geom_text_repel(aes(label = abbrev), stat = "identity",
                  point.padding = unit(0.5, "lines"),
                  max.iter = 5000) +
  scale_color_gradient(low = "midnightblue", high = "pink",
                       name="Number\nof songs") +
  labs(title = "States in Song Lyrics with Empirical Bayes",
       subtitle = "States like Montana and Hawaii (high rates, few mentions) are shifted the most",
       x = "Measured rate of mentions per 100k population",
       y = "Mean predicted rate per 100k population",
       caption = "plot design by @juliasilge") +
  theme_minimal(base_family = "Trebuchet MS") +
  theme(plot.title=element_text(family="Trebuchet MS"))

Bayesian vs. Observed estimate and 95% CI by state

The plot design here is based entirely on Silge's visualization

state_estimates %>% 
  arrange(desc(mean)) %>% 
  mutate(state_name = factor(name, levels = rev(unique(name)))) %>%
  dplyr::select(state_name, 'Measured rate' = rate, 
         'Bayesian estimate' = mean, q025, q975) %>% 
  gather(type, rate, `Measured rate`, `Bayesian estimate`) %>%
  ggplot(aes(rate, state_name, color = type)) +
  geom_errorbarh(aes(xmin = q025, xmax = q975), color = "gray50") +
  geom_point(size = 3) +
  xlim(0, NA) +
   labs(x = "Rate of mentions per 100k population",
       y = NULL, title = "Measured Rates, Bayesian Estimates (HMC), and 95% Credible Intervals",
       subtitle = "Mention rate for states sorted by descending posterior mean",
       caption = "plot design by @juliasilge") +
  theme_minimal(base_family = "Trebuchet MS") +
  theme(plot.title=element_text(family="Trebuchet MS", face = "bold")) +
  theme(legend.title=element_blank())

Adding City Counts

Below is a repeate of the analysis above, but using that dataset that includes the additional counts of cities mentions: `state_city_counts_zeros`

new_success_SC = state_city_counts_zeros$n_city_state
new_attempts_SC = (state_city_counts_zeros$pop2014)/100000
model_SC_pred <- list(x = state_city_counts_zeros$city_state_rate, 
                   N = length(state_city_counts_zeros$city_state_rate),
                   new_success = new_success_SC,
                   new_attempts = new_attempts_SC,
                   M = length(new_success_SC))

Fit same Stan model as above, but new data.

fit_SC_pred <- stan(model_code = model_string1_pred, 
                  data = model_SC_pred,
                  iter = 10000, chains = 4,  warmup=2500)

Summarise Stan fit

fit_SC_pred_summary <- data.frame(summary(fit_SC_pred)[["summary"]]) %>%
  rownames_to_column() %>%
  mutate(Parameter = c("alpha0", "beta0",
                  as.character(state_counts_zeros$name), "lp__")) %>%
  dplyr::select(Parameter, mean, sd, X2.5., X97.5., n_eff, Rhat) %>%
  dplyr::rename(Mean = mean,
                SD = sd,
                `2.5%` = X2.5.,
                `97.5%` = X97.5.)
kable(fit_SC_pred_summary,  digits = 3)

Parameter	Mean	SD	2.5%	97.5%	n_eff	Rhat
alpha0	0.225	0.033	0.165	0.296	17279.23	1
beta0	1.646	0.406	0.951	2.549	16973.93	1
New York	0.643	0.034	0.575	0.708	29848.10	1
California	0.189	0.020	0.151	0.230	30000.00	1
Georgia	0.398	0.049	0.305	0.496	30000.00	1
Tennessee	0.545	0.061	0.425	0.661	30000.00	1
Texas	0.176	0.023	0.133	0.224	30000.00	1
Alabama	0.344	0.067	0.220	0.479	29206.28	1
Mississippi	0.322	0.081	0.175	0.491	30000.00	1
Kentucky	0.223	0.061	0.116	0.355	29811.61	1
Hawaii	0.394	0.120	0.176	0.638	29736.12	1
Illinois	0.178	0.033	0.117	0.247	30000.00	1
Indiana	0.107	0.038	0.045	0.192	30000.00	1
Virginia	0.098	0.033	0.045	0.171	30000.00	1
Louisiana	0.547	0.072	0.407	0.685	30000.00	1
Arizona	0.122	0.039	0.056	0.209	29910.51	1
Florida	0.129	0.024	0.086	0.181	29215.51	1
Ohio	0.053	0.021	0.020	0.100	30000.00	1
Oklahoma	0.206	0.063	0.098	0.342	30000.00	1
Colorado	0.153	0.049	0.071	0.259	30000.00	1
Maine	0.214	0.102	0.054	0.448	30000.00	1
Nebraska	0.452	0.108	0.246	0.666	29657.57	1
Washington	0.074	0.031	0.026	0.145	29910.37	1
Kansas	0.073	0.046	0.011	0.187	30000.00	1
Montana	0.186	0.109	0.030	0.445	29356.45	1
Nevada	0.245	0.078	0.111	0.411	29610.70	1
New Jersey	0.036	0.019	0.008	0.082	28206.21	1
Pennsylvania	0.063	0.021	0.028	0.112	29625.22	1
West Virginia	0.109	0.068	0.016	0.272	30000.00	1
Arkansas	0.135	0.060	0.040	0.272	29560.75	1
Idaho	0.124	0.075	0.019	0.305	29396.58	1
Maryland	0.070	0.033	0.020	0.146	30000.00	1
Michigan	0.102	0.030	0.051	0.167	29490.48	1
Minnesota	0.040	0.026	0.006	0.104	30000.00	1
North Dakota	0.137	0.109	0.007	0.412	29612.46	1
Alaska	0.025	0.049	0.000	0.174	28267.34	1
Connecticut	0.006	0.012	0.000	0.041	29777.99	1
Delaware	0.020	0.041	0.000	0.147	30000.00	1
Iowa	0.007	0.014	0.000	0.050	30000.00	1
Massachusetts	0.076	0.032	0.026	0.148	29567.38	1
Missouri	0.068	0.032	0.020	0.142	29661.64	1
New Hampshire	0.015	0.030	0.000	0.106	30000.00	1
New Mexico	0.055	0.047	0.003	0.177	29817.16	1
North Carolina	0.043	0.020	0.012	0.090	30000.00	1
Oregon	0.006	0.012	0.000	0.040	29704.75	1
Rhode Island	0.099	0.082	0.005	0.306	29375.50	1
South Carolina	0.066	0.035	0.015	0.149	27237.50	1
South Dakota	0.022	0.044	0.000	0.155	30000.00	1
Utah	0.008	0.015	0.000	0.054	30000.00	1
Vermont	0.027	0.054	0.000	0.191	29276.90	1
Wisconsin	0.004	0.008	0.000	0.028	30000.00	1
Wyoming	0.030	0.059	0.000	0.211	29559.97	1
lp__	122.002	0.964	119.375	122.919	14550.35	1

Prepare fit estimates for plotting

state_city_estimates <- rstan::extract(fit_SC_pred, pars = "x_tilde") %>%
  data.frame() %>%
  rename_(.dots=setNames(names(.),state_city_counts_zeros$state_name)) %>%
  gather() %>%
  dplyr::rename(state_name = key) %>%
  group_by(state_name) %>%
  dplyr::summarise(q025 = quantile(value, probs = 0.025),
                   q5 = quantile(value, probs = 0.5),
                   q975 = quantile(value, probs = 0.975),
                   mean = mean(value)) %>%
  left_join(.,state_city_counts_zeros)

Joining, by = "state_name"

Realtionship of Bayesian to observed state + city estimate

The plot design here is based entirely on Silge's visualization

### could melt and add q025,q5,q975 by color/shape
### could also predict across range of rates and show areas
ggplot(state_city_estimates, aes(city_state_rate, mean)) +
  geom_abline(intercept = 0, slope = 1, color = "gray70", linetype = 2) +
  geom_point(size = 4, aes(color = n_city_state)) +
  geom_text_repel(aes(label = abbrev), stat = "identity",
                  point.padding = unit(0.5, "lines"),
                  max.iter = 5000) +
  scale_color_gradient(low = "midnightblue", high = "pink",
                       name="Number\nof songs") +
  labs(title = "States & Cities in Song Lyrics Modeled with Bayes (HMC)",
       subtitle = "States like Nebraska and Hawaii (high rates, few mentions) are shifted the most",
       x = "Measured rate of mentions per 100k population",
       y = "Mean predicted rate per 100k population",
       caption = "plot design by @juliasilge") +
  theme_minimal(base_family = "Trebuchet MS") +
  theme(plot.title=element_text(family="Trebuchet MS"))

Bayesian vs. Observed state + city estimate and 95% CI by state

The plot design here is based entirely on Silge's visualization

### range estiamtes plot
state_city_estimates %>% 
  arrange(desc(mean)) %>% 
  mutate(state_name = factor(name, levels = rev(unique(name)))) %>%
  dplyr::select(state_name, 'Measured rate' = city_state_rate, 
         'Bayesian estimate' = mean, q025, q975) %>% 
  gather(type, city_state_rate, `Measured rate`, `Bayesian estimate`) %>%
  ggplot(aes(city_state_rate, state_name, color = type)) +
  geom_errorbarh(aes(xmin = q025, xmax = q975), color = "gray50") +
  geom_point(size = 3) +
  xlim(0, NA) +
  labs(x = "Rate of mentions per 100k population",
       y = NULL, title = "Measured Rates, Bayesian Estimates (HMC), and 95% Credible Intervals",
       subtitle = "Mention rate for states & cities sorted by descending posterior mean",
       caption = "plot design by @juliasilge") +
  theme_minimal(base_family = "Trebuchet MS") +
  theme(plot.title=element_text(family="Trebuchet MS", face = "bold")) +
  theme(legend.title=element_blank())

Model Comparison

There are lots of ways to compare which model is "better", but I am not going to go crazy with it. Log posteriors from the Stan models can be compared, but I do not believe they can be used for inference directly from the form they are reported from the fit. We could also do things like hold-out samples, cross-validations, Leave-One-Out CV, or use information criteria such as WAIC. See Gelman and go down the rabbit hole from there.

Here I take a simple approach and calculate a few metrics based a few things:

Loss Metrics
- Mean Absolute Error (MAE)
- Root Mean Square Error (RMSE)
Credible Interval (CI)
- mean width of 95% CI
Predicted Mentions
- RMSE of expected mentions vs observed mentions
- MAE of the same

These result show that the state counts only model may have the slightest advantage over the states + cities counts, but it really is very small. For me, the choice of model would simply be which fits my purpose better; both are pretty darn good at estimating the number of mentions. the MAE of expected mentions per state is less than 0.2 of a mention; rounded down to no real error at all in that category. The beta distribution accurately described these data

city_state_error <- state_city_estimates %>%
  mutate(rate_error = mean - city_state_rate,
         pred_mentions = round(mean * (pop2014/100000),1),
         mention_error = n_city_state - pred_mentions,
         CI_width = q975 - q025) %>%
  dplyr::summarise(RMSE_rate = sqrt(mean(rate_error^2)),
            MAE_rate = mean(abs(rate_error)),
            mean_CI = mean(CI_width),
            RMSE_mentions = sqrt(mean(mention_error^2)),
            MAE_mentions = mean(abs(mention_error))) %>%
  as.numeric()

state_error <- state_estimates %>%
  mutate(rate_error = mean - rate,
         pred_mentions = round(mean * (pop2014/100000),1),
         mention_error = n - pred_mentions,
         CI_width = q975 - q025) %>%
  dplyr::summarise(RMSE_rate = sqrt(mean(rate_error^2)),
            MAE_rate = mean(abs(rate_error)),
            median_CI = median(CI_width),
            RMSE_mentions = sqrt(mean(mention_error^2)),
            MAE_mentions = mean(abs(mention_error))) %>%
  as.numeric()

#print
model_rmse <- data.frame(model = c("States Only", "City and States"),
           RMSE_rate =  c(state_error[1], city_state_error[1]),
           MAE_rate =  c(state_error[2], city_state_error[2]),
           Median_CI =  c(state_error[3], city_state_error[3]),
           RMSE_mentions =  c(state_error[4], city_state_error[4]),
           MAE_mentions =  c(state_error[5], city_state_error[5]))
kable(model_rmse, digits = 3)

model	RMSE_rate	MAE_rate	Median_CI	RMSE_mentions	MAE_mentions
States Only	0.013	0.007	0.111	0.246	0.174
City and States	0.013	0.008	0.182	0.416	0.236

Create R scipt

From this markdown, I use the purl() function in the rmarkdown package to extract the code. This is run just after the *.Rmd file is created. Otherwise, it trys to read an already open file and fails to knit()

rmd_loc <- "/Users/mattharris/Documents/R_Local/Put_a_prior_on_it-Blog_post/"
# purl(input  = paste0(rmd_loc, "README.Rmd"), 
#     output = paste0(rmd_loc, "R_script/README_r_script.R"))

Thanks for reading!!!!

Environment

sessionInfo()

R version 3.3.1 (2016-06-21)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.5 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] tools     grid      stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] rmarkdown_1.0      knitr_1.14         rstan_2.11.1      
 [4] StanHeaders_2.11.0 fitdistrplus_1.0-7 survival_2.39-4   
 [7] MASS_7.3-45        broom_0.4.1        ggrepel_0.6.2     
[10] scales_0.4.0.9003  tidytext_0.1.1     reshape2_1.4.1    
[13] acs_2.0            XML_3.98-1.4       plyr_1.8.4        
[16] stringr_1.1.0      purrr_0.2.2        readr_1.0.0       
[19] tidyr_0.6.0        tibble_1.2         tidyverse_1.0.0   
[22] dplyr_0.5.0        ggplot2_2.1.0.9001

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7       highr_0.6         formatR_1.4      
 [4] bitops_1.0-6      tokenizers_0.1.4  digest_0.6.10    
 [7] evaluate_0.9      gtable_0.2.0      nlme_3.1-128     
[10] lattice_0.20-33   psych_1.6.9       Matrix_1.2-6     
[13] DBI_0.5           parallel_3.3.1    yaml_2.1.13      
[16] gridExtra_2.2.1   janeaustenr_0.1.2 stats4_3.3.1     
[19] inline_0.3.14     R6_2.1.3          foreign_0.8-66   
[22] magrittr_1.5      codetools_0.2-14  splines_3.3.1    
[25] SnowballC_0.5.1   htmltools_0.3.5   mnormt_1.5-4     
[28] assertthat_0.1    colorspace_1.2-6  labeling_0.3     
[31] stringi_1.1.1     RCurl_1.95-4.8    lazyeval_0.2.0   
[34] munsell_0.4.3

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
README_cache/markdown_github		README_cache/markdown_github
README_files/figure-markdown_github		README_files/figure-markdown_github
R_script		R_script
Rmarkdown		Rmarkdown
data		data
README.Rmd		README.Rmd
README.md		README.md

mrecos/Put_a_prior_on_it-Blog_post

Folders and files

Latest commit

History

Repository files navigation

Puttin'a Prior on It: Stan Beta Estmation of Song Lyrics

Introduction

Estimating a Beta Distribution with Stan HMC

Repository for data, analysis (markdown), and R code

Differences from Silge's Analysis

Why Go Bayes?

Packages

Get Data

Join, mutate, and munge the data

Prepare data for extracting lyrics

Join lyrics to geography

Do the same as above, but for cities

Join city counts to state counts and compute new rate

Fun facts...

Some interesting things about these data

1. States mentioned by their cities, but not the state itself

2. What song mentions the most unique state names?

3. What song mentions a single state the most number of times?

Parameter estimation

Estimate the parameters of the beta distribution in 3 ways:

MLE with fitdistr

Optimizing in rstan

Parameters estimates and log likelihood

Hamiltonian Monte Carlo (HMC)

Estimation and prediction

Run the Stan model

Extract results

Realtionship of Bayesian to observed estimate

Bayesian vs. Observed estimate and 95% CI by state

Adding City Counts

Below is a repeate of the analysis above, but using that dataset that includes the additional counts of cities mentions: state_city_counts_zeros

Fit same Stan model as above, but new data.

Summarise Stan fit

Prepare fit estimates for plotting

Realtionship of Bayesian to observed state + city estimate

Bayesian vs. Observed state + city estimate and 95% CI by state

Model Comparison

Create R scipt

Thanks for reading!!!!

Environment

About

Resources

Stars

Watchers

Forks

Languages

MLE with `fitdistr`

Optimizing in `rstan`

Below is a repeate of the analysis above, but using that dataset that includes the additional counts of cities mentions: `state_city_counts_zeros`