title | author | date | output | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Methods |
Nick Michalak |
10/28/2019 |
|
Methods
Plain Language Summary
I modeled a separate polling trend for each candidate that accounts for variability in pollster and correlations between polling dates.
Data
- I used polling data collected by FiveThirtyEight (github.com/fivethirtyeight/data).
Wrangling and Selecting Data
- I computed date as the median of the start and the end dates for each poll (e.g., median(2019-10-01, 2019-10-03) = 2019-10-02).
- I excluded head-to-head polls.
- For a given model, I use the most inclusive population of voters from a survey: "Adults" and "Voters" populations are equally inclusive, and they're more inclusive than "Registered Voters," which is more inclusive than "Likely Voters."
Model
- I estimated polling prediction trends for each candidate via Generalized Additive Mixed Model.
$\ln(\frac{p}{1 - p}) = b_0 + f_1(\text{Median Date}) + \epsilon$
- I modeled proportions as an outcome with a binomial function with a logit link. I weighted by sample size, which is equivalent to modeling binary data (i.e., 1 = success, 0 = failure).
- For each model, I specified random intercepts for pollster (polls nested within pollster) as well as an auto-correlation structure of order 1 with a continuous time covariate (i.e., Continuous AR(1) Correlation).
- I have not come up with a way to handle tracker polls (i.e., same respondents surveyed over time). So, models treat tracker polls the same as other polls.
- When states have fewer than 10 polls, I simply fit a logistic regression line per candidate instead of the more complex GAMM fit (averages reflect the estimates from the logistic regression).
Estimates ("Averages") and Confidence Intervals (Uncertainty)
- I table the most recent fit estimates as "averages." Note that these values are sensitive to the trend and different from simply averaging the most recent polls.
- For averages based on GAMM fits, the confidence intervals are simultaneous (see the gratia R package documentation).
- For averages based on fewer than 10 observations, confidence intervals are based on the profile method (see the MASS R package documentation).
- Period Polling Averages reflect simple averages per candidate since a given polling start date (e.g., average since October 1st, 2019)
Historical Probability of Winning the Nomination
- I input the fitted polling averages for each candidate from the National Polling Average model (described above) into a logistic regression equation built from regressing nomination success (1 or 0) onto Historical Democratic Primary candidates' polling averages during either the first or second part of the year (January to June or June to December) (I pulled data from this FiveThirtyEight series by Geoffrey Skelley).
$\ln(\frac{p(nominated)}{1 - p(nominated)}) = b_0 + b_1(\text{Polling Average}) + \epsilon$
For example, if a candidate is pulling at 5% in March 2019, then the log odds of their winning the nomination is -3.46 + 8.65(0.05) = -3.02 (i.e., 4.6% = exp(-3.02) / (1 + exp(-3.02))). Their log odds don't improve much if they are polling at 5% in October 2019: -3.36 + 8.28(0.05) = -2.95 (i.e., 5.0% = exp(-2.95) / (1 + exp(-2.95))).
Cautionary Words
This is a work in progress. I made a handful of decisions based on convenience. Here's a list of those:
- I used k = 10 as a default for the basis for the smooth term for each candidate's trend (see
choose.k
in the mgcv R package documentation). - I used defaults for the autoregressive correlation structure.
- I did not apply weights based on pollsters' historical accuracy or bias.
Resources
- Wood, S.N. (2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semi parametric generalized linear models. Journal of the Royal Statistical Society (B) 3(1):3-36
- Wood S.N., N. Pya and B. Saefken (2016) Smoothing parameter and model selection for general smooth models (with discussion). Journal of the American Statistical Association 111:1548-1575.
- Wood, S.N. (2004) Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association. 99:673-686.
- Wood, S.N. (2017) Generalized Additive Models: An Introduction with R (2nd edition). Chapman and Hall/CRC.
- Wood, S.N. (2003) Thin-plate regression splines. Journal of the Royal Statistical Society (B) 65(1):95-114.
Other Poll Aggregator π Sites
- 270ToWin: 2020 Democratic Presidential Nomination
- The Economist: Who is ahead in the Democratic primary race?
- FiveThirtyEight: The 2020 Democratic Presidential Primary
- New York Times: Which Democrats Are Leading the 2020 Presidential Race?
- Real Clear Politics: 2020 Democratic Presidential Nomination