Skip to content

Cohort component population projection and reconstruction model in TMB

License

Notifications You must be signed in to change notification settings

mrc-ide/ccmpp.tmb

Repository files navigation

ccmpp.tmb

R-CMD-check Codecov test coverage

ccmpp.tmb is an implementation of a cohort component model of population projection (CCMPP) and Wheldon et al.’s Bayesian Population Reconstruction in the R package Template Model Builder (TMB).

The package was created to explore the tractability of implementing Bayesian Population Reconstruction in TMB and prototype ideas for structuring integration of simulation models as templated C++ for flexible integration into TMB models and callable R functions with Rcpp wrappers.

Installation

Install the development version from GitHub via devtools:

# install.packages("devtools")
devtools::install_github("mrc-ide/ccmpp.tmb")

Example

Construct a sparse Leslie matrix:

library(tidyverse)
library(ccmpp.tmb)
library(popReconstruct)

data(burkina_faso_females)

make_leslie_matrixR(sx = burkina.faso.females$survival.proportions[,1],
                    fx = burkina.faso.females$fertility.rates[4:10, 1],
                    srb = 1.05,
                    age_span = 5,
                    fx_idx = 4)
#> 17 x 17 sparse Matrix of class "dgCMatrix"
#>                                                                                                                                                  
#>  [1,] .         .         0.2090608 0.5400452 0.6110685 0.5131988 0.3952854 0.2440665 0.1012326 0.01816255 .        .         .         .        
#>  [2,] 0.8782273 .         .         .         .         .         .         .         .         .          .        .         .         .        
#>  [3,] .         0.9713785 .         .         .         .         .         .         .         .          .        .         .         .        
#>  [4,] .         .         0.9730318 .         .         .         .         .         .         .          .        .         .         .        
#>  [5,] .         .         .         0.9577709 .         .         .         .         .         .          .        .         .         .        
#>  [6,] .         .         .         .         0.9481755 .         .         .         .         .          .        .         .         .        
#>  [7,] .         .         .         .         .         0.9460075 .         .         .         .          .        .         .         .        
#>  [8,] .         .         .         .         .         .         0.9393766 .         .         .          .        .         .         .        
#>  [9,] .         .         .         .         .         .         .         0.9258789 .         .          .        .         .         .        
#> [10,] .         .         .         .         .         .         .         .         0.9052283 .          .        .         .         .        
#> [11,] .         .         .         .         .         .         .         .         .         0.87537666 .        .         .         .        
#> [12,] .         .         .         .         .         .         .         .         .         .          0.832338 .         .         .        
#> [13,] .         .         .         .         .         .         .         .         .         .          .        0.7736165 .         .        
#> [14,] .         .         .         .         .         .         .         .         .         .          .        .         0.6966118 .        
#> [15,] .         .         .         .         .         .         .         .         .         .          .        .         .         0.5928803
#> [16,] .         .         .         .         .         .         .         .         .         .          .        .         .         .        
#> [17,] .         .         .         .         .         .         .         .         .         .          .        .         .         .        
#>                                    
#>  [1,] .         .         .        
#>  [2,] .         .         .        
#>  [3,] .         .         .        
#>  [4,] .         .         .        
#>  [5,] .         .         .        
#>  [6,] .         .         .        
#>  [7,] .         .         .        
#>  [8,] .         .         .        
#>  [9,] .         .         .        
#> [10,] .         .         .        
#> [11,] .         .         .        
#> [12,] .         .         .        
#> [13,] .         .         .        
#> [14,] .         .         .        
#> [15,] .         .         .        
#> [16,] 0.4547571 .         .        
#> [17,] .         0.3181678 0.2099861

Simulate a cohort component population projection:

pop_proj <- ccmppR(basepop = as.numeric(burkina.faso.females$baseline.pop.counts),
                   sx = burkina.faso.females$survival.proportions,
                   fx = burkina.faso.females$fertility.rates[4:10, ],
                   gx = burkina.faso.females$migration.proportions,
                   srb = rep(1.05, ncol(burkina.faso.females$survival.proportions)),
                   age_span = 5,
                   fx_idx = 4)
pop_proj$population[ , c(1, 2, 10)]
#>         [,1]       [,2]       [,3]
#>  [1,] 386000 496963.688 1369041.17
#>  [2,] 292000 338995.727 1088715.23
#>  [3,] 260000 283642.516  952860.73
#>  [4,] 244000 246278.270  846073.20
#>  [5,] 207000 221576.949  719894.38
#>  [6,] 175000 186062.343  572001.86
#>  [7,] 153000 156791.159  458905.67
#>  [8,] 135000 136059.686  379925.83
#>  [9,] 117000 118338.831  309642.33
#> [10,]  98000 100304.139  208006.24
#> [11,]  78000  81282.772  154298.67
#> [12,]  60000  63137.000  114066.08
#> [13,]  43000  46416.989   90879.78
#> [14,]  29000  29954.307   65876.04
#> [15,]  17000  17193.530   41985.79
#> [16,]   8000   7730.870   22044.43
#> [17,]   2000   2965.314   11332.85

TMB

Calculate a population projection in TMB. Carry forward 2000 values for two further periods to explore projections.

basepop_init <- as.numeric(burkina.faso.females$baseline.pop.counts)

sx_init <- burkina.faso.females$survival.proportions
sx_init <- cbind(sx_init, `2005` = sx_init[ , "2000"], `2010` = sx_init[ , "2000"])

fx_init <- burkina.faso.females$fertility.rates[4:10, ]
fx_init <- cbind(fx_init, `2005` = fx_init[ , "2000"], `2010` = fx_init[ , "2000"])

gx_init <- burkina.faso.females$migration.proportions
gx_init <- cbind(gx_init, `2005` = gx_init[ , "2000"], `2010` = gx_init[ , "2000"])

log_basepop_mean <- as.vector(log(basepop_init))
logit_sx_mean <- as.vector(qlogis(sx_init))
log_fx_mean <- as.vector(log(fx_init))
gx_mean <- as.vector(gx_init)
  
data <- list(log_basepop_mean = log_basepop_mean,
             logit_sx_mean = logit_sx_mean,
             log_fx_mean = log_fx_mean,
             gx_mean = gx_mean,
             srb = rep(1.05, ncol(sx_init)),
             age_span = 5,
             n_steps = ncol(sx_init),
             fx_idx = 4L,
             fx_span = 7L,
             census_log_pop = log(burkina.faso.females$census.pop.counts),
             census_year_idx = c(4L, 6L, 8L, 10L))
par <- list(log_tau2_logpop = 0,
            log_tau2_sx = 0,
            log_tau2_fx = 0,
            log_tau2_gx = 0,
            log_basepop = log_basepop_mean,
            logit_sx = logit_sx_mean,
            log_fx = log_fx_mean,
            gx = gx_mean)

obj <- make_tmb_obj(data, par)

init_sim <- obj$report()
matrix(init_sim$population, nrow = 17)[ , data$census_year_idx]
#>             [,1]       [,2]        [,3]       [,4]
#>  [1,] 605122.263 795208.536 1033996.879 1369041.17
#>  [2,] 500213.483 632644.932  848330.217 1088715.23
#>  [3,] 434412.243 542556.373  724519.032  952860.73
#>  [4,] 316003.052 470378.768  599148.299  846073.20
#>  [5,] 247736.574 387113.613  483615.125  719894.38
#>  [6,] 203773.553 267704.942  402180.255  572001.86
#>  [7,] 181898.908 207461.542  331145.222  458905.67
#>  [8,] 151475.723 168764.539  226610.907  379925.83
#>  [9,] 125079.907 148891.568  173019.647  309642.33
#> [10,] 105028.794 120751.178  137082.748  208006.24
#> [11,]  86990.038  95552.187  117356.990  154298.67
#> [12,]  70264.782  77258.457   92546.531  114066.08
#> [13,]  53708.538  61306.324   71116.143   90879.78
#> [14,]  36336.814  43957.286   52091.893   65876.04
#> [15,]  20722.535  26639.236   33648.268   41985.79
#> [16,]   8842.486  12213.805   16905.919   22044.43
#> [17,]   3494.269   4906.748    7694.051   11332.85

obj$fn()
#> [1] 110.8335
#> attr(,"logarithm")
#> [1] TRUE


input <- list(data = data, par_init = par)
fit <- fit_tmb(input)
#>   0:     110.83349:  0.00000  0.00000  0.00000  0.00000
#>   1:     24.892728:  3.47909 0.328263 0.303597  1.92250
#>   2:    -39.081896:  5.44273  1.52064  1.06743  5.10666
#>   3:    -56.306496:  6.80894  2.66742  1.68368  4.98032
#>   4:    -56.845045:  6.34157  3.88053  2.61444  5.99075
#>   5:    -64.797080:  6.63179  3.87162  2.76071  5.10270
#>   6:    -66.807665:  6.84558  4.12277  3.61226  5.34850
#>   7:    -68.296168:  6.65362  4.16468  4.52944  5.22802
#>   8:    -68.630793:  6.74695  4.37864  5.18194  5.02382
#>   9:    -68.736453:  6.68438  4.50870  4.91524  5.05102
#>  10:    -68.738063:  6.69140  4.51487  4.94850  5.05221
#>  11:    -68.738076:  6.69265  4.51519  4.94882  5.05129
#>  12:    -68.738077:  6.69239  4.51545  4.94908  5.05136
#>  13:    -68.738077:  6.69241  4.51543  4.94903  5.05137
#> converged: relative convergence (4)

fit[1:6]
#> $par
#> log_tau2_logpop     log_tau2_sx     log_tau2_fx     log_tau2_gx 
#>        6.692407        4.515433        4.949035        5.051373 
#> 
#> $objective
#> [1] -68.73808
#> 
#> $convergence
#> [1] 0
#> 
#> $iterations
#> [1] 13
#> 
#> $evaluations
#> function gradient 
#>       18       14 
#> 
#> $message
#> [1] "relative convergence (4)"

Sample from posterior distribution and generate outputs

fit <- sample_tmb(fit)



colnames(fit$sample$population) <- 1:ncol(fit$sample$population)
colnames(fit$sample$migrations) <- 1:ncol(fit$sample$migrations)
colnames(fit$sample$fx) <- 1:ncol(fit$sample$fx)

init_pop_mat <- ccmpp_leslieR(basepop_init, sx_init, fx_init, gx_init,
                              srb = rep(1.05, ncol(sx_init)), age_span = 5, fx_idx = 4)

df <- crossing(year = seq(1960, 2015, 5),
               sex = "female",
               age_group = c(sprintf("%02d-%02d", 0:15*5, 0:15*5+4), "80+")) %>%
  mutate(init_pop = as.vector(init_pop_mat))

census_pop <- crossing(sex = "female",
                       age_group = c(sprintf("%02d-%02d", 0:15*5, 0:15*5+4), "80+")) %>%
  bind_cols(as_tibble(burkina.faso.females$census.pop.counts)) %>%
  gather(year, census_pop, `1975`:`2005`) %>%
  type_convert(cols(year = col_double()))

df <- df %>%
  left_join(census_pop) %>%
  bind_cols(as_tibble(fit$sample$population)) %>%
  gather(sample, value, `1`:last_col())
#> Joining, by = c("year", "sex", "age_group")

agepop <- df %>%
  group_by(year, sex, age_group) %>%
  summarise(init_pop = mean(init_pop),
            census_pop = mean(census_pop),
            mean = mean(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975))
#> `summarise()` has grouped output by 'year', 'sex'. You can override using the `.groups` argument.

totalpop <- df %>%
  group_by(year, sample) %>%
  summarise(init_pop = sum(init_pop),
            census_pop = sum(census_pop),
            value = sum(value)) %>%
  group_by(year) %>%
  summarise(init_pop = mean(init_pop),
            census_pop = mean(census_pop),
            mean = mean(value),
            lower = quantile(value, 0.025),
            upper = quantile(value, 0.975))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.
ggplot(totalpop, aes(year, mean, ymin = lower, ymax = upper)) +
  geom_ribbon(alpha = 0.2) +
  geom_line() +
  geom_line(aes(y = init_pop), linetype = "dashed") +
  geom_point(aes(y = census_pop), shape = 4, color = "darkred", stroke = 2) +
  scale_y_continuous("Total population (millions)", labels = scales::number_format(scale = 1e-6)) +
  expand_limits(y = 0) +
  labs(x = NULL) +
  theme_light() +
  ggtitle("BFA females: total population")
#> Warning: Removed 8 rows containing missing values (geom_point).

ggplot(agepop, aes(age_group, mean, ymin = lower, ymax = upper, group = 1)) +
  geom_ribbon(alpha = 0.2) +
  geom_line() +
  geom_line(aes(y = init_pop), linetype = "dashed") +
  geom_point(aes(y = census_pop), color = "darkred") +
  facet_wrap(~year, scales = "free_y") + 
  scale_y_continuous("Total population (thousands)", labels = scales::number_format(scale = 1e-3)) +
  expand_limits(y = 0) +
  theme_light() +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        panel.grid = element_blank()) +
  ggtitle("BFA females: population by age")
#> Warning: Removed 136 rows containing missing values (geom_point).

Posterior distribution for TFR:

asfr <- crossing(year = seq(1960, 2010, 5),
                 age_group = sprintf("%02d-%02d", 3:9*5, 3:9*5+4)) %>%
  mutate(init_asfr = as.vector(fx_init)) %>%
  bind_cols(as_tibble(fit$sample$fx)) %>%
  gather(sample, value, `1`:last_col()) 

tfr <- asfr %>%
  group_by(year, sample) %>%
  summarise(init_tfr = 5 * sum(init_asfr),
            value = 5 * sum(value)) %>%
  group_by(year) %>%
  summarise(init_tfr = mean(init_tfr),
         mean = mean(value),
         lower = quantile(value, 0.025),
         upper = quantile(value, 0.975))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

ggplot(tfr, aes(year, mean, ymin = lower, ymax = upper)) +
  geom_ribbon(alpha = 0.2) +
  geom_line() +
  geom_line(aes(y = init_tfr), linetype = "dashed") +
  scale_y_continuous("Total fertility rate", limits = c(5, 9)) +
  labs(x = NULL) +
  theme_light() +
  theme(panel.grid = element_blank()) +
  ggtitle("BFA: total fertility rate")

migrations <- crossing(year = seq(1960, 2010, 5),
                       age_lower = 0:16*5) %>%
  mutate(init_migrations = init_sim$migrations) %>%
  bind_cols(as_tibble(fit$sample$migrations)) %>%
  gather(sample, value, `1`:last_col()) 

total_migrations <- migrations %>%
  group_by(year, sample) %>%
  summarise(init_migrations = sum(init_migrations),
            value = sum(value)) %>%
  group_by(year) %>%
  summarise(init_migrations = mean(init_migrations),
         mean = mean(value),
         lower = quantile(value, 0.025),
         upper = quantile(value, 0.975))
#> `summarise()` has grouped output by 'year'. You can override using the `.groups` argument.

ggplot(total_migrations, aes(year, mean, ymin = lower, ymax = upper)) +
  geom_ribbon(alpha = 0.2) +
  geom_line() +
  geom_line(aes(y = init_migrations), linetype = "dashed") +
  scale_y_continuous("Net migration (1000s)",
                     labels = scales::number_format(scale=1e-3)) +
  labs(x = NULL) +
  theme_light() +
  theme(panel.grid = element_blank()) +
  ggtitle("BFA: total net migrations (1000s)")

Code design

Simulation model

The simulation model is implemented as templated C++ code in src/ccmpp.h. This is the simulation model may be developed as a standalone C++ library that can be called by other software without requiring R-specific code features. The code uses header-only open source libraries to maximize portability.

Statistical inference

The objective function (the negative log posterior density) is implemented in templated C++ code in the script src/TMB/ccmpp_tmb.hpp using probability functions from the Template Model Builder (TMB) R package. The TMB package provides estimates of the gradient of the objective function via automatic differentiation and Laplace approximation to integrate random effects.

The posterior density can also be sampled from using Hamiltonian Monte Carlo in Stan using the tmbstan package.

Latent Gaussian models for model components (fertiliy rates, mortality rates, migration rates) are implemented using linear algebra tools from the Eigen package. Predicted values are coerced into arrays for the CCMPP model inputs.

R functions

The file src/ccmppR.cpp contains R wrapper functions for the model simulation via Rcpp and RcppEigen.

Development notes

Simulation model

  • The CCMPP model is implemented as a sparse Leslie matrix formulation and using direct calculation of the projection in arrays so that interim outputs (deaths, births, migrations) are also saved. The array-based implementation appears to be faster.
  • Class structure for popualtion projection model needs review.
  • Specifying static dimensions for the state space may improve efficiency. This should be possible for common options (5x5 year, 1x1 year) through templating.
  • The model was implemented using Eigen containers following the initial sparse Leslie matrix specification. However, multi-dimensional arrays in the boost library may be preferable.

TMB

  • Further investigation is needed about the portability of AD objective function DLLs outside of the R environment.

TMB model code and testing are implemented following templates from the TMBtools package with guidance for package development with both TMB models and Rcpp code.

  • To add a new TMB model, save the model template in the src/TMB with extension .hpp. The model name must match the file name. The signature is slightly different – see other .hpp files for example.
  • Call TMBtools::export_models() to export the new TMB model to the meta-model list.
  • When constructing the objective function with TMB::MakeADFun, use DLL= "ccmpp.tmb_TMBExports" and add an additional value model = "<model_name>" to the data list.

About

Cohort component population projection and reconstruction model in TMB

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published