seeking efficient approach to group_by age adjustment work #50

mcSamuelDataSci · 2018-11-26T21:24:54Z

Zev (and Nate)-

The first coding approach below works just fine, and the dataframe "countyAA" has exactly what I need. That is: for all county, year, sex, and CAUSE combinations I get the age-adjusted death rate, and associated upper and lower CI, and SE. The only thing I am not grouping by is ageGroup, and this is what drives the whole thing.

BUT, it is clearly inefficient to run the adeadjust.direct function (part of the "epitools" package; modified by me to ageadjust.direct.SAM (to include the SE, and deal with some 0's)) four times. I did this a while ago when I really needed to just get this done, but now am trying to figure out how to do it more efficiently, and have not been successful so far despite a fair bit of effort.

The second approach below partially works, but puts the results in one column in one character string representation of a vector of results--see picture below (the ones with "NAs" are filtered out elsewhere, they have 0 deaths). I could do something to parse out this string, but I bet there is a simpler better way?

Any suggestions would be most appreciated?


countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(
            aRate   = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000
       ) 
          
countyAAtry2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            )

The text was updated successfully, but these errors were encountered:

nteetor · 2018-11-26T21:35:01Z

You could try the following,

ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct.SAM(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }

mcSamuelDataSci · 2018-11-26T22:41:22Z

Thanks for your quick reply. It runs, and gives the right number of lines, but does not generate proper results. See below. I don't know how the "."s work in your code, but thought maybe a %>% was missing before the summarize, but that didn't work... I'd be very happy to do a webex any time if that would help diagnose/solve and/or all the code is working and is on the GitHub site? Thanks!!!

zross · 2018-12-03T13:23:19Z

@mcSamuelDataSci Any chance you can create a reproducible/small example we can test with?

mcSamuelDataSci · 2018-12-03T21:44:02Z

Here is a reproducible example (exported the input file, and a couple other little edits):

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")


# THIS WORKS - INEFFICIENT
countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000 
            ) 

# MY ATTEMPTS... CLOSE BUT NO CIGAR
countyAA_try2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            ) 


# CLOSER, STILL NO CIGAR
countyAA_try3 <- ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }

zross · 2018-12-05T13:46:54Z

@nteetor could you please take a look?

nteetor · 2018-12-07T17:02:05Z

I am confused why values 2 through 5 are pulled from the result of ageadjust.direct() (see original try). It looks like the return value is a vector of 4 values.

mcSamuelDataSci · 2018-12-07T23:00:31Z

I am not clear on what you are saying. What I need is something like shown below.

nteetor · 2018-12-07T23:56:47Z

Yes and I believe aRate is <result>[1] not <result>[2]. I do not use the epitools package, so I apologize if I am missing something. If the indexing is off and is fixed, does this help any of the problems outlined above?

mcSamuelDataSci · 2018-12-08T18:15:06Z

See shortened example below. I don't see the indexing being off. aRate is <result>[2] in all cases (<result>[1] is the "crude rate", and is not used). This code should run fine, and shows the exact issues I believe. Thanks.

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")


ageSmall <- filter(ageCounty,county=="Alameda",year==2017,sex=="Total", CAUSE %in% c("A","B","C"))


# This works just fine but is inefficient
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000) 


# This "works" but generates a vector rather than three seperate columns; I tried "unlisting" and could not get it
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
    round(ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)))
  ) 


# Your code gets close, but the values for the output varaibles are the same for all rows, and I am not sure what they are
ageSmall %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000
    ) 
  }

zross · 2018-12-10T17:50:43Z

I'm traveling, but I think I'll have a chance to look at this this evening.

zross · 2018-12-11T02:01:38Z

I thought I had a good solution, but it actually runs slower than yours despite only running the function once, perhaps you can test on a bigger dataset?

This is a common issue and I see it discussed in these references. My results match yours but take several milliseconds more. See what I have below.

@nteetor can correct me, but I also think that do() is discouraged these days but I'm not sure.

https://stackoverflow.com/questions/38223003/efficient-assignment-of-a-function-with-multiple-outputs-in-dplyr-mutate-or-summ

tidyverse/dplyr#2326

https://github.com/romainfrancois/tie

Results running yours:

Results running mine:

tmp <- ageSmall %>%
  group_by(county, year, sex, CAUSE) %>%
  do(vals =   ageadjust.direct(count = .$Ndeaths, 
                               pop = .$pop, 
                               rate = NULL, 
                               stdpop = .$US2000POP, 
                               conf.level = 0.95))


mynames <- map(tmp$vals, names) %>% 
  unlist()

tmp %>% unnest() %>% 
  mutate(names = mynames,
         vals = round(100000*vals, 2)) %>% 
  spread(names, vals) %>% 
  select(-crude.rate) %>% 
  rename(aRate = adj.rate,
         aLCI = lci,
         aUCI = uci)

mcSamuelDataSci · 2018-12-11T03:27:15Z

I had assumed there would be something close to a "standard" (and therefore efficient) approach to this, and wanted to do it that "right" way. But, since there is not, unless you suggest otherwise, I will stick with what I did, and can now be confident that it is not ridiculous. One thing I like about my approach is that it is pretty easy to tell what it is doing, and this is a priority for the project too. I will close this unless, again, you suggest otherwise. Thanks.

zross · 2018-12-11T13:13:11Z

I'm surprised that do() is so slow here. I ran on the full dataset and see that my code takes 3x longer. If speed matters, I did see some data.table solutions in my digging. But if you can pre-compute then whatever works.

mcSamuelDataSci assigned zross and nteetor Nov 26, 2018

mcSamuelDataSci closed this as completed Dec 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

seeking efficient approach to group_by age adjustment work #50

seeking efficient approach to group_by age adjustment work #50

mcSamuelDataSci commented Nov 26, 2018

nteetor commented Nov 26, 2018

mcSamuelDataSci commented Nov 26, 2018

zross commented Dec 3, 2018

mcSamuelDataSci commented Dec 3, 2018 •

edited by nteetor

zross commented Dec 5, 2018

nteetor commented Dec 7, 2018 •

edited

mcSamuelDataSci commented Dec 7, 2018

nteetor commented Dec 7, 2018 •

edited

mcSamuelDataSci commented Dec 8, 2018 •

edited by nteetor

zross commented Dec 10, 2018

zross commented Dec 11, 2018 •

edited

mcSamuelDataSci commented Dec 11, 2018

zross commented Dec 11, 2018

seeking efficient approach to group_by age adjustment work #50

seeking efficient approach to group_by age adjustment work #50

Comments

mcSamuelDataSci commented Nov 26, 2018

nteetor commented Nov 26, 2018

mcSamuelDataSci commented Nov 26, 2018

zross commented Dec 3, 2018

mcSamuelDataSci commented Dec 3, 2018 • edited by nteetor

zross commented Dec 5, 2018

nteetor commented Dec 7, 2018 • edited

mcSamuelDataSci commented Dec 7, 2018

nteetor commented Dec 7, 2018 • edited

mcSamuelDataSci commented Dec 8, 2018 • edited by nteetor

zross commented Dec 10, 2018

zross commented Dec 11, 2018 • edited

mcSamuelDataSci commented Dec 11, 2018

zross commented Dec 11, 2018

mcSamuelDataSci commented Dec 3, 2018 •

edited by nteetor

nteetor commented Dec 7, 2018 •

edited

nteetor commented Dec 7, 2018 •

edited

mcSamuelDataSci commented Dec 8, 2018 •

edited by nteetor

zross commented Dec 11, 2018 •

edited