Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

seeking efficient approach to group_by age adjustment work #50

Closed
mcSamuelDataSci opened this issue Nov 26, 2018 · 13 comments
Closed

seeking efficient approach to group_by age adjustment work #50

mcSamuelDataSci opened this issue Nov 26, 2018 · 13 comments
Assignees

Comments

@mcSamuelDataSci
Copy link
Owner

Zev (and Nate)-

The first coding approach below works just fine, and the dataframe "countyAA" has exactly what I need. That is: for all county, year, sex, and CAUSE combinations I get the age-adjusted death rate, and associated upper and lower CI, and SE. The only thing I am not grouping by is ageGroup, and this is what drives the whole thing.

BUT, it is clearly inefficient to run the adeadjust.direct function (part of the "epitools" package; modified by me to ageadjust.direct.SAM (to include the SE, and deal with some 0's)) four times. I did this a while ago when I really needed to just get this done, but now am trying to figure out how to do it more efficiently, and have not been successful so far despite a fair bit of effort.

The second approach below partially works, but puts the results in one column in one character string representation of a vector of results--see picture below (the ones with "NAs" are filtered out elsewhere, they have 0 deaths). I could do something to parse out this string, but I bet there is a simpler better way?

Any suggestions would be most appreciated?


countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(
            aRate   = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000
       ) 
          
countyAAtry2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct.SAM(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            ) 

capture

@nteetor
Copy link

nteetor commented Nov 26, 2018

You could try the following,

ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct.SAM(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }

@mcSamuelDataSci
Copy link
Owner Author

Thanks for your quick reply. It runs, and gives the right number of lines, but does not generate proper results. See below. I don't know how the "."s work in your code, but thought maybe a %>% was missing before the summarize, but that didn't work... I'd be very happy to do a webex any time if that would help diagnose/solve and/or all the code is working and is on the GitHub site? Thanks!!!

image

@zross
Copy link

zross commented Dec 3, 2018

@mcSamuelDataSci Any chance you can create a reproducible/small example we can test with?

@mcSamuelDataSci
Copy link
Owner Author

mcSamuelDataSci commented Dec 3, 2018

Here is a reproducible example (exported the input file, and a couple other little edits):

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")


# THIS WORKS - INEFFICIENT
countyAA <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000, 
            aSE     = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[5]*100000 
            ) 

# MY ATTEMPTS... CLOSE BUT NO CIGAR
countyAA_try2 <- ageCounty %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
                            round(
                              ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)
                            )
                          )
            ) 


# CLOSER, STILL NO CIGAR
countyAA_try3 <- ageCounty %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000, 
      aSE  = sam[5] * 100000
    ) 
  }
  
  

@zross
Copy link

zross commented Dec 5, 2018

@nteetor could you please take a look?

@nteetor
Copy link

nteetor commented Dec 7, 2018

I am confused why values 2 through 5 are pulled from the result of ageadjust.direct() (see original try). It looks like the return value is a vector of 4 values.

@mcSamuelDataSci
Copy link
Owner Author

I am not clear on what you are saying. What I need is something like shown below.

capture

@nteetor
Copy link

nteetor commented Dec 7, 2018

Yes and I believe aRate is <result>[1] not <result>[2]. I do not use the epitools package, so I apologize if I am missing something. If the indexing is off and is fixed, does this help any of the problems outlined above?

@mcSamuelDataSci
Copy link
Owner Author

mcSamuelDataSci commented Dec 8, 2018

See shortened example below. I don't see the indexing being off. aRate is <result>[2] in all cases (<result>[1] is the "crude rate", and is not used). This code should run fine, and shows the exact issues I believe. Thanks.

library(dplyr)
library(epitools)

githubURL <- "https://raw.githubusercontent.com/mcSamuelDataSci/CACommunityBurden/master/myCBD/myData/fake/forZev.ageCounty.RDS"
download.file(githubURL,"temp.rds", method="curl")
ageCounty <- readRDS("temp.rds")


ageSmall <- filter(ageCounty,county=="Alameda",year==2017,sex=="Total", CAUSE %in% c("A","B","C"))


# This works just fine but is inefficient
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aRate   = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[2]*100000,
            aLCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[3]*100000,
            aUCI    = ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)[4]*100000) 


# This "works" but generates a vector rather than three seperate columns; I tried "unlisting" and could not get it
ageSmall %>% group_by(county,year,sex,CAUSE) %>%
  summarize(aMulti = list(unique(
    round(ageadjust.direct(count=Ndeaths, pop=pop, rate = NULL, stdpop=US2000POP, conf.level = 0.95)*100000,2)))
  ) 


# Your code gets close, but the values for the output varaibles are the same for all rows, and I am not sure what they are
ageSmall %>% 
  group_by(county, year, sex, CAUSE) %>% {
    sam <- ageadjust.direct(
      count = .$Ndeaths, 
      pop = .$pop, 
      rate = NULL, 
      stdpop = .$US2000POP, 
      conf.level = 0.95
    )
    
    summarize(
      .,
      aRate = sam[2] * 100000,
      aLCI = sam[3] * 100000,
      aUCI = sam[4] * 100000
    ) 
  }

@zross
Copy link

zross commented Dec 10, 2018

I'm traveling, but I think I'll have a chance to look at this this evening.

@zross
Copy link

zross commented Dec 11, 2018

I thought I had a good solution, but it actually runs slower than yours despite only running the function once, perhaps you can test on a bigger dataset?

This is a common issue and I see it discussed in these references. My results match yours but take several milliseconds more. See what I have below.

@nteetor can correct me, but I also think that do() is discouraged these days but I'm not sure.

https://stackoverflow.com/questions/38223003/efficient-assignment-of-a-function-with-multiple-outputs-in-dplyr-mutate-or-summ

tidyverse/dplyr#2326

https://github.com/romainfrancois/tie

Results running yours:

image

Results running mine:

image

tmp <- ageSmall %>%
  group_by(county, year, sex, CAUSE) %>%
  do(vals =   ageadjust.direct(count = .$Ndeaths, 
                               pop = .$pop, 
                               rate = NULL, 
                               stdpop = .$US2000POP, 
                               conf.level = 0.95))


mynames <- map(tmp$vals, names) %>% 
  unlist()

tmp %>% unnest() %>% 
  mutate(names = mynames,
         vals = round(100000*vals, 2)) %>% 
  spread(names, vals) %>% 
  select(-crude.rate) %>% 
  rename(aRate = adj.rate,
         aLCI = lci,
         aUCI = uci)

@mcSamuelDataSci
Copy link
Owner Author

I had assumed there would be something close to a "standard" (and therefore efficient) approach to this, and wanted to do it that "right" way. But, since there is not, unless you suggest otherwise, I will stick with what I did, and can now be confident that it is not ridiculous. One thing I like about my approach is that it is pretty easy to tell what it is doing, and this is a priority for the project too. I will close this unless, again, you suggest otherwise. Thanks.

@zross
Copy link

zross commented Dec 11, 2018

I'm surprised that do() is so slow here. I ran on the full dataset and see that my code takes 3x longer. If speed matters, I did see some data.table solutions in my digging. But if you can pre-compute then whatever works.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants