Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design in margins.svyglm #124

Open
tzoltak opened this issue Jun 17, 2019 · 16 comments
Open

design in margins.svyglm #124

tzoltak opened this issue Jun 17, 2019 · 16 comments
Labels

Comments

@tzoltak
Copy link

tzoltak commented Jun 17, 2019

margins() S3 method for svyglm object now requires providing design object as an additional argument. This causes trouble if there are missings in some variable included in a model:

## load package
library(margins)
library(survey)
data(api)

## code goes here
dstrat <- svydesign(id=~1,strata=~stype, weights=~pw, data=apistrat, fpc=~fpc)
m <- svyglm(growth ~ target, dstrat)
margins(m, design = dstrat)
# not sure, why there are 2 times more rows reported in error message,
# but in general error seems to be a consequence of 20 observations in apistrat$target
# being NAs:
summary(is.na(apistrat$target))

## session info for your system
R version 3.6.0 (2019-04-26)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Polish_Poland.1250  LC_CTYPE=Polish_Poland.1250   
[3] LC_MONETARY=Polish_Poland.1250 LC_NUMERIC=C                  
[5] LC_TIME=Polish_Poland.1250    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] margins_0.3.23    survey_3.36       survival_2.44-1.1 Matrix_1.2-17    

loaded via a namespace (and not attached):
[1] MASS_7.3-51.4      compiler_3.6.0     DBI_1.0.0          tools_3.6.0       
[5] splines_3.6.0      data.table_1.12.2  lattice_0.20-38    prediction_0.3.6.2
[9] mitools_2.4  

Importantly, providing survey design object as an argument to margins() seems completely unnecessary, as it is included in a svyglm object and it is already restricted only to observations included in estimation - please run:

m$survey.design
nrow(m$survey.design)

So simple solution should be replacing lines 25-32 in "margins_svyglm.R" with something like:

wts <- weights(model$survey.design)
@tzoltak
Copy link
Author

tzoltak commented Jun 17, 2019

I tried a little myself and apart of this there is also a problem within prediction package and error message obtained with calling a code I provided applies to that problem. Nevertheless problem of getting weights from data (survey design's subset) on which model actually was estimated still needs to be solved afterwards and solution I've proposed above seems to work.

@tzoltak
Copy link
Author

tzoltak commented Aug 6, 2019

Little correction, so it all works also with svrepdesign objects - weights should be obtained as:

wts <- weights(model$survey.design, type = "sampling")

@leeper leeper added the bug label Dec 22, 2019
@tzoltak
Copy link
Author

tzoltak commented Dec 22, 2019

Oh, and I've just realized that this (what should come instead of lines 25-32) should be a little more complicated to allow use of survey design objects as data argument:

if (inherits(data, c("survey.design", "svyrep.design"))) {
  wts <- weights(data, type = "sampling")
  data <- find_data(data)
} else {
  wts <- weights(model$survey.design, type = "sampling")
}

@marcondesfel
Copy link

You mentioned replacing the code lines 25-32 in the margins.svyglm.R in margins package 0.3.23. However, when I try replacing the code line 25-32 of version 0.3.26 with the code above, it does not work. Do you know what lines 25-32 in the older version correspond to in the version 0.3.26?

Many thanks,

@tzoltak
Copy link
Author

tzoltak commented May 14, 2021

Well, @leeper has updated the package since I wrote this post and it became a little incompatible.
If you look for a version that works well with svyglm models you may use my forks of prediction and margins (still_ waiting @leeper has time to merge my pull requests; it seems that his work at Facebook gives him no time for any other activities). This forks also include aforementioned changes by @leeper. This is important you need them both to make it all work. To install them, you may call (assuming you have package devtools already installed):

devtools::install_github("tzoltak/prediction")
devtools::install_github("tzoltak/margins")

@marcondesfel
Copy link

@tzoltak thank you! The forks above worked on the margins version I have. I've been trying to find a solution for this for days! I really appreciate the quick response and the solution.

@tzoltak
Copy link
Author

tzoltak commented May 15, 2021

@marcondesfel, your welcome! :)

@marcondesfel
Copy link

@tzoltak does the margins package in R allow to calculate pairwise comparisons (for diff-in-diff analysis)? Or should I use another package like emmeans?

@tzoltak
Copy link
Author

tzoltak commented May 17, 2021

Unluckily I'm not sure, what exactly you're asking about @marcondesfel
Package margins enables estimating effects also of factor variables (and this can be described as "pairwise"), but with some limitations:

  • There is no easy way to get effects describing differences between all the possible pairs of levels if factor has more than two levels (see also at and factors #116).
  • Package margins is aimed at computing so-called Average Marginal Effects and consequently by default it firstly computes effects separately for every observation in the dataset to further average them across observations - this may be something you're interested in or not (you may prefer some other form of marginal effects, like Marginal Effects at the Means).
  • If you want to compute effects for some specific combination of values of predictors, there are some problems with specifying values of factors in the current version of margins (see at and factors #116 and margins(at = ) fails when only one factor level is specified #121).

@marcondesfel
Copy link

My apologies for the lack of clarity @tzoltak.
I have a dataset that looks at the pre (a=0)/post (a=1) effect on y of an intervention on control (x=0) and treatment (x=1) groups.
I have the svyglm model below
t <- svyglm(y~x*a, family=quasibinomial(), design=df, na.action = na.omit)

With the margins code below, I can get the 1st difference (the average marginal effects for control vs treatment groups)

summary(margins(t, variables = "x", type="response", design=df, at= list(a= "0")))
summary(margins(t, variables = "x", type="response", design=df, at= list(a= "1"))

However, now I want to go a step further and get the 2nd difference (difference in difference). in Stata, I can do margins, dydx(*) pwcompare. How can I get pairwise comparisons of average marginal effects in R?

I tried the code below with package emmeans:

pairs(pairs(emmeans(t, ~ x | a)), by = NULL, adjust = "none")

But that does not seem correct.

How can do it with the margins package in R?

@tzoltak
Copy link
Author

tzoltak commented May 19, 2021

Well, there are two answers to your question, @marcondesfel:

1. Technical one:

There is no pwcompare-like feature in margins but there is rather easy workaround: simply you must create one variable that describes all the possible situations that are modeled by the interaction (and main effects of variables involved in this interaction). Now you should use this variable as a predictor in your model instead of variables you used to create it.

Further you run margins() on such a model and as a result get effects that describes differences between the reference level and other levels (assuming you estimated model using the default contr.treatment effect coding). If you're interested in some other differences (here: crossing 2 binary variables you get 4 groups what results in 6 pairs of comparisons that can be reconstructed using 3 model parameters - and consequently 3 is the number of comparisons you get by a single call to margins()), you may call margins() specifying additional argument at - but you must use my fork of margins to make this work.

I suppose that code like this should work for you (assuming you're interested primarily in how x=1 and a=1 differs comparing to x=0 and a=1):

# installing forks with patched version of packages
devtools::install_github("tzoltak/prediction")
devtools::install_github("tzoltak/margins")
# code itself
df <- update(df, xa = factor(paste0("x", x, "a", a))
t <- svyglm(y~xa, family=quasibinomial(), design=df, na.action = na.omit)
# using `at` argument to get differences with respect to x=1 & a=1
summary(margins(t, at = list(xa="x1a1"))
  1. Statistical one:

In such a simple situation, i.e. with no additional predictors, you may compute size of the effects on the response scale simply by examining conditional means of the dependent variable in 4 groups defined by crossing a and x. Compute them (in your case: svyby(~y, ~x+a, df, svymean)) and you may simply subtract them from each other to get size of the effect you're interested in. That's all you need regarding effect size (on the response scale). Of course this works only if you don' use additional predictors (control variables).

The only reason you want to use a model - logistic regression in your case - in this simple situation is to test hypothesis regarding these effects. But to do this you may (and in fact even should) simply look at significance of model coefficients (on the link function scale) and perhaps significance of their linear combinations (in case of svyglm objects see function regTermTest() and in case of regular LMs or GLMs function lht() from the package car or glht() from the package multcomp - both these functions work for LMs and GLMs; however, in many situations you may avoid using this functions carefully specifying the way contrasts are coded in your model).

The only reason this p-levels may change while turning to effects on the response scale is by using approximations necessary to compute variances after nonlinear transformations of random variables - that's why it is better to avoid them (and that's why you can get results of a call to emmeans() on the response scale - by specifying type="response" argument - but calling pairs(emmeans()) gets you back on the link scale - in case of logistic regression this can be also the odds ratio scale).

@marcondesfel
Copy link

marcondesfel commented Jul 2, 2021

Hi @tzoltak,

Thanks for the code provided above. But I still cannot seem to get the average marginal effect that I really want.

I have the following svyglm model t:
t <- (svyglm(utdflu~age2+expansion+manpenrate+sex+race+pov2+agemom+marital+facility2+visits+feeraise*feebump, family=quasibinomial(), design=nissvy2bmedicaid, na.action = na.omit))
I was able to get the 1st difference for each level of fee bump as follows:
summary(margins(t, variables = "feeraise", type="response", design=nissvy2bmedicaid, at= list(feebump= "before feebump"))) summary(margins(t, variables = "feeraise", type="response", design=nissvy2bmedicaid, at= list(feebump= "after feebump")))

To the 2nd difference (the DiD) I tried the code you suggested for the simple model without any covariates (model t2) as below:
t2 <- (svyglm(utdflu~feeraise*feebump, family=quasibinomial(), design=nissvy2bmedicaid, na.action = na.omit)) nissvy2bmedicaidf <- update(nissvy2bmedicaid, xa= factor(paste0("feeraise",feeraise, "feebump", feebump))) t0 <- svyglm(utdflu ~xa, family=quasibinomial(), design=nissvy2bmedicaidf, na.action = na.omit) summary(margins(t0, type="response", at= list(xa= "feeraiseyes fee raisefeebumpbefore feebump")))

But the 1st difference and 2nd difference (1st diff before fee bump - 1st diff after fee bump) do not match. I get the following output:
summary(margins(t0, type="response", at= list(xa= "feeraiseyes fee raisefeebumpbefore feebump")))
factor xa AME SE z p lower upper
xafeeraiseno fee raisefeebumpafter feebump 4.0000 0.1811 0.0229 7.9219 0.0000 0.1363 0.2259
xafeeraiseno fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA
xafeeraiseno fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA
xafeeraiseno fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA
xafeeraiseno fee raisefeebumpbefore feebump 4.0000 -0.0217 0.0217 -0.9997 0.3175 -0.0643 0.0209
xafeeraiseno fee raisefeebumpbefore feebump NA NaN NA NaN NaN NA NA
xafeeraiseno fee raisefeebumpbefore feebump NA NaN NA NaN NaN NA NA
xafeeraiseno fee raisefeebumpbefore feebump NA NaN NA NaN NaN NA NA
xafeeraiseyes fee raisefeebumpafter feebump 4.0000 0.2408 0.0193 12.4886 0.0000 0.2030 0.2786
xafeeraiseyes fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA
xafeeraiseyes fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA
xafeeraiseyes fee raisefeebumpafter feebump NA NaN NA NaN NaN NA NA

I was expecting a number for the AME closer to 0.038 (for yes vs no fee raise in the after feebump period). What am I doing wrong?

@tzoltak
Copy link
Author

tzoltak commented Jul 10, 2021

Dear @marcondesfel,
having additional predictors in the model you must keep them, changing only the way variables feebump and feeraise are introduced into the model: (having no easy way to test the code I hope I haven't misspelled something below)

# values of your variables are quite long so I will combine them in a little shorter way
nissvy2bmedicaid <- update(nissvy2bmedicaid,
                           feeRaiseBump = paste0(".", feeraise, ".", feebump))
# and by the way you may set the most interesting level as a reference one even before estimating a model
# (so you don't need to use `at` calling `margins()`)
nissvy2bmedicaid <- update(nissvy2bmedicaid,
                           feeRaiseBump = relevel(feeRaiseBump, ".yes fee raise.before feebump"))
# now estimation
t2 <- (svyglm(utdflu ~ age2 + expansion + manpenrate + sex + race + pov2 + agemom + marital + facility2 + visits + feeRaiseBump,
              family = quasibinomial(), design = nissvy2bmedicaid, na.action = na.omit))
# and you should be able to calculate AMEs describing the differences in comparison to ".yes fee raise.before feebump"
summary(margins(t2))

@dsanchezp18
Copy link

Hello there! I'm trying to use margins() to obtain average partial effects for a svyglm object. I installed your forks, since I was unable to get them with the original package from CRAN. However, now it takes a lot of time to compute these partial effects. A model with 2 terms and about 3000 observations takes about 12 seconds for computation. A model with 13 variables and the same number of observations takes 2min45s for APEs calculations. The same model but as a glm object takes only about 3 seconds for the computation of APEs.

I'm not sure if has something to do with the fork and the missing value problems, or something to do with my R session. I'm attaching the session info below. Thanks in advance for your help in this!

Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252    LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C                           LC_TIME=English_United States.1252    

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] psych_2.1.6       car_3.0-11        broom_0.7.9       openxlsx_4.2.4    arsenal_3.6.3     haven_2.4.3      
 [7] performance_0.7.3 mfx_1.2-2         betareg_3.1-4     MASS_7.3-54       lmtest_0.9-38     zoo_1.8-9        
[13] sandwich_3.0-1    margins_0.3.26    pscl_1.5.5        effects_4.2-0     carData_3.0-4     ggeffects_1.1.1  
[19] stargazer_5.2.2   survey_4.1-1      survival_3.2-11   Matrix_1.3-4      forcats_0.5.1     stringr_1.4.0    
[25] dplyr_1.0.7       purrr_0.3.4       readr_2.0.1       tidyr_1.1.3       tibble_3.1.3      ggplot2_3.3.5    
[31] tidyverse_1.3.1  

loaded via a namespace (and not attached):
 [1] nlme_3.1-152      fs_1.5.0          lubridate_1.7.10  bit64_4.0.5       insight_0.14.3    httr_1.4.2       
 [7] tools_4.1.0       backports_1.2.1   utf8_1.2.2        R6_2.5.1          DBI_1.1.1         colorspace_2.0-2 
[13] nnet_7.3-16       withr_2.4.2       mnormt_2.0.2      tidyselect_1.1.1  curl_4.3.2        bit_4.0.4        
[19] compiler_4.1.0    cli_3.0.1         rvest_1.0.1       xml2_1.3.2        scales_1.1.1      digest_0.6.27    
[25] foreign_0.8-81    minqa_1.2.4       rio_0.5.27        pkgconfig_2.0.3   htmltools_0.5.1.1 lme4_1.1-27.1    
[31] dbplyr_2.1.1      rlang_0.4.11      readxl_1.3.1      rstudioapi_0.13   generics_0.1.0    jsonlite_1.7.2   
[37] vroom_1.5.4       zip_2.2.0         magrittr_2.0.1    modeltools_0.2-23 Formula_1.2-4     Rcpp_1.0.7       
[43] munsell_0.5.0     fansi_0.5.0       abind_1.4-5       prediction_0.3.15 lifecycle_1.0.0   stringi_1.7.3    
[49] flexmix_2.3-17    parallel_4.1.0    crayon_1.4.1      lattice_0.20-44   splines_4.1.0     hms_1.1.0        
[55] tmvnsim_1.0-2     knitr_1.33        pillar_1.6.2      boot_1.3-28       stats4_4.1.0      reprex_2.0.1     
[61] glue_1.4.2        mitools_2.4       data.table_1.14.0 modelr_0.1.8      vctrs_0.3.8       nloptr_1.2.2.2   
[67] tzdb_0.1.2        cellranger_1.1.0  gtable_0.3.0      assertthat_0.2.1  xfun_0.25         ellipsis_0.3.2   

@tzoltak
Copy link
Author

tzoltak commented Sep 10, 2021

Dear @dsanchezp18 I don't think it has much to do with my fork specifically. What I have done regarding use of svyglm objects is simply changing the place in which margins is searching for data and information about survey design. Moreover, what margins do internally in general is calling predict() method for a given model very extensively. Consequently I suppose that the slowdown you mention is most probably caused by the predict() method for svyglm objects being much slower than for glm objects. I checked it quickly at this may be the case because predict() method for svyglm objects by default computes standard errors of the returned predictions, what is much more time consuming that computing predictions themselves (especially using complex survey designs!) and what predict() method for glm objects simply don't do by default. This suggests that the solution may be modifying margins (and possibly also the predictions package that it uses underneath) so predict() method for svyglm objects being called with argument saying it explicitly not to compute standard errors. I would have to look closer on it, however.

@dsanchezp18
Copy link

I see, @tzoltak , thank you for your insight. It does surprise me that there is only one way to compute average partial effects of survey-weighted glm's, at least to the full extent of my own knowledge. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants