Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicate rows in mgcv::gam() with ti(age, by = grade, bs ='fs') #150

Closed
ddsjoberg opened this issue Feb 22, 2022 · 9 comments · Fixed by #151
Closed

Duplicate rows in mgcv::gam() with ti(age, by = grade, bs ='fs') #150

ddsjoberg opened this issue Feb 22, 2022 · 9 comments · Fixed by #151

Comments

@ddsjoberg
Copy link
Collaborator

ddsjoberg commented Feb 22, 2022

Reported on https://stackoverflow.com/questions/71215280/gtsummary-output-with-mgcv-gam

I hadn't seen the interaction specification in mgcv::gam() using ti(age, by = grade, bs ='fs'), and it looks like there is a duplication of rows issue with these models. Example below! I think it's related to the calculation of n_obs.

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

m1 <- gam(marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
          data = gtsummary::trial, 
          method = 'REML', 
          family = gaussian)

gtsummary::tidy_gam(m1)
#> # A tibble: 7 x 8
#>   term        estimate std.error statistic  p.value      edf   ref.df parametric
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl> <lgl>     
#> 1 (Intercept)    1.09      0.107  10.2     2.91e-19 NA       NA       TRUE      
#> 2 gradeII       -0.390     0.156  -2.49    1.36e- 2 NA       NA       TRUE      
#> 3 gradeIII      -0.125     0.156  -0.803   4.23e- 1 NA       NA       TRUE      
#> 4 s(age)        NA        NA       0.259   9.90e- 1  3.43e-4  6.01e-4 FALSE     
#> 5 ti(age):gr~   NA        NA       0.523   5.67e- 1  1.73e+0  2.14e+0 FALSE     
#> 6 ti(age):gr~   NA        NA       0.00363 9.53e- 1  1.00e+0  1.00e+0 FALSE     
#> 7 ti(age):gr~   NA        NA       0.462   6.15e- 1  1.57e+0  1.95e+0 FALSE
broom.helpers::tidy_plus_plus(m1, tidy_fun = gtsummary::tidy_gam) |> 
  dplyr::select(all_of(c("term", "estimate", "std.error", "statistic", 
                         "p.value", "edf", "ref.df", "parametric")))
#> # A tibble: 10 x 8
#>    term        estimate std.error statistic p.value      edf   ref.df parametric
#>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl> <lgl>     
#>  1 gradeI         0        NA      NA       NA      NA       NA       NA        
#>  2 gradeII       -0.390     0.156  -2.49     0.0136 NA       NA       TRUE      
#>  3 gradeIII      -0.125     0.156  -0.803    0.423  NA       NA       TRUE      
#>  4 s(age)        NA        NA       0.259    0.990   3.43e-4  6.01e-4 FALSE     
#>  5 ti(age):gr~   NA        NA       0.523    0.567   1.73e+0  2.14e+0 FALSE     
#>  6 ti(age):gr~   NA        NA       0.523    0.567   1.73e+0  2.14e+0 FALSE     
#>  7 ti(age):gr~   NA        NA       0.00363  0.953   1.00e+0  1.00e+0 FALSE     
#>  8 ti(age):gr~   NA        NA       0.00363  0.953   1.00e+0  1.00e+0 FALSE     
#>  9 ti(age):gr~   NA        NA       0.462    0.615   1.57e+0  1.95e+0 FALSE     
#> 10 ti(age):gr~   NA        NA       0.462    0.615   1.57e+0  1.95e+0 FALSE

Created on 2022-02-21 by the reprex package (v2.0.1)

@larmarange
Copy link
Owner

I just add a quick look. Duplicates row are due to the specific model_get_n.gam() method:

model_get_n.gam <- function(model) {

Here we are trying to compute a number of observations for the smooth terms and this is this code that is failing and compute two rows per interaction term with ti(age).

@larmarange
Copy link
Owner

larmarange commented Feb 22, 2022

At that stage:

tidyr::unnest(cols = .data$tbl_obs)

We should keep only one row per term, but what would be the expected value to keep? The maximum? The minimum? NA if two computed values?

# A tibble: 7 x 2
  term             n_obs
  <chr>            <dbl>
1 s(age)             179
2 ti(age):gradeI     179
3 ti(age):gradeI      64
4 ti(age):gradeII     57
5 ti(age):gradeII    179
6 ti(age):gradeIII    58
7 ti(age):gradeIII   179

@ddsjoberg
Copy link
Collaborator Author

I think the N should be consistent with a typical continuous by categorical interaction term we report

library(gtsummary)

lm(marker ~ age:grade, trial) %>%
  tbl_regression() %>%
  add_n(location = "level") %>%
  as_kable()
Characteristic N Beta 95% CI p-value
Age * Grade 179
Age * I 64 0.00 -0.01, 0.01 0.6
Age * II 57 0.00 -0.01, 0.00 0.3
Age * III 58 0.00 -0.01, 0.01 >0.9

Created on 2022-02-22 by the reprex package (v2.0.1)

Header row reports overall N, and the row report the Ns from the categorical variable. I think that makes sense here as well

@larmarange
Copy link
Owner

Then it should be 64, 57 and 58.

One of the issue here is that ti(age) generates two different columns in the model matrix.

library(broom.helpers)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

mod <- gam(
  marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
  data = gtsummary::trial, 
  method = 'REML', 
  family = gaussian
)

mod %>% model_get_model_matrix() %>% colnames()
#>  [1] "(Intercept)"        "gradeII"            "gradeIII"          
#>  [4] "s(age).1"           "s(age).2"           "s(age).3"          
#>  [7] "s(age).4"           "s(age).5"           "s(age).6"          
#> [10] "s(age).7"           "s(age).8"           "s(age).9"          
#> [13] "s(age).10"          "s(age).11"          "s(age).12"         
#> [16] "s(age).13"          "s(age).14"          "s(age).15"         
#> [19] "s(age).16"          "s(age).17"          "s(age).18"         
#> [22] "s(age).19"          "s(age).20"          "s(age).21"         
#> [25] "s(age).22"          "s(age).23"          "s(age).24"         
#> [28] "s(age).25"          "s(age).26"          "s(age).27"         
#> [31] "s(age).28"          "s(age).29"          "s(age).30"         
#> [34] "s(age).31"          "s(age).32"          "s(age).33"         
#> [37] "s(age).34"          "s(age).35"          "s(age).36"         
#> [40] "s(age).37"          "s(age).38"          "s(age).39"         
#> [43] "ti(age):gradeI.1"   "ti(age):gradeI.2"   "ti(age):gradeI.3"  
#> [46] "ti(age):gradeI.4"   "ti(age):gradeII.1"  "ti(age):gradeII.2" 
#> [49] "ti(age):gradeII.3"  "ti(age):gradeII.4"  "ti(age):gradeIII.1"
#> [52] "ti(age):gradeIII.2" "ti(age):gradeIII.3" "ti(age):gradeIII.4"

Created on 2022-02-22 by the reprex package (v2.0.1)

Because there are 2 columns for each term, there is a bug in the computation of N.

The other thing to consider is that ti(age) is not detected as a variable and therefore ti(age):gradeI is not categorized as an interaction.

I'm not familiar with these tensor interactions terms and I'm not sure how to properly deal with them.

@ddsjoberg
Copy link
Collaborator Author

Hmm, i agree with you that it's not clear what to do...

What do you think of this:

  1. We add a check that the merge won't create duplicate rows and if it does, we return an error.
  2. In tidy_plus_plus() we wrap the N calculation in a tryCatch(). If there is an error, print a note that the Ns for each level couldn't be added and just move on to the next calculation?

@larmarange
Copy link
Owner

Note: the easier fix would be to remove model_get_n.gam() method, defaulting to default method. In such case, number of observations will be computed only for parametric terms.

If we want to continue to compute the number of observations for smooth terms, we need to identy such terms, interactions with such terms, and how to handle multiple smooth columns in model matrix.

@larmarange
Copy link
Owner

I would prefer to switch model_get_n() to the default behavior, resulting in computing the number of observations only for parametric terms that are classic terms.

@ddsjoberg
Copy link
Collaborator Author

I would prefer to switch model_get_n() to the default behavior, resulting in computing the number of observations only for parametric terms that are classic terms.

ok that sounds like a good option

@larmarange
Copy link
Owner

Please have a look at #151

library(broom.helpers)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

mod <- gam(
  marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
  data = gtsummary::trial, 
  method = 'REML', 
  family = gaussian
)

mod %>% 
  tidy_plus_plus(tidy_fun = gtsummary::tidy_gam) %>%
  dplyr::select(term, variable, var_type, estimate, n_obs, parametric)
#> # A tibble: 7 x 6
#>   term             variable         var_type    estimate n_obs parametric
#>   <chr>            <chr>            <chr>          <dbl> <dbl> <lgl>     
#> 1 gradeI           grade            categorical    0        64 NA        
#> 2 gradeII          grade            categorical   -0.390    57 TRUE      
#> 3 gradeIII         grade            categorical   -0.125    58 TRUE      
#> 4 s(age)           s(age)           continuous    NA        NA FALSE     
#> 5 ti(age):gradeI   ti(age):gradeI   continuous    NA        NA FALSE     
#> 6 ti(age):gradeII  ti(age):gradeII  continuous    NA        NA FALSE     
#> 7 ti(age):gradeIII ti(age):gradeIII continuous    NA        NA FALSE

Created on 2022-02-22 by the reprex package (v2.0.1)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants