Duplicate rows in `mgcv::gam()` with `ti(age, by = grade, bs ='fs')` #150

ddsjoberg · 2022-02-22T02:55:33Z

Reported on https://stackoverflow.com/questions/71215280/gtsummary-output-with-mgcv-gam

I hadn't seen the interaction specification in mgcv::gam() using ti(age, by = grade, bs ='fs'), and it looks like there is a duplication of rows issue with these models. Example below! I think it's related to the calculation of n_obs.

library(mgcv)
#> Loading required package: nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

m1 <- gam(marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
          data = gtsummary::trial, 
          method = 'REML', 
          family = gaussian)

gtsummary::tidy_gam(m1)
#> # A tibble: 7 x 8
#>   term        estimate std.error statistic  p.value      edf   ref.df parametric
#>   <chr>          <dbl>     <dbl>     <dbl>    <dbl>    <dbl>    <dbl> <lgl>     
#> 1 (Intercept)    1.09      0.107  10.2     2.91e-19 NA       NA       TRUE      
#> 2 gradeII       -0.390     0.156  -2.49    1.36e- 2 NA       NA       TRUE      
#> 3 gradeIII      -0.125     0.156  -0.803   4.23e- 1 NA       NA       TRUE      
#> 4 s(age)        NA        NA       0.259   9.90e- 1  3.43e-4  6.01e-4 FALSE     
#> 5 ti(age):gr~   NA        NA       0.523   5.67e- 1  1.73e+0  2.14e+0 FALSE     
#> 6 ti(age):gr~   NA        NA       0.00363 9.53e- 1  1.00e+0  1.00e+0 FALSE     
#> 7 ti(age):gr~   NA        NA       0.462   6.15e- 1  1.57e+0  1.95e+0 FALSE
broom.helpers::tidy_plus_plus(m1, tidy_fun = gtsummary::tidy_gam) |> 
  dplyr::select(all_of(c("term", "estimate", "std.error", "statistic", 
                         "p.value", "edf", "ref.df", "parametric")))
#> # A tibble: 10 x 8
#>    term        estimate std.error statistic p.value      edf   ref.df parametric
#>    <chr>          <dbl>     <dbl>     <dbl>   <dbl>    <dbl>    <dbl> <lgl>     
#>  1 gradeI         0        NA      NA       NA      NA       NA       NA        
#>  2 gradeII       -0.390     0.156  -2.49     0.0136 NA       NA       TRUE      
#>  3 gradeIII      -0.125     0.156  -0.803    0.423  NA       NA       TRUE      
#>  4 s(age)        NA        NA       0.259    0.990   3.43e-4  6.01e-4 FALSE     
#>  5 ti(age):gr~   NA        NA       0.523    0.567   1.73e+0  2.14e+0 FALSE     
#>  6 ti(age):gr~   NA        NA       0.523    0.567   1.73e+0  2.14e+0 FALSE     
#>  7 ti(age):gr~   NA        NA       0.00363  0.953   1.00e+0  1.00e+0 FALSE     
#>  8 ti(age):gr~   NA        NA       0.00363  0.953   1.00e+0  1.00e+0 FALSE     
#>  9 ti(age):gr~   NA        NA       0.462    0.615   1.57e+0  1.95e+0 FALSE     
#> 10 ti(age):gr~   NA        NA       0.462    0.615   1.57e+0  1.95e+0 FALSE

^{Created on 2022-02-21 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

larmarange · 2022-02-22T08:04:04Z

I just add a quick look. Duplicates row are due to the specific model_get_n.gam() method:

broom.helpers/R/model_get_n.R

Line 138 in 7f4d5d4

model_get_n.gam <- function(model) {

Here we are trying to compute a number of observations for the smooth terms and this is this code that is failing and compute two rows per interaction term with ti(age).

larmarange · 2022-02-22T08:17:20Z

At that stage:

broom.helpers/R/model_get_n.R

Line 159 in 7f4d5d4

tidyr::unnest(cols = .data$tbl_obs)

We should keep only one row per term, but what would be the expected value to keep? The maximum? The minimum? NA if two computed values?

# A tibble: 7 x 2
  term             n_obs
  <chr>            <dbl>
1 s(age)             179
2 ti(age):gradeI     179
3 ti(age):gradeI      64
4 ti(age):gradeII     57
5 ti(age):gradeII    179
6 ti(age):gradeIII    58
7 ti(age):gradeIII   179

ddsjoberg · 2022-02-22T13:37:59Z

I think the N should be consistent with a typical continuous by categorical interaction term we report

library(gtsummary)

lm(marker ~ age:grade, trial) %>%
  tbl_regression() %>%
  add_n(location = "level") %>%
  as_kable()

Characteristic	N	Beta	95% CI	p-value
Age * Grade	179
Age * I	64	0.00	-0.01, 0.01	0.6
Age * II	57	0.00	-0.01, 0.00	0.3
Age * III	58	0.00	-0.01, 0.01	>0.9

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

Header row reports overall N, and the row report the Ns from the categorical variable. I think that makes sense here as well

larmarange · 2022-02-22T13:55:29Z

Then it should be 64, 57 and 58.

One of the issue here is that ti(age) generates two different columns in the model matrix.

library(broom.helpers)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

mod <- gam(
  marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
  data = gtsummary::trial, 
  method = 'REML', 
  family = gaussian
)

mod %>% model_get_model_matrix() %>% colnames()
#>  [1] "(Intercept)"        "gradeII"            "gradeIII"          
#>  [4] "s(age).1"           "s(age).2"           "s(age).3"          
#>  [7] "s(age).4"           "s(age).5"           "s(age).6"          
#> [10] "s(age).7"           "s(age).8"           "s(age).9"          
#> [13] "s(age).10"          "s(age).11"          "s(age).12"         
#> [16] "s(age).13"          "s(age).14"          "s(age).15"         
#> [19] "s(age).16"          "s(age).17"          "s(age).18"         
#> [22] "s(age).19"          "s(age).20"          "s(age).21"         
#> [25] "s(age).22"          "s(age).23"          "s(age).24"         
#> [28] "s(age).25"          "s(age).26"          "s(age).27"         
#> [31] "s(age).28"          "s(age).29"          "s(age).30"         
#> [34] "s(age).31"          "s(age).32"          "s(age).33"         
#> [37] "s(age).34"          "s(age).35"          "s(age).36"         
#> [40] "s(age).37"          "s(age).38"          "s(age).39"         
#> [43] "ti(age):gradeI.1"   "ti(age):gradeI.2"   "ti(age):gradeI.3"  
#> [46] "ti(age):gradeI.4"   "ti(age):gradeII.1"  "ti(age):gradeII.2" 
#> [49] "ti(age):gradeII.3"  "ti(age):gradeII.4"  "ti(age):gradeIII.1"
#> [52] "ti(age):gradeIII.2" "ti(age):gradeIII.3" "ti(age):gradeIII.4"

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

Because there are 2 columns for each term, there is a bug in the computation of N.

The other thing to consider is that ti(age) is not detected as a variable and therefore ti(age):gradeI is not categorized as an interaction.

I'm not familiar with these tensor interactions terms and I'm not sure how to properly deal with them.

ddsjoberg · 2022-02-22T13:57:58Z

Hmm, i agree with you that it's not clear what to do...

What do you think of this:

We add a check that the merge won't create duplicate rows and if it does, we return an error.
In tidy_plus_plus() we wrap the N calculation in a tryCatch(). If there is an error, print a note that the Ns for each level couldn't be added and just move on to the next calculation?

larmarange · 2022-02-22T14:01:01Z

Note: the easier fix would be to remove model_get_n.gam() method, defaulting to default method. In such case, number of observations will be computed only for parametric terms.

If we want to continue to compute the number of observations for smooth terms, we need to identy such terms, interactions with such terms, and how to handle multiple smooth columns in model matrix.

larmarange · 2022-02-22T14:05:54Z

I would prefer to switch model_get_n() to the default behavior, resulting in computing the number of observations only for parametric terms that are classic terms.

ddsjoberg · 2022-02-22T14:11:49Z

I would prefer to switch model_get_n() to the default behavior, resulting in computing the number of observations only for parametric terms that are classic terms.

ok that sounds like a good option

larmarange · 2022-02-22T14:27:25Z

Please have a look at #151

library(broom.helpers)
library(mgcv)
#> Le chargement a nécessité le package : nlme
#> This is mgcv 1.8-38. For overview type 'help("mgcv-package")'.

mod <- gam(
  marker ~ s(age, bs = 'ad', k = -1) + grade + ti(age, by = grade, bs ='fs'),  
  data = gtsummary::trial, 
  method = 'REML', 
  family = gaussian
)

mod %>% 
  tidy_plus_plus(tidy_fun = gtsummary::tidy_gam) %>%
  dplyr::select(term, variable, var_type, estimate, n_obs, parametric)
#> # A tibble: 7 x 6
#>   term             variable         var_type    estimate n_obs parametric
#>   <chr>            <chr>            <chr>          <dbl> <dbl> <lgl>     
#> 1 gradeI           grade            categorical    0        64 NA        
#> 2 gradeII          grade            categorical   -0.390    57 TRUE      
#> 3 gradeIII         grade            categorical   -0.125    58 TRUE      
#> 4 s(age)           s(age)           continuous    NA        NA FALSE     
#> 5 ti(age):gradeI   ti(age):gradeI   continuous    NA        NA FALSE     
#> 6 ti(age):gradeII  ti(age):gradeII  continuous    NA        NA FALSE     
#> 7 ti(age):gradeIII ti(age):gradeIII continuous    NA        NA FALSE

^{Created on 2022-02-22 by the reprex package (v2.0.1)}

larmarange mentioned this issue Feb 22, 2022

Simply use model_get_n.glm() for gam models #151

Merged

larmarange closed this as completed in #151 Feb 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Duplicate rows in `mgcv::gam()` with `ti(age, by = grade, bs ='fs')` #150

Duplicate rows in `mgcv::gam()` with `ti(age, by = grade, bs ='fs')` #150

ddsjoberg commented Feb 22, 2022 •

edited

larmarange commented Feb 22, 2022

larmarange commented Feb 22, 2022 •

edited

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

larmarange commented Feb 22, 2022

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

Duplicate rows in mgcv::gam() with ti(age, by = grade, bs ='fs') #150

Duplicate rows in mgcv::gam() with ti(age, by = grade, bs ='fs') #150

Comments

ddsjoberg commented Feb 22, 2022 • edited

larmarange commented Feb 22, 2022

larmarange commented Feb 22, 2022 • edited

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

larmarange commented Feb 22, 2022

ddsjoberg commented Feb 22, 2022

larmarange commented Feb 22, 2022

Duplicate rows in `mgcv::gam()` with `ti(age, by = grade, bs ='fs')` #150

Duplicate rows in `mgcv::gam()` with `ti(age, by = grade, bs ='fs')` #150

ddsjoberg commented Feb 22, 2022 •

edited

larmarange commented Feb 22, 2022 •

edited