Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Original data set cannot be safely accessed when the estimation is performed in a loop #340

Open
lrberge opened this issue Jul 21, 2022 · 7 comments

Comments

@lrberge
Copy link
Owner

lrberge commented Jul 21, 2022

Example

library(fixest)

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    res_all[[i]] = feols(y ~ x1 + x2, base[base$species == s, ])
}

# Have a look at the mean of the dependent variable:
etable(res_all, fitstat = "my")
#>                            model 1            model 2            model 3
#> Dependent Var.:                  y                  y                  y
#>                                                                         
#> (Intercept)      2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1              0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                 0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _______________ __________________ __________________ __________________
#> S.E. type                      IID                IID                IID
#> Dep. Var. mean              6.5880             6.5880             6.5880

What happens

In fixest many computations are delayed, occurring after the estimation and only at the user's request.
This is true for computing the standard-errors (when clustering w.r.t. a variable not used in the estimation) and for several fit statistics.

For post-computation to work properly, the original data, the one used for the estimation, needs to be accessed.
Of course, one solution would be to store the original data in the estimation object. This would be safe but would come at an exorbitant memory price.
Currently, the data is accessed by using the same data call as in the estimation: it is just an access to a data set currently stored in memory. Following the example, base[base$species == s, ] is evaluated to get the data.

In general, this is fine. Now comes the loop problem. When information have to be computed after the loop, like in the example the mean of the dependent variable, for all three models the data is accessed with base[base$species == s, ], leading to erroneously use the same data set. Hence leading to wrong results.

Solutions

VCOV

This problem affects the VCOV if clustering (or any other VCOV using an extra variable) has to be performed ex-post.
The solution is then to use the argument vcov at estimation time.

Note that you will not be able to navigate through different VCOVs (other than standard and heteroskedasticity-robust) ex post. If needed, you will have to store the object with the different VCOVs within the loop, while the right data is accessible.

fit statistics

Currently there is no way to store the fit statistics at estimation time.
There is a major overhaul of the fit statistics mechanism under way. Once the new fit-stats are implemented, that will be possible and easy (basically just using fitstat = TRUE will do).

In the short run, if you want to use data-dependent fit-stats ex post, you need to do it manually. FWIW, here's an example:

# A) create a custom fit stat
my_meandep = function(x){
    # x: fixest object
    if("mean_dep" %in% names(x)) x$mean_dep else NA
}

# B) register it
fitstat_register("meandep", my_meandep, "Mean of the Dep. Var.")

# C) within the loop, save the requested information 

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
all_species = unique(base$species)
res_all = list()
for(i in 1:3){
    s = all_species[i]
    
    # C') => we create the slot mean_dep
    mod = feols(y ~ x1 + x2, base[base$species == s, ])
    mod$mean_dep = fitstat(mod, "my", simplify = TRUE)
    res_all[[i]] = mod
}

# D) use it
etable(res_all, fitstat = ~my + meandep)
#>                                  model 1            model 2            model 3
#> Dependent Var.:                        y                  y                  y
#>                                                                               
#> (Intercept)            2.304*** (0.3853)  2.116*** (0.4943)    0.6248 (0.5249)
#> x1                    0.6674*** (0.0904)    0.2476 (0.1868)   0.2600. (0.1533)
#> x2                       0.2834 (0.1972) 0.7356*** (0.1248) 0.9348*** (0.0896)
#> _____________________ __________________ __________________ __________________
#> S.E. type                            IID                IID                IID
#> Dep. Var. mean                    6.5880             6.5880             6.5880
#> Mean of the Dep. Var.             5.0060             5.9360             6.5880
@vincentarelbundock
Copy link
Contributor

Just wanted to +1 this issue. marginaleffects must rely on some ugly tricks to get the dataset from by evaluating its name in the global environment. This is unsafe and will lead to bad results if the user re-assigns the data object.

It would be awesome if we could store the original dataset in fixest objects for easy retrieval.

As always, thanks for your great work on this!

@vincentarelbundock
Copy link
Contributor

I see now that I misread your post and that you do not plan to attach the original data.

FWIW, here’s an example of the problematic situation I had in mind. The first model should produce a marginal effects data frame with 32 rows because there are 32 observations. Yet, because we reassign an object in the global environment, trying to access the feols data produces incorrect results.

Of course, this is dangerous behavior by the user, but reassigning data frames is super common practice…

library(fixest)
library(marginaleffects)

dat <- mtcars
mod1 <- feols(hp ~ mpg | cyl, data = dat)

dat <- dat[1:20,]
mod2 <- feols(hp ~ mpg | cyl, data = dat)

marginaleffects(mod1) |> nrow()
#> [1] 20

marginaleffects(mod2) |> nrow()
#> [1] 20

@ngreifer
Copy link

ngreifer commented Sep 1, 2022

Chiming in with my own examination of the problem. feols() stores a call_env component in the output, which is meant to store the calling environment. However, neither this environment nor the environment of the formula in the fml component actually refer to the environments where the function call or formula were evaluated. It is impossible to retrieve the actual environment where feols() was called.

# Evaluating feols() in its own environment
f1 <- function() {
  d1 <- data.frame(y = rnorm(10),
                   x = rnorm(10))
  
  fixest::feols(y ~ x, data = d1)
}
fit1 <- f1()

# d1 is no where to be found
list(call_env = ls(fit1$call_env))
#> $call_env
#> character(0)
list(fml = ls(environment(fit1$fml)))
#> $fml
#> [1] "lhs" "res" "rhs"

# Evaluating lm() in its own environment
f2 <- function() {
  d2 <- data.frame(y = rnorm(10),
                   x = rnorm(10))
  
  lm(y ~ x, data = d2)
}
fit2 <- f2()

# d2 found easily in formula environment
list(terms = ls(environment(fit2$terms)))
#> $terms
#> [1] "d2"

Created on 2022-09-01 with reprex v2.0.2

This is problematic when using feols() inside testthat() since the variable scoping is really weird with it. If I am trying to use marginaleffects or insight::get_data(), the data cannot be found because no information about the calling environment of feols() is actually stored in the output object. That is, even if I am willing to use the name of the original dataset and I haven't touched that dataset, I cannot access it if it is not in the global environment (which would be true inside simulations or testthat blocks.

I encourage you to consider at least providing an option for users to store the original dataset in the output object, perhaps with the lean option allowing users with huge datasets to turn this off. A persistent version of the original dataset must exist somewhere; if that is in a new environment created by feols() or in the output object itself, it will take up the same memory regardless, and you might have to swallow allowing that to happen.

@lrberge
Copy link
Owner Author

lrberge commented Sep 2, 2022

Don't have the time to reply sorry, but just to say that I didn't write insight::get_data().

The value call_env is internal cuisine, and it works fine. It is used internally in the following unexported function:

fixest:::fetch_data(fit1)
#>              y          x
#> 1  -0.86791309 -0.6534246
#> 2   0.09164135 -1.0960332
#> 3  -0.45866784 -0.9918161
#> 4  -1.12278344 -0.7937952
#> 5   0.11677956  0.4422500
#> 6  -0.45137917  1.1813671
#> 7   1.07984048  0.5867055
#> 8  -0.13633684 -2.0547528
#> 9   0.20415421 -1.7486942
#> 10  0.31391199 -2.0428270

I didn't save the environment directly but a child of the calling environment, to allow the env. to be garbage collected if need (but I'm not sure if that happens in the end). Here to clarify:

list(call_env = ls(parent.env(fit1$call_env)))
#> $call_env
#> [1] "d1"

Or stated differently:

eval(fit1$call$data, fit1$call_env)
#>              y          x
#> 1  -0.86791309 -0.6534246
#> 2   0.09164135 -1.0960332
#> 3  -0.45866784 -0.9918161
#> 4  -1.12278344 -0.7937952
#> 5   0.11677956  0.4422500
#> 6  -0.45137917  1.1813671
#> 7   1.07984048  0.5867055
#> 8  -0.13633684 -2.0547528
#> 9   0.20415421 -1.7486942
#> 10  0.31391199 -2.0428270

Option to save makes sense in some circumstances and will be there. Just wait for my teaching semester to end please.

@ngreifer
Copy link

ngreifer commented Sep 2, 2022

This is very helpful, thank you!

@lrberge
Copy link
Owner Author

lrberge commented Feb 13, 2024

Hi, with a huge time lag (sorry Vincent...), there's the new argument data.save:

base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
est = feols(y ~ x1, base, data.save = TRUE)
rm(base)

#
# the code below would not have worked with data.save = FALSE
#

vcov(est, ~species)
#>             (Intercept)         x1
#> (Intercept)   0.8823064 -0.3670791
#> x1           -0.3670791  0.1654361

update(est, .~.+x2)
#> OLS estimation, Dep. Var.: y
#> Observations: 150
#> Standard-errors: IID
#>             Estimate Std. Error  t value   Pr(>|t|)
#> (Intercept) 2.249140   0.247970  9.07022 7.0385e-16 ***
#> x1          0.595525   0.069328  8.58994 1.1633e-14 ***
#> x2          0.471920   0.017118 27.56916  < 2.2e-16 ***
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.329937   Adj. R2: 0.838003

BTW I now exposed the function fixest_data to get the data set used for the estimation (following Kyle's suggestion #465), which accounts for this new mechanism.
Once this version of fixest is on CRAN I may write a PR to insight::get_data() to account for that.

@vincentarelbundock
Copy link
Contributor

Oh, this looks fantastic!

Feel free to ping me on the insight repo when it's out on CRAN. I still have commit rights there and am happy to help with implementation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants