New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Original data set cannot be safely accessed when the estimation is performed in a loop #340
Comments
Just wanted to +1 this issue. It would be awesome if we could store the original dataset in As always, thanks for your great work on this! |
I see now that I misread your post and that you do not plan to attach the original data. FWIW, here’s an example of the problematic situation I had in mind. The first model should produce a marginal effects data frame with 32 rows because there are 32 observations. Yet, because we reassign an object in the global environment, trying to access the Of course, this is dangerous behavior by the user, but reassigning data frames is super common practice… library(fixest)
library(marginaleffects)
dat <- mtcars
mod1 <- feols(hp ~ mpg | cyl, data = dat)
dat <- dat[1:20,]
mod2 <- feols(hp ~ mpg | cyl, data = dat)
marginaleffects(mod1) |> nrow()
#> [1] 20
marginaleffects(mod2) |> nrow()
#> [1] 20 |
Chiming in with my own examination of the problem. # Evaluating feols() in its own environment
f1 <- function() {
d1 <- data.frame(y = rnorm(10),
x = rnorm(10))
fixest::feols(y ~ x, data = d1)
}
fit1 <- f1()
# d1 is no where to be found
list(call_env = ls(fit1$call_env))
#> $call_env
#> character(0)
list(fml = ls(environment(fit1$fml)))
#> $fml
#> [1] "lhs" "res" "rhs"
# Evaluating lm() in its own environment
f2 <- function() {
d2 <- data.frame(y = rnorm(10),
x = rnorm(10))
lm(y ~ x, data = d2)
}
fit2 <- f2()
# d2 found easily in formula environment
list(terms = ls(environment(fit2$terms)))
#> $terms
#> [1] "d2" Created on 2022-09-01 with reprex v2.0.2 This is problematic when using I encourage you to consider at least providing an option for users to store the original dataset in the output object, perhaps with the |
Don't have the time to reply sorry, but just to say that I didn't write insight::get_data(). The value call_env is internal cuisine, and it works fine. It is used internally in the following unexported function: fixest:::fetch_data(fit1)
#> y x
#> 1 -0.86791309 -0.6534246
#> 2 0.09164135 -1.0960332
#> 3 -0.45866784 -0.9918161
#> 4 -1.12278344 -0.7937952
#> 5 0.11677956 0.4422500
#> 6 -0.45137917 1.1813671
#> 7 1.07984048 0.5867055
#> 8 -0.13633684 -2.0547528
#> 9 0.20415421 -1.7486942
#> 10 0.31391199 -2.0428270 I didn't save the environment directly but a child of the calling environment, to allow the env. to be garbage collected if need (but I'm not sure if that happens in the end). Here to clarify: list(call_env = ls(parent.env(fit1$call_env)))
#> $call_env
#> [1] "d1" Or stated differently: eval(fit1$call$data, fit1$call_env)
#> y x
#> 1 -0.86791309 -0.6534246
#> 2 0.09164135 -1.0960332
#> 3 -0.45866784 -0.9918161
#> 4 -1.12278344 -0.7937952
#> 5 0.11677956 0.4422500
#> 6 -0.45137917 1.1813671
#> 7 1.07984048 0.5867055
#> 8 -0.13633684 -2.0547528
#> 9 0.20415421 -1.7486942
#> 10 0.31391199 -2.0428270 Option to save makes sense in some circumstances and will be there. Just wait for my teaching semester to end please. |
This is very helpful, thank you! |
Hi, with a huge time lag (sorry Vincent...), there's the new argument base = setNames(iris, c("y", "x1", "x2", "x3", "species"))
est = feols(y ~ x1, base, data.save = TRUE)
rm(base)
#
# the code below would not have worked with data.save = FALSE
#
vcov(est, ~species)
#> (Intercept) x1
#> (Intercept) 0.8823064 -0.3670791
#> x1 -0.3670791 0.1654361
update(est, .~.+x2)
#> OLS estimation, Dep. Var.: y
#> Observations: 150
#> Standard-errors: IID
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 2.249140 0.247970 9.07022 7.0385e-16 ***
#> x1 0.595525 0.069328 8.58994 1.1633e-14 ***
#> x2 0.471920 0.017118 27.56916 < 2.2e-16 ***
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> RMSE: 0.329937 Adj. R2: 0.838003 BTW I now exposed the function fixest_data to get the data set used for the estimation (following Kyle's suggestion #465), which accounts for this new mechanism. |
Oh, this looks fantastic! Feel free to ping me on the insight repo when it's out on CRAN. I still have commit rights there and am happy to help with implementation. |
Example
What happens
In
fixest
many computations are delayed, occurring after the estimation and only at the user's request.This is true for computing the standard-errors (when clustering w.r.t. a variable not used in the estimation) and for several fit statistics.
For post-computation to work properly, the original data, the one used for the estimation, needs to be accessed.
Of course, one solution would be to store the original data in the estimation object. This would be safe but would come at an exorbitant memory price.
Currently, the data is accessed by using the same data call as in the estimation: it is just an access to a data set currently stored in memory. Following the example,
base[base$species == s, ]
is evaluated to get the data.In general, this is fine. Now comes the loop problem. When information have to be computed after the loop, like in the example the mean of the dependent variable, for all three models the data is accessed with
base[base$species == s, ]
, leading to erroneously use the same data set. Hence leading to wrong results.Solutions
VCOV
This problem affects the VCOV if clustering (or any other VCOV using an extra variable) has to be performed ex-post.
The solution is then to use the argument
vcov
at estimation time.Note that you will not be able to navigate through different VCOVs (other than standard and heteroskedasticity-robust) ex post. If needed, you will have to store the object with the different VCOVs within the loop, while the right data is accessible.
fit statistics
Currently there is no way to store the fit statistics at estimation time.
There is a major overhaul of the fit statistics mechanism under way. Once the new fit-stats are implemented, that will be possible and easy (basically just using
fitstat = TRUE
will do).In the short run, if you want to use data-dependent fit-stats ex post, you need to do it manually. FWIW, here's an example:
The text was updated successfully, but these errors were encountered: