Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected call_env / summary_flags behaviour in model objects when called within a function. #514

Closed
aaronrudkin opened this issue Jul 4, 2024 · 2 comments

Comments

@aaronrudkin
Copy link

aaronrudkin commented Jul 4, 2024

I am running models on enormous amounts of data (20-odd million observations, 1000-odd variables counting FEs). We want to store fit model objects so that we can dynamically re-run various outputs (etable, coefplot, etc.). We run the models with clustered SEs and store the lean versions of the models in rds files for use on other machines.

I am running into an unusual case where the fit model objects end up enormous when the models are assembled inside function calls.

Here's an example. It's not quite a reprex, because you need the data, but I think it at least gives an intuition:

# big_dat is a ~300MB subset of my data
# big_formula is a DV ~ var1 + var2 + var3 | fevar1 + fevar2 style formula
# clustvar is the clustering variable

# Correct behavior:
test_obj <- feols(
 big_formula,
 lean = TRUE,
 cluster = ~ clustvar,
 data = big_dat
)

# Incorrect behavior:
test_func = function(x) {
  feols(
    big_formula,
    lean = TRUE,
    cluster = ~ clustvar,
    data = x)
}
test_obj2 = test_func(big_dat)

# Result
pryr::object_size(test_obj)
# 123.02 kB
pryr::object_size(test_obj2)
# 334.78 MB

This is not a misfire from pryr; attempts to serialize the resulting objects to a file reflect the same size disparity. I introspected the objects to figure out where the disconnect is:

for(obj in ls(test_obj)) {
  size1 <- pryr::object_size(test_obj[[obj]])
  size2 <- pryr::object_size(test_obj2[[obj]])
  if(size1 != size2) {
    print(paste0(obj, ": ", size1, " / ", size2))
  }
}
# [1] "call: 5264 / 4984"
# [1] "call_env: 336 / 334654480"
# [1] "summary_flags: 840 / 334654872"

The discrepancy in call is clearly use the length of the data argument being shrunk. Let's not worry about that. But as you can see, the summary_flags and the call_env are both carrying around the environment of the call in full, even with lean = TRUE given as an argument.

I can solve the problem by NULLing out these objects before serializing, and it doesn't seem to cause any downstream issues I wouldn't expect. I assume this is an oversight.

Suggested fix: have lean = TRUE drop the environment from the result object.

@lrberge
Copy link
Owner

lrberge commented Aug 27, 2024

Hi, thanks for reporting and the very clear issue.

@aaronrudkin
Copy link
Author

Hi! Sorry to re-open this issue, I ran into another manifestation of this. This time, running feglm inside a function with lean=TRUE, family was retained and is similarly large. Same bug as above -- running outside a function won't do it. In specific it's the variance, dev.resids, aic, validmu, and simulate objects within the family object that carry around the environment.

Just a quick hack to verify this, j is the result of an feglm run inside a function using an enormous amount of data and with lean=TRUE:

> names(j[["family"]]) %>% map(function(x) { tibble(name = x, size = pryr::object_size(j[["family"]][[x]])) }) %>% bind_rows() %>% print(n = 39)
# A tibble: 14 × 2
   name         size      
   <chr>        <lbstr_by>
 1 family          120 B  
 2 link            112 B  
 3 linkfun         504 B  
 4 linkinv         504 B  
 5 variance      3.27 GB  
 6 dev.resids    3.27 GB  
 7 aic           3.27 GB  
 8 mu.eta          504 B  
 9 initialize   17.36 kB  
10 validmu       3.27 GB  
11 valideta        280 B  
12 simulate      3.27 GB  
13 family_type     112 B  
14 family_equiv    112 B  

And then just as further proof that the environment is the culprit (if the identical sizes weren't obvious), observe that the AIC object is just a function, and so should be tiny, if not for the attached environment:

> j[["family"]][["aic"]]
function (y, n, mu, wt, dev) 
{
    m <- if (any(n > 1)) 
        n
    else wt
    -2 * sum(ifelse(m > 0, (wt/m), 0) * dbinom(round(m * y), 
        round(m), mu, log = TRUE))
}
<bytecode: 0x55594eb707e8>
<environment: 0x55594e5c3658>

I assume this didn't show up earlier because feols doesn't have a family object, it's only for non-linear link functions?

There's no rush on fixing this, I mitigate the problem by manually nulling out things in the mean time. Thanks for your great support on this bug and the other one I reported. fixest is really a breath of fresh air in a space where a lot of estimation packages are filled with jankiness.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants