-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop both columns in dependent variable and design matrix when missings occur #124
Comments
Another nice feature of the R
|
Hi @s3alfisc! Thanks for reaching out! re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using? re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the |
Hi Matthew, I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this? A (potentially highly specialized) use case where metadata on dropped columns might be helpful: I want to run cluster robust inference as a post estimation command after a regression model, specified via A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS. There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the Thanks for your response, and please let me know if I can help debug this further! Best, Alex |
Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar. I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course. The immediate solution that comes to mind is to use multi-part formulae, like:
This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too. Do any of these solve your use-case? |
Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation! |
Hi Matthew - thanks for making this super useful package available!
Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y,
model_matrix()
drops observations columnwise, but not for both X and Y.Here's a quick example:
The NaN column is dropped from X, but not from Y.
I think it would be nice to add functionality for this (though it might already exist?).
E.g. in R, before calling
base::model.matrix()
, one would define abase::model.frame()
, which would then by default drop the entire column where a missing value exists (for both X and Y).The text was updated successfully, but these errors were encountered: