Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop both columns in dependent variable and design matrix when missings occur #124

Closed
s3alfisc opened this issue Nov 26, 2022 · 5 comments
Closed
Assignees
Labels
question Further information is requested

Comments

@s3alfisc
Copy link

s3alfisc commented Nov 26, 2022

Hi Matthew - thanks for making this super useful package available!

Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y, model_matrix() drops observations columnwise, but not for both X and Y.

Here's a quick example:

from formulaic import model_matrix 
import numpy as np
import pandas as pd

N = 10
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)
data = {'Y':Y, 'X':X}
data = pd.DataFrame(data)

data.X.iloc[0] = None

fml = 'Y ~ X'
y, X = model_matrix(fml, data, na_action = 'ignore')
>>> y.head()
          Y
0 -0.174508
1  0.373280
2  1.631371
3 -0.622598
4 -0.482028
>>> X.head()
   Intercept         X
1        1.0  2.652463
2        1.0 -1.356067
3        1.0  1.143417
4        1.0 -1.020435
5        1.0  0.072263

y, X = model_matrix(fml, data)
>>> y.shape
(10, 1)
>>> X.shape
(9, 2)

The NaN column is dropped from X, but not from Y.

I think it would be nice to add functionality for this (though it might already exist?).

E.g. in R, before calling base::model.matrix(), one would define a base::model.frame(), which would then by default drop the entire column where a missing value exists (for both X and Y).

N <- 10
X <- rnorm(N)
Y <- rnorm(N)
X[1] <- NA

data <- data.frame(Y = Y, X = X)
mf <- model.frame(Y ~ X, data)
# Y           X
# 2   0.9418535  0.05795054
# 3  -1.2333905 -1.02186716
# 4  -0.1277604  1.59699265
# 5  -0.1258892 -1.16908339
# 6   0.2176256  0.22375018
# 7  -1.2068559  0.92400472
# 8  -0.5803319  0.55442642
# 9  -1.3511992 -0.34372283
# 10 -2.0518279 -0.31997878
mm <- model.matrix(mf)

depvar <- model.response(mf)

@s3alfisc
Copy link
Author

Another nice feature of the R model.frame class is that it returns a set of informative attributes - for example, it returns an index of dropped columns. Is there a similar feature for formulaic? I quickly glanced over all attributes of formulaic.model_matrix.ModelMatrix but did not find an equivalent attribute.

attributes(mf)
$names
[1] "Y" "X"

$terms
Y ~ X
attr(,"variables")
list(Y, X)
attr(,"factors")
  X
Y 0
X 1
attr(,"term.labels")
[1] "X"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(Y, X)
attr(,"dataClasses")
        Y         X 
"numeric" "numeric" 

$row.names
[1]  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

$na.action
1 
1 
attr(,"class")
[1] "omit"

@matthewwardrop
Copy link
Owner

Hi @s3alfisc! Thanks for reaching out!

re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using?

re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the ModelSpec (since it was not "specification", but rather data-specific state). When using pandas, however, the index is maintained, allowing you to determine which rows were omitted. Do you have use-cases where this would be useful?

@s3alfisc
Copy link
Author

s3alfisc commented Nov 28, 2022

Hi Matthew,

I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this?

A (potentially highly specialized) use case where metadata on dropped columns might be helpful:

I want to run cluster robust inference as a post estimation command after a regression model, specified via Y ~ X, where some columns have been dropped due to missing values. The regression model only stores the 'cleaned' X and Y as used when fitting the model. The clustering variable is not included in the regression model, and therefore not included in the design matrix X. In order to make the post-estimation inference work, I need to drop the columns which were dropped from X and Y from the clustering variable as well, and for that, an index of dropped columns might be handy. The alternative to metadata for dropped rows would be to force the categorical clustering variable to an integer type, to add it to the model formula and run model_matrix('Y~X + cluster) and to fetch the NaN-less cluster variable from it.

A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS.

There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the na_action = 'ignore' and get an index of all missing values in both X and Y myself. Here's a (admittedly rather convoluted) code example I hacked out over the weekend. =)

Thanks for your response, and please let me know if I can help debug this further! Best, Alex

@matthewwardrop
Copy link
Owner

Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar.

I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course.

The immediate solution that comes to mind is to use multi-part formulae, like:

Y ~ X | cluster, which would result in a structured output of three model matrices : lhs=Y, rhs=(X, cluster). This keeps the distinction between your cluster variables, but also guarantees that the same rows are dropped across all model matrices. There's actually more you could do too, like:

from formulaic import Formula

Formula(lhs='Y', rhs='X', clusters='cluster')

This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too.

Do any of these solve your use-case?

@matthewwardrop matthewwardrop added question Further information is requested cannot reproduce and removed cannot reproduce labels Apr 20, 2023
@matthewwardrop
Copy link
Owner

Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants