drop both columns in dependent variable and design matrix when missings occur #124

s3alfisc · 2022-11-26T09:45:05Z

Hi Matthew - thanks for making this super useful package available!

Currently, when there is a missing value in a column of the design matrix X, but not in the dependent variable Y, model_matrix() drops observations columnwise, but not for both X and Y.

Here's a quick example:

from formulaic import model_matrix 
import numpy as np
import pandas as pd

N = 10
X = np.random.normal(0, 1, N)
Y = np.random.normal(0, 1, N)
data = {'Y':Y, 'X':X}
data = pd.DataFrame(data)

data.X.iloc[0] = None

fml = 'Y ~ X'
y, X = model_matrix(fml, data, na_action = 'ignore')
>>> y.head()
          Y
0 -0.174508
1  0.373280
2  1.631371
3 -0.622598
4 -0.482028
>>> X.head()
   Intercept         X
1        1.0  2.652463
2        1.0 -1.356067
3        1.0  1.143417
4        1.0 -1.020435
5        1.0  0.072263

y, X = model_matrix(fml, data)
>>> y.shape
(10, 1)
>>> X.shape
(9, 2)

The NaN column is dropped from X, but not from Y.

I think it would be nice to add functionality for this (though it might already exist?).

E.g. in R, before calling base::model.matrix(), one would define a base::model.frame(), which would then by default drop the entire column where a missing value exists (for both X and Y).

N <- 10
X <- rnorm(N)
Y <- rnorm(N)
X[1] <- NA

data <- data.frame(Y = Y, X = X)
mf <- model.frame(Y ~ X, data)
# Y           X
# 2   0.9418535  0.05795054
# 3  -1.2333905 -1.02186716
# 4  -0.1277604  1.59699265
# 5  -0.1258892 -1.16908339
# 6   0.2176256  0.22375018
# 7  -1.2068559  0.92400472
# 8  -0.5803319  0.55442642
# 9  -1.3511992 -0.34372283
# 10 -2.0518279 -0.31997878
mm <- model.matrix(mf)

depvar <- model.response(mf)

The text was updated successfully, but these errors were encountered:

s3alfisc · 2022-11-26T10:03:27Z

Another nice feature of the R model.frame class is that it returns a set of informative attributes - for example, it returns an index of dropped columns. Is there a similar feature for formulaic? I quickly glanced over all attributes of formulaic.model_matrix.ModelMatrix but did not find an equivalent attribute.

attributes(mf)
$names
[1] "Y" "X"

$terms
Y ~ X
attr(,"variables")
list(Y, X)
attr(,"factors")
  X
Y 0
X 1
attr(,"term.labels")
[1] "X"
attr(,"order")
[1] 1
attr(,"intercept")
[1] 1
attr(,"response")
[1] 1
attr(,".Environment")
<environment: R_GlobalEnv>
attr(,"predvars")
list(Y, X)
attr(,"dataClasses")
        Y         X 
"numeric" "numeric" 

$row.names
[1]  2  3  4  5  6  7  8  9 10

$class
[1] "data.frame"

$na.action
1 
1 
attr(,"class")
[1] "omit"

matthewwardrop · 2022-11-27T23:40:57Z

Hi @s3alfisc! Thanks for reaching out!

re: nan dropping being inconsistent between the left- and right- hand sides, that's interesting. I cannot reproduce that behaviour. Which version of formulaic are you using?

re: metadata for which rows are being dropped, you're correct; formulaic doesn't propagate that information through to the ModelSpec (since it was not "specification", but rather data-specific state). When using pandas, however, the index is maintained, allowing you to determine which rows were omitted. Do you have use-cases where this would be useful?

s3alfisc · 2022-11-28T20:36:36Z

Hi Matthew,

I am running version 0.5.2. Sorry for not reporting the package version, a rookie mistake. Is there any other information I could provide to help debug this?

A (potentially highly specialized) use case where metadata on dropped columns might be helpful:

I want to run cluster robust inference as a post estimation command after a regression model, specified via Y ~ X, where some columns have been dropped due to missing values. The regression model only stores the 'cleaned' X and Y as used when fitting the model. The clustering variable is not included in the regression model, and therefore not included in the design matrix X. In order to make the post-estimation inference work, I need to drop the columns which were dropped from X and Y from the clustering variable as well, and for that, an index of dropped columns might be handy. The alternative to metadata for dropped rows would be to force the categorical clustering variable to an integer type, to add it to the model formula and run model_matrix('Y~X + cluster) and to fetch the NaN-less cluster variable from it.

A similar problem arises for regression models with high-dimensional fixed effects, where the fixed effects are projected out prior to running OLS on the residualized X and Y (e.g. as in the fixest R package). In this case, I want to keep the categorical fixed effect variable in a single column, hence create the one-hot encoded X only for variables which are not "projected out". In a second step, I then delete missing columns from both X and the fixed effect variable(s), residualize, and run OLS.

There is an obvious workaround for both problems (even if the input is not a pandas.DataFrame): just set the na_action = 'ignore' and get an index of all missing values in both X and Y myself. Here's a (admittedly rather convoluted) code example I hacked out over the weekend. =)

Thanks for your response, and please let me know if I can help debug this further! Best, Alex

matthewwardrop · 2023-04-20T16:30:42Z

Goodness, sorry @s3alfisc ! Life has kept me busy and this went under the radar.

I'm cautious about adding too much extra information to the model spec (such as the indices of missing rows) because it is often the case that you want to serialize it for later use; and it is preferable if it doesn't scale with the size of the data. We could revisit that as necessary, of course.

The immediate solution that comes to mind is to use multi-part formulae, like:

Y ~ X | cluster, which would result in a structured output of three model matrices : lhs=Y, rhs=(X, cluster). This keeps the distinction between your cluster variables, but also guarantees that the same rows are dropped across all model matrices. There's actually more you could do too, like:

from formulaic import Formula

Formula(lhs='Y', rhs='X', clusters='cluster')

This would result in three top-level model matrices, which you can extract by name or index. Also, if you are using pandas dataframes, the index is maintained from the input data, so you could use that to slice your data too.

Do any of these solve your use-case?

matthewwardrop · 2023-07-03T22:19:11Z

Hi again @s3alfisc ! I'm going to assume that the above does solve your use-cases, and close this one out. Feel free to reopen if you'd like to resume the conversation!

matthewwardrop self-assigned this Nov 27, 2022

matthewwardrop added the cannot reproduce label Nov 27, 2022

matthewwardrop added question Further information is requested cannot reproduce and removed cannot reproduce labels Apr 20, 2023

matthewwardrop closed this as completed Jul 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

drop both columns in dependent variable and design matrix when missings occur #124

drop both columns in dependent variable and design matrix when missings occur #124

s3alfisc commented Nov 26, 2022 •

edited

Loading

s3alfisc commented Nov 26, 2022

matthewwardrop commented Nov 27, 2022

s3alfisc commented Nov 28, 2022 •

edited

Loading

matthewwardrop commented Apr 20, 2023

matthewwardrop commented Jul 3, 2023

drop both columns in dependent variable and design matrix when missings occur #124

drop both columns in dependent variable and design matrix when missings occur #124

Comments

s3alfisc commented Nov 26, 2022 • edited Loading

s3alfisc commented Nov 26, 2022

matthewwardrop commented Nov 27, 2022

s3alfisc commented Nov 28, 2022 • edited Loading

matthewwardrop commented Apr 20, 2023

matthewwardrop commented Jul 3, 2023

s3alfisc commented Nov 26, 2022 •

edited

Loading

s3alfisc commented Nov 28, 2022 •

edited

Loading