Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Intercept is not added after being removed #148

Closed
tomicapretto opened this issue Jul 15, 2023 · 4 comments · Fixed by #156
Closed

Intercept is not added after being removed #148

tomicapretto opened this issue Jul 15, 2023 · 4 comments · Fixed by #156
Assignees
Labels
enhancement New feature or request

Comments

@tomicapretto
Copy link

In the following example I remove the intercept and then I add it. It turns out "it's never added". I'm not sure if this is a feature you would like to support, but if I mentally parse from left to right I would expect the final formula to contain the intercept.

import pandas as pd
import formulaic
from formulaic import model_matrix
print("version", formulaic.__version__)
df = pandas.DataFrame({"g": ["a", "a", "a", "b", "b"]})
print(model_matrix("0 + g + 1", df))
version 0.6.4
   g[T.a]  g[T.b]
0       1       0
1       1       0
2       1       0
3       0       1
4       0       1
@matthewwardrop
Copy link
Owner

Hi @tomicapretto ,

Thanks for taking the time to report this. However, this is actually expected behaviour due to the ordered nature of formulae and the automatic full-rank algorithm (where terms to the left take precedence over terms to the right in terms of materialization).

That is:

>>> print(model_matrix("0 + g + 1", df))
   g[T.a]  g[T.b]
0       1       0
1       1       0
2       1       0
3       0       1
4       0       1
>>> print(model_matrix("0 + 1 + g", df))
   Intercept  g[T.b]
0        1.0       0
1        1.0       0
2        1.0       0
3        1.0       1
4        1.0       1

These model matrices are equivalent, and the columns span the vector space (i.e. the model matrix is full rank). In both cases the intercept is spanned, but in the former case it is spanned by the categorical factors.

I'll close this one out for now, but let me know if have further questions!

@matthewwardrop
Copy link
Owner

Hmm... but this does, upon further reflection, not match the behaviour expected by users of R or patsy. It might be worth special casing the full rank algorithm to deal with the intercept. Point taken. I'll fix this!

@tomicapretto
Copy link
Author

@matthewwardrop your explanation makes perfect sense (i.e. terms to the left take precedence over terms to the right). But as you said, it may be surprising to users coming from Patsy or R. I don't have a strong preference. But if you decide to continue with the current approach, it would be good to leave a note in the documentation explaining why it works the way it works in this specific case. I could open a PR 😄

@matthewwardrop
Copy link
Owner

Thanks again for reporting this @tomicapretto . This should be resolved as of 0.6.5+.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants