Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserve term and factor order in formulae #139

Merged
merged 10 commits into from
Apr 22, 2023
Merged

Conversation

matthewwardrop
Copy link
Owner

@matthewwardrop matthewwardrop commented Apr 19, 2023

@bashtage @DongGilKim Here's an initial implementation that is 90% aligned with patsy. (Tests will fail due to the order of the terms being different than in tests).

Example:

>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> f.get_model_matrix(data)
	Intercept	y	x	A[T.b]:x	A[T.c]:x	A[T.b]:y	A[T.c]:y
0	1.0	1	1	0	0	0	0
1	1.0	2	2	2	0	2	0
2	1.0	3	3	0	3	0	3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)

As you can see from the above, formulaic is now respecting the input order of terms (except in the factor order in the column names of the model matrix, an oversight I'll fix shortly), but still differs slightly from patsy. This is because patsy groups first by the numerical factors involved in the terms, then by degree, and finally by user order; whereas formulaic is skipping that first grouping because at time of formula specification, formulaic doesn't know whether the variables are categorical or not.

We have two options:

  1. Keep the order of the columns in the model matrix as it is now (consistent with the formula, and with R), or
  2. Sort the columns of the model matrix the same way that patsy does, at the expense of consistency with the formula.

If patsy didn't exist, I would want to go with 1, which is more internally consistent and consistent with R. Since it does, how important is it to you that the order matches patsy exactly? I'm not committing to doing it, but I need to think a bit more about the trade-offs.

Closes #135

@DongGilKim
Copy link

@bashtage @DongGilKim Here's an initial implementation that is 90% aligned with patsy. (Tests will fail due to the order of the terms being different than in tests).

Example:

>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> f.get_model_matrix(data)
	Intercept	y	x	A[T.b]:x	A[T.c]:x	A[T.b]:y	A[T.c]:y
0	1.0	1	1	0	0	0	0
1	1.0	2	2	2	0	2	0
2	1.0	3	3	0	3	0	3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)

As you can see from the above, formulaic is now respecting the input order of terms (except in the factor order in the column names of the model matrix, an oversight I'll fix shortly), but still differs slightly from patsy. This is because patsy groups first by the numerical factors involved in the terms, then by degree, and finally by user order; whereas formulaic is skipping that first grouping because at time of formula specification, formulaic doesn't know whether the variables are categorical or not.

We have two options:

  1. Keep the order of the columns in the model matrix as it is now (consistent with the formula, and with R), or
  2. Sort the columns of the model matrix the same way that patsy does, at the expense of consistency with the formula.

If patsy didn't exist, I would want to go with 1, which is more internally consistent and consistent with R. Since it does, how important is it to you that the order matches patsy exactly? I'm not committing to doing it, but I need to think a bit more about the trade-offs.

Closes #135

Hi @matthewwardrop,

In my opinion, option 1 is more consistent. However, considering the widespread usage of the 'statsmodel' package for statistical analysis and econometrics in the Python community, which employs 'pasty' for formula syntax as said in their documentation (https://www.statsmodels.org/dev/example_formulas.html), it may be prudent for users to continue following the order of 'pasty' unless the developers of 'pasty' make changes accordingly.

oh, I just found out that 'patsy is no longer under active development. As of August 2021, Matthew Wardrop (@matthewwardrop) and Tomás Capretto (@tomicapretto) have taken on responsibility from Nathaniel Smith (@njsmith) for keeping the lights on, but no new feature development is planned.' at https://github.com/pydata/patsy

The best scenario would be that we persuade developers of 'statsmodel' to employ 'formulaic' and go with option 1. What do you think guys @bashtage @matthewwardrop?
Either going with option 1 or 2, It's a big enhancement. Thank you @matthewwardrop

@bashtage
Copy link
Contributor

I agree that 1 makes sense. I think statsmodels will move to formulaic after a decent deprecation period using an opt in, and having something that resembles the formula closely would help a lot.

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Apr 21, 2023

Thanks for the feedback @bashtage and @DongGilKim .

I thought about things a bit more, and figured I might as well add a few more knobs. Let me know if you think the following makes sense. If it all looks good, I'll add some spit and polish (and unit tests and docs) and merge shortly.

The two added knobs are _ordering on the Formula class, and cluster_by on the ModelSpec class. Here's an example along the lines of the above:

>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> Formula("y + x:A + x + y:A", _ordering="none")
1 + y + x:A + x + y:A

>>> Formula("y + x:A + x + y:A", _ordering="sort")
1 + x + y + A:x + A:y   # <-- this is Formulaic's old behaviour

>>> f.get_model_matrix(data)  # <- Equivalent to R output
   Intercept  y  x  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1  1         0         0         0         0
1        1.0  2  2         2         0         2         0
2        1.0  3  3         0         3         0         3

>>> f.get_model_matrix(data, cluster_by="numerical_factors")  # <- Equivalent to Patsy, below
   Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
0        1.0  1         0         0  1         0         0
1        1.0  2         2         0  2         2         0
2        1.0  3         0         3  3         0         3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)

Overkill? Other thoughts?

@bashtage
Copy link
Contributor

I think the ordering is good since it will let me have an opt-in feature to change the order to the future behavior (_ordering="none") from the current. I'm not sure I would expose this to end users though since I think most expect/hope for something that mostly resembles the model spec they use.

I suppose the same is true for the second option, especially for statmodels which has used patsy for a long time. It would provide a ramp from patsy to something more R like over a couple of releases.

@matthewwardrop
Copy link
Owner Author

matthewwardrop commented Apr 21, 2023

@bashtage, oh... interesting. That seems to suggest you feel the default ordering should be as input by the user ("none"), rather than, as in R, ordering by interaction degree ("degree"). In this PR, "degree" is the default (matching R). Another thing I perhaps didn't make clear is that the model matrix will always be materialized in the same order as the formula, so there's no disconnect there (except when clustering is enabled as above).

>>> f = Formula("y + x:A + x + y:A")  # <-- equivalent to adding `_ordering="degree"`, which is the default
>>> f
1 + y + x + x:A + y:A
>>> f.get_model_matrix(data)
   Intercept  y  x  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1  1         0         0         0         0
1        1.0  2  2         2         0         2         0
2        1.0  3  3         0         3         0         3

>>> f2 = Formula("y + x:A + x + y:A", _ordering="none")
>>> f2
1 + y + x:A + x + y:A
>>> f.get_model_matrix(data)
   Intercept  y  x:A[T.a]  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1         1         0         0         0         0
1        1.0  2         0         2         0         2         0
2        1.0  3         0         0         3         0         3

Note that since x:A came first here, x was omitted to preserve full rank.

I think this behaviour makes sense, including having the defaults align with R, but if you have any further concerns, let me know! I'll try to get this in shortly!

@codecov
Copy link

codecov bot commented Apr 22, 2023

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.08 🎉

Comparison is base (db3d59b) 99.91% compared to head (4275799) 100.00%.

Additional details and impacted files
@@             Coverage Diff             @@
##             main      #139      +/-   ##
===========================================
+ Coverage   99.91%   100.00%   +0.08%     
===========================================
  Files          47        48       +1     
  Lines        2485      2530      +45     
===========================================
+ Hits         2483      2530      +47     
+ Misses          2         0       -2     
Flag Coverage Δ
unittests 100.00% <100.00%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
formulaic/formula.py 100.00% <100.00%> (ø)
formulaic/materializers/__init__.py 100.00% <100.00%> (ø)
formulaic/materializers/base.py 100.00% <100.00%> (+0.66%) ⬆️
formulaic/materializers/pandas.py 100.00% <100.00%> (ø)
formulaic/materializers/types/__init__.py 100.00% <100.00%> (ø)
formulaic/materializers/types/enums.py 100.00% <100.00%> (ø)
formulaic/materializers/types/scoped_term.py 100.00% <100.00%> (ø)
formulaic/model_spec.py 100.00% <100.00%> (ø)
formulaic/parser/parser.py 100.00% <100.00%> (ø)
formulaic/parser/types/__init__.py 100.00% <100.00%> (ø)
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@matthewwardrop
Copy link
Owner Author

@bashtage @DongGilKim This is now being merged in. Feel free to try it out. If you find any issues, let me know!

@matthewwardrop matthewwardrop merged commit a7ecebd into main Apr 22, 2023
@matthewwardrop matthewwardrop deleted the preserve_term_order branch April 22, 2023 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Preserve variable order as they appear in formulas
3 participants