Preserve term and factor order in formulae #139

matthewwardrop · 2023-04-19T05:33:16Z

@bashtage @DongGilKim Here's an initial implementation that is 90% aligned with patsy. (Tests will fail due to the order of the terms being different than in tests).

Example:

>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> f.get_model_matrix(data)
	Intercept	y	x	A[T.b]:x	A[T.c]:x	A[T.b]:y	A[T.c]:y
0	1.0	1	1	0	0	0	0
1	1.0	2	2	2	0	2	0
2	1.0	3	3	0	3	0	3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)

As you can see from the above, formulaic is now respecting the input order of terms (except in the factor order in the column names of the model matrix, an oversight I'll fix shortly), but still differs slightly from patsy. This is because patsy groups first by the numerical factors involved in the terms, then by degree, and finally by user order; whereas formulaic is skipping that first grouping because at time of formula specification, formulaic doesn't know whether the variables are categorical or not.

We have two options:

Keep the order of the columns in the model matrix as it is now (consistent with the formula, and with R), or
Sort the columns of the model matrix the same way that patsy does, at the expense of consistency with the formula.

If patsy didn't exist, I would want to go with 1, which is more internally consistent and consistent with R. Since it does, how important is it to you that the order matches patsy exactly? I'm not committing to doing it, but I need to think a bit more about the trade-offs.

Closes #135

DongGilKim · 2023-04-19T13:51:06Z

@bashtage @DongGilKim Here's an initial implementation that is 90% aligned with patsy. (Tests will fail due to the order of the terms being different than in tests).

Example:
>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> f.get_model_matrix(data)
	Intercept	y	x	A[T.b]:x	A[T.c]:x	A[T.b]:y	A[T.c]:y
0	1.0	1	1	0	0	0	0
1	1.0	2	2	2	0	2	0
2	1.0	3	3	0	3	0	3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)
As you can see from the above, formulaic is now respecting the input order of terms (except in the factor order in the column names of the model matrix, an oversight I'll fix shortly), but still differs slightly from patsy. This is because patsy groups first by the numerical factors involved in the terms, then by degree, and finally by user order; whereas formulaic is skipping that first grouping because at time of formula specification, formulaic doesn't know whether the variables are categorical or not.

We have two options:

Keep the order of the columns in the model matrix as it is now (consistent with the formula, and with R), or

Sort the columns of the model matrix the same way that patsy does, at the expense of consistency with the formula.

If patsy didn't exist, I would want to go with 1, which is more internally consistent and consistent with R. Since it does, how important is it to you that the order matches patsy exactly? I'm not committing to doing it, but I need to think a bit more about the trade-offs.

Closes #135

Hi @matthewwardrop,

In my opinion, option 1 is more consistent. However, considering the widespread usage of the 'statsmodel' package for statistical analysis and econometrics in the Python community, which employs 'pasty' for formula syntax as said in their documentation (https://www.statsmodels.org/dev/example_formulas.html), it may be prudent for users to continue following the order of 'pasty' unless the developers of 'pasty' make changes accordingly.

oh, I just found out that 'patsy is no longer under active development. As of August 2021, Matthew Wardrop (@matthewwardrop) and Tomás Capretto (@tomicapretto) have taken on responsibility from Nathaniel Smith (@njsmith) for keeping the lights on, but no new feature development is planned.' at https://github.com/pydata/patsy

The best scenario would be that we persuade developers of 'statsmodel' to employ 'formulaic' and go with option 1. What do you think guys @bashtage @matthewwardrop?
Either going with option 1 or 2, It's a big enhancement. Thank you @matthewwardrop

bashtage · 2023-04-19T14:05:24Z

I agree that 1 makes sense. I think statsmodels will move to formulaic after a decent deprecation period using an opt in, and having something that resembles the formula closely would help a lot.

matthewwardrop · 2023-04-21T05:34:31Z

Thanks for the feedback @bashtage and @DongGilKim .

I thought about things a bit more, and figured I might as well add a few more knobs. Let me know if you think the following makes sense. If it all looks good, I'll add some spit and polish (and unit tests and docs) and merge shortly.

The two added knobs are _ordering on the Formula class, and cluster_by on the ModelSpec class. Here's an example along the lines of the above:

>>> from formulaic import Formula
>>> from pandas import DataFrame
>>> from patsy import dmatrix

>>> data = DataFrame({"x": [1,2,3], "y": [1,2,3], "A": ["a", "b", "c"]})
>>> data
	x	y	A
0	1	1	a
1	2	2	b
2	3	3	c

>>> f = Formula("y + x:A + x + y:A")
>>> f
1 + y + x + x:A + y:A

>>> Formula("y + x:A + x + y:A", _ordering="none")
1 + y + x:A + x + y:A

>>> Formula("y + x:A + x + y:A", _ordering="sort")
1 + x + y + A:x + A:y   # <-- this is Formulaic's old behaviour

>>> f.get_model_matrix(data)  # <- Equivalent to R output
   Intercept  y  x  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1  1         0         0         0         0
1        1.0  2  2         2         0         2         0
2        1.0  3  3         0         3         0         3

>>> f.get_model_matrix(data, cluster_by="numerical_factors")  # <- Equivalent to Patsy, below
   Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
0        1.0  1         0         0  1         0         0
1        1.0  2         2         0  2         2         0
2        1.0  3         0         3  3         0         3

>>> dmatrix("y + x:A + x + y:A", data)
DesignMatrix with shape (3, 7)
  Intercept  y  y:A[T.b]  y:A[T.c]  x  x:A[T.b]  x:A[T.c]
          1  1         0         0  1         0         0
          1  2         2         0  2         2         0
          1  3         0         3  3         0         3
  Terms:
    'Intercept' (column 0)
    'y' (column 1)
    'y:A' (columns 2:4)
    'x' (column 4)
    'x:A' (columns 5:7)

Overkill? Other thoughts?

bashtage · 2023-04-21T07:39:15Z

I think the ordering is good since it will let me have an opt-in feature to change the order to the future behavior (_ordering="none") from the current. I'm not sure I would expose this to end users though since I think most expect/hope for something that mostly resembles the model spec they use.

I suppose the same is true for the second option, especially for statmodels which has used patsy for a long time. It would provide a ramp from patsy to something more R like over a couple of releases.

matthewwardrop · 2023-04-21T16:26:26Z

@bashtage, oh... interesting. That seems to suggest you feel the default ordering should be as input by the user ("none"), rather than, as in R, ordering by interaction degree ("degree"). In this PR, "degree" is the default (matching R). Another thing I perhaps didn't make clear is that the model matrix will always be materialized in the same order as the formula, so there's no disconnect there (except when clustering is enabled as above).

>>> f = Formula("y + x:A + x + y:A")  # <-- equivalent to adding `_ordering="degree"`, which is the default
>>> f
1 + y + x + x:A + y:A
>>> f.get_model_matrix(data)
   Intercept  y  x  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1  1         0         0         0         0
1        1.0  2  2         2         0         2         0
2        1.0  3  3         0         3         0         3

>>> f2 = Formula("y + x:A + x + y:A", _ordering="none")
>>> f2
1 + y + x:A + x + y:A
>>> f.get_model_matrix(data)
   Intercept  y  x:A[T.a]  x:A[T.b]  x:A[T.c]  y:A[T.b]  y:A[T.c]
0        1.0  1         1         0         0         0         0
1        1.0  2         0         2         0         2         0
2        1.0  3         0         0         3         0         3

Note that since x:A came first here, x was omitted to preserve full rank.

I think this behaviour makes sense, including having the defaults align with R, but if you have any further concerns, let me know! I'll try to get this in shortly!

codecov · 2023-04-22T05:55:05Z

Codecov Report

Patch coverage: 100.00% and project coverage change: +0.08 🎉

Comparison is base (db3d59b) 99.91% compared to head (4275799) 100.00%.

Additional details and impacted files

@@             Coverage Diff             @@
##             main      #139      +/-   ##
===========================================
+ Coverage   99.91%   100.00%   +0.08%     
===========================================
  Files          47        48       +1     
  Lines        2485      2530      +45     
===========================================
+ Hits         2483      2530      +47     
+ Misses          2         0       -2

Flag	Coverage Δ
unittests	`100.00% <100.00%> (+0.08%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
formulaic/formula.py	`100.00% <100.00%> (ø)`
formulaic/materializers/__init__.py	`100.00% <100.00%> (ø)`
formulaic/materializers/base.py	`100.00% <100.00%> (+0.66%)`	⬆️
formulaic/materializers/pandas.py	`100.00% <100.00%> (ø)`
formulaic/materializers/types/__init__.py	`100.00% <100.00%> (ø)`
formulaic/materializers/types/enums.py	`100.00% <100.00%> (ø)`
formulaic/materializers/types/scoped_term.py	`100.00% <100.00%> (ø)`
formulaic/model_spec.py	`100.00% <100.00%> (ø)`
formulaic/parser/parser.py	`100.00% <100.00%> (ø)`
formulaic/parser/types/__init__.py	`100.00% <100.00%> (ø)`
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

…ering` kwarg.

…ossible (matching patsy).

… R and patsy.

matthewwardrop · 2023-04-22T17:12:44Z

@bashtage @DongGilKim This is now being merged in. Feel free to try it out. If you find any issues, let me know!

matthewwardrop mentioned this pull request Apr 19, 2023

ENH: Preserve variable order as they appear in formulas #135

Closed

matthewwardrop added 8 commits April 22, 2023 09:19

Add new OrderedSet type.

578ddd6

Use OrderedSet to preserve order of terms.

36257fb

Formula: Allow controlling ordering of terms in formula via the `_ord…

817037d

…ering` kwarg.

Respect factor ordering during materialization.

e6fc750

Add ability to cluster by numerical factors during materialization.

f7863b4

Update rank reduction algorithm to respect term ordering as much as p…

0c5b959

…ossible (matching patsy).

Update unit tests to reflect new term ordering.

94f0336

Reverse iteration order of products of categorical variables to match…

75bd75d

… R and patsy.

matthewwardrop force-pushed the preserve_term_order branch from 0ce6a55 to 75bd75d Compare April 22, 2023 16:19

matthewwardrop added 2 commits April 22, 2023 09:41

Add new unit tests to restore coverage.

a6eb6b5

Update document to reflect new term ordering functionality.

4275799

matthewwardrop merged commit a7ecebd into main Apr 22, 2023

matthewwardrop deleted the preserve_term_order branch April 22, 2023 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve term and factor order in formulae #139

Preserve term and factor order in formulae #139

matthewwardrop commented Apr 19, 2023 •

edited

Loading

DongGilKim commented Apr 19, 2023

bashtage commented Apr 19, 2023

matthewwardrop commented Apr 21, 2023 •

edited

Loading

bashtage commented Apr 21, 2023

matthewwardrop commented Apr 21, 2023 •

edited

Loading

codecov bot commented Apr 22, 2023 •

edited

Loading

matthewwardrop commented Apr 22, 2023

Preserve term and factor order in formulae #139

Preserve term and factor order in formulae #139

Conversation

matthewwardrop commented Apr 19, 2023 • edited Loading

DongGilKim commented Apr 19, 2023

bashtage commented Apr 19, 2023

matthewwardrop commented Apr 21, 2023 • edited Loading

bashtage commented Apr 21, 2023

matthewwardrop commented Apr 21, 2023 • edited Loading

codecov bot commented Apr 22, 2023 • edited Loading

Codecov Report

matthewwardrop commented Apr 22, 2023

matthewwardrop commented Apr 19, 2023 •

edited

Loading

matthewwardrop commented Apr 21, 2023 •

edited

Loading

matthewwardrop commented Apr 21, 2023 •

edited

Loading

codecov bot commented Apr 22, 2023 •

edited

Loading