The best way to learn R
-style formula syntax with ydot
is to head on over to patsy :cite:`2020:patsy` and read the documentation. Below, we show very simple code to transform a Spark dataframe into two design matrices (these are also Spark dataframes), y
and X
, using a formula that defines a model up to two-way interactions.
.. literalinclude:: _code/demo.py :language: python :linenos:
We use the code below to generate the models (data) below.
.. literalinclude:: _code/demo-formulas.py :language: python :linenos:
You can use numpy
functions against continuous variables.
The *
specifies interactions and keeps lower order terms.
The :
specifies interactions and drops lower order terms.
The /
is quirky according to the patsy documentation, but it is shorthand for a / b = a + a:b
.
If you need to drop the Intercept
, add - 1
at the end. Note that one of the dummy variables for a
is not dropped. This could be a bug with patsy.