Merge pull request #107 from posit-dev/feat-selectors

Feat selectors
posit-dev · Jan 5, 2024 · eb24086 · eb24086
2 parents 541b420 + 4f85424
commit eb24086
Show file tree

Hide file tree

Showing 12 changed files with 239 additions and 29 deletions.
diff --git a/docs/_quarto.yml b/docs/_quarto.yml
@@ -38,6 +38,7 @@ website:
         - section: Extra Topics
           contents:
             - get-started/column-selection.qmd
+            - get-started/row-selection.qmd
 
 format:
   html:

diff --git a/docs/get-started/basic-styling.qmd b/docs/get-started/basic-styling.qmd
@@ -136,8 +136,30 @@ gt_pl_air.tab_style(
 )
 ```
 
+
+### Using functions
+
+You can also use a function, that takes the DataFrame and returns a Series with a style value for each row.
+
+This is shown below on a pandas DataFrame.
+
+```{python}
+def map_color(df):
+    return (df["Temp"] > 70).map(
+        {True: "lightyellow", False: "lightblue"}
+    )
+
+gt_air.tab_style(
+    style=style.fill(
+        color=map_color),
+    locations=loc.body("Temp")
+)
+```
+
 ## Specifying columns and rows
 
+### Using polars selectors
+
 If you are using **Polars**, you can use column selectors and expressions for selecting specific columns and rows:
 
 ```{python}
@@ -154,6 +176,21 @@ gt_pl_air.tab_style(
 
 See [Column Selection](./column-selection.qmd) for details on selecting columns.
 
+### Using a function
+
+For tools like **pandas**, you can use a function (or lambda) to select rows. The function should take a DataFrame, and output a boolean Series.
+
+```{python}
+gt_air.tab_style(
+    style=style.fill(color="yellow"),
+    locations=loc.body(
+        columns=lambda col_name: col_name.startswith("Te"),
+        rows=lambda D: D["Temp"] > 70,
+    )
+)
+```
+
+
 ## Multiple styles and locations
 
 We can use a list within `style=` to apply multiple styles at once. For example, the code below sets fill and border styles on the same set of body cells.

diff --git a/docs/get-started/column-selection.qmd b/docs/get-started/column-selection.qmd
@@ -4,7 +4,9 @@ jupyter: python3
 html-table-processing: none
 ---
 
-The `columns=` argument for methods like [`tab_spanner()`](`great_tables.GT.tab_spanner`) and [`cols_move()`](`great_tables.GT.cols_move`) can accept a range of arguments. In the previous examples, we just passed a list of strings with the exact column names. However, we can specify columns using any of the following:
+The `columns=` argument for methods like [`GT.tab_spanner()`](`great_tables.GT.tab_spanner`), [`GT.cols_move()`](`great_tables.GT.cols_move`), and [`GT.tab_style`](`great_tables.GT.tab_style`) allows a range of options for selecting columns.
+
+The simplest approach is just a list of strings with the exact column names. However, we can specify columns using any of the following:
 
 * a single string column name.
 * an integer for the column's position.
@@ -16,12 +18,13 @@ The `columns=` argument for methods like [`tab_spanner()`](`great_tables.GT.tab_
 from great_tables import GT
 from great_tables.data import exibble
 
-gt_ex = GT(exibble)
+lil_exibble = exibble[["num", "char", "fctr", "date", "time"]].head(4)
+gt_ex = GT(lil_exibble)
 
 gt_ex
 ```
 
-## String and Integer Selectors
+## Using integers
 
 We can use a list of strings or integers to select columns by name or position, respectively.
 
@@ -37,23 +40,15 @@ Note the code above moved the following columns:
 
 Moreover, the order of the list defines the order of selected columns. In this case, `"data"` was the first entry, so it's the very first column in the new table.
 
-## Using Function Selectors
-
-A function can be used to select columns. It should take a string and returns `True` or `False`.
-
-```{python}
-gt_ex.cols_move_to_start(columns=lambda x: "c" in x)
-```
-
-## **Polars** Selectors
+## Using **Polars** selectors
 
 When using a **Polars** DataFrame, you can select columns using [**Polars** selectors](https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html). The example below uses **Polars** selectors to move all columns that start with `"c"` or `"f"` to the start of the table.
 
 ```{python}
 import polars as pl
 import polars.selectors as cs
 
-pl_df = pl.from_pandas(exibble)
+pl_df = pl.from_pandas(lil_exibble)
 
 GT(pl_df).cols_move_to_start(columns=cs.starts_with("c") | cs.starts_with("f"))
 ```
@@ -67,3 +62,10 @@ pl_df.select(cs.starts_with("c") | cs.starts_with("f")).columns
 See the [Selectors page in the polars docs](https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html) for more information on this.
 
 
+## Using functions
+
+A function can be used to select columns. It should take a column name as a string and return `True` or `False`.
+
+```{python}
+gt_ex.cols_move_to_start(columns=lambda x: "c" in x)
+```
diff --git a/docs/get-started/row-selection.qmd b/docs/get-started/row-selection.qmd
@@ -0,0 +1,94 @@
+---
+title: Row Selection
+jupyter: python3
+html-table-processing: none
+---
+
+Location and formatter functions (e.g. [`loc.body()`](`great_tables.loc.body`) and [`GT.fmt_number()`](`great_tables.GT.fmt_number`)) can be applied to specific rows, using the `rows=` argument.
+
+Rows may be specified using any of the following:
+
+* None (the default), to select everything.
+* an integer for the row's position.
+* a list of or integers.
+* a **Polars** selector for filtering.
+* a function that takes a DataFrame and returns a boolean Series.
+
+The following sections will use a subset of the `exibble` data, to demonstrate these options.
+
+```{python}
+from great_tables import GT, exibble, loc, style
+
+lil_exibble = exibble[["num", "char", "currency"]].head(3)
+gt_ex = GT(lil_exibble)
+```
+
+## Using integers
+
+Use a single integer, or a list of integers, to select rows by position.
+
+```{python}
+gt_ex.fmt_currency("currency", rows=0, decimals=1)
+```
+
+Notice that a dollar sign (`$`) was only added to the first row (index `0` in python).
+
+Indexing works the same as selecting items from a python list. This  negative integers select relative to the final row.
+
+```{python}
+gt_ex.fmt_currency("currency", rows=[0, -1], decimals=1)
+```
+
+
+## Using polars expressions
+
+The `rows=` argument accepts polars expressions, which return a boolean Series, indicating which rows to operate on.
+
+For example, the code below only formats the `num` column, but only when currency is less than 40.
+
+```{python}
+import polars as pl
+
+gt_polars = GT(pl.from_pandas(lil_exibble))
+
+gt_polars.fmt_integer("num", rows=pl.col("currency") < 40)
+```
+
+Here's a more realistic example, which highlights the row with the highest value for currency.
+
+```{python}
+import polars.selectors as cs
+
+gt_polars.tab_style(
+    style.fill("yellow"),
+    loc.body(
+        columns=cs.all(),
+        rows=pl.col("currency") == pl.col("currency").max()
+    )
+)
+```
+
+
+## Using a function
+
+Since libraries like `pandas` don't have lazy expressions, the `rows=` argument also accepts a function for selecting rows. The function should take a DataFrame and return a boolean series.
+
+Here's the same example as the previous polars section, but with pandas data, and a lamba for selecting rows.
+
+```{python}
+gt_ex.fmt_integer("num", rows=lambda D: D["currency"] < 40)
+```
+
+Here's the styling example from the previous polars section.
+
+```{python}
+import polars.selectors as cs
+
+gt_ex.tab_style(
+    style.fill("yellow"),
+    loc.body(
+        columns=lambda colname: True,
+        rows=lambda D: D["currency"] == D["currency"].max()
+    )
+)
+```
diff --git a/great_tables/_formats.py b/great_tables/_formats.py
@@ -2,9 +2,10 @@
 from decimal import Decimal
 from typing import TYPE_CHECKING, Any, Callable, TypeVar, Union, List, cast, Optional, Dict, Literal
 from typing_extensions import TypeAlias
-from ._tbl_data import n_rows
+from ._tbl_data import PlExpr, n_rows
 from ._gt_data import GTData, FormatFns, FormatFn, FormatInfo
 from ._locale import _get_locales_data, _get_default_locales_data, _get_currencies_data
+from ._locations import resolve_rows_i
 from ._text import _md_html
 from ._utils import _str_detect, _str_replace
 import pandas as pd
@@ -86,12 +87,10 @@ def fmt(
 
     columns = _listify(columns, list)
 
-    if rows is None:
-        rows = list(range(n_rows(self._tbl_data)))
-    elif isinstance(rows, int):
-        rows = [rows]
+    row_res = resolve_rows_i(self, rows)
+    row_pos = [name_pos[1] for name_pos in row_res]
 
-    formatter = FormatInfo(fns, columns, rows)
+    formatter = FormatInfo(fns, columns, row_pos)
     return self._replace(_formats=[*self._formats, formatter])
 
 

diff --git a/great_tables/_locations.py b/great_tables/_locations.py
@@ -4,25 +4,25 @@
 
 from dataclasses import dataclass
 from functools import singledispatch
-from typing import TYPE_CHECKING, Literal, List, Callable
+from typing import TYPE_CHECKING, Literal, List, Callable, Union
 from typing_extensions import TypeAlias
 
 # note that types like Spanners are only used in annotations for concretes of the
 # resolve generic, but we need to import at runtime, due to singledispatch looking
 # up annotations
 from ._gt_data import GTData, FootnoteInfo, Spanners, ColInfoTypeEnum, StyleInfo, FootnotePlacement
-from ._tbl_data import eval_select, PlExpr
+from ._tbl_data import eval_select, eval_transform, PlExpr
 from ._styles import CellStyle
 
 
 if TYPE_CHECKING:
     from ._gt_data import TblData
-    from ._tbl_data import PlSelectExpr
+    from ._tbl_data import SelectExpr
 
 # Misc Types ===========================================================================
 
 PlacementOptions: TypeAlias = Literal["auto", "left", "right"]
-SelectExpr: TypeAlias = "list[str | int] | PlSelectExpr | str | int | Callable[[str], bool] | None"
+RowSelectExpr: TypeAlias = Union[List[int], PlExpr, Callable[["TblData"], bool], None]
 
 # Locations ============================================================================
 # TODO: these are called cells_* in gt. I prefixed them with Loc just to keep things
@@ -106,7 +106,7 @@ class LocBody(Loc):
         A LocBody object, which is used for a `locations` argument if specifying the table body.
     """
     columns: SelectExpr = None
-    rows: list[str] | str | None = None
+    rows: RowSelectExpr = None
 
 
 @dataclass
@@ -270,7 +270,7 @@ def resolve_cols_i(
 
 def resolve_rows_i(
     data: GTData | list[str],
-    expr: list[str | int] | None = None,
+    expr: RowSelectExpr = None,
     null_means: Literal["everything", "nothing"] = "everything",
 ) -> list[tuple[str, int]]:
     """Return matching row numbers, based on expr
@@ -283,6 +283,9 @@ def resolve_rows_i(
     the order they appear in the data (rather than ordered by selectors).
     """
 
+    if isinstance(expr, (str, int)):
+        expr: List["str | int"] = [expr]
+
     if isinstance(data, GTData):
         if expr is None:
             if null_means == "everything":
@@ -312,12 +315,25 @@ def resolve_rows_i(
         result = data._tbl_data.with_row_count(name="__row_number__").filter(expr)
         # print([(row_names[ii], ii) for ii in result["__row_number__"]])
         return [(row_names[ii], ii) for ii in result["__row_number__"]]
+    elif callable(expr):
+        res: "list[bool]" = eval_transform(data._tbl_data, expr)
+        if not all(map(lambda x: isinstance(x, bool), res)):
+            raise ValueError(
+                "If you select rows using a callable, it must take a DataFrame, "
+                "and return a boolean Series."
+            )
+        return [(row_names[ii], ii) for ii, val in enumerate(res) if val]
 
     # TODO: identify filter-like selectors using some backend check
     # e.g. if it's a siuba expression vs tidyselect expression, etc..
     # TODO: how would this be handled with something like polars? May need a predicate
     # function, similar in spirit to where()?
-    raise NotImplementedError("Currently, rows can only be selected via a list of strings")
+    raise NotImplementedError(
+        "Currently, rows can only be selected using these approaches:\n\n"
+        "  * a list of integers\n"
+        "  * a polars expression\n"
+        "  * a callable that takes a DataFrame and returns a boolean Series"
+    )
 
 
 # Resolve generic ======================================================================

diff --git a/great_tables/_spanners.py b/great_tables/_spanners.py
@@ -5,7 +5,8 @@
 from typing import TYPE_CHECKING, Union, List, Dict, Optional, Any
 
 from ._gt_data import Spanners, SpannerInfo
-from ._locations import SelectExpr, resolve_cols_c
+from ._tbl_data import SelectExpr
+from ._locations import resolve_cols_c
 
 if TYPE_CHECKING:
     from ._gt_data import Boxhead

diff --git a/great_tables/_styles.py b/great_tables/_styles.py
@@ -74,7 +74,7 @@ def _evaluate_expressions(self, data: TblData) -> Self:
         new_fields: dict[str, FromValues] = {}
         for field in fields(self):
             attr = getattr(self, field.name)
-            if isinstance(attr, PlExpr):
+            if isinstance(attr, PlExpr) or callable(attr):
                 col_res = eval_transform(data, attr)
                 new_fields[field.name] = FromValues(expr=attr, values=col_res)
 

diff --git a/great_tables/_tbl_data.py b/great_tables/_tbl_data.py
@@ -234,11 +234,19 @@ def _(data: PlDataFrame, group_key: str) -> Dict[Any, List[int]]:
 
 # eval_select ----
 
+SelectExpr: TypeAlias = Union[
+    List["str | int"],
+    PlSelectExpr,
+    str,
+    int,
+    Callable[[str], bool],
+    None,
+]
 _NamePos: TypeAlias = List[Tuple[str, int]]
 
 
 @singledispatch
-def eval_select(data: DataFrameLike, expr: Any, strict: bool = True) -> _NamePos:
+def eval_select(data: DataFrameLike, expr: SelectExpr, strict: bool = True) -> _NamePos:
     """Return a list of column names selected by expr."""
 
     raise NotImplementedError(f"Unsupported type: {type(expr)}")