Skip to content

Commit

Permalink
Merge pull request #107 from posit-dev/feat-selectors
Browse files Browse the repository at this point in the history
Feat selectors
  • Loading branch information
machow committed Jan 5, 2024
2 parents 541b420 + 4f85424 commit eb24086
Show file tree
Hide file tree
Showing 12 changed files with 239 additions and 29 deletions.
1 change: 1 addition & 0 deletions docs/_quarto.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,7 @@ website:
- section: Extra Topics
contents:
- get-started/column-selection.qmd
- get-started/row-selection.qmd

format:
html:
Expand Down
37 changes: 37 additions & 0 deletions docs/get-started/basic-styling.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,30 @@ gt_pl_air.tab_style(
)
```


### Using functions

You can also use a function, that takes the DataFrame and returns a Series with a style value for each row.

This is shown below on a pandas DataFrame.

```{python}
def map_color(df):
return (df["Temp"] > 70).map(
{True: "lightyellow", False: "lightblue"}
)
gt_air.tab_style(
style=style.fill(
color=map_color),
locations=loc.body("Temp")
)
```

## Specifying columns and rows

### Using polars selectors

If you are using **Polars**, you can use column selectors and expressions for selecting specific columns and rows:

```{python}
Expand All @@ -154,6 +176,21 @@ gt_pl_air.tab_style(

See [Column Selection](./column-selection.qmd) for details on selecting columns.

### Using a function

For tools like **pandas**, you can use a function (or lambda) to select rows. The function should take a DataFrame, and output a boolean Series.

```{python}
gt_air.tab_style(
style=style.fill(color="yellow"),
locations=loc.body(
columns=lambda col_name: col_name.startswith("Te"),
rows=lambda D: D["Temp"] > 70,
)
)
```


## Multiple styles and locations

We can use a list within `style=` to apply multiple styles at once. For example, the code below sets fill and border styles on the same set of body cells.
Expand Down
28 changes: 15 additions & 13 deletions docs/get-started/column-selection.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,9 @@ jupyter: python3
html-table-processing: none
---

The `columns=` argument for methods like [`tab_spanner()`](`great_tables.GT.tab_spanner`) and [`cols_move()`](`great_tables.GT.cols_move`) can accept a range of arguments. In the previous examples, we just passed a list of strings with the exact column names. However, we can specify columns using any of the following:
The `columns=` argument for methods like [`GT.tab_spanner()`](`great_tables.GT.tab_spanner`), [`GT.cols_move()`](`great_tables.GT.cols_move`), and [`GT.tab_style`](`great_tables.GT.tab_style`) allows a range of options for selecting columns.

The simplest approach is just a list of strings with the exact column names. However, we can specify columns using any of the following:

* a single string column name.
* an integer for the column's position.
Expand All @@ -16,12 +18,13 @@ The `columns=` argument for methods like [`tab_spanner()`](`great_tables.GT.tab_
from great_tables import GT
from great_tables.data import exibble
gt_ex = GT(exibble)
lil_exibble = exibble[["num", "char", "fctr", "date", "time"]].head(4)
gt_ex = GT(lil_exibble)
gt_ex
```

## String and Integer Selectors
## Using integers

We can use a list of strings or integers to select columns by name or position, respectively.

Expand All @@ -37,23 +40,15 @@ Note the code above moved the following columns:

Moreover, the order of the list defines the order of selected columns. In this case, `"data"` was the first entry, so it's the very first column in the new table.

## Using Function Selectors

A function can be used to select columns. It should take a string and returns `True` or `False`.

```{python}
gt_ex.cols_move_to_start(columns=lambda x: "c" in x)
```

## **Polars** Selectors
## Using **Polars** selectors

When using a **Polars** DataFrame, you can select columns using [**Polars** selectors](https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html). The example below uses **Polars** selectors to move all columns that start with `"c"` or `"f"` to the start of the table.

```{python}
import polars as pl
import polars.selectors as cs
pl_df = pl.from_pandas(exibble)
pl_df = pl.from_pandas(lil_exibble)
GT(pl_df).cols_move_to_start(columns=cs.starts_with("c") | cs.starts_with("f"))
```
Expand All @@ -67,3 +62,10 @@ pl_df.select(cs.starts_with("c") | cs.starts_with("f")).columns
See the [Selectors page in the polars docs](https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html) for more information on this.


## Using functions

A function can be used to select columns. It should take a column name as a string and return `True` or `False`.

```{python}
gt_ex.cols_move_to_start(columns=lambda x: "c" in x)
```
94 changes: 94 additions & 0 deletions docs/get-started/row-selection.qmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
---
title: Row Selection
jupyter: python3
html-table-processing: none
---

Location and formatter functions (e.g. [`loc.body()`](`great_tables.loc.body`) and [`GT.fmt_number()`](`great_tables.GT.fmt_number`)) can be applied to specific rows, using the `rows=` argument.

Rows may be specified using any of the following:

* None (the default), to select everything.
* an integer for the row's position.
* a list of or integers.
* a **Polars** selector for filtering.
* a function that takes a DataFrame and returns a boolean Series.

The following sections will use a subset of the `exibble` data, to demonstrate these options.

```{python}
from great_tables import GT, exibble, loc, style
lil_exibble = exibble[["num", "char", "currency"]].head(3)
gt_ex = GT(lil_exibble)
```

## Using integers

Use a single integer, or a list of integers, to select rows by position.

```{python}
gt_ex.fmt_currency("currency", rows=0, decimals=1)
```

Notice that a dollar sign (`$`) was only added to the first row (index `0` in python).

Indexing works the same as selecting items from a python list. This negative integers select relative to the final row.

```{python}
gt_ex.fmt_currency("currency", rows=[0, -1], decimals=1)
```


## Using polars expressions

The `rows=` argument accepts polars expressions, which return a boolean Series, indicating which rows to operate on.

For example, the code below only formats the `num` column, but only when currency is less than 40.

```{python}
import polars as pl
gt_polars = GT(pl.from_pandas(lil_exibble))
gt_polars.fmt_integer("num", rows=pl.col("currency") < 40)
```

Here's a more realistic example, which highlights the row with the highest value for currency.

```{python}
import polars.selectors as cs
gt_polars.tab_style(
style.fill("yellow"),
loc.body(
columns=cs.all(),
rows=pl.col("currency") == pl.col("currency").max()
)
)
```


## Using a function

Since libraries like `pandas` don't have lazy expressions, the `rows=` argument also accepts a function for selecting rows. The function should take a DataFrame and return a boolean series.

Here's the same example as the previous polars section, but with pandas data, and a lamba for selecting rows.

```{python}
gt_ex.fmt_integer("num", rows=lambda D: D["currency"] < 40)
```

Here's the styling example from the previous polars section.

```{python}
import polars.selectors as cs
gt_ex.tab_style(
style.fill("yellow"),
loc.body(
columns=lambda colname: True,
rows=lambda D: D["currency"] == D["currency"].max()
)
)
```
11 changes: 5 additions & 6 deletions great_tables/_formats.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,10 @@
from decimal import Decimal
from typing import TYPE_CHECKING, Any, Callable, TypeVar, Union, List, cast, Optional, Dict, Literal
from typing_extensions import TypeAlias
from ._tbl_data import n_rows
from ._tbl_data import PlExpr, n_rows
from ._gt_data import GTData, FormatFns, FormatFn, FormatInfo
from ._locale import _get_locales_data, _get_default_locales_data, _get_currencies_data
from ._locations import resolve_rows_i
from ._text import _md_html
from ._utils import _str_detect, _str_replace
import pandas as pd
Expand Down Expand Up @@ -86,12 +87,10 @@ def fmt(

columns = _listify(columns, list)

if rows is None:
rows = list(range(n_rows(self._tbl_data)))
elif isinstance(rows, int):
rows = [rows]
row_res = resolve_rows_i(self, rows)
row_pos = [name_pos[1] for name_pos in row_res]

formatter = FormatInfo(fns, columns, rows)
formatter = FormatInfo(fns, columns, row_pos)
return self._replace(_formats=[*self._formats, formatter])


Expand Down
30 changes: 23 additions & 7 deletions great_tables/_locations.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,25 +4,25 @@

from dataclasses import dataclass
from functools import singledispatch
from typing import TYPE_CHECKING, Literal, List, Callable
from typing import TYPE_CHECKING, Literal, List, Callable, Union
from typing_extensions import TypeAlias

# note that types like Spanners are only used in annotations for concretes of the
# resolve generic, but we need to import at runtime, due to singledispatch looking
# up annotations
from ._gt_data import GTData, FootnoteInfo, Spanners, ColInfoTypeEnum, StyleInfo, FootnotePlacement
from ._tbl_data import eval_select, PlExpr
from ._tbl_data import eval_select, eval_transform, PlExpr
from ._styles import CellStyle


if TYPE_CHECKING:
from ._gt_data import TblData
from ._tbl_data import PlSelectExpr
from ._tbl_data import SelectExpr

# Misc Types ===========================================================================

PlacementOptions: TypeAlias = Literal["auto", "left", "right"]
SelectExpr: TypeAlias = "list[str | int] | PlSelectExpr | str | int | Callable[[str], bool] | None"
RowSelectExpr: TypeAlias = Union[List[int], PlExpr, Callable[["TblData"], bool], None]

# Locations ============================================================================
# TODO: these are called cells_* in gt. I prefixed them with Loc just to keep things
Expand Down Expand Up @@ -106,7 +106,7 @@ class LocBody(Loc):
A LocBody object, which is used for a `locations` argument if specifying the table body.
"""
columns: SelectExpr = None
rows: list[str] | str | None = None
rows: RowSelectExpr = None


@dataclass
Expand Down Expand Up @@ -270,7 +270,7 @@ def resolve_cols_i(

def resolve_rows_i(
data: GTData | list[str],
expr: list[str | int] | None = None,
expr: RowSelectExpr = None,
null_means: Literal["everything", "nothing"] = "everything",
) -> list[tuple[str, int]]:
"""Return matching row numbers, based on expr
Expand All @@ -283,6 +283,9 @@ def resolve_rows_i(
the order they appear in the data (rather than ordered by selectors).
"""

if isinstance(expr, (str, int)):
expr: List["str | int"] = [expr]

if isinstance(data, GTData):
if expr is None:
if null_means == "everything":
Expand Down Expand Up @@ -312,12 +315,25 @@ def resolve_rows_i(
result = data._tbl_data.with_row_count(name="__row_number__").filter(expr)
# print([(row_names[ii], ii) for ii in result["__row_number__"]])
return [(row_names[ii], ii) for ii in result["__row_number__"]]
elif callable(expr):
res: "list[bool]" = eval_transform(data._tbl_data, expr)
if not all(map(lambda x: isinstance(x, bool), res)):
raise ValueError(
"If you select rows using a callable, it must take a DataFrame, "
"and return a boolean Series."
)
return [(row_names[ii], ii) for ii, val in enumerate(res) if val]

# TODO: identify filter-like selectors using some backend check
# e.g. if it's a siuba expression vs tidyselect expression, etc..
# TODO: how would this be handled with something like polars? May need a predicate
# function, similar in spirit to where()?
raise NotImplementedError("Currently, rows can only be selected via a list of strings")
raise NotImplementedError(
"Currently, rows can only be selected using these approaches:\n\n"
" * a list of integers\n"
" * a polars expression\n"
" * a callable that takes a DataFrame and returns a boolean Series"
)


# Resolve generic ======================================================================
Expand Down
3 changes: 2 additions & 1 deletion great_tables/_spanners.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
from typing import TYPE_CHECKING, Union, List, Dict, Optional, Any

from ._gt_data import Spanners, SpannerInfo
from ._locations import SelectExpr, resolve_cols_c
from ._tbl_data import SelectExpr
from ._locations import resolve_cols_c

if TYPE_CHECKING:
from ._gt_data import Boxhead
Expand Down
2 changes: 1 addition & 1 deletion great_tables/_styles.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def _evaluate_expressions(self, data: TblData) -> Self:
new_fields: dict[str, FromValues] = {}
for field in fields(self):
attr = getattr(self, field.name)
if isinstance(attr, PlExpr):
if isinstance(attr, PlExpr) or callable(attr):
col_res = eval_transform(data, attr)
new_fields[field.name] = FromValues(expr=attr, values=col_res)

Expand Down
10 changes: 9 additions & 1 deletion great_tables/_tbl_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -234,11 +234,19 @@ def _(data: PlDataFrame, group_key: str) -> Dict[Any, List[int]]:

# eval_select ----

SelectExpr: TypeAlias = Union[
List["str | int"],
PlSelectExpr,
str,
int,
Callable[[str], bool],
None,
]
_NamePos: TypeAlias = List[Tuple[str, int]]


@singledispatch
def eval_select(data: DataFrameLike, expr: Any, strict: bool = True) -> _NamePos:
def eval_select(data: DataFrameLike, expr: SelectExpr, strict: bool = True) -> _NamePos:
"""Return a list of column names selected by expr."""

raise NotImplementedError(f"Unsupported type: {type(expr)}")
Expand Down

0 comments on commit eb24086

Please sign in to comment.