Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: basic sub_missing and sub_zero #244

Merged
merged 9 commits into from
Mar 21, 2024
Merged

feat: basic sub_missing and sub_zero #244

merged 9 commits into from
Mar 21, 2024

Conversation

machow
Copy link
Collaborator

@machow machow commented Mar 15, 2024

This PR adds support for substitution functions. It address #182, by adding the following methods:

  • sub_missing: substitutes any missing values. This includes both null and nan.
    • polars: should be equivalent to polars using .is_null() | .is_nan(), since distinguishes between the two.
    • pandas: should be equivalent to .isna(), which flags both kinds of missingness.
  • sub_zero

Currently, it is just using the formatter machinery. As I understand, substitutions should always go after formatters. (e.g. you should be able to .sub_zero() right away in a chain, and expect later fmt_*() calls to not override that.

Note these important pieces:

  • Introduced new class FormatterSkipElement, which when returned from a format call, indicates that no change should be made.
  • Rich, in his HTML wizardry, had to address this situation: when a row of a table has all empty cells, then its height collapses (because there is no content inside). Recommended practice is to add a single <br>.
  • Fixed is_na to properly detect float("nan")
  • Added the parameter is_substitution= to fmt(). This deviates from gt. Happy to change to be more similar.


return isinstance(x, (pl.Null, type(None))) or x is np.nan or x is nan
return isinstance(x, (pl.Null, type(None))) or (isinstance(x, float) and isnan(x))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed that in python float("nan") is float("nan") evaluates to false. Using math.isnan() seems to work though for detecting either numpy or builtin nans.

@@ -64,6 +64,7 @@ def fmt(
fns: Union[FormatFn, FormatFns],
columns: SelectExpr = None,
rows: Union[int, List[int], None] = None,
is_substitution=False,
Copy link
Collaborator Author

@machow machow Mar 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this flag mostly because was very fast to do, but am down for creating a dedicated sub() function, if it seems better!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you have makes a lot of sense! By way of comparison, the R version has the analogous subst() function but it's not exported.

@machow machow marked this pull request as ready for review March 15, 2024 19:32
@github-actions github-actions bot temporarily deployed to pr-244 March 15, 2024 19:34 Destroyed
@rich-iannone
Copy link
Member

From a quick test of sub_missing(), something worth fixing is allowing the missing_text= to work with the md() and html() helper functions. This currently fails:

from great_tables import GT, md, exibble

GT(exibble).sub_missing(missing_text=md("*MISSING*"))
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
File [~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:344](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:344), in BaseFormatter.__call__(self, obj)
    [342](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:342)     method = get_real_method(obj, self.print_method)
    [343](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:343)     if method is not None:
--> [344](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:344)         return method()
    [345](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:345)     return None
    [346](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/IPython/core/formatters.py:346) else:

File [~/py_projects/great-tables/great_tables/gt.py:195](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:195), in GT._repr_html_(self)
    [194](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:194) def _repr_html_(self):
--> [195](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:195)     return self.render(context="html")

File [~/py_projects/great-tables/great_tables/gt.py:298](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:298), in GT.render(self, context)
    [297](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:297) def render(self, context: str) -> str:
--> [298](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:298)     html_table = self._build_data(context=context)._render_as_html()
    [299](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:299)     return html_table

File [~/py_projects/great-tables/great_tables/gt.py:277](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:277), in GT._build_data(self, context)
    [274](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:274) def _build_data(self, context: str) -> Self:
    [275](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:275)     # Build the body of the table by generating a dictionary
    [276](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:276)     # of lists with cells initially set to nan values
--> [277](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:277)     built = self._render_formats(context)
    [278](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:278)     # built._body = _migrate_unformatted_to_output(body)
    [279](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/great_tables/gt.py:279) 
...
--> [421](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/pandas/core/arrays/string_.py:421) if len(value) and not lib.is_string_array(value, skipna=True):
    [422](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/pandas/core/arrays/string_.py:422)     raise TypeError("Must provide strings.")
    [424](https://file+.vscode-resource.vscode-cdn.net/Users/riannone/py_projects/great-tables/~/py_projects/great-tables/env/lib/python3.9/site-packages/pandas/core/arrays/string_.py:424) mask = isna(value)

TypeError: len() of unsized object

@github-actions github-actions bot temporarily deployed to pr-244 March 19, 2024 19:10 Destroyed
@rich-iannone
Copy link
Member

rich-iannone commented Mar 19, 2024

Having "---" as the default for missing_text= doesn't seem ideal. What do you think of having None here as a default, then we'll use an em-dash in each output type (right now just "&mdash;" for HTML) for that default case? Then we don't have to worry about an AsIs()-type helper function at all (like we discussed before).

@github-actions github-actions bot temporarily deployed to pr-244 March 19, 2024 20:27 Destroyed
@github-actions github-actions bot temporarily deployed to pr-244 March 19, 2024 20:28 Destroyed
@machow
Copy link
Collaborator Author

machow commented Mar 19, 2024

Using None as the flag for using some default, such as an emdash, seems reasonable! I wonder if we could keep the surface area of Great Tables down, by avoiding automatic conversions of plaintext to other characters (e.g. "---" to emdash)?

In the future, some kind of enum might make it easy to separate things like raw "---" from emdash?

E.g.

from enum import Enum

class Chars(Enum):
    emdash = "emdash"
    endash = "endash"

(But I'm really spitballing here! May be some better way to represent? Or could people use unicode?!)

@rich-iannone
Copy link
Member

Using None as the flag for using some default, such as an emdash, seems reasonable! I wonder if we could keep the surface area of Great Tables down, by avoiding automatic conversions of plaintext to other characters (e.g. "---" to emdash)?

I’m proposing skipping that conversion altogether. The default emdash is enough for this purpose. You’re right in that users could just supply Unicode characters for virtually anything here!

@github-actions github-actions bot temporarily deployed to pr-244 March 20, 2024 13:57 Destroyed
@github-actions github-actions bot temporarily deployed to pr-244 March 20, 2024 13:57 Destroyed
Copy link
Member

@rich-iannone rich-iannone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@rich-iannone rich-iannone merged commit aeee0d6 into main Mar 21, 2024
7 checks passed
@rich-iannone rich-iannone deleted the feat-sub-funcs branch March 21, 2024 17:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants