Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default repr for EAs #23601

Merged
merged 54 commits into from Dec 4, 2018

Conversation

Projects
None yet
5 participants
@TomAugspurger
Copy link
Contributor

commented Nov 9, 2018

Closes #22846
Closes #23590

In [4]: pd.core.arrays.period_array(['2000', '2001', None], freq='D')
Out[4]:
<PeriodArray>
['2000-01-01', '2001-01-01', 'NaT']
Length: 3, dtype: period[D]

In [5]: pd.core.arrays.period_array(['2000', '2001', None] * 100, freq='D')
Out[5]:
<PeriodArray>
['2000-01-01', '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 ...
        'NaT', '2000-01-01', '2001-01-01',        'NaT', '2000-01-01',
 '2001-01-01',        'NaT', '2000-01-01', '2001-01-01',        'NaT']
Length: 300, dtype: period[D]

In [6]: pd.core.arrays.integer_array([1, 2, None])
Out[6]:
<IntegerArray>
[1, 2, NaN]
Length: 3, dtype: Int64

In [7]: pd.core.arrays.integer_array([1, 2, None] * 1000)
Out[7]:
<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64
wip

@TomAugspurger TomAugspurger added this to the 0.24.0 milestone Nov 9, 2018

@pep8speaks

This comment has been minimized.

Copy link

commented Nov 9, 2018

Hello @TomAugspurger! Thanks for submitting the PR.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

In [3]: integer_array([1, 2, 3])
Out[3]:
<IntegerArray>
[1, 2, 3]
Length: 3, dtype: Int64

In [4]: period_array(['2000', '2001'], freq='D')
Out[4]:
<PeriodArray>
[2000-01-01, 2001-01-01]
Length: 2, dtype: period[D]

In [5]: IntervalArray.from_breaks([1, 2, 3])
Out[5]:
<IntervalArray>
[(1, 2], (2, 3]]
Length: 2, dtype: interval[int64]

In [6]: integer_array([1, 2, 3] * 1000)
Out[6]:
<IntegerArray>
[1, 2, 3, 1, 2, 3, 1, 2, 3, 1,
 ...
 3, 1, 2, 3, 1, 2, 3, 1, 2, 3]
Length: 3000, dtype: Int64
@jorisvandenbossche
Copy link
Member

left a comment

Do you want to add one for DatetimeArray and TimedeltaArray here as well?

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Is there any control over the number of elements shown?

Show resolved Hide resolved pandas/core/arrays/base.py Outdated
Show resolved Hide resolved pandas/core/arrays/base.py
@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

For DatetimeArray & Timedelta I was waiting to see what happens on #23587. If @jbrockmendel reverts the repr changes before merging then I'll add it here. Otherwise I'll just delete them after that's merged.

(in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Do we need a mechanism to indicate which attributes to print in addition to length and dtype? (in case we want to keep printing the freq for DatetimeArray/Timedelta)Array

Sure. I suppose that info doesn't belong in the dtype (right?) so we can add hooks for extra attrs.

Is there any control over the number of elements shown?

At the array level, no. But I think following the option at pd.options.display.max_seq_items is the right thing to do. I'll add a note to the docs.

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Nov 9, 2018

The main difference for PeriodArray seems to be that the values are no longer quoted? (not a strong opinion here though, for Index reprs we use quotes and also numpy quotes dates)

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

diff --git a/pandas/core/arrays/period.py b/pandas/core/arrays/period.py
index f6996f8e6..4bc7841fe 100644
--- a/pandas/core/arrays/period.py
+++ b/pandas/core/arrays/period.py
@@ -330,6 +330,10 @@ class PeriodArray(dtl.DatetimeLikeArrayMixin, ExtensionArray):
     def end_time(self):
         return self.to_timestamp(how='end')
 
+    @property
+    def _formatter(self):
+        return "'{}'".format
+
     def __setitem__(
             self,
             key,   # type: Union[int, Sequence[int], Sequence[bool]]

will quote. Gonna mixup the py2 issues.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

Update:

  1. deprecated ExtensionArray._formatting_values
  2. Change _formatter to a function that takes a flag for whether or we're printing inside a Series / DataFrame
In [5]: integer_array([1, 2, None] * 1000)
Out[5]:
<IntegerArray>
[  1,   2, nan,   1,   2, nan,   1,   2, nan,   1,
 ...
 nan,   1,   2, nan,   1,   2, nan,   1,   2, nan]
Length: 3000, dtype: Int64

In [6]: IntervalArray.from_breaks(list(range(1000)))
Out[6]:
<IntervalArray>
[    (0, 1],     (1, 2],     (2, 3],     (3, 4],     (4, 5],     (5, 6],
     (6, 7],     (7, 8],     (8, 9],    (9, 10],
 ...
 (989, 990], (990, 991], (991, 992], (992, 993], (993, 994], (994, 995],
 (995, 996], (996, 997], (997, 998], (998, 999]]
Length: 999, dtype: interval[int64]

In [7]: period_array(['2000', '2001'] * 1000, freq='D')
Out[7]:
<PeriodArray>
['2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01',
 ...
 '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01',
 '2001-01-01', '2000-01-01', '2001-01-01', '2000-01-01', '2001-01-01']
Length: 2000, dtype: period[D]

@jsexauer jsexauer referenced this pull request Nov 9, 2018

Open

DEPR: deprecations from prior versions #6581

0 of 94 tasks complete
@jreback

This comment has been minimized.

Copy link
Contributor

commented Nov 9, 2018

deprecated: ExtensionArray._formatting_values

has this been around for a while? its a private attribute, why deprecate?

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

It's part of the interface and was around since 0.23.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Nov 9, 2018

This are into a bit larger of a refactor... I removed {Categorica,Period,Interval}ArrayFormatter in favor of a generic ExtensionArrayFormatter.

EAs will get control over formatting of individual values by overriding ExtensionArray._formatter.

TomAugspurger added some commits Nov 9, 2018

TomAugspurger added some commits Nov 9, 2018

@jorisvandenbossche

This comment has been minimized.

Copy link
Member

commented Dec 1, 2018

@TomAugspurger I was thinking for a moment again on the quoting issue.

It would be nice to have a somewhat general rule about this, and also to reflect this in the base repr.

From observing the results, it seems that - mostly - we use a str representation in Series / DataFrame, and an (abbreviated) repr in Index / Array. With 'abbreviated' I mean for eg Timestamp, we use '2012-01-01' instead of Timestamp('2012-01-01') (so leaving out the object type) since the dtype already makes it clear they are all of that object type.

If we like this general pattern, we could also do this for the default repr (and sorry, that goes against what I commented earlier #23601 (comment) and what you changed). So instead of

def _formatter(self, boxed=False):
    return repr

we could have

def _formatter(self, boxed=False):
    if boxed:
        return str
    return repr

That would also fit a possible StringArray to not quote it in Series/DataFrame, but to show it quoted in the array repr.

Anyway, since this is configurable, and I think none of the internal ones inherits the base _formatter implementation, it is not that important, and certainly does not need to hold up this PR.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 2, 2018

Implemented
#23601 (comment) in 2a60c15.

Show resolved Hide resolved doc/source/whatsnew/v0.24.0.rst Outdated
Show resolved Hide resolved pandas/core/arrays/base.py Outdated
----------
boxed: bool, default False
An indicated for whether or not your array is being printed
within a Series, DataFrame, or Index (True), or just by

This comment has been minimized.

Copy link
@jreback

jreback Dec 2, 2018

Contributor

I would add this arg to the Index formatters as well for compatiblity.

Show resolved Hide resolved pandas/core/arrays/categorical.py Outdated
Show resolved Hide resolved pandas/core/arrays/sparse.py
Show resolved Hide resolved pandas/core/arrays/period.py
Show resolved Hide resolved pandas/io/formats/printing.py Outdated
defaults to the class name of the obj
Pass ``False`` to indicate that subsequent lines should

This comment has been minimized.

Copy link
@jreback

jreback Dec 2, 2018

Contributor

can this be another parameter then? it seems like it is used for 2 purposes

summary += '],'

# right now close is either '' or ', '
# Now we want to include the ']', but not the maybe space.

This comment has been minimized.

Copy link
@jreback

jreback Dec 2, 2018

Contributor

so another difference this is highliting is that EA have the attributes on another line, while the Index does not (as they are args).

TomAugspurger added some commits Dec 3, 2018

Update for review
* docs
* removed overloading of name=False
* added indent_for_name
@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 3, 2018

can you rebase

@@ -503,7 +503,7 @@ def __array_wrap__(self, result, context=None):

@property
def _formatter_func(self):
return lambda x: "'%s'" % x
return self.array._formatter(boxed=False)

This comment has been minimized.

Copy link
@jreback

jreback Dec 3, 2018

Contributor

why is this the only index sublcass that you need to do this for?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Dec 4, 2018

Author Contributor

It's the only extension-array backed index that's using the formatting right now.

  • CategoricalIndex always uses the default formatter for the underlying categories (since it can be a container for any type, it dispatches the formatting).
  • IntervalIndex /IntervalArray don't use this formatting

Datetime / TImedelta will use this too, so this will be pushed up into DatetimeIndexOpsMixin later.

def __init__(self, values, *args, **kwargs):
GenericArrayFormatter.__init__(self, values, *args, **kwargs)
if is_categorical_dtype(values.dtype):
# Categorical is special for now, so that we can preserve tzinfo

This comment has been minimized.

Copy link
@jreback

jreback Dec 4, 2018

Contributor

do we need a TODO here? this is until DatetimeArray is fully pushed?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Dec 4, 2018

Author Contributor

That depends on whether we're willing to change __array__ for datetime-backed series / index (right now . I'm writing up an issue now to discuss that specific point.)

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Dec 4, 2018

Author Contributor

#23569 (comment) for that.

data = integer_array([1, 2, None] * 1000)
expected = (
"<IntegerArray>\n"
"[ 1, 2, NaN, 1, 2, NaN, 1, 2, NaN, 1,\n"

This comment has been minimized.

Copy link
@jreback

jreback Dec 4, 2018

Contributor

these are somehow justified?

This comment has been minimized.

Copy link
@TomAugspurger

TomAugspurger Dec 4, 2018

Author Contributor

Justified, as in "vertically aligned"? Here's the repr

In [10]: data
Out[10]:
<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

The NaN pattern there makes the formatting a bt strange, but I think unavoidable.

This comment has been minimized.

Copy link
@jreback

jreback Dec 4, 2018

Contributor

no this is ok, was just wondering about

@jreback
Copy link
Contributor

left a comment

couple of comments

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 4, 2018

I'm not sure how hard it would be, but in

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN,   1,
 ...
 NaN,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

I don't like that we devote an entire line just to ... I wonder if we could instead do

<IntegerArray>
[  1,   2, NaN,   1,   2, NaN,   1,   2, NaN, ...,
 ...,   1,   2, NaN,   1,   2, NaN,   1,   2, NaN]
Length: 3000, dtype: Int64

Although, that makes the continuation harder to see (for me)...

Edit: thinking about it more, I prefer the one with ... on its own line.


I'll try to followup with a short one-line repr before the release.

@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

this is fine: #23601 (comment)

we do this elsewhere, try this with a large Array and the own line makes a lot of sense.

@TomAugspurger

This comment has been minimized.

Copy link
Contributor Author

commented Dec 4, 2018

👍 all good then?

@jreback

jreback approved these changes Dec 4, 2018

@jreback jreback merged commit 1573340 into pandas-dev:master Dec 4, 2018

3 checks passed

ci/circleci Your tests passed on CircleCI!
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
pandas-dev.pandas Build #20181203.73 succeeded
Details
@jreback

This comment has been minimized.

Copy link
Contributor

commented Dec 4, 2018

thanks a lot @TomAugspurger ; this consolidates a lot of disparate code!

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019

Pingviinituutti added a commit to Pingviinituutti/pandas that referenced this pull request Feb 28, 2019


def _format_strings(self):
fmt_values = format_array(self.values.get_values(), self.formatter,
fmt_values = format_array(array,

This comment has been minimized.

Copy link
@simonjayhawkins

simonjayhawkins Apr 5, 2019

Member

@TomAugspurger : i'm struggling to resolve some formatting issues. what is the reason for calling format_array here. As far as I can tell is looping back round to create a GenericArrayFormatter instance with a formatter specified to pick up the display options.

This comment has been minimized.

Copy link
@simonjayhawkins

simonjayhawkins Apr 5, 2019

Member

i guess, to be more succinct, why is super()._format_strings() not used?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 5, 2019

Member

I am not that familiar with this code, but from a quick look: calling super()._format_strings() would be different, as this would call GenericArrayFormatter._format_strings, while the generic format_array can still result in using custom formatters like Datetime64(TZ)Formatter or Timedelta64Formatter, depending on what the values of the underlying EA are.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 5, 2019

Member

Although that most of those custom Formatter classes don't do much special if formatter is specified.

Eg Datetime64Formatter has this in _format_strings:

if self.formatter is not None and callable(self.formatter):
return [self.formatter(x) for x in values]

This comment has been minimized.

Copy link
@simonjayhawkins

simonjayhawkins Apr 5, 2019

Member

so if ExtensionArrayFormatter is not inheriting from GenericArrayFormatter but calling format_array to dispatch to another ...ArrayFormatter class, why wouldn't the logic in ExtensionArrayFormatter be in format_array?

def _formatter(self, boxed=False):
def fmt(x):
if isna(x):
return 'NaN'

This comment has been minimized.

Copy link
@simonjayhawkins

simonjayhawkins Apr 5, 2019

Member

should NaN have been hardcoded here?

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 5, 2019

Member

Why not? This is used for the Integer data display, where we currently use NaN.

(whether we should use rather 'NA' instead of 'NaN', that's another question)

This comment has been minimized.

Copy link
@simonjayhawkins

simonjayhawkins Apr 5, 2019

Member

just wondering whether it would be a problem with to_string(na_rep=...). will do some tests.

This comment has been minimized.

Copy link
@jorisvandenbossche

jorisvandenbossche Apr 5, 2019

Member

Ah, yes, that's a good reason. But in general, this _formatter does not follow display options at all, is that correct?
In which case this is something to think about in general.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.