Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

COMPAT: Scalar types immutable #17116

Closed
jbrockmendel opened this issue Jul 29, 2017 · 18 comments
Closed

COMPAT: Scalar types immutable #17116

jbrockmendel opened this issue Jul 29, 2017 · 18 comments
Labels
API Design Datetime Datetime data dtype Interval Interval data type Period Period data type Testing pandas testing functions or related to the test suite

Comments

@jbrockmendel
Copy link
Member

My impression (discussion somewhere on gitter) is that pd.Period is supposed to be immutable. That would certainly make sense for objects going into a PeriodIndex.

per = pd.Period('2014Q1')
>>> per
Period('2014Q1', 'Q-DEC')

freq = per.freq.copy()
freq.n *= 2
per.freq = freq
>>> per
Period('2014Q1', '2Q-DEC')

per.freq is a DateOffset object*, which is defined in tseries.offsets. There is a comment in tseries.offsets.BusinessHourMixin.apply saying "# calculate here because offset is not immutable".

So a couple of questions:

  1. Are we sure pd.Period should be immutable? If so, is there a preferred way of imposing that? Note that the class is defined in a .pyx file and some methods explicitly call __new__.
  2. Does the comment about offset not being immutable mean that it can't or shouldn't be? Outside of __init__ and __setstate__, the only place in tseries.offsets where I see any self.foo attributes being set is self.daytime = False in BusinessHourMixin._get_business_hours_by_sec. grepping for "daytime", this attribute does not appear to be referenced anywhere else.

* BTW, I think DateOffset is one of a small number of commonly-used classes for which there is not a ABCDateOffset equivalent. Is that by design? If not, I'll be happy to make one.

@gfyoung gfyoung added API Design Datetime Datetime data dtype labels Jul 29, 2017
@shoyer
Copy link
Member

shoyer commented Jul 30, 2017

Yes, it sure seems like pd.Period should be immutable, though I don't know the historical context.

If so, is there a preferred way of imposing that?

The standard way is to hide attributes like freq behind properties, so they can only be accessed through public APIs, not set, e.g.,

class Period(object):
    ...

    @property
    def freq(self):
        return self._freq

@jbrockmendel
Copy link
Member Author

OK, I can take care of this.

As long as we're looking at futzing with pd.Period, this might be a good time to make design decisions about _libs.period.IncompatibleFrequency. See e.g. #17112, #13129, #15183, ...

Right now two pd.Period objects can be compared (sorted, put into a well-behaved Index,...) if and only if they have the same frequency. AFAICT this choice was made in order to avoid having to take a stand on the other cases. How about a lexicographic ordering based on (start_time, freq)? Timestamp and datetime objects can implicitly have a frequency of zero.

@shoyer
Copy link
Member

shoyer commented Jul 30, 2017

@jbrockmendel Comparisons between periods with different frequencies is also worth discussing, but please open a separate issue for that.

@jbrockmendel
Copy link
Member Author

In profiling some downstream code I found that DateOffset.__ne__ and DateOffset.__eq__ were taking an enormous amount of time. It looks like this is because it is calling self._params() == other._params() in each call to __eq__. Point being: there's some real mileage to be picked up by making _params() a cache_readonly or something.

@jreback
Copy link
Contributor

jreback commented Jul 31, 2017

Note that we don't explicity test, nor write down anywhere, the implicit guarantee for the pandas scalars Timestamp/Period/Interval that they are immutable.

So most important are tests here.

This is almost trivial to fix. See how _Timestamp is defined, IOW the attributes are cdef readonly, while Period are not.

Would appreciate tests on all of the scalar types.

@jreback jreback added Difficulty Intermediate Testing pandas testing functions or related to the test suite labels Jul 31, 2017
@jreback jreback added this to the Next Major Release milestone Jul 31, 2017
@jreback jreback changed the title Make pd.Period immutable COMPAT: Scalar types immutable Jul 31, 2017
@jreback jreback added Interval Interval data type Period Period data type labels Jul 31, 2017
@jbrockmendel
Copy link
Member Author

This is almost trivial to fix. See how _Timestamp is defined, IOW the attributes are cdef readonly, while Period are not.

Current definition:

cdef class _Period(object):

    cdef public:
        int64_t ordinal
        object freq

So just change "cdef public" to "cdef readonly"?

Would appreciate tests on all of the scalar types.

I'm happy to help, but am stretched pretty thin. For now I need to triage.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

So just change "cdef public" to "cdef readonly"?

Yes, that is correct.

I'm happy to help, but am stretched pretty thin. For now I need to triage.

No worries! This is not urgent, so feel free to take your time on this.

@jbrockmendel
Copy link
Member Author

What about for the non-cython DateOffset? It's got a kwds attribute that's a dict.

@jbrockmendel
Copy link
Member Author

Changing ordinal and freq to cdef readonly breaks Period._from_ordinal, which is called by Period.__init__.

    @classmethod
    def _from_ordinal(cls, ordinal, freq):
        """
        Fast creation from an ordinal and freq that are already validated!
        """
        if ordinal == iNaT:
            return NaT
        else:
            self = _Period.__new__(cls)
            self.ordinal = ordinal
            self.freq = cls._maybe_convert_freq(freq)
            return self

Please advise.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

@jbrockmendel : I think what you should do is try to see which attribute you can make readonly without breaking anything. For those that do, try to refactor so that you can initialize as many of these parameters as possible. I can't see the code ATM to judge how feasible this second sentence is, but I hope that should help for the time being.

@jbrockmendel
Copy link
Member Author

There are only two _Period attributes that are relevant: ordinal and freq. Both of them are currently declared public, and both of them are set from within the classmethod Period._from_ordinal, which is called at the end of Period.__init__.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

Can you initialize them in __new__ ? Otherwise, it might be difficult to make these readonly otherwise.

@jbrockmendel
Copy link
Member Author

Correction: _from_ordinal is a classmethod of _Period, not Period. Shouldn't make a difference, but be advised.

Can you initialize them in __new__ ?

__new__ and cython are both at the edges of my comfort zone. I'm not the person we want futzing with this. That said...

Changing _from_ordinal to look like:

    @classmethod
    def _from_ordinal(cls, ordinal, freq):
        """
        Fast creation from an ordinal and freq that are already validated!
        """
        if ordinal == iNaT:
            return NaT
        else:
            freq = cls._maybe_convert_freq(freq)
            self = _Period.__new__(cls, ordinal=ordinal, freq=freq)
            # self = _Period.__new__(cls)
            # self.ordinal = ordinal
            # self.freq = cls._maybe_convert_freq(freq)
            return self

Then the constructor "works", but...

>>> per = pd.Period('2016Q1', freq='Q')
>>> per
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/_libs/period.pyx", line 952, in pandas._libs.period._Period.__repr__ (pandas/_libs/period.c:17168)
    base, mult = frequencies.get_freq_code(self.freq)
  File "pandas/tseries/frequencies.py", line 347, in get_freq_code
    base, stride = _base_and_stride(freqstr)
  File "pandas/tseries/frequencies.py", line 587, in _base_and_stride
    groups = opattern.match(freqstr)
TypeError: expected string or buffer

Also freq and ordinal are set inside __setstate__. No idea what to do about that.

@gfyoung
Copy link
Member

gfyoung commented Jul 31, 2017

@jbrockmendel : I know you said you were stretched thin logistically, so this is great that you've been able to dive into this refactoring with some detail. No worries if you're stuck. At least people will know where to pick things up for the time being.

@jbrockmendel
Copy link
Member Author

Possibly closed by #21341

@gfyoung
Copy link
Member

gfyoung commented Jun 22, 2018

@jbrockmendel : Would you be able to confirm? Can we add tests to check?

@jbrockmendel
Copy link
Member Author

@gfyoung The original motivations (Period and DateOffset) have been made immutable. The part I'm not sure about is if there were other scalars @jreback had in mind when making the issue title more general.

@jbrockmendel
Copy link
Member Author

Closed by #17239 and #21341

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Datetime Datetime data dtype Interval Interval data type Period Period data type Testing pandas testing functions or related to the test suite
Projects
None yet
Development

No branches or pull requests

4 participants