Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN/ENH: Stop instantiating all offsets on load. #5189

Merged

Conversation

jtratner
Copy link
Contributor

Implements @wesm's suggestion and clears the way for new offsets.

Very simple to extend - map a string to your class in mapping, define a from_name
classmethod, then class gets passed everything after the - (if
anything) to its method from_name(). Caches to dict afterwards (right
now doesn't use cacheable, but could take advantage of that).

cc @cancan101

For reference, here are all the existing offset: output aliases (using
types because otherwise prints exactly the same).
[(None, NoneType),
('A-APR', pandas.tseries.offsets.YearEnd),
('A-AUG', pandas.tseries.offsets.YearEnd),
('A-DEC', pandas.tseries.offsets.YearEnd),
('A-FEB', pandas.tseries.offsets.YearEnd),
('A-JAN', pandas.tseries.offsets.YearEnd),
('A-JUL', pandas.tseries.offsets.YearEnd),
('A-JUN', pandas.tseries.offsets.YearEnd),
('A-MAR', pandas.tseries.offsets.YearEnd),
('A-MAY', pandas.tseries.offsets.YearEnd),
('A-NOV', pandas.tseries.offsets.YearEnd),
('A-OCT', pandas.tseries.offsets.YearEnd),
('A-SEP', pandas.tseries.offsets.YearEnd),
('AS-APR', pandas.tseries.offsets.YearBegin),
('AS-AUG', pandas.tseries.offsets.YearBegin),
('AS-DEC', pandas.tseries.offsets.YearBegin),
('AS-FEB', pandas.tseries.offsets.YearBegin),
('AS-JAN', pandas.tseries.offsets.YearBegin),
('AS-JUL', pandas.tseries.offsets.YearBegin),
('AS-JUN', pandas.tseries.offsets.YearBegin),
('AS-MAR', pandas.tseries.offsets.YearBegin),
('AS-MAY', pandas.tseries.offsets.YearBegin),
('AS-NOV', pandas.tseries.offsets.YearBegin),
('AS-OCT', pandas.tseries.offsets.YearBegin),
('AS-SEP', pandas.tseries.offsets.YearBegin),
('B', pandas.tseries.offsets.BusinessDay),
('BA-APR', pandas.tseries.offsets.BYearEnd),
('BA-AUG', pandas.tseries.offsets.BYearEnd),
('BA-DEC', pandas.tseries.offsets.BYearEnd),
('BA-FEB', pandas.tseries.offsets.BYearEnd),
('BA-JAN', pandas.tseries.offsets.BYearEnd),
('BA-JUL', pandas.tseries.offsets.BYearEnd),
('BA-JUN', pandas.tseries.offsets.BYearEnd),
('BA-MAR', pandas.tseries.offsets.BYearEnd),
('BA-MAY', pandas.tseries.offsets.BYearEnd),
('BA-NOV', pandas.tseries.offsets.BYearEnd),
('BA-OCT', pandas.tseries.offsets.BYearEnd),
('BA-SEP', pandas.tseries.offsets.BYearEnd),
('BAS-APR', pandas.tseries.offsets.BYearBegin),
('BAS-AUG', pandas.tseries.offsets.BYearBegin),
('BAS-DEC', pandas.tseries.offsets.BYearBegin),
('BAS-FEB', pandas.tseries.offsets.BYearBegin),
('BAS-JAN', pandas.tseries.offsets.BYearBegin),
('BAS-JUL', pandas.tseries.offsets.BYearBegin),
('BAS-JUN', pandas.tseries.offsets.BYearBegin),
('BAS-MAR', pandas.tseries.offsets.BYearBegin),
('BAS-MAY', pandas.tseries.offsets.BYearBegin),
('BAS-NOV', pandas.tseries.offsets.BYearBegin),
('BAS-OCT', pandas.tseries.offsets.BYearBegin),
('BAS-SEP', pandas.tseries.offsets.BYearBegin),
('BM', pandas.tseries.offsets.BusinessMonthEnd),
('BMS', pandas.tseries.offsets.BusinessMonthBegin),
('BQ', pandas.tseries.offsets.BQuarterEnd),
('BQ-APR', pandas.tseries.offsets.BQuarterEnd),
('BQ-AUG', pandas.tseries.offsets.BQuarterEnd),
('BQ-DEC', pandas.tseries.offsets.BQuarterEnd),
('BQ-FEB', pandas.tseries.offsets.BQuarterEnd),
('BQ-JAN', pandas.tseries.offsets.BQuarterEnd),
('BQ-JUL', pandas.tseries.offsets.BQuarterEnd),
('BQ-JUN', pandas.tseries.offsets.BQuarterEnd),
('BQ-MAR', pandas.tseries.offsets.BQuarterEnd),
('BQ-MAY', pandas.tseries.offsets.BQuarterEnd),
('BQ-NOV', pandas.tseries.offsets.BQuarterEnd),
('BQ-OCT', pandas.tseries.offsets.BQuarterEnd),
('BQ-SEP', pandas.tseries.offsets.BQuarterEnd),
('BQS', pandas.tseries.offsets.BQuarterBegin),
('BQS-APR', pandas.tseries.offsets.BQuarterBegin),
('BQS-AUG', pandas.tseries.offsets.BQuarterBegin),
('BQS-DEC', pandas.tseries.offsets.BQuarterBegin),
('BQS-FEB', pandas.tseries.offsets.BQuarterBegin),
('BQS-JAN', pandas.tseries.offsets.BQuarterBegin),
('BQS-JUL', pandas.tseries.offsets.BQuarterBegin),
('BQS-JUN', pandas.tseries.offsets.BQuarterBegin),
('BQS-MAR', pandas.tseries.offsets.BQuarterBegin),
('BQS-MAY', pandas.tseries.offsets.BQuarterBegin),
('BQS-NOV', pandas.tseries.offsets.BQuarterBegin),
('BQS-OCT', pandas.tseries.offsets.BQuarterBegin),
('BQS-SEP', pandas.tseries.offsets.BQuarterBegin),
('Q-APR', pandas.tseries.offsets.QuarterEnd),
('Q-AUG', pandas.tseries.offsets.QuarterEnd),
('Q-DEC', pandas.tseries.offsets.QuarterEnd),
('Q-FEB', pandas.tseries.offsets.QuarterEnd),
('Q-JAN', pandas.tseries.offsets.QuarterEnd),
('Q-JUL', pandas.tseries.offsets.QuarterEnd),
('Q-JUN', pandas.tseries.offsets.QuarterEnd),
('Q-MAR', pandas.tseries.offsets.QuarterEnd),
('Q-MAY', pandas.tseries.offsets.QuarterEnd),
('Q-NOV', pandas.tseries.offsets.QuarterEnd),
('Q-OCT', pandas.tseries.offsets.QuarterEnd),
('Q-SEP', pandas.tseries.offsets.QuarterEnd),
('QS', pandas.tseries.offsets.QuarterBegin),
('QS-APR', pandas.tseries.offsets.QuarterBegin),
('QS-AUG', pandas.tseries.offsets.QuarterBegin),
('QS-DEC', pandas.tseries.offsets.QuarterBegin),
('QS-FEB', pandas.tseries.offsets.QuarterBegin),
('QS-JAN', pandas.tseries.offsets.QuarterBegin),
('QS-JUL', pandas.tseries.offsets.QuarterBegin),
('QS-JUN', pandas.tseries.offsets.QuarterBegin),
('QS-MAR', pandas.tseries.offsets.QuarterBegin),
('QS-MAY', pandas.tseries.offsets.QuarterBegin),
('QS-NOV', pandas.tseries.offsets.QuarterBegin),
('QS-OCT', pandas.tseries.offsets.QuarterBegin),
('QS-SEP', pandas.tseries.offsets.QuarterBegin),
('S', pandas.tseries.offsets.Second),
('T', pandas.tseries.offsets.Minute),
('U', pandas.tseries.offsets.Micro),
('W-FRI', pandas.tseries.offsets.Week),
('W-MON', pandas.tseries.offsets.Week),
('W-SAT', pandas.tseries.offsets.Week),
('W-SUN', pandas.tseries.offsets.Week),
('W-THU', pandas.tseries.offsets.Week),
('W-TUE', pandas.tseries.offsets.Week),
('W-WED', pandas.tseries.offsets.Week),
('WOM-1FRI', pandas.tseries.offsets.WeekOfMonth),
('WOM-1MON', pandas.tseries.offsets.WeekOfMonth),
('WOM-1THU', pandas.tseries.offsets.WeekOfMonth),
('WOM-1TUE', pandas.tseries.offsets.WeekOfMonth),
('WOM-1WED', pandas.tseries.offsets.WeekOfMonth),
('WOM-2FRI', pandas.tseries.offsets.WeekOfMonth),
('WOM-2MON', pandas.tseries.offsets.WeekOfMonth),
('WOM-2THU', pandas.tseries.offsets.WeekOfMonth),
('WOM-2TUE', pandas.tseries.offsets.WeekOfMonth),
('WOM-2WED', pandas.tseries.offsets.WeekOfMonth),
('WOM-3FRI', pandas.tseries.offsets.WeekOfMonth),
('WOM-3MON', pandas.tseries.offsets.WeekOfMonth),
('WOM-3THU', pandas.tseries.offsets.WeekOfMonth),
('WOM-3TUE', pandas.tseries.offsets.WeekOfMonth),
('WOM-3WED', pandas.tseries.offsets.WeekOfMonth),
('WOM-4FRI', pandas.tseries.offsets.WeekOfMonth),
('WOM-4MON', pandas.tseries.offsets.WeekOfMonth),
('WOM-4THU', pandas.tseries.offsets.WeekOfMonth),
('WOM-4TUE', pandas.tseries.offsets.WeekOfMonth),
('WOM-4WED', pandas.tseries.offsets.WeekOfMonth)]

@ghost ghost assigned jtratner Oct 12, 2013
_offset_map = {}
# TODO: Incorporate this into the make_offset thing in offsets
# And what, Nano doesn't exist with np >= 1.7??
# if not _np_version_under1p7:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nano is not supported under < 1.7 so it exists, but he rule-code should error should raise

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, lol it's not under.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

anyways, just need to fix up offsets.py for this.

@cancan101
Copy link
Contributor

@wesm @jtratner This might be a good time to answer: #5015

for day in days]
#singletons
names += ['S', 'T', 'U', 'BM', 'BMS', 'BQ', 'QS'] # No 'Q'
_offset_map.clear()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe put this in a setUp method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean for _offset_map, I thought about it, but only matters for certain methods to actually clear the offset map - we can revisit as necessary. The rest we only need to test once - if more testing needed, can refactor then.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, I meant the call: _offset_map.clear(). Fair enough

@cancan101
Copy link
Contributor

@jtratner I wonder if this is anything worth cherry picking from #5148

@jtratner
Copy link
Contributor Author

I'll make it _named, that seems reasonable.

@jtratner
Copy link
Contributor Author

Looking through, the rule code is pretty clear, it's prefix-OffsetSpecificEnding, where prefix is something like B, Q, etc. and the OffsetSpecificEnding is something like month, day, etc. There are a bunch of aliases for backwards compatibility and that's about it.

@cancan101
Copy link
Contributor

Looking at the regex you can see there are a few more rules: for example numbers are allowed only right after the dash.

@jtratner
Copy link
Contributor Author

Yeah, but that's really just for validation: it covers the set of rules that already exist (as well as the legacy rules from scikits.timeseries, etc). (since you only end up with #s in the string for WOM) If we wanted to add some crazy additional rule_code, you'd change the regex to make it pass the validation.

@cancan101
Copy link
Contributor

I agree that we can always change the regex since the regex is just used for validation and not for parsing. We just want to make sure that we can express the parameterization for a given offset in a reasonable manner.

@jtratner
Copy link
Contributor Author

Places where rule_code actually matters:

  • is_subperiod/is_superperiod (I guess different WOM can't downsample to each other)
  • tseries/tools - where it checks that rule_code == 'M' (so would be fine with static rule_code)
  • Timegrouper - which actually has to hack around rule_code to remove the suffix and just check for the prefix
  • generic - used for an error message
  • _get_freq in tseries/plotting - used to get period alias [probably could just have offsets know their period aliases)

@jtratner
Copy link
Contributor Author

Okay, I have to go, but I think it's simple: you split on '-', first element needs to define class for lookup, you can have whatever else you want can go afterwards (and it's up to the dateoffset to handle it)

@cancan101
Copy link
Contributor

Agreed.

@cancan101
Copy link
Contributor

I would like to cleanup the is_subperiod/is_superperiod eventually. One of the nice features of the 52-53 week fiscal years is that by design they upsample to weeks.

@jtratner
Copy link
Contributor Author

actually not sure whether we should deprecate rule_code, just want to work around it.

@jtratner
Copy link
Contributor Author

@cancan101 good suggestion about adding a prefix attribute, ended up meaning that I could remove most of the special rule_codes and simplify things! (I ended up leaving comments with the actual prefixes just for clarity 😄)

@cancan101
Copy link
Contributor

@jtratner An additional location where rule_code matters is _FrequencyInferer. See for example get_freq, _get_wom_rule, _get_monthly_rule.

@jtratner
Copy link
Contributor Author

okay well this doesn't change rule_code.

@cancan101
Copy link
Contributor

That may be the case but I would be in favor of at the very least moving these over to using constants (along lines of #5189 (comment)) if not writing this code to check against a DateOffset object rather than the freq_str / rule_code.

@jtratner
Copy link
Contributor Author

I want to push that to a separate refactor. This keeps the behavior relatively the same and will enable you to merge your 52-53 week PR.

@cancan101
Copy link
Contributor

That's totally fine. I was just going through looking at the upsample/downsample code and was reminded of those string literals...

@jtratner
Copy link
Contributor Author

frankly, you probably have a better sense of this than I, just figured I could simplify a bit for you first, make sure we have a working framework, then it's easier to make more advanced tweaks later.

return offset
else:
raise ValueError('Bad rule name requested: %s.' % name)
except KeyError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under what conditions would a KeyError be raised (ie which lines could raise the error)?

@@ -683,6 +711,9 @@ def rule_code(self):
6: 'SUN'
}

# reversed
_weekday_dict.update([(v, k) for k, v in _weekday_dict.items()])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of adding the reversed map to the same dict?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Means that _weekday_dict['SUN'] is 6, while _weekday_dict[6] is 'SUN' and that way you don't have to maintain two separate dicts of the same information (have to lookup the first way in from_name and the second in rule_code.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does Week.from_name(suffix=1) give you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll raise an error because you can't slice it. I'll go change all of them to private methods now, thanks for the heads up.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will actually create the Week but when that Week is used it will raise an Exception.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, this makes me realize that my implementation is wrong.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed it - need to catch the TypeError from getting an offset with too many -

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I don't see the benefit of sharing the same dict for forward and reverse lookups. It seems to leave open the possibility of not fully checking for invalid inputs. Admittedly the dict is private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, that's reasonable and trivial to implement - I'll add it.

@jtratner
Copy link
Contributor Author

Okay, I think this is in a pretty good place at this point. Thanks for the feedback @cancan101! Any other comments? @wesm - does this match what you were thinking?

@cancan101
Copy link
Contributor

@jtratner NP.

else:
# what the heck is this Exception supposed to be for???
try:
return offset.freqstr
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any idea about this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't know...all I know is that BusinessDay was treated separately and explicitly tested to be this way. I think I might be able to remove the try/except though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, can remove the try/except now because of the check - is this considered public? I.e., does it need to validate anything? If it's not I might put the try/except back.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean rather, if it is, would prefer to have error be "Bad rule" rather than "No attribute freqstr"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put it back for just that reason, catching any Exception

@jtratner
Copy link
Contributor Author

@jreback and others - okay?

@jreback
Copy link
Contributor

jreback commented Oct 14, 2013

ok by me

Use a more flexible lookup based on prefix then delegate creation to
class' `from_name()` method. Totally get rid of hasOffsetName (not necessary).
Finally, make some inheriting go on to simplify things.
@jtratner
Copy link
Contributor Author

okay, going to merge this soon? any other comments?

@cancan101
Copy link
Contributor

Is there a vbench for pandas load time?

@jtratner
Copy link
Contributor Author

No, I don't think so (hard to time), but this is mostly to make your PR
simpler and easier to support.

jtratner added a commit that referenced this pull request Oct 15, 2013
…load

CLN/ENH: Stop instantiating all offsets on load.
@jtratner jtratner merged commit 4db583d into pandas-dev:master Oct 15, 2013
@cancan101
Copy link
Contributor

@jtratner I'm totally fine with commit. I was just curious as to start up gain.

@jtratner
Copy link
Contributor Author

It's a good point, I've definitely thought about it too. Not sure how you'd
test within Python, don't know how you could unload it to load again.
Memory footprint might be a better metric, but even that isn't so high.

@cancan101
Copy link
Contributor

Maybe something from:
http://stackoverflow.com/questions/437589/how-do-i-unload-reload-a-python-module
On Oct 14, 2013 9:34 PM, "Jeff Tratner" notifications@github.com wrote:

It's a good point, I've definitely thought about it too. Not sure how you'd
test within Python, don't know how you could unload it to load again.
Memory footprint might be a better metric, but even that isn't so high.


Reply to this email directly or view it on GitHubhttps://github.com//pull/5189#issuecomment-26303269
.

@jtratner
Copy link
Contributor Author

Go for it - I'd certainly be interested to see the result.

@cancan101
Copy link
Contributor

@jtratner Any reason for this code to still exist in frequencies:

for _i, _weekday in enumerate(['MON', 'TUE', 'WED', 'THU', 'FRI']):
    for _iweek in range(4):
        _name = 'WOM-%d%s' % (_iweek + 1, _weekday)
        _offset_map[_name] = offsets.WeekOfMonth(week=_iweek, weekday=_i)
        _rule_aliases[_name.replace('-', '@')] = _name

@jtratner
Copy link
Contributor Author

shoot, the _offset_map part shouldn't be there. _rule_aliases needs to be though (I think). I didn't cover everything, there's still a bunch of work for you to do after this.

@jtratner jtratner deleted the stop-instantiating-offsets-on-load branch October 15, 2013 03:35
@jtratner
Copy link
Contributor Author

I mean in the internal cleanup you previously proposed.

@cancan101
Copy link
Contributor

I dont think you need the alias part either. You handle that:

def _make_offset(key):
    """Gets offset based on key. KeyError if prefix is bad, ValueError if
    suffix is bad. All handled by `get_offset` in tseries/frequencies. Not
    public."""
    if key is None:
        return None
    split = key.replace('@', '-').split('-')

@cancan101
Copy link
Contributor

I will leave the alias in for now but I will remove the _offset_map?

@jtratner
Copy link
Contributor Author

yes

On Mon, Oct 14, 2013 at 11:42 PM, Alex Rothberg notifications@github.comwrote:

I will leave the alias in for now but I will remove the _offset_map ?


Reply to this email directly or view it on GitHubhttps://github.com//pull/5189#issuecomment-26307329
.

klass = prefix_mapping[split[0]]
# handles case where there's no suffix (and will TypeError if too many '-')
obj = klass._from_name(*split[1:])
obj._named = key
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason for setting the _named field here? This leads to some printing weirdness: #5148 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to preserve the special case existing behavior for shorter offset names (i.e. there are offsets that are synonyms and we wanted to preserve their repr output in the console).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue is depending how the same offset is created , it will print
differently.
On Apr 14, 2014 5:49 PM, "Jeff Tratner" notifications@github.com wrote:

In pandas/tseries/offsets.py:

+if not _np_version_under1p7:

  • Only 1.7+ supports nanosecond resolution

  • prefix_mapping['N'] = Nano

+def _make_offset(key):

  • """Gets offset based on key. KeyError if prefix is bad, ValueError if
  • suffix is bad. All handled by get_offset in tseries/frequencies. Not
  • public."""
  • if key is None:
  •    return None
    
  • split = key.replace('@', '-').split('-')
  • klass = prefix_mapping[split[0]]
  • handles case where there's no suffix (and will TypeError if too many '-')

  • obj = klass._from_name(*split[1:])
  • obj._named = key

to preserve the special case existing behavior for shorter offset names
(i.e. there are offsets that are synonyms and we wanted to preserve their
repr output in the console).


Reply to this email directly or view it on GitHubhttps://github.com//pull/5189/files#r11610049
.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay. 6 months later, I agree with you that we should just have canonicalized offset names.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looking back, that was probably a mistake on my part, I'm okay with changing it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants