Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: fix PeriodConverter issue when given a list of integers (GH9012) #9050

Closed

Conversation

jorisvandenbossche
Copy link
Member

For example coming up when [0, 1] is passed by axhline

Closes #9012

test and whatsnew are coming

@jorisvandenbossche jorisvandenbossche added this to the 0.15.2 milestone Dec 10, 2014
@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

the perf diff was massive
you need to find a better way

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

something like this

In [8]: from pandas.tseries.converter import get_datevalue

In [9]: pr = pd.period_range('2000',periods=10000,freq='D')

In [10]: [ get_datevalue(v,pr.freq) for v in pr ][0:10]
Out[10]: [10957, 10958, 10959, 10960, 10961, 10962, 10963, 10964, 10965, 10966]

In [11]: pd.PeriodIndex(pr).values
Out[11]: array([10957, 10958, 10959, ..., 20954, 20955, 20956])

In [12]: %timeit pd.PeriodIndex(pr).values 
100000 loops, best of 3: 3.96 us per loop

In [13]: %timeit [ get_datevalue(v,pr.freq) for v in pr ]      
10 loops, best of 3: 116 ms per loop

Though i don't think this is actually causing the bug, they return the same results AFAICT

@jorisvandenbossche
Copy link
Member Author

In this case they return the same (as it are periods), but not when you provide eg ints:

In [2]: values = [0, 1]

In [3]: [ get_datevalue(v, 'B') for v in values ]
Out[3]: [0, 1]

In [4]: pd.PeriodIndex(values, freq='D').values
...
DateParseError: day is out of range for month

@jorisvandenbossche
Copy link
Member Author

And the thing is, if it is a PeriodIndex, it is already special cased: https://github.com/pydata/pandas/blob/v0.15.1/pandas/tseries/converter.py#L116, so this will not impact perfomance.

But, I think if you just do ts.plot() the x values are not provided as a PeriodIndex, but as an ndarray of Periods, so that hits the line in question, and in this way it will indeed impact performance.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

ok, so obviously need a test for that, I thing should be straightforward to fix. The thing is to avoid reinterpreting the freq when you already have a PeriodIndex.

@jorisvandenbossche
Copy link
Member Author

Maybe specifically check for an array of Periods?

In [38]: pr = pd.period_range('2000',periods=10000,freq='D')

In [39]: a = np.array([p for p in pr])

In [40]: a
Out[40]:
array([Period('2000-01-01', 'D'), Period('2000-01-02', 'D'),
       Period('2000-01-03', 'D'), ..., Period('2027-05-16', 'D'),
       Period('2027-05-17', 'D'), Period('2027-05-18', 'D')], dtype=object)

In [41]: pd.lib.infer_dtype(a)
Out[41]: 'period'

In [42]: pd.core.common.is_period_arraylike(a)
Out[42]: True

This is the performance difference (that you fixed for the general time series plotting):

In [49]: pd.PeriodIndex(a, freq='D').values
Out[49]: array([10957, 10958, 10959, ..., 20954, 20955, 20956], dtype=int64)

In [50]: [get_datevalue(x, 'D') for x in a][0:5]
Out[50]: [10957L, 10958L, 10959L, 10960L, 10961L]

In [51]: %timeit pd.PeriodIndex(a, freq='D').values
100 loops, best of 3: 7.88 ms per loop

In [52]: %timeit [get_datevalue(x, 'D') for x in a]
10 loops, best of 3: 167 ms per loop

In [53]: s = pd.Series(np.random.randn(10000), index=pd.date_range('2012-01-01', periods=10000, freq='s'))

# current master
In [36]: %%timeit
   ....: fig, ax = plt.subplots()
   ....: s.plot(ax=ax)

1 loops, best of 3: 207 ms per loop

# patched with how it was before
In [38]: %%timeit
   ....: fig, ax = plt.subplots()
   ....: s.plot(ax=ax)

1 loops, best of 3: 353 ms per loop

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

you prob can use isinstance(a, np.ndarray) and is_period_arraylike(a)

because then a PeriodIndex would be excluded

@jorisvandenbossche
Copy link
Member Author

yes, but I don't think that is needed, because: PeriodIndex is already checked for in an if before that line, and also lists of Periods can go directly to PeriodIndex(..) I think

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

ok then!

@jorisvandenbossche
Copy link
Member Author

ah, but is_period_arraylike does not seem to handle lists of Periods (gives False)

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

In [1]: pr = pd.period_range('20130101',periods=3,freq='D')

In [2]: pr.tolist()
Out[2]: 
[Period('2013-01-01', 'D'),
 Period('2013-01-02', 'D'),
 Period('2013-01-03', 'D')]

In [3]: np.array(pr.tolist())
Out[3]: 
array([Period('2013-01-01', 'D'), Period('2013-01-02', 'D'),
       Period('2013-01-03', 'D')], dtype=object)

In [4]: pandas.core.common.is_period_arraylike(np.array(pr.tolist()))
Out[4]: True

really

@jorisvandenbossche
Copy link
Member Author

yeah, but I expected it to handle lists without having to convert it to array:

In [67]: pandas.core.common.is_period_arraylike(pr.tolist())
Out[67]: False

But in any case, not going to worry about that particular case for now

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

ahh, ok, that makes sense then (it assuem ndarray ness)

@jorisvandenbossche
Copy link
Member Author

OK, updated the PR, no time now to further look at it, but will try to add tests later this evening

@jreback
Copy link
Contributor

jreback commented Dec 10, 2014

ok, will try to do the release tonight in that case

For example coming up when [0, 1] is passed by axhline

Restore old behaviour, but special case arrays of Periods to keep the
performance gain of using PeriodIndex instead of get_datevalue for that case.
@jreback
Copy link
Contributor

jreback commented Dec 11, 2014

merged via b69b9d3

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Plotting problem in pandas 0.15.1 for DataFrame with datetime-index
2 participants