Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: strange timeseries plot behavior #6608

Closed
rosnfeld opened this issue Mar 11, 2014 · 29 comments · Fixed by #7322
Closed

BUG: strange timeseries plot behavior #6608

rosnfeld opened this issue Mar 11, 2014 · 29 comments · Fixed by #7322
Labels
Testing pandas testing functions or related to the test suite Timeseries Visualization plotting
Milestone

Comments

@rosnfeld
Copy link
Contributor

After some discussion below, here's a simple repro case:

s1 = pd.Series([1, 2, 3], index=[datetime.datetime(1995, 12, 31), datetime.datetime(2000, 12, 31), datetime.datetime(2005, 12, 31)])
s2 = pd.Series([1, 2, 3], index=[datetime.datetime(1997, 12, 31), datetime.datetime(2003, 12, 31), datetime.datetime(2008, 12, 31)])

# plot first series, then add the second series to those axes, then try adding the first series again
ax = s1.plot()
s2.plot(ax=ax)
s1.plot(ax=ax)

causes

Traceback (most recent call last):
  File "simple_repro.py", line 10, in <module>
    s1.plot(ax=ax)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 2116, in plot_series
    plot_obj.generate()
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 920, in generate
    self._make_plot()
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1482, in _make_plot
    self._make_ts_plot(data)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1577, in _make_ts_plot
    _plot(data, 0, ax, label, self.style, **kwds)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1553, in _plot
    style=style, **kwds)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tseries/plotting.py", line 82, in tsplot
    left, right = _get_xlim(ax.get_lines())
  File "/home/andrew/git/pandas-rosnfeld/pandas/tseries/plotting.py", line 226, in _get_xlim
    left = min(x[0].ordinal, left)
AttributeError: 'datetime.datetime' object has no attribute 'ordinal'

-- ORIGINAL MESSAGE --

Here's a small dataset:

date,region,value
1996-12-31,BRA,4.5
2003-12-31,BRA,3.7
2007-12-31,BRA,2.2
1995-12-31,COL,6.3
2000-12-31,COL,4.9
2005-12-31,COL,5.1
2010-12-31,COL,3.4
1997-12-31,PAN,6.3
2003-12-31,PAN,5.1
2008-12-31,PAN,3.9
1990-12-31,VEN,6.7
1991-12-31,VEN,5.4
1992-12-31,VEN,4.5
1993-12-31,VEN,4
1994-12-31,VEN,3.9
1995-12-31,VEN,4.1
1996-12-31,VEN,4.4
1997-12-31,VEN,4.5
1998-12-31,VEN,4.6
1999-12-31,VEN,4.1
2000-12-31,VEN,3.9
2007-12-31,VEN,3.7

If I read this in using

data = pd.read_csv('./data.csv', parse_dates='date', index_col='date')

and then try and plot it using

data.groupby('region').value.plot(legend=True)

I get more or less what I expect (perhaps the xlim doesn't go up to 2010-12-31, but otherwise fine).

If I delete out the BRA rows and try this again, I get:

Traceback (most recent call last):
  File "repro.py", line 6, in <module>
    data.groupby('region').value.plot()
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 342, in wrapper
    return self.apply(curried)
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 428, in apply
    return self._python_apply_general(f)
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 432, in _python_apply_general
    self.axis)
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 958, in apply
    res = f(group)
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 426, in f
    return func(g, *args, **kwargs)
  File "/home/andrew/git/pandas-rosnfeld/pandas/core/groupby.py", line 333, in curried
    return f(x, *args, **kwargs)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1921, in plot_series
    plot_obj.generate()
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 912, in generate
    self._make_plot()
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1379, in _make_plot
    self._make_ts_plot(data, **self.kwds)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1450, in _make_ts_plot
    _plot(data, 0, ax, label, self.style, **kwds)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tools/plotting.py", line 1434, in _plot
    style=style, **kwds)
  File "/home/andrew/git/pandas-rosnfeld/pandas/tseries/plotting.py", line 82, in tsplot
    left, right = _get_xlim(ax.get_lines())
  File "/home/andrew/git/pandas-rosnfeld/pandas/tseries/plotting.py", line 226, in _get_xlim
    left = min(x[0].ordinal, left)
AttributeError: 'datetime.datetime' object has no attribute 'ordinal'

If I delete out both BRA and VEN rows, then there is no exception raised but I only see one series plotted and the x-axis is not formatted as a date.

One could also approach this whole exercise via something like

data = pd.read_csv('./data.csv', parse_dates='date')
data.pivot('date', 'region', 'value').plot()

but this works even worse, I just get a truncated VEN series and nothing else.

This is with current master pandas (but also happens in 0.13.1) and matplotlib 1.3.1.

Are there known issues with plotting sparse-yet-overlapping timeseries?

@rosnfeld
Copy link
Contributor Author

I guess if I do

data.pivot('date', 'region', 'value').interpolate().plot()

or

pivoted = data.pivot('date', 'region', 'value')
pivoted.index = pd.to_datetime(pivoted.index)
pivoted.interpolate(method='time').plot()

I get something like what I wanted as it cleans up the missing values. Interpolate's a new feature I hadn't seen before. (cool!)

Maybe this is all user-error but I had been doing groupby plotting like this for a while, and had been getting what looked like correct results. I feel there may actually be a bug here someplace, that groupby behavior seems so bizarre.

@rosnfeld
Copy link
Contributor Author

Actually, interpolation is not really what I want, as it makes it seem as if there is more data than is actually present. I basically just want to see the various region series all plotted as they would be if they were plotted individually, except all together on the same axes.

(and a loop like

figure = plt.figure()
ax = figure.gca()
data = pd.read_csv('./data.csv', parse_dates='date')
for region in data.region.unique():
    subset = data[data.region == region]
    subset = subset.set_index('date')
    subset.value.plot(ax=ax, label=region)

seems to just over-write the axes)

@jorisvandenbossche
Copy link
Member

I think the groupby/plot issue seems certainly like a bug. I can't fully lay my hand on it, but I think it has something to do with combining regular/irregular timeseries.

The issue with the xlim not respecting the data is because it is updated by the last group (while this is a smaller group), and this is a seperate issue I think (you can open another issue for that).

The reason you get almost no points on the plot with pivot you seem to already figured out, this is indeed because of all the NaN values. You can also deal with this by plotting points instead of lines.

@rosnfeld
Copy link
Contributor Author

Thanks @jorisvandenbossche, it looks like the "separate issue" is already filed as #2960 . So this one is just the groupby/plot weirdness.

Did you mean to add to your comment? (it ends in what looks like the start to some code)

@jorisvandenbossche
Copy link
Member

@rosnfeld ah yes, I first wanted to add a code snippet how to do your last example easier, but this also had the same bug, but forgot to remove it. Removed it now

@jreback jreback added this to the 0.14.0 milestone Mar 13, 2014
@jreback
Copy link
Contributor

jreback commented Apr 21, 2014

@rosnfeld @jorisvandenbossche

so this is the exception that's in the top section?

what is causing this?

@rosnfeld
Copy link
Contributor Author

Yeah, the exception is the most alarming thing, though changing the data slightly causes some other incorrect behavior (missing/incorrect plotting, which is harder to spot/diagnose).

I don't know what's causing it without further investigation, but I can try to investigate and hopefully submit a fix. (it will be my first time digging into the plotting code, not sure how involved it is)

@TomAugspurger
Copy link
Contributor

The error is occurring in

def _get_xlim(lines):
    left, right = np.inf, -np.inf
    for l in lines:
        x = l.get_xdata()
        left = min(x[0].ordinal, left)
        right = max(x[-1].ordinal, right)
    return left, right

The line left = min(to_ordinal(x[0]), left) apparently expects a PeriodIndex.
For whatever reason, when you select sub = data[data.region != 'BRA'] and plot that, you get an array of datetime.datetime objects at that point, instead of a PeriodIndex.

I'm not too familiar with our Datetime code, but does anyone know why these aren't the same?

In [40]: import datetime
In [41]: b = datetime.datetime(1997, 12, 31, 0, 0)

In [42]: y = pd.Period(year=1997, month=12, day=31, hour=0, minute=0, freq='S')

In [43]: b.toordinal()
Out[43]: 729389

In [44]: y.ordinal
Out[44]: 883526400

@jorisvandenbossche
Copy link
Member

Some time ago I looked into this (for another issue), and then one of the fundamental problems was the design of pandas plotting for timeseries splitted in two ways: with datetimes when having irregular serieses, and with ordinals when having a regular timeseries (which was then converted to periodindex), and that those two types are incompatible with each other (so when you combine both types in a certain way it gives problems).

But I have to dig it up again to fully remember (I have some overview of the problem somewhere, but never finished it). I don't know i I find some time in the short term, but will try.

@jorisvandenbossche
Copy link
Member

And the datetime.datetime.toordinal is in days, the pandas.Period.ordinal int he frequency you specified (in this case seconds).
Plus datetime.datetime.toordinal is since 01/01/0001, pandas base is 1970

@rosnfeld
Copy link
Contributor Author

Yes, I looked at this a little last night, and agree with what you're saying - when you whittle the dataset down to just COL/PAN/VEN rows and then plot, COL and VEN get converted to have PeriodIndexes, but PAN stays with a DatetimeIndex for some reason, and then plotting them all on the same axes (via groupby) blows up somehow.

@TomAugspurger
Copy link
Contributor

Thanks.

@rosnfeld you may want the x_compat=True keyword argument to plot. That seems to "solve" the problem

@rosnfeld
Copy link
Contributor Author

Indeed, thanks! I actually hadn't seen that option before. It also fixes the "missing series" variant I mentioned in the original description. The bad xlim variant for the original dataset still remains, though.

I'll try and dig into things a bit and see what can be done - I presume that not requiring x_compat is desirable.

@TomAugspurger
Copy link
Contributor

Yeah it would be desirable. May be tricky though. I'm guessing that argument was added for cases precisely like this one.

@rosnfeld
Copy link
Contributor Author

For a bit more detail, I think this is what's going on:

pandas uses special timeseries plotting if it can infer a "periodic" frequency from a series. While it takes a bit of digging, this is part of the _use_dynamic_x() check in tools/plotting.py:

    def _make_plot(self):
        # this is slightly deceptive
        if not self.x_compat and self.use_index and self._use_dynamic_x():
            data = self._maybe_convert_index(self.data)
            self._make_ts_plot(data)
        else:
...  # regular plotting

This special tseries logic converts plotted series to use a PeriodIndex, and sets a "base" version of the frequency on the axes object for later reference. (Note that x_compat disables all of this and uses regular, non-tseries plotting)

The first series to be plotted in my dataset (COL) gets a frequency of '5A-DEC', which can be converted to a period. In the timeseries plotting code the "base" version of this frequency ('A-DEC') gets assigned to the axes object.

The 3-item DatetimeIndex of 1997-12-31, 2003-12-31, and 2008-12-31 for the next series (PAN) has a surprising inferred frequency of 'WOM-5WED' since 1997, 2003, and 2008 all ended on a Wednesday (the 5th Wednesday of December). pandas can't convert frequencies like that to periods, so it uses regular plotting rather than the special timeseries plotting for that series, and its index is not converted to PeriodIndex. It doesn't try and use the axes frequency since it has already inferred a frequency for this series.

The next and final series (VEN) does not have an inferred frequency, so it inherits the axes frequency, and tries to use tseries plotting again. tseries plotting tries to re-calculate x_lim's to include all data, so it looks at the lines already plotted, but it assumes all existing lines will have PeriodIndex data. It blows up when it tries to call 'ordinal' on the DatetimeIndex entries from the earlier (PAN) series.

I'm not sure what the right fix is here. Frequency inference clearly makes some interesting choices, that are relied on in other parts of the codebase. I'm not sure if either the frequency inference or the usage of it should be modified. Timeseries plotting should maybe tolerate non-PeriodIndex data when calculating x_lim, though I don't yet understand much of that code yet, e.g. why PeriodIndex is desirable.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

@rosnfeld so this occurs when you have multiple series overlaid on the same plot and 1 is converted to PeriodIndex for display while 1 is not.

can you edit the top of the post to make it easily copy-pastable for the failing case?

@rosnfeld
Copy link
Contributor Author

I added an even simpler example at the start of the post. No groupby or anything, just plot a timeseries with an inferred frequency that can be converted to a period, then one that can't, then the first one again, all on the same axes, and you get the same stack trace.

Hope that's along the lines of what you were looking for.

@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

ok...I guess the soln is in the plotting routines to check if their is a plot on the axis already that has a conflicting axis/index, then handle the current plotting better.

I am not sure if this involves too much introspection or is even possible (e.g. you would have to get the index state from the axis and not sure if saved 'enough' to be able to figure out what is up)

@rosnfeld give it a shot?

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Apr 23, 2014
@jreback
Copy link
Contributor

jreback commented Apr 23, 2014

moving to 0.15, but if you are able to figure out soon can move back

@rosnfeld
Copy link
Contributor Author

Sure, I can take a shot at it. I'm optimistic something can be done.

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Jul 2, 2014

I think this one should be re-opened - my bad for having a comment in #7322 saying "this does not fix #6608".

However, I think @sinhrks has some PRs that look to affect this behavior somewhat, changing this issue if not closing it.

@jreback jreback reopened this Jul 2, 2014
@sinhrks
Copy link
Member

sinhrks commented Jul 5, 2014

#7459 partially fixes this not to raise AttributeError.

But unable to set correct xlim and formatter yet. The result after #7459 is as below.
figure_1
.

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Jul 5, 2014

Well, regular vs irregular series have pretty different ordinals, as in @TomAugspurger comment above, so I think the problem is unfortunately deeper than just xlim/representation. A solution might be to rework _use_dynamic_x() (in tools/plotting.py), to better catch cases that might mix these two together.

@jreback
Copy link
Contributor

jreback commented Oct 4, 2014

@TomAugspurger push?

@jreback
Copy link
Contributor

jreback commented Oct 5, 2014

@TomAugspurger status (pushing #7670) ok, so push this as well

@jorisvandenbossche
Copy link
Member

It looks like this issue is solved in the meantime. At least the simplified example at the top now works correctly for me.

@rosnfeld Would you be able to test with your more complex example as well?

@rosnfeld
Copy link
Contributor Author

rosnfeld commented Oct 1, 2016

Yes! I tested with the more complex example and everything works now. (as of 0.18.1)

@rosnfeld
Copy link
Contributor Author

I see this is still open - should I close it?

Or do people want some unit tests to explicitly try to protect against this happening again? Unfortunately given our (or at least my) incomplete understanding of why it was happening and how it has since been fixed, perhaps the best we could do would be writing a test that would have failed against code from a couple of years ago.

Not sure what community practice is on things like this.

@jorisvandenbossche jorisvandenbossche added Difficulty Novice Testing pandas testing functions or related to the test suite and removed Bug labels Dec 16, 2016
@jorisvandenbossche
Copy link
Member

Yes, I would first like to see a test added to confirm this (and keep it working!). A PR very welcome!

@jreback jreback modified the milestones: 0.21.0, Next Major Release May 24, 2017
stangirala pushed a commit to stangirala/pandas that referenced this issue Jun 11, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Testing pandas testing functions or related to the test suite Timeseries Visualization plotting
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants