Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time series histogram plot example #18733

Merged
merged 8 commits into from Oct 27, 2020
Merged

Time series histogram plot example #18733

merged 8 commits into from Oct 27, 2020

Conversation

ecotner
Copy link
Contributor

@ecotner ecotner commented Oct 15, 2020

PR Summary

Follow up to issue #18643. This PR seeks to add an example to the gallery which shows an alternative way to visualize time series by reinterpreting them as 2d histograms, allowing for

  1. faster plotting of large numbers of time series
  2. easier identification of temporal patterns even under overwhelming noise

The example script generates a plot like the following, where a sinusoidal signal is visible in the 2nd/3rd plots despite large amounts of noise:
plot

PR Checklist

  • [N/A] Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (run flake8 on changed files to check).
  • New features are documented, with examples if plot related.
  • [N/A] Documentation is sphinx and numpydoc compliant (the docs should build without error).
  • [N/A] Conforms to Matplotlib style conventions (install flake8-docstrings and pydocstyle<4 and run flake8 --docstring-convention=all).
  • [N/A] New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • [N/A] API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).

@jklymak
Copy link
Member

jklymak commented Oct 15, 2020

This seems to jumpy through painful hoops to get the data into the right form. Suggest not to use hist2d, but rather https://numpy.org/doc/stable/reference/generated/numpy.histogram2d.html and then pcolormesh, or imshow. We are trying to get away from helpers like hist2d, and hist2d itself has quite a bizarre call signature.

@jklymak jklymak added this to the v3.4.0 milestone Oct 15, 2020
@ecotner
Copy link
Contributor Author

ecotner commented Oct 15, 2020

I'm not sure I understand; np.histogram2d takes pretty much the exact same input as plt.hist2d, so the same operations would be needed to get it into the right form. If you want to reduce the "painful hoops", I could get rid of the interpolation (this example doesn't really benefit from it much anyway).

@ecotner
Copy link
Contributor Author

ecotner commented Oct 15, 2020

I take that back; once you remove the interpolation the signal become impossible to see:
Screenshot from 2020-10-15 18-55-00
so I think it is critical that we include those steps.

@jklymak
Copy link
Member

jklymak commented Oct 15, 2020

x_fine = np.linspace(x.min(), x.max(), num_fine)  # x_fine.shape == (1_000,)
 y_fine = np.stack([np.interp(x_fine, x, Y[i]) for i in range(
     Y.shape[0])], axis=0)  # y_fine.shape = (10_000, 1_000)
 # convert into tensor of (x, y) pairs along the -1 axis
 xy = np.stack([np.broadcast_to(x_fine[None, :], y_fine.shape),
                y_fine], axis=-1)  # xy.shape == (10_000, 1_000, 2)
 xy = xy.reshape(-1, 2)  # xy.shape = (10_000_000, 2)

is the interpolation step? I've not checked this, but I'd do:

ntseries = Y.shape[0]
x_fine = np.linspace(x.min(), x.max(), num_fine)
y_fine = np.zeros(ntseries, num_fine)
for i in range(ntseries):
    y_fine[i, :] = np.interp(x_fine, x, Y[i, :])
y_fine = y_fine.flatten()
x_fine = np.repmat(x_fine, ntseries, 1).flatten()
xy = np.vstack([x_fine, y_fine])

My dislike of hist2d is packing x and y together, which seems an odd API.

@ecotner
Copy link
Contributor Author

ecotner commented Oct 15, 2020

Oh, you don't have to pack x and y together, I just did that because it was more convenient. The call signature of plt.hist2d takes x and y separately.

@ecotner
Copy link
Contributor Author

ecotner commented Oct 15, 2020

Also, do you know why the build is failing? I have found the CI process emits a warning like

/home/circleci/project/doc/gallery/statistics/time_series_histogram.rst:20: WARNING: py:obj reference target not found: plt.plot

but I'm not sure how to fix it. Are there supposed to be documentation files accompanying the python scripts?

@jklymak
Copy link
Member

jklymak commented Oct 15, 2020

`plt.plot`

in rst will try and resolve some module called that. But you mean it as a literal:

``plt.plot``

@ecotner
Copy link
Contributor Author

ecotner commented Oct 16, 2020

@jklymak I replaced the interpolation code with your suggestion, then also kept x_fine and y_fine as 1D vectors to pass to np.histogram2d, followed by pcolormesh. Does that address your concerns?

@jklymak
Copy link
Member

jklymak commented Oct 19, 2020

OK, so one more suggestion - rather than use just random, maybe try np.cumsum(np.random.randn)) to make the time series red instead of white? Your original examples in the issue were pretty reasonable uses of this, but here its hard to see what the advantage is.

@ecotner
Copy link
Contributor Author

ecotner commented Oct 20, 2020

Hmm yeah interpolating between the random noise is kind of messing up the visualization; a random walk with np.cumsum is a lot more "continuous" and more realistic for time series.

Ok, if I try noisy random walk superimposed with a slightly noisy sinusoidal signal (with amplitude equal to the RMS of the random walk displacement), with judicious choice of color coding, it looks a fair bit more impressive:
random_walk_hidden_signal
Does this look better to you?

Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know the speed is one of your points, but we can't easily have something this slow in the docs. Consider pointing out the speed difference in the text, but don't make the actual example be so slow. We run our CI very often....

examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
_, axes = plt.subplots(nrows=3, figsize=(10, 6 * 3))

# Make some data; a 1D random walk + small fraction of sine waves
num_series = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example takes over 40 seconds to run, out of a 15 minute process. It needs to be much faster to be worth it and I'm sure the effect can be achieved with fewer than 10,000 time series.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the CI pipeline run over the same code multiple times or something? On my machine it only takes 5 seconds to run! I'll take a look and see if I can reduce the number of series while still illustrating the other advantages.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just look at the output on CI - it says the time...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reducing the number of time series from 10k -> 1k also reduces the time it takes (on my machine) from ~4 secs to ~0.3 secs, but now the sinusoidal signal is clearly visible in the top plot, which kind of takes away from the advantage of using the histogram:
asdf

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can increase alpha=0.1 to obfuscate it again, but this kind of feels like "cheating".
asdf

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of the above example(s)? The fact that you can see the sinusoid now when using plot kind of detracts from the proposed advantages of increased visibility, but the histograms definitely look "sexier" with the plasma color coding IMO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats fine. I think its also reasonable to increase the alpha. That alpha effect is pretty fragile at such a low value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, so CI now says it only takes 2.4 seconds after reducing the number of series by a factor of 10 (not quite linear scaling unfortunately). Does that seem low enough, or should we try and reduce the build time a bit more?

examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
…etter illustration, eliminated loop over each time series line plot, changed color map to plasma
@ecotner
Copy link
Contributor Author

ecotner commented Oct 23, 2020

Hmm from looking at the CI logs, it seems that the failure is due to some entirely unrelated test called test_image_comparison_expect_rms, not anything I changed. Do you think re-running the tests again would might get them to pass? It seems like this function is deterministic, so perhaps not...

@ecotner ecotner requested a review from jklymak October 24, 2020 17:48
Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks fine, but a few suggestions....

examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
examples/statistics/time_series_histogram.py Outdated Show resolved Hide resolved
# sinusoidal signal
num_signal = int(round(SNR * num_series))
phi = (np.pi / 8) * np.random.randn(num_signal, 1) # small random offest
Y[-num_signal:] = (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit I'm not super clear what this is doing. It would be nice if you could explain the underlying signal in the intro or here so it is clear what the reader should be looking for in the data. Are you saying some signals are random walk and others are sines? Why does the amplitude of the sine change?

Copy link
Contributor Author

@ecotner ecotner Oct 24, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, so what I'm trying to do is show that this method of plotting can be helpful to find patterns buried under some noisy background. The random walk is the "noise/background", and the sinusoid is the "signal/pattern".

I add a little bit of random noise to the sine by 1) adding a small random offset phi to each series to shift it a bit left/right, and 2) add a little bit of additive noise with the np.random.randn to shift each point up/down. I would never expect a perfect sine signal in real data, so this is just to make it a bit more "realistic".

The amplitude of the sine changes simply because it would not be very visible on the plot otherwise. For a Gaussian random walk with stddev of σ (in this case, σ=1), the RMS displacement from the origin after n steps is σ*sqrt(n). So I scale the amplitude of the sine by this value so that it grows along with the random walk. Otherwise, the range of the sine would be restricted to +1/-1, while the random walk grows to have an RMS amplitude of +10/-10 (and non-negligible probability to have amplitudes of even higher magnitude as well; most of the time this plot will produce random walk series that go as high as +30/-30).

Does that make sense? I can add these details to the intro for clarity.

@ecotner ecotner requested a review from jklymak October 25, 2020 21:57
Copy link
Member

@jklymak jklymak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, final nitpick. This figure is really large and barely fits vertically onto my 27" screen when rendered in the docs. Nobody is going to publish a 10" wide figure, so suggest you make at least 60% the size it currently is.

cmap = copy(plt.cm.plasma)
cmap.set_bad(cmap(0))
h, xedges, yedges = np.histogram2d(x_fine, y_fine, bins=[400, 100])
axes[1].pcolormesh(xedges, yedges, h.T, cmap=cmap, norm=LogNorm(vmax=1.5e2))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this will render faster if you do rasterized=True

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, doesn't really seem to make a difference on my machine, but maybe it will on others? I added another digit to the time elapsed output since it's getting pretty short already.

@ecotner
Copy link
Contributor Author

ecotner commented Oct 26, 2020

Sorry, final nitpick. This figure is really large and barely fits vertically onto my 27" screen when rendered in the docs. Nobody is going to publish a 10" wide figure, so suggest you make at least 60% the size it currently is.

Ahh yeah that's a holdover from the default size I use when I'm plotting in jupyter lab; my monitor is fairly large so I try and make the plots take up a bit more real estate. How does 6x8 sound? Keeps roughly the same aspect ratio (maybe a little shorter).


fig.tight_layout()
Copy link
Member

@jklymak jklymak Oct 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Constrained layout? I guess the other nit, is that the pcolormeshs should have colorbars...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I'd use tight_layout() to reduce the whitespace in the margins. Is this a bad practice?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well if you add the colorbars, you will find that constrained_layout works better...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, I didn't even realize constrained_layout was even a thing. Only ever used tight_layout before.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well I wrote it, so I try to twist peoples' arms to use it ;-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Does it actually use CSP or LP to optimize the layout, or is it like a heuristic solver?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what CSP or LP are 😉. It uses a linear constraint solver (kiwi solver) and relatively straight forward constraints....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh; CSP = Constraint Satisfaction Problem/Programming, and LP = Linear Programming. They're very broad classes of numerical techniques to optimize some objective under constraints. Looks like kiwi solver is based off an LP algorithm, so I guess you know more about it than you think 😄

@jklymak
Copy link
Member

jklymak commented Oct 27, 2020

Thanks for all your work on this - it is a nice example, and I think others will find this a useful technique....

@jklymak jklymak merged commit 2b055a0 into matplotlib:master Oct 27, 2020
@ecotner
Copy link
Contributor Author

ecotner commented Oct 27, 2020

No problem. And thanks for all the feedback, I think it really made the end result way better!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants