Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor hist for less numerical errors #22773

Closed
wants to merge 1 commit into from

Conversation

oscargus
Copy link
Contributor

@oscargus oscargus commented Apr 3, 2022

PR Summary

Should help with #22622

Idea is to do computation on the edges rather than the widths and then do diff on the result. This may be numerically better (or not...). Or rather, it is probably numerically worse, but will give visually better results...

Probably the alternative approach of providing a flag to bar/barh, making sure that adjacent bars are actually exactly adjacent may be a better approach, but I wanted to see what comes out of this first...

PR Checklist

Tests and Styling

  • Has pytest style unit tests (and pytest passes).
  • Is Flake 8 compliant (install flake8-docstrings and run flake8 --docstring-convention=all).

Documentation

  • New features are documented, with examples if plot related.
  • New features have an entry in doc/users/next_whats_new/ (follow instructions in README.rst there).
  • API changes documented in doc/api/next_api_changes/ (follow instructions in README.rst there).
  • Documentation is sphinx and numpydoc compliant (the docs should build without error).

@jklymak
Copy link
Member

jklymak commented Apr 3, 2022

Is the numerical problem the diff? Would it make sense to just convert the numpy bin edges to float64 before the diff?

@oscargus
Copy link
Contributor Author

oscargus commented Apr 3, 2022

Is the numerical problem the diff?

Hard to say. But the problem is that one does quite a bit of computations and at some stage there are rounding errors that leads to that there are overlaps or gaps between edges. So postponing diff will reduce the risk that this happens (on the other hand, one may get cancellations as a result, but I do not think that will happen more now since the only things we add here are about the same order of magnitude).

Would it make sense to just convert the numpy bin edges to float64 before the diff?

Yes, or even float32, but as argued in the issue, one tend to use float16 for memory limited environments, so not clear if one can afford it.

Here, I am primarily trying to see the effect of it. As we do not deal with all involved computations here, some are also in bar/barh, the better approach may be to use a flag, "fill", or something that makes sure that all edges are adjacent if set (I'm quite sure a similar problem can arise if feeding bar-edges in float16 as well.

@oscargus
Copy link
Contributor Author

oscargus commented Apr 3, 2022

It seems like we do not have any test images that are negatively affected by this at least... But it may indeed not be the best solution to the problem.

@oscargus
Copy link
Contributor Author

oscargus commented Apr 3, 2022

Ahh, but even if the data to hist is float16, the actual histogram array doesn't have to be that... And that is probably much smaller compared to the data. So probably a simpler fix is to change the data type of the histogram data before starting to process it...

@jklymak
Copy link
Member

jklymak commented Apr 4, 2022

I think you just want another type catch here (I guess I'm not sure the difference between float and "float64"), or at least that fixes the problem for me.

diff --git a/lib/matplotlib/axes/_axes.py b/lib/matplotlib/axes/_axes.py
index f1ec9406ea..88d90294a3 100644
--- a/lib/matplotlib/axes/_axes.py
+++ b/lib/matplotlib/axes/_axes.py
@@ -6614,6 +6614,7 @@ such objects
             m, bins = np.histogram(x[i], bins, weights=w[i], **hist_kwargs)
             tops.append(m)
         tops = np.array(tops, float)  # causes problems later if it's an int
+        bins = np.array(bins, float)  # causes problems is float16!
         if stacked:
             tops = tops.cumsum(axis=0)
             # If a stacked density plot, normalize so the area of all the

@timhoffm
Copy link
Member

timhoffm commented Apr 5, 2022

I guess I'm not sure the difference between float and "float64"

Numpy accepts builtin python types and maps them to numpy types:

https://numpy.org/doc/stable/reference/arrays.dtypes.html#specifying-and-constructing-data-types
(scroll a bit to "Built-in Python types").

The mapping can be platform specific. E.g. int maps to np.int64 on linux but np.int32 on win.
float maps on x86 linux and win to np.float64. But I don't know if that's true on arm etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants