Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quiver barb size not correct on some arches (ppc64, ppc64le...) #7797

Closed
AdamWill opened this issue Jan 11, 2017 · 30 comments

Comments

Projects
None yet
5 participants
@AdamWill
Copy link
Contributor

commented Jan 11, 2017

Continuing to work through test suite failures found when building on the Fedora ppc64 arch (big-endian) but not on little-endian arches. Two failures:

matplotlib.tests.test_quiver.test_barbs.test
matplotlib.tests.test_transforms.test_pre_transform_plotting.test

seem to be caused by differences in the lengths of 'barbs' on 'quivers' - the former is a test specific to such barbs, the latter includes a figure with some barbs as part of a larger test. In both cases, the lengths of the barbs come out subtly differently on ppc64:

barbs_test_image GOT: barbs_test_image
barbs_test_image EXPECTED: barbs_test_image-expected
pre_transform_data GOT: pre_transform_data
pre_transform_data EXPECTED: pre_transform_data-expected

the difference is quite subtle, it helps to A/B the images quickly in an image viewer. AFAICS, in all cases where the lengths are visibly different, the barbs actually produced are slightly shorter than expected.

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2017

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2017

This may be #4385 ?

@AdamWill AdamWill changed the title Quiver barb size not correct on ppc64 (big-endian) Quiver barb size not correct on some arches (ppc64, ppc64le...) Jan 11, 2017

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2017

This also affects ppc64le, so it's not a big-endian thing. It may also affect aarch64.

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Jan 11, 2017

I don't think #4385--that's a different rounding problem, IMO. It looks like some kind of pixel snapping issue to me.

I don't have a clue how to begin to figure out what's going on here without access to a machine...

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 11, 2017

@puiterwijk ahoy! would you be willing to let an upstream matplotlib person have access to one of our sandbox ppc64 or aarch64 environments to try and look into this kinda thing?

@puiterwijk

This comment has been minimized.

Copy link

commented Jan 12, 2017

@AdamWill You can add their key to your test instance, and let them use that?

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 12, 2017

I thought you'd already torn it down :) If you don't mind leaving it up a while longer, sure. @dopplershift , can you send me your ssh pubkey?

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Jan 18, 2017

Sorry, I've been swamped with conference prep. It will probably be no earlier than the 21st/22nd that I might be able to get to this. Not sure if that time frame works for you guys.

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Jan 18, 2017

I don't think it should be a problem, we should still be able to put up an instance for you at that time, I think.

@puiterwijk

This comment has been minimized.

Copy link

commented Jan 19, 2017

@AdamWill I'll keep your current instance around. Just let me know when I can clear it after they're done with it.

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Jan 31, 2017

@AdamWill Ok, I've got cycles now. Here's the public key:

ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDxtEykVuf26l7WVrYnvzVWmKdM2a4IBtH0O6wRqPBbG6+/ABmx+YZ+gF63Ry0ylZ9AXhJslUFxkSnuk8X1HjlXc19fyoJTWnAWVF2kNyJj+fUC095dExI5DBCVqtA/34tKBX/FXHr90Nbb7eiUK0v3pjwo1H34ATq0ggeTur/Gj3pEhk5fg+EtsESv2WFIbofQUWN7d7nHxLTNwF45JNacl8BK2McnclVhg+qVHf91vK+x7vqHRi/biHm2NT9fojlOvfC6q/wmCKzZkQ6H2uI9A6RzrlKWpx0HibgRkNHh0bBKvSRpKHKVHBcZLz/J+YHMZ4Sb5Pf7hKXsqRfMb8uX rmay
@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

sorry, I kinda lost track of these remaining issues; did I ever set you up with the appropriate access, @dopplershift ? If not, should we try another round?

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2017

Not that I'm aware of. I have a few cycles I can use to look at this.

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

same pubkey?

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2017

Yes.

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

@puiterwijk could you possibly spin up a sandbox ppc64 instance for @dopplershift to use, with the pubkey given above? thanks!

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2017

@dopplershift ok, you should be able to ssh to fedora@209.132.184.129 with that ssh key now, and you can sudo without a password if you need root access. Let us know if that doesn't work out for you.

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2017

It's working...hoping to be able to give this a look tomorrow.

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2017

Ok, so far what I can say is that barbs itself isn't producing different polygon vertices on ppc64, which means that this actually happens somewhere in the Agg layer...sigh.

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Aug 31, 2017

Not sure I have cycles to keep digging. Anyone from matplotlib feel like digging into a weird Agg rendering difference?

@tacaswell tacaswell added this to the 2.1.1 (next bug fix release) milestone Sep 1, 2017

@tacaswell

This comment has been minimized.

Copy link
Member

commented Sep 1, 2017

Did you look at the verts before or after running through all of the path simplification code?

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Sep 1, 2017

Before. I may not know enough about the path simplification code, but the differences we're seeing aren't coming from dropping a vertex.

@tacaswell

This comment has been minimized.

Copy link
Member

commented Sep 1, 2017

miss interpretting LE/BE tends to have catastrophic effects, I'm quite puzzled by this. There are some issues with float128 on ppc64 (see https://github.com/h5py/h5py/pull/842/files) I wonder if this is related (I hope not!)?

attn @sandrotosi do you see these with debian too?

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Sep 2, 2017

Note this is definitely not an LE / BE thing, per the comment on January 10. It also affects ppc64le.

@tacaswell

This comment has been minimized.

Copy link
Member

commented Sep 2, 2017

Sorry, did not read carefully enough 🐑

@tacaswell tacaswell removed this from the 2.1.1 (next bug fix release) milestone Oct 9, 2017

@tacaswell tacaswell added this to the 2.2 (next feature release) milestone Oct 9, 2017

@QuLogic

This comment has been minimized.

Copy link
Member

commented Oct 15, 2017

After a few days of trial and error, I think I've found the problem. It seems to originate in the Agg cross_product function. You can see the difference in this gist where I ran a few sample inputs from the broken test. The inputs are bit-for-bit identical between x86 and ppc at the start of cross_product, but only the first test case outputs the same result. The second and third tests are slightly less and slightly greater than the expected 0.

Interestingly, this difference only happens with both -O2 and -fPIC (which are both part of Fedora flags). I wonder if that qualifies as a gcc bug?

The slightly negative result is okay, but the slightly positive result triggers the wrong side of this check, meaning the stroking uses an inner join instead of an outer join, resulting in a couple fewer vertices. If this condition is changed to use some epsilon check, then ppc should work (though I'd need to see whether it breaks anything else.)

@dopplershift

This comment has been minimized.

Copy link
Contributor

commented Oct 17, 2017

Interesting--thanks for running this down @QuLogic !

I've seen this problem in other projects-->0 checks need to be checked against epsilon, just like everything else floating point.

@QuLogic

This comment has been minimized.

Copy link
Member

commented Oct 17, 2017

For now, I'm not sure if we should import the changes here, as it causes a few test failures on x86_64. I've opened this bug report to see if there's anything up with gcc itself as the trigger is a bit odd to me.

@QuLogic

This comment has been minimized.

Copy link
Member

commented Dec 22, 2017

gcc devs say this is an expected optimization (FMA), so I'm going to open a PR with the patch here.

@AdamWill

This comment has been minimized.

Copy link
Contributor Author

commented Dec 22, 2017

Thanks a lot for figuring this out, folks!

@QuLogic QuLogic modified the milestones: v2.2, v2.1.2 Jan 15, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.