Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test suite issues on different Debian release architectures #1370

Closed
tillea opened this issue Apr 15, 2020 · 51 comments
Closed

Test suite issues on different Debian release architectures #1370

tillea opened this issue Apr 15, 2020 · 51 comments
Assignees

Comments

@tillea
Copy link

tillea commented Apr 15, 2020

Hi,
there is a new bug report concerning different Debian release architectures. For example the arm64 build log is mentioning the following errors:

128: Merging layouts 2 (igraph_i_layout_merge):      FAILED (layout.at:68)
195: Infomap community structure (igraph_community_infomap) : FAILED (community.at:99)
251: SCG of a graph, stochastic matrix (igraph_scg) : FAILED (scg.at:73)

There are even more errors on i386 architecture. I'll try to reproduce these by emulating that architecture and will attach this in another post to this bug report.
Kind regards, Andreas.

@szhorvat
Copy link
Member

Can you please upload the full contents of tests/testsuite.dir?

@tillea
Copy link
Author

tillea commented Apr 15, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 15, 2020

This is the build in i386 emulation. It has the same tests failing as on the real hardware.
i386_testsuite_dir.zip

@szhorvat
Copy link
Member

Thanks for the report Andreas!

The bug you linked seemed to suggest that these failures started to happen only with 0.8.1. That would be strange, as these functions were not touched, at least not in any meaningful way. If it is not too much trouble, could you please try the same with 0.8.0 on the very same system (same hardware, same compiler)?

@szhorvat
Copy link
Member

szhorvat commented Apr 15, 2020

Was there anything else that changed, other than going from igraph 0.8.0 -> 0.8.1? I assume you built igraph with all external packages (external arpack, lapack, blas, etc.) Did those dependencies change?

Since some of these look like they may be due to numerical issues, could the compiler (different compiler version or different optimizations) be at least partly responsible?

@vtraag
Copy link
Member

vtraag commented Apr 15, 2020

This problem was already observed previously in issue #678. There, the discussion lead us to relax the testing because it does not seem to be possible to guarantee consistent output always (but feel free to suggest possibilities). We have not yet gotten around to correcting this. I presume this is necessary to be able to include the new igraph release in Debian? If so, we should treat this with a higher priority.

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@szhorvat
Copy link
Member

@tillea A lot of the affected code was already present in 0.7.1, so it is surprising that the tests passed before, but they are failing now. If it is not too much trouble, could you please help figure out what exactly changed? I strongly suspect that it is not just igraph, but it's the way you build it (dependencies, compiler, hardware).

I can't speak for others, but personally I am stuck at home with limited access to different systems to test on, which makes it difficult to get to the bottom of the problem. Still, I tried it on whatever I could get access to. All tests pass on:

  • macOS 10.14 with default compiler
  • Ubuntu 16.04 with default compiler (x86_64 in a VM)
  • Raspbian Buster with default compiler (32-bit ARM)
  • openSUSE Leap 15.1 with gcc 7.5.0 (x86_64 on our HPC cluster)

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@ntamas
Copy link
Member

ntamas commented Apr 16, 2020

My proposal: let's try to get this sorted out for i386 in some simulated environment (if we can reproduce this) and try the build again if it passes on i386. Maybe the relaxations that we build into the tests would also make them pass on other architectures.

@szhorvat when testing on Raspbian, did you build it with external LAPACK/ARPACK libraries? If so, did you use the official packages in Raspbian or did you use some other, alternative LAPACK/ARPACK implementation?

@tillea I don't have much experience with simulated i386 environments - is there a quickstart guide somewhere on the Debian pages that describes how to do this? I can try Virtualbox on my Mac or QEMU on an amd64-based Linux server at the university if any of these are feasible options.

@szhorvat
Copy link
Member

@ntamas No, I didn't build with external packages. This issue is almost certainly not because of i368, but because of using external packages. I do see some failures on macOS when using all external packages.

@szhorvat
Copy link
Member

It would be better to start with fixing up the tests so that they do not depend on the behaviour of the external dependencies (when those behaviours are correct). I suggest doing this first, and leaving the i386 stuff for later.

Doing this would take quite a bit of work because none of us have a full understanding of all functionality included in igraph. E.g., several of the SCG tests fail. One would need to read up on spectral coarse graining before one can understand if the failure is a true problem, or just an issue with the test being too specific.

@AdrianBunk
Copy link

This issue is almost certainly not because of i368, but because of using external packages.

There might be several problems, and one of them is related to i386:
The Debian i386 port still supports relatively old hardware, and therefore uses only the x87 FPU.
The x87 FPU has excess precision (80bit), which can result in different floating point results.
I am getting fewer (but not 0) failures when rebuilding the Debian package with "-msse2 -mfpmath=sse".

@szhorvat
Copy link
Member

szhorvat commented Apr 16, 2020

@AdrianBunk Do you know if x87 FPU instructions are supported at all on x86_64?

On macOS, if I use -mfpmath=387, I get

error: the '387' unit is not supported with this instruction set

I do not know if this is a limitation of clang, a limitation of macOS, or a limitation of the x86_64 instruction set.

If it were possible to use these instructions even in 64-bit mode, that would make it much easier for us to fix these problems.

@ntamas
Copy link
Member

ntamas commented Apr 16, 2020

Judging from this page, some problems are not related to test cases but indicate a general compilation failure (e.g., this one). It looks like on certain platforms the linker is not trying to link to the standard C++ library, which might be a misconfiguration on our end.

@szhorvat
Copy link
Member

szhorvat commented Apr 16, 2020

I split off the x87 issues into #1371, also to reduce the noise in this thread.

@AdrianBunk
Copy link

Judging from this page, some problems are not related to test cases but indicate a general compilation failure (e.g., this one). It looks like on certain platforms the linker is not trying to link to the standard C++ library, which might be a misconfiguration on our end.

Ignore these symbol errors, this is a Debian packaging issue. Debian has a way to track what symbols a shared library provides and when they were introduced, and can generate versioned dependencies in packages using the library based on the versions when the symbols used from the library were introduced instead of enforcing a dependency on the latest version of the library.
https://sources.debian.org/src/igraph/0.8.1+ds-1/debian/libigraph1.symbols/

@AdrianBunk
Copy link

Since i386-specific problems moved elsewhere, the problem in question is:
testsuite: 128 195 251 failed

Failing architectures for that:

  • arm64
  • i386
  • ppc*
  • s390x
  • ia64
  • riscv64

The ppc failures alone cover big/little endian and 32/64 bit, there is no obvious pattern which architectures fail.

@szhorvat
Copy link
Member

szhorvat commented Apr 16, 2020

@tillea Small note: I believe you do not need to pass --with-external-f2c to ./configure when you already have --with-external-blas --with-external-lapack --with-external-arpack. f2c simply won't be used.

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@szhorvat
Copy link
Member

Another followup: #1372 This still doesn't cover all issues reported here.

@tillea Can you let us know the precise versions of blas, lapack, arpack you use and link to?

@AdrianBunk
Copy link

Can you let us know the precise versions of blas, lapack, arpack you use and link to?

They are downloaded and installed at the beginning in the build logs at https://buildd.debian.org/status/package.php?p=igraph&suite=sid

Get:46 https://mirror.netcologne.de/debian unstable/main arm64 libblas-dev arm64 3.9.0-1 [107 kB]
Get:47 https://mirror.netcologne.de/debian unstable/main arm64 liblapack-dev arm64 3.9.0-1 [2777 kB]
Get:48 https://mirror.netcologne.de/debian unstable/main arm64 libarpack2-dev arm64 3.7.0-3 [99.5 kB]

@AdrianBunk
Copy link

Not sure whether it is related to any of the test failures, but some of the compile warnings in the build logs seem to point at real bugs.

foreign.c: In function ‘igraph_i_gml_tostring’:
foreign.c:1052:12: warning: function may return address of local variable [-Wreturn-local-addr]
 1052 |     return p;
      |            ^
foreign.c:1031:10: note: declared here
 1031 |     char tmp[256];
      |          ^~~

Something like this or code causing may be used uninitialized warnings can easily create the kind of bugs that appear or disappear based on random changes like compiler/architecture/optimization/...

@ntamas
Copy link
Member

ntamas commented Apr 16, 2020

Thanks @AdrianBunk , nice catch! I'll fix this soon, although it's probably unrelated (it is triggered only when reading GML files, and none of the problematic test cases do that).

I have fixed two test case failures on my computer (for the external ARPACK case) in e3f48fe and 55f88af; these will be cherry-picked back to the master branch soon.

@ntamas
Copy link
Member

ntamas commented Apr 16, 2020

f03d914 fixes the issue raised by @AdrianBunk ; c7fa487 probably fixes test case 128 (129 in the develop branch). Test case 251 is most likely fixed as well. Test case 195 seems to be unrelated to BLAS / LAPACK / ARPACK (the Infomap algorithm uses none of these) so we need to keep on investigating there.

@tillea
Copy link
Author

tillea commented Apr 16, 2020 via email

@vtraag
Copy link
Member

vtraag commented Apr 16, 2020

I believe that Infomap uses some eigenvector to get the stationary probabilities for a random walk. Presumably that then also relates to some of the external libraries?

@ntamas
Copy link
Member

ntamas commented Apr 17, 2020

Hmmm, you may be right. Nevertheless, I still need to reproduce the issue somehow before I can proceed further. I'll try an i386 chroot somewhere and see if the bug surfaces there.

@ntamas
Copy link
Member

ntamas commented Apr 17, 2020

Managed to reproduce the failures in an i386 chroot. Most of these (the ones that I actually understand) seem to be numerical inaccuracies and/or changes in labeling of clusterings that are semantically equivalent. I'll try to make them pass by modifying the test cases.

@ntamas ntamas self-assigned this Apr 17, 2020
@tillea
Copy link
Author

tillea commented Apr 17, 2020 via email

@ntamas
Copy link
Member

ntamas commented Apr 20, 2020

@tillea All tests pass now on i386 with the recent patches I have committed to the develop branch -- these will be merged back to master soon as well.

Is there a way to re-run all the tests on the Debian build architecture without making a new upstream release first? Say, if I send you a tarball with these 9 patches, can they be applied in the Debian build process temporarily until we have a new release? We can push this through in a few days if the patches seem to work in the Debian build environment as well.

@tillea
Copy link
Author

tillea commented Apr 20, 2020 via email

@ntamas
Copy link
Member

ntamas commented Apr 20, 2020

That's great -- here are the patches: patches.tar.gz.

@tillea
Copy link
Author

tillea commented Apr 20, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 20, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 21, 2020

Hi,
as you might have noticed the results became better when considering the number of failed tests but the issue is not solved finally. I'm attaching the testlog from the i386 build (hmmm, my interface does not like tar - so I use zip :-( )
tests_log_i386.zip
I'll compare with my arm64 (pinebook I just reactivated for this purpose) and let you know about more results.
Kind regards, Andreas.

@ntamas
Copy link
Member

ntamas commented Apr 21, 2020

Okay, this remaining issue will be easy to fix; it is simply a printing issue (the test case prints -0+0i instead of 0+0i). I'll try to fix it in the afternoon.

@ntamas
Copy link
Member

ntamas commented Apr 21, 2020

I'm a bit baffled by this; I have no idea how the test case managed to end up with -0+0i, but here's a stab in the dark: 6fb8f81 . I cannot reproduce the issue in my i386 chroot so I don't know whether it would work or not.

@tillea
Copy link
Author

tillea commented Apr 22, 2020 via email

@tillea
Copy link
Author

tillea commented Apr 22, 2020 via email

@ntamas
Copy link
Member

ntamas commented Apr 23, 2020

Okay, I think I'll just bite the bullet and write a custom matrix printer function for the test cases that replaces all occurrences of -0+0i, -0-0i and 0-0i with 0+0i.

@szhorvat
Copy link
Member

@ntamas Maybe a macro in the test utilities would help? Something like:

#define CANON(x) (x == 0.0 ? 0.0 : x)

Then this could be used in all functions that print doubles to quickly canonicalize results.

@szhorvat
Copy link
Member

szhorvat commented Apr 23, 2020

@ntamas On my machine, if a = -0.0; then a + 0.0 results in a positive zero. Tested with cling. Maybe that's an even simpler workaround. No CANON needed. But I'm not sure if this would work on all the myriads of architectures that Debian supports.

@ntamas
Copy link
Member

ntamas commented Apr 23, 2020

I suspect that it won't help here; what probably happens is that we end up with a small value in the real or the imaginary part of the complex number that is smaller than the printing accuracy but larger than the epsilon value that I use to force "small" values to zero. Therefore, the number would not satisfy the x == 0.0 condition but would nevertheless appear as 0 when printed to stdout. Not sure, though.

@szhorvat
Copy link
Member

Could be, but I'd expect most tiny numbers to print like 1.23e-300 (big negative exponent). Still, there's a lot of weirdness in floating point implementations, such as denormal numbers.

@ntamas
Copy link
Member

ntamas commented Apr 23, 2020

@tillea I hope this will be the last one: https://github.com/igraph/igraph/commit/ead7e0c59b8bb1745e0e50c892d986f9ccdb7f4f.patch

Can you please try adding this to the patchset? It should explicitly avoid printing -0 in the test case.

@tillea
Copy link
Author

tillea commented Apr 24, 2020 via email

@ntamas
Copy link
Member

ntamas commented Apr 24, 2020

Yay, passed everywhere except the kfreebsd ones - should we be worried about those? (It is due to missing build deps).

If this is okay, we will release 0.8.2 soon and then Debian can ship an unpatched 0.8.2 release to get the same result.

@tillea
Copy link
Author

tillea commented Apr 24, 2020 via email

@ntamas ntamas closed this as completed Apr 27, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants