Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test failures in 32-bit architectures #887

Closed
TheTincho opened this Issue Jul 11, 2015 · 32 comments

Comments

Projects
None yet
5 participants
@TheTincho
Copy link
Contributor

TheTincho commented Jul 11, 2015

Hi,

I got a report from the Debian autobuilders that Prometheus fails to build in i386, armel and armhf; all 32-bit architectures.

They all have similar failures: the tests were expecting a value, but got a different one. Any idea of what could be causing this?

--- FAIL: TestEvaluations (0.12s)
    promql_test.go:33: error running test testdata/functions.test: error in eval resets(http_requests[20m]): expected 1 for {path="/foo"} but got 0
    promql_test.go:33: error running test testdata/legacy.test: error in eval rate(testcounter_reset_middle[1h]): expected 0.03 for {} but got 0.03333333333333333
FAIL
FAIL    github.com/prometheus/prometheus/promql 0.170s
--- FAIL: TestFuzzChunkType0 (0.32s)
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436373891.467), got 1.000000 (at time 1436373891.467).
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436368093.943), got 1.000000 (at time 1436368093.943).

[..]
    storage_test.go:1115: Value (or timestamp) mismatch, want 2147481355.000000 (at time 1436464500.04), got 2147483070.000000 (at time 1436464500.04).
    storage_test.go:1115: Value (or timestamp) mismatch, want 2147481391.000000 (at time 1436464346.74), got 2147483070.000000 (at time 1436464346.74).

[... many more...]

    storage_test.go:878: #1: failed on input 4106209714314777601
--- FAIL: TestFuzzChunkType1 (0.34s)
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436373891.807), got 1.000000 (at time 1436373891.807).
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436368094.283), got 1.000000 (at time 1436368094.283).
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436367063.307), got 1.000000 (at time 1436367063.307).
    storage_test.go:1115: Value (or timestamp) mismatch, want 2147483509.000000 (at time 1436328522.604), got 2147483775.000000 (at time 1436328522.604).
    storage_test.go:1115: Value (or timestamp) mismatch, want 0.000000 (at time 1436366243.083), got 1.000000 (at time 1436366243.083).

[...]

    storage_test.go:878: #1: failed on input 4106209714314777601

You can see the full logs here:

https://buildd.debian.org/status/fetch.php?pkg=prometheus&arch=i386&ver=0.14.0%2Bds-1&stamp=1436512554
https://buildd.debian.org/status/fetch.php?pkg=prometheus&arch=armel&ver=0.14.0%2Bds-1&stamp=1436512912
https://buildd.debian.org/status/fetch.php?pkg=prometheus&arch=armhf&ver=0.14.0%2Bds-1&stamp=1436512972

@TheTincho TheTincho changed the title Build issues in 32-bit architectures Test failures in 32-bit architectures Jul 11, 2015

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 11, 2015

Strange. Especially in the resets(http_requests[20m]) case (https://github.com/prometheus/prometheus/blob/master/promql/testdata/functions.test#L13-L16), no floating point precision issues should be involved. Maybe it's something wrong with how timestamps are treated, since the missing reset in that test happens in the last sample in the series and that's also the time instant for the test evaluation.

Do you happen to have access to some 32-bit system? :)

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 11, 2015

(just running the tests with GOARCH=368 didn't reproduce the problem for me)

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 11, 2015

Hi julius,

I do have access to 32-bit machines, but these are only for Debian Developers, sadly.. :(

Now, looking at the build log for i386, I see it is actually a 64bit machine running a 32bit system. So I tried doing the same here, and I was able to reproduce the problem using a i386 chroot inside my amd64 system. Maybe you could do the same? If you run Debian or Ubuntu, creating a 32bit chroot is as easy as running

$ sudo /usr/sbin/debootstrap --arch=i386 sid ~/chroot
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

So I did the above under Ubuntu on a 64bit machine and ran make (also setting GOARCH=386 and allowing that override in Makefile.INCLUDE) inside the chrooted Prometheus checkout, and the tests still don't fail for me. Neither for HEAD, nor for 0.14.0.

Hmmm, maybe could you detail all the exact steps (after the debootstrap) that led to your test failure? Maybe I did something slightly different...

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 12, 2015

As this is rooted in the storage, my best guess would be some difference at the file system level. Do you have information on that @TheTincho?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

Oh, that'd be really strange. I'd still bet on something in the double-delta encoding manual bit fiddling magic being off on 32 bit.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

The steps I used are the ones from the Debian build, namely apt-get source prometheus and dpkg-buildpackage -b.

I have now tried doing a build with an untouched prometheus 0.14.0 tree and Makefile, and the tests don't fail in the chroot either. So I guess I am hitting some bug in one of the dependencies that was solved in a different commit, and that the tests for that package are not detecting. The debian tree separates the dependencies in different packages, and I might have made a mistake getting the right version...

I will try checking the dependencies' versions one by one then. Any hints as to which package might be responsible for this? goleveldb maybe?

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

Oh, interesting! If it is indeed a dependency issue, the only one that I can imagine having any effect on this would be github.com/prometheus/client_golang.

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 12, 2015

goleveldb does not store store any sample values, so that's unlikely to be the package in question.
For client_golang only model is involved in the tests, but this package has not changed in a long time aside from added constants, docs, and fingerprinting – nothing involving sample values.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 12, 2015

There was a alignment change for mutex correctness a while back - prometheus/client_golang@944920c

@fabxc

This comment has been minimized.

Copy link
Member

fabxc commented Jul 12, 2015

Without this change those tests wouldn't even have the chance to error on the used 32-bit machine as they would have paniced immediately.
They are also already in 0.14.0 which apparently works fine.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

Yeah, that also shouldn't be able to cause e.g. the resets function test failure (the instrumentation library is not involved there). I don't immediately see any other obvious change that would cause this though.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

I checked the client_golang, and I am using exactly the same commit.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

I also managed to reproduce the test failures using the build steps you described now (took me a while to figure out the necessary Debianisms). It seems that it's actually the sample values, not the timestamps, which are off. Adding some debug statements to the resets implementation to print the output matrix gives this for path="foo" for the testcase eval instant at 50m resets(http_requests[20m]):

http_requests{path="/foo"} => 
7 @[1800], 
8 @[2100], 
9 @[2400], 
10 @[2700]

The sample values here (7, 8, 9, 10) do not match the test fixture data, which is:

http_requests{path="/foo"}  1 2 3 0 1 0 0 1 2 0

I suspect that the delta/double-delta-encoded chunks have a bug on 32-bit systems somewhere. Still looking into it, but probably won't make much progress anymore today due to jetlag and going to bed soon.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

(No clue of course why this doesn't happen when compiling via Makefile)

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

So, you say this might be a bug in the prometheus code and not in the dependencies then?

Thanks a lot for looking into this!

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

Maybe! I still don't understand why it would only happen when using all the Debian dependencies (including the Debian Go compiler, etc.), and not when using the Makefile. So far, the actual bug could still be anywhere...

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

I think I found something..

I tried compiling with the Makefile, but using Debian dependencies and it still works ok. So, what I tried now is to use the Debian go binaries instead of the official ones, and then I get the test failures. So this might actually be a problem with the go compiler.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

(or with the go runtime)

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

I found out one deviation. The following program:

package main

import "fmt"

func main() {
    a := -10.0
    fmt.Println(byte(a))
}

...prints 0 with the Debian Go compiler and 246 with the http://play.golang.org/ one (and mine). Perhaps casting a negative value to byte is undefined behavior. Digging deeper...

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

So turns out with the Debian compiler, you have to change this to get correct results:

uint8(negativeValue) -> uint8(int8(negativeValue))
uint16(negativeValue) -> uint16(int16(negativeValue))
uint32(negativeValue) -> uint32(int32(negativeValue))
uint64(negativeValue) -> uint64(int64(negativeValue))

That initial conversion to signed integer seems to make the later conversion to unsigned work equivalently in both compilers. Will look into this some more tomorrow.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 12, 2015

This issue explains why: https://code.google.com/p/go/issues/detail?id=3966

"In all non-constant conversions involving floating-point or complex values, if the result type cannot represent the value the conversion succeeds but the result value is implementation-dependent."

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 12, 2015

Ah, great find!! The bad news is that then it cannot be considered a bug in the Debian golang-go packages :-)

Meanwhile, I was able to make the debian package work with only using the upstream go complier and not changing anything else.

juliusv added a commit that referenced this issue Jul 13, 2015

storage: Fix float->uint conversions on some compilers.
See #887, which will at
least be partially fixed by this.

From the spec https://golang.org/ref/spec#Conversions:

"In all non-constant conversions involving floating-point or complex
values, if the result type cannot represent the value the conversion
succeeds but the result value is implementation-dependent."

This ended up setting the converted values to 0 on Debian's Go 1.4.2
compiler, at least on 32-bit Debians.
@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 13, 2015

#890 to fix this. With this PR, all tests pass for me when I run them manually with Debian's Go compiler and a HEAD checkout of Prometheus. When I do a local modification to Debian's Prometheus source install, I still get some errors, including some which seem to have different causes, like YAML errors. There might be other bad dependencies in there.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 13, 2015

I have backported #890 to 0.14.0 as a Debian patch, and I get no errors now!

I am not sure what could be the cause of the errors you are seeing, I'd be happy to help debug them.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 13, 2015

@TheTincho Oh, I don't need my errors debugged - I probably just did something wrong in my chroot. If it works for you without errors in your installation now, that sounds good :)

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 13, 2015

I will double check this in real 32 bit hardware, but I think it is solved now.

@juliusv

This comment has been minimized.

Copy link
Member

juliusv commented Jul 13, 2015

👍

@beorn7

This comment has been minimized.

Copy link
Member

beorn7 commented Jul 13, 2015

@TheTincho Please close this once you have verified with the real 32bit hardware.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 13, 2015

@beorn7 The main issue seems solved, but I experienced other errors, like Julius did. I am trying to determine if these are related or not.

@TheTincho

This comment has been minimized.

Copy link
Contributor Author

TheTincho commented Jul 26, 2015

So it seems it is a different issue, closing this now, and will report on the other issues separately.

@TheTincho TheTincho closed this Jul 26, 2015

@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.