Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: "sweep increased allocation count" on linux-amd64-staticlockranking builder #38702

Open
bcmills opened this issue Apr 27, 2020 · 7 comments

Comments

@bcmills
Copy link
Member

@bcmills bcmills commented Apr 27, 2020

2020-04-27T15:53:46-9b9556f/linux-amd64-staticlockranking

CC @danscales @mknyszek @aclements

Tentatively marking as release-blocker because this seems to indicate memory corruption in the runtime.

@aclements
Copy link
Member

@aclements aclements commented May 14, 2020

These seem not uncommon. Here are the ones from this year. They go to 2016 (which could well be when we introduced this panic), but there was a clear uptick around 2020-03.

(Edited: @mknyszek found that all but the first two in this list were #37881)

$ greplogs -dashboard -E "error: sweep increased allocation count" -l -md

2020-04-27T15:53:46-9b9556f/linux-amd64-staticlockranking
2020-04-08T18:35:49-f7e6ab4/solaris-amd64-oraclerel
2020-03-24T19:05:50-f975485/linux-386-clang
2020-03-24T17:24:24-9dcd6b3/darwin-386-10_14
2020-03-24T14:21:50-ade9886/darwin-386-10_14
2020-03-24T10:33:13-9ef61d5/openbsd-386-64
2020-03-23T19:14:29-6aded25/freebsd-386-12_0
2020-03-23T17:56:24-5c9bd49/linux-386-clang
2020-03-23T17:23:03-5d47f87/freebsd-386-12_0
2020-03-23T17:07:22-67c2dcb/linux-386-387
2020-03-23T03:56:18-bb929b7/linux-386-sid
2020-03-22T08:42:38-787e7b0/linux-386-sid
2020-03-22T00:10:27-36b815e/linux-386-387
2020-03-21T02:46:16-287d67e/linux-386-387
2020-03-20T16:05:35-d965bb6/freebsd-386-11_2
2020-03-20T16:05:33-ab5a40c/linux-386
2020-03-20T08:42:30-9d468f4/openbsd-386-62
2020-03-20T00:27:02-a0917eb/freebsd-386-11_2
2020-03-20T00:27:02-a0917eb/linux-386-clang
2020-03-19T00:08:40-b3b174f/freebsd-386-11_2
2020-03-18T19:44:13-0205790/linux-386-387
2020-03-18T19:13:50-f1f947a/linux-386-sid
2020-03-18T18:59:32-e39de05/linux-386-387
2020-03-18T16:00:44-0c0e8f2/linux-386
2020-03-18T01:03:36-6412750/linux-386
2020-03-17T20:48:23-0eeec4f/freebsd-386-11_2
2020-03-17T17:10:51-14d20dc/freebsd-386-12_0
2020-03-17T01:24:30-0e44c69/linux-386
2020-03-16T20:59:27-ff1eb42/linux-386-387
2020-03-15T08:13:55-32dbccd/linux-386-clang
2020-03-14T07:03:15-d774d97/linux-386-sid
2020-03-13T20:43:12-e2a9ea0/openbsd-386-62

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 14, 2020

There was an uptick in March, but then it slowed down considerably. Either something is masking the bug now, or it got fixed (also all those failures are on 386, interestingly enough). The last one before that block was plan9-arm in November.

@aclements
Copy link
Member

@aclements aclements commented May 14, 2020

I'm going to put together a CL to at least improve the debugging output from this.

@gopherbot
Copy link

@gopherbot gopherbot commented May 14, 2020

Change https://golang.org/cl/234100 mentions this issue: runtime: detect and report zombie slots during sweeping

gopherbot pushed a commit that referenced this issue May 21, 2020
A zombie slot is a slot that is marked, but isn't allocated. This can
indicate a bug in the GC, or a bad use of unsafe.Pointer. Currently,
the sweeper has best-effort detection for zombie slots: if there are
more marked slots than allocated slots, then there must have been a
zombie slot. However, this is imprecise since it only compares totals
and it reports almost no information that may be helpful to debug the
issue.

Add a precise check that compares the mark and allocation bitmaps and
reports detailed information if it detects a zombie slot.

No appreciable effect on performance as measured by the sweet
benchmarks:

name                                old time/op  new time/op  delta
BiogoIgor                            15.8s ± 2%   15.8s ± 2%    ~     (p=0.421 n=24+25)
BiogoKrishna                         15.6s ± 2%   15.8s ± 5%    ~     (p=0.082 n=22+23)
BleveIndexBatch100                   4.90s ± 3%   4.88s ± 2%    ~     (p=0.627 n=25+24)
CompileTemplate                      204ms ± 1%   205ms ± 0%  +0.22%  (p=0.010 n=24+23)
CompileUnicode                      77.8ms ± 2%  78.0ms ± 1%    ~     (p=0.236 n=25+24)
CompileGoTypes                       729ms ± 0%   731ms ± 0%  +0.26%  (p=0.000 n=24+24)
CompileCompiler                      3.52s ± 0%   3.52s ± 1%    ~     (p=0.152 n=25+25)
CompileSSA                           8.06s ± 1%   8.05s ± 0%    ~     (p=0.192 n=25+24)
CompileFlate                         132ms ± 1%   132ms ± 1%    ~     (p=0.373 n=24+24)
CompileGoParser                      163ms ± 1%   164ms ± 1%  +0.32%  (p=0.003 n=24+25)
CompileReflect                       453ms ± 1%   455ms ± 1%  +0.39%  (p=0.000 n=22+22)
CompileTar                           181ms ± 1%   181ms ± 1%  +0.20%  (p=0.029 n=24+21)
CompileXML                           244ms ± 1%   244ms ± 1%    ~     (p=0.065 n=24+24)
CompileStdCmd                        15.8s ± 2%   15.7s ± 2%    ~     (p=0.059 n=23+24)
FoglemanFauxGLRenderRotateBoat       13.4s ±11%   12.8s ± 0%    ~     (p=0.377 n=25+24)
FoglemanPathTraceRenderGopherIter1   18.6s ± 0%   18.6s ± 0%    ~     (p=0.696 n=23+24)
GopherLuaKNucleotide                 28.7s ± 4%   28.6s ± 5%    ~     (p=0.700 n=25+25)
MarkdownRenderXHTML                  250ms ± 1%   248ms ± 1%  -1.01%  (p=0.000 n=24+24)
[Geo mean]                           1.60s        1.60s       -0.11%

(https://perf.golang.org/search?q=upload:20200517.6)

For #38702.

Change-Id: I8af1fefd5fbf7b9cb665b98f9c4b73d1d08eea81
Reviewed-on: https://go-review.googlesource.com/c/go/+/234100
Run-TryBot: Austin Clements <austin@google.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Cherry Zhang <cherryyz@google.com>
@toothrot
Copy link
Contributor

@toothrot toothrot commented May 26, 2020

Hello! This is one of the few remaining issues blocking the Beta release of Go 1.15. We'll need to make a decision on this in the next week in order to keep our release on schedule.

@mknyszek
Copy link
Contributor

@mknyszek mknyszek commented May 26, 2020

@toothrot I don't think we've seen this crash since the failure that caused Bryan to open this issue. @aclements' CL should give us a lot more information should we see this crash again, though. I think it's probably safe to not mark this as a beta-blocking issue, but we should keep an eye out for more such failures.

Also an interesting data point: it looks like all the failures that were happening through March were happening with the same size class (s.nelems=512 in every case I've looked at, which is size class 2; unclear whether they're noscan just from the crashes, could be tiny allocator related?). The two most recent failures aren't for the same size class. I tried to trace this back to a particular CL, but nothing seems obviously wrong. These failures are also all very consistently on 386. I think whatever failures were going on in March are distinct from the last two which happened in April. Ah, digging through issues I found that those build failures are referenced by #37881 which is declared fixed.

I think the only two crashes relevant to this thread are:

2020-04-27T15:53:46-9b9556f/linux-amd64-staticlockranking
2020-04-08T18:35:49-f7e6ab4/solaris-amd64-oraclerel
@aclements
Copy link
Member

@aclements aclements commented May 26, 2020

Thanks for that sleuthing, @mknyszek ! I agree that given that this now seems rare, it shouldn't block the beta. There's also not much we can do until we get some more debugging information.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
5 participants
You can’t perform that action at this time.