Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x/build: maintner.golang.org not-accessible #21383

Closed
paulzhol opened this issue Aug 10, 2017 · 13 comments

Comments

Projects
None yet
5 participants
@paulzhol
Copy link
Member

commented Aug 10, 2017

The freebsd-arm-paulzhol builder is stuck in a loop since yesterday trying to build the sys subrepo but failing because it can't download it:

Error: runTests: looking up ref for "sys": rpc error: code = 13 desc = grpc: Post https://maintner.golang.org/apipb.MaintnerService/
GetRef: dial tcp 35.188.67.38:443: i/o timeout

Also the dashboard for the sys build reports HTTP 404, attached logs captured from farmer.golang.org:
temporarylogs.txt
temporarylogs2.txt

@josharian

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2017

cc @kevinburke @andybons

I always suspected maintner was really just Brad on his phone.

@josharian josharian added the Builders label Aug 10, 2017

@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2017

I don't have access to the production infrastructure, unfortunately. My guess is it's crashing on some bad input.

@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2017

I think @jessfraz and @adams-sarah have access to production as well

@adams-sarah

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2017

maintner in a crash loop.

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x904f37]

goroutine 521 [running]:
main.tryWorkItem(0x0, 0xc42030d530)
/go/src/golang.org/x/build/maintner/maintnerd/api.go:92 +0x37
main.apiService.GoFindTryWork(0xc42000b000, 0x103bb80, 0xc438f43a10, 0xc452496d60, 0x0, 0x0, 0x0)
/go/src/golang.org/x/build/maintner/maintnerd/api.go:170 +0x4b7
golang.org/x/build/maintner/maintnerd/apipb._MaintnerService_GoFindTryWork_Handler(0xab0de0, 0xc42000b000, 0x103bb80, 0xc438f43a10, 0xc449069260, 0x0, 0x0, 0x0, 0x66cb2b, 0x106a9e0)
/go/src/golang.org/x/build/maintner/maintnerd/apipb/api.pb.go:360 +0x28d
grpc%2ego4%2eorg.(*Server).processUnaryRPC(0xc452847bc0, 0x103df80, 0xc4322b2360, 0xc4460ae200, 0xc4539ce150, 0x1027590, 0x0, 0x0, 0x0)
/go/src/grpc.go4.org/server.go:697 +0xaa0
grpc%2ego4%2eorg.(*Server).handleStream(0xc452847bc0, 0x103df80, 0xc4322b2360, 0xc4460ae200, 0x0)
/go/src/grpc.go4.org/server.go:873 +0x1261
grpc%2ego4%2eorg.(*Server).serveStreams.func1.1(0xc452496cd0, 0xc452847bc0, 0x103df80, 0xc4322b2360, 0xc4460ae200)
/go/src/grpc.go4.org/server.go:456 +0xa9
created by grpc%2ego4%2eorg.(*Server).serveStreams.func1
/go/src/grpc.go4.org/server.go:457 +0xa1

@adams-sarah

This comment has been minimized.

Copy link
Contributor

commented Aug 10, 2017

Running to a meeting, but looks like a nil check (cl == nil) in maintnerd/api.go:167 would fix this.

@gopherbot gopherbot added this to the Unreleased milestone Aug 10, 2017

@gopherbot

This comment has been minimized.

Copy link

commented Aug 10, 2017

Change https://golang.org/cl/54751 mentions this issue: x/build/maintner/maintnerd: check CL is found before doing work

@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

From Sarah: https://golang.org/cl/54751

maintner is back up actually. wonder if someone rolled the prev cl back? or what.

but regardless, killing this CL for now.

Which makes me wonder if there is some sort of race or ordering problem.

One thing we could do is run this with the race detector on, either in production or in some sort of staging realm that mirrors the same data.

@paulzhol

This comment has been minimized.

Copy link
Member Author

commented Aug 11, 2017

with the maintainer back up, freebsd-arm-paulzhol finished building sys and net subrepos on 1.7 and 1.8 but now the builder is not receiving any work:

Reverse pool summary:
host-freebsd-arm-paulzhol: 0/1

Reverse pool machine detail
a20.home.idea-y.com (10.240.0.8:54566) version 15, host-freebsd-arm-paulzhol: connected 6h13m59.5s, idle for 359.5ms

My logs don't indicate anything out of the ordinary:

====================
2017/08/10 22:36:45 buildlet starting.
2017/08/10 22:36:45 Not on GCE; not remounting root filesystem.
2017/08/10 22:36:45 Dialing coordinator farmer.golang.org:443 ...
2017/08/10 22:36:45 Doing TLS handshake with coordinator (verifying hostname "farmer.golang.org")...
2017/08/10 22:36:46 Registering reverse mode with coordinator...
2017/08/10 22:36:46 Connected to coordinator; reverse dialing active
2017/08/11 03:17:12 buildlet reverse mode exiting.
====================
2017/08/11 03:17:59 buildlet starting.
2017/08/11 03:17:59 Not on GCE; not remounting root filesystem.
2017/08/11 03:17:59 Dialing coordinator farmer.golang.org:443 ...
2017/08/11 03:17:59 Doing TLS handshake with coordinator (verifying hostname "farmer.golang.org")...
2017/08/11 03:18:00 Registering reverse mode with coordinator...
2017/08/11 03:18:00 Connected to coordinator; reverse dialing active
@kevinburke

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

@paulzhol, can you open a separate issue for that one?

@paulzhol

This comment has been minimized.

Copy link
Member Author

commented Aug 11, 2017

@adams-sarah

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

Hey @kevinburke, all. Turns out my CL is the right fix.
It's a data issue - Gerrit gets ahead of maintner sometimes it looks like. In which case, the cl ptr would be nil. Which is causing a panic on dereference.

So the issue will only crop up periodically, which is what we are seeing for maintner outages.
I will get this CL in today.

@adams-sarah

This comment has been minimized.

Copy link
Contributor

commented Aug 11, 2017

Waiting to deploy. Will ping back here when this fix is live.

@adams-sarah

This comment has been minimized.

Copy link
Contributor

commented Aug 14, 2017

fyi this is live as of friday.

@golang golang locked and limited conversation to collaborators Aug 14, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.