okie dok i'm fairly certain i've tracked down the issue.
there is a for loop running in a goroutine on gitmirror which runs git fetch every N seconds.
the git fetch (origin being https://go.googlesource.com/blah) is giving a 502 on the rare-ish occasion.
when this happens, something is killing the goroutine. the git fetch loop then never restarts.
Ok so I've finally figured out what is happening. I just don't know where in the code it is happening yet.
Here is my update:
Essentially, we are continuously constructing a double-linked list of all the commits on a given branch. When we find a new commit, we link it to its parent based on the parent's SHA, such that parentCommit.child == newCommit && newCommit.parent == parentCommit.
Now, for the first new commit on a given branch (not the base commit from the fork to create the branch) this link is erased somewhere after it is created such that now, parent.child == nil && child.parent == nil.
This makes it so that when we take a parent commit and try to iterate over its children commits to post them each to the dashboard, the parent commit of our new commit has no children. So, the first commit on a branch is never sent to the dashboard.
The second piece of this is that the builders are set up such that if a commit is pushed to the dashboard, and the dashboard does not recognize the new commit's parent commit SHA, it will reject the new commit. This puts us in a place where, if we lose one commit, we can never recover (until we restart gitmirror).
Thanks for looking into this. I've never really understood (or dug into) how this code works. I think @adg wrote it originally, but I've seen moved it around a few times. It's possible I broke it in the process during one of the moves. Or maybe it rotted a bit during the Mercurial to Git or other move.