Fix find primary timeout#2284
Conversation
| res = c.get("/node/network", log_capture=logs) | ||
| if res.status_code != 200: | ||
| continue | ||
| assert res.status_code == http.HTTPStatus.OK.value, res |
There was a problem hiding this comment.
The /node/network endpoint isn't guaranteed to return 200 but at this stage, we only loop through nodes that have joined the service so this assert is actually safe.
|
|
||
| view_change_in_progress = body["view_change_in_progress"] | ||
| primary_id = body["primary_id"] | ||
| if primary_id is not None: |
There was a problem hiding this comment.
Is this if necessary? It looks like line 702 has already checked that "primary_id" is present, and then we've assigned it here. Or is it possible to have a "primary_id": null in the response object?
There was a problem hiding this comment.
It's not now, but when @jumaffre and I talked about it, I suggested we should always have a primary_id and it should be set to null when it's unknown, rather than not have the key at all when we don't know which node is primary.
There's a more general point here about what do for optional fields in our JSON responses, we can talk about it.
There was a problem hiding this comment.
Let's discuss (and agree on) this separately. I'll merge this PR as it is as it is blocking other PRs and will address further changes in a follow-up PR.
|
fix_find_primary_timeout@20029 aka 20210309.8 vs main ewma over 20 builds from 19707 to 20009 |

In #2241, I incorrectly changed the logic of
network.find_primary()so that we don't exit early if the primary is found. This meant that some end-to-end tests that rely heavily on this function took longer, and started to fail in the Daily pipeline. This PR should fix this.