Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Remove broken host map subset assertions from nodetool #333

Merged
merged 3 commits into from
May 8, 2018

Conversation

wallrj
Copy link
Member

@wallrj wallrj commented Apr 14, 2018

Fixes: #331

Release note:

NONE

@wallrj
Copy link
Member Author

wallrj commented Apr 14, 2018

/retest

Copy link
Contributor

@munnerz munnerz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure on how this works, so am not able to effectively review it.

It'd be great to also see an e2e test added with this that tests the scenario described in #331. I'm apprehensive to close the issue without knowing for sure it's resolved 😄

leavingNodes, joiningNodes, movingNodes, mappedNodes,
)
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function confuses me a lot, and I'm not really sure how it works.

Can you add a comment to the function itself explaining the rough algorithm in place here, so that this change can be reviewed properly?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've documented the function. Please take another look.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 thanks a lot. Makes sense and lgtm 😃

@wallrj
Copy link
Member Author

wallrj commented Apr 25, 2018

I'm not 100% sure on how this works, so am not able to effectively review it.

I've added a docstring to the function as you suggested.

It'd be great to also see an e2e test added with this that tests the scenario described in #331. I'm apprehensive to close the issue without knowing for sure it's resolved smile

I doubt I'm going to be able to write an E2E test for this particular failure.
The error was caused (as far as I can tell) because when a new C* node starts up and connects to a seed node, it is seen by the rest of the nodes in the cluster as a live node.
But it has not necessarily yet generated and gossiped its host_id.
And my original version of the golang nodetool status function assumed that all the nodes whose state and status are known should also therefore be present in the host_id map.

@wallrj
Copy link
Member Author

wallrj commented Apr 25, 2018

/retest

@munnerz
Copy link
Contributor

munnerz commented Apr 26, 2018

I doubt I'm going to be able to write an E2E test for this particular failure.

Would simply ensuring we can add a new node to an existing cluster without any of the other nodes failing their readiness probes test this case?

@wallrj
Copy link
Member Author

wallrj commented Apr 27, 2018

/retest

@wallrj
Copy link
Member Author

wallrj commented Apr 27, 2018

Ok. Looks like I triggered the failure:

I0427 15:03:40.431] 2018-04-27 15:02:22 +0000 UTC   2018-04-27 15:02:22 +0000 UTC   1         cass-test-np-region-1-zone-a-0.1529531f54f60de4   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 432f5d7f-4a29-11e8-a8d5-0a580a1c0002   Liveness probe failed: HTTP probe failed with statuscode: 500

Unfortunately, the pilot logs aren't available because the container we restarted by the subsequent liveness probe test.

Perhaps I'll change the tests to exit after the first failure.

@wallrj
Copy link
Member Author

wallrj commented Apr 27, 2018

Here we go. The test failed and the logs contain the nodetool error

W0427 15:58:40.008] + echo 'TEST FAILURE: original pods were unhealthy during the scale out'
W0427 15:58:40.008] + exit 1
W0427 15:58:40.008] + dump_debug_logs /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + local output_dir=/go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.008] + echo 'Dumping cluster state to /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs'
W0427 15:58:40.008] + mkdir -p /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
W0427 15:58:40.009] + kubectl cluster-info dump --all-namespaces --output-directory /go/src/github.com/jetstack/navigator/_artifacts/dump_debug_logs
I0427 15:58:40.110] Checking original pods for 'Unhealthy' events during scale out...
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC   2018-04-27 15:57:17 +0000 UTC   1         cass-test-np-region-1-zone-a-0   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c   Liveness probe failed: HTTP probe failed with statuscode: 500
I0427 15:58:40.110] 2018-04-27 15:57:17 +0000 UTC   2018-04-27 15:57:17 +0000 UTC   1         cass-test-np-region-1-zone-a-0   Pod       spec.containers{cassandra}   Warning   Unhealthy   kubelet, 29966e7a-4a31-11e8-9444-0a580a1c540c   Readiness probe failed: HTTP probe failed with statuscode: 500

E0427 15:57:17.727230      15 listen.go:21] Error while running Check function for probe on port 12001: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]
E0427 15:57:17.738374      15 listen.go:21] Error while running Check function for probe on port 12000: mapped nodes must be a superset of Live and Unreachable nodes. Live: map[172.17.0.11:{} 172.17.0.10:{}], Unreachable: map[], Mapped: map[172.17.0.10:{}]

It only failed on the 1.7 cluster.

Now I'll commit the fix and expect the tests to pass consistently.

@munnerz
Copy link
Contributor

munnerz commented May 8, 2018

/lgtm
/approve

@jetstack-bot
Copy link
Collaborator

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: munnerz

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@jetstack-bot jetstack-bot merged commit e8fb08d into jetstack:master May 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants