Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node bounce causes healthy node to be dropped #64

Open
bernerdschaefer opened this issue Jun 5, 2013 · 0 comments
Open

Node bounce causes healthy node to be dropped #64

bernerdschaefer opened this issue Jun 5, 2013 · 0 comments

Comments

@bernerdschaefer
Copy link
Contributor

Given 3 doozerd nodes listening on 8046, 8047, and 8048 (as in the fire drill), when 8048 is killed and restarted, after the timeout window one of the other nodes is evicted from the cluster.

This script will consistently produce these results for me: https://gist.github.com/bernerdschaefer/5714419

Hitting the web UI after the cluster state has stabilized shows something like this:

/
    ctl/
        cal/
            0     (5)       D4HVNXRRANR4YRGQ
            1   (532)       
            2   (477)       73O2WLRB3DVRPG5V
        node/
            4MMJNJ76M5IDQSBQ/
                applied     (573)       572
            73O2WLRB3DVRPG5V/
                addr        (273)       127.0.0.1:8048
                applied     (574)       573
                hostname    (276)       precise64
                version     (279)       0.8+53+g985ed10
                writable    (538)       true
            D4HVNXRRANR4YRGQ/
                addr          (2)       127.0.0.1:8046
                applied     (575)       574
                hostname      (3)       precise64
                version       (4)       0.9.0-alpha
                writable     (57)       true
        ns/
            test/
                4MMJNJ76M5IDQSBQ    (123)       127.0.0.1:8047
                6D32P3ZOQDJIMVEV    (131)       127.0.0.1:8048
                73O2WLRB3DVRPG5V    (533)       127.0.0.1:8048
                D4HVNXRRANR4YRGQ      (6)       127.0.0.1:8046
        err     (541)       rev mismatch
        name      (1)       test

Where in this case, node 8047 has been (partially) evicted from the cluster: it's been removed from /cal/ctl, and everything except "applied" has been removed from the node info.

At this point, node 8047 is still running but produces no messages in the log. If node 8047 is killed and restarted, the cluster returns to normal operation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant