[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

iansmith · 2015-09-02T22:18:22Z

[I'm pretty new to influx, so please close if this is silly.]

I'm running a cluster of 3 nodes, per the instructions in the docs on how to create a cluster, on Ubuntu 14.10:

2015/09/01 23:10:13 InfluxDB starting, version 0.9.4-nightly-d6cb017, branch master, commit d6cb01722b2f618af84cb9633bd1fb2efb668a8d

These are all amazon ec2 instances, connected via a security group. I wanted to see if a writes to working nodes A or C in the cluster while node B was down would eventually get "corrected" when B returns. Can I run a query after B is brought back into the cluster and get the same answers from A, B, or C?

So, I used this script to write 10K points, carefully avoiding one of the nodes because I took it down immediately after database creation (I wasn't sure if db creation would affect anything):


#!/bin/bash
set -e


db="tempdb7"
curl -G http://ip-172-31-13-16.us-west-2.compute.internal:8086/query --data-urlencode "q=CREATE DATABASE $db"


###
### use this if you want to sleep a bit to give yourself time to take down the node you don't want
echo "sleeping 15 secs so you can turn off  ip-172-31-13-15.us-west-2.compute.internal"
sleep 15

for i in {1..3333}; do
    t=`expr 1438387200 + $i`
    temp1=`echo $((RANDOM % 60))`
    temp2=`echo $((RANDOM % 60))`
    temp3=`echo $((RANDOM % 60))`
    curl -i -XPOST "http://ip-172-31-13-16.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-a,region=us-east value=$temp1 $t"

    #this can be dropped to experiment with writes during the period when a node is down 
    #curl -i -XPOST "http://ip-172-31-13-15.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-b,region=us-west value=$temp2 $t"

    #this one is what should be used if you drop the one above
    curl -i -XPOST "http://ip-172-31-13-14.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-b,region=us-west value=$temp2 $t"

    curl -i -XPOST "http://ip-172-31-13-14.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-c,region=us-west value=$temp3 $t"

    echo "wrote point $i"
done

Then a brought the node back up after this script had finished.

In the log of the leader, I saw this when I brought up the node that had been previously down:

[handoff] 2015/09/02 21:37:07 6666 queued writes sent to node 2 in 25.509379966s

which seemed good to me.

Then I used the script below to read back the data I had just written, using all three of the nodes as sources. The two that had been up throughout worked fine. The one that had been down during the writes gave weird values, suggesting that it had about 4700 values out of the expected 10,000.

#!/bin/bash

set -e

db="tempdb7"

onethirdCount="3333"
twothirdsCount="6666"

testslicebyserver() {

    influxhost=$1
    datahost=$2

    shouldbecount=`curl -s -S -G "http://$influxhost:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE host='$datahost'" | jq -r .results[0].series[0].columns[1]`
    n=`curl -s -S -G "http://$influxhost:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE host='$datahost'" | jq .results[0].series[0].values[0][1]`

    if [ "$shouldbecount" != "count" ]; then
        echo "expected count but got $shouldbecount"
#       exit 1
    fi

    if [ "$n" != "$onethirdCount" ]; then
        echo "expected $onethirdCount but got $n"
        echo "host was $influxhost and data was $datahost"
#       exit 1
    fi
}

influxnodes=( "ip-172-31-13-16.us-west-2.compute.internal" "ip-172-31-13-15.us-west-2.compute.internal" "ip-172-31-13-14.us-west-2.compute.internal")
servers=("server-a" "server-b" "server-c")

for i in ${influxnodes[@]}; do

    echo  "#### influxnode $i"

    for j in ${servers[@]}; do 
        testslicebyserver $i $j
    done

    echo "### two thirds test $i"
    ###do a test of the other field that is not evenly divided
    n=`curl -s -S -G "http://$i:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE region='us-west'"  | jq '.results[0].series[0].values[0][1]'`


    if [ "$n" != "$twothirdsCount" ]; then
        echo "failed uneven test, expected $twothirdsCount but got $n"
        exit 1
    fi
done

Perhaps I'm just confused about how this is supposed to work? That seems the most likely explanation.
thanks

The text was updated successfully, but these errors were encountered:

Jhors2 · 2015-09-02T22:28:17Z

I have noticed the same actually. It appears from logs that a whole slew of connections from the cluster members that were up to the dead node occur, writes occur, and are then closed. However it doesn't seem that the member ever actually gets back into a syncd state.

I verified the raft key is set to true on all servers in the cluster by looking at "show servers"

beckettsean · 2015-09-02T22:29:59Z

@jwilder something amiss with hinted handoff?

desa · 2015-09-03T19:27:42Z

@beckettsean @iansmith I was able to repro the issue. Here are the steps

Launch a cluster (verify that its healthy)
Create a database
Take one of the nodes down
Write 1 million points to the cluster
Check that the other two nodes received all the points select count(value) from cpu
Bring up the node that was down. Wait a few minutes while the points propagate.
Query the node that was brought up.

Ran the test twice and each time the node that was brought down is missing points. I'm going to let the system sit for a while and see if it's a timing issue.

desa · 2015-09-03T20:16:24Z

After letting it sit for a while it seems as if the points are still missing.

desa · 2015-09-03T20:20:57Z

Also one thing to note is that the number of missing points is very clean. When I wrote 2 million points the number of points that the third node was reporting was 19345000. The results were similar for 1 million points.

beckettsean · 2015-09-03T20:41:19Z

It sounds like entire batches are failing to replicate, not just individual points within a batch.

@jwilder seems like something is amiss with hinted handoff. What other testing can we do or what details can we provide to help characterize it?

jwilder · 2015-09-04T16:51:28Z

I'm able to reproduce some issues and am looking into it.

The meta-store would open but may not have finished loading the raft log. If write requests came in, they could fail or be dropped because of missing shard group info. This change makes the meta store only return after it has found the leader and is really ready. This change also fixed a race in the ClusterRestart test that may be causing it to fail sporadically. Fixes #3677 #3960

The mux listener was handling connections and demux serially. This could cause issues if one handler was slow or blocked. For example, if a node had many hinted handed writes queued for a down node, when the down node was started, it would start handling the write requests (possibly before it had synchronized with the cluster). This would cause the connectiosn to block and also prevent the cluster synchronization from occuring because they both use the same mux listener. Fixes #3960

desa · 2015-09-04T22:05:24Z

woo!

iansmith · 2015-09-04T22:06:21Z

2 days from bug report to fix! Awesome! 🏆

jwilder · 2015-09-04T22:41:50Z

@mjdesa @iansmith Please re-open if you still see this issue occurring.

iansmith · 2015-09-17T22:33:44Z

I'm testing 0.9.4.1 and I'm not seeing the node that was down during the writes ever recovering. Should I re-open?

beckettsean added category/clustering area/writes labels Sep 2, 2015

beckettsean changed the title ~~possible "catch up" bug with nodes down in a cluster~~ [0.9.4-nightly] possible "catch up" bug with nodes down in a cluster Sep 2, 2015

jwilder self-assigned this Sep 4, 2015

jwilder mentioned this issue Sep 4, 2015

Fix cluster restarting issues #4000

Merged

jwilder added the 2 - Working label Sep 4, 2015

jwilder closed this as completed in #4000 Sep 4, 2015

jwilder removed the 2 - Working label Sep 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

iansmith commented Sep 2, 2015

Jhors2 commented Sep 2, 2015

beckettsean commented Sep 2, 2015

desa commented Sep 3, 2015

desa commented Sep 3, 2015

desa commented Sep 3, 2015

beckettsean commented Sep 3, 2015

jwilder commented Sep 4, 2015

desa commented Sep 4, 2015

iansmith commented Sep 4, 2015

jwilder commented Sep 4, 2015

iansmith commented Sep 17, 2015

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

Comments

iansmith commented Sep 2, 2015

Jhors2 commented Sep 2, 2015

beckettsean commented Sep 2, 2015

desa commented Sep 3, 2015

desa commented Sep 3, 2015

desa commented Sep 3, 2015

beckettsean commented Sep 3, 2015

jwilder commented Sep 4, 2015

desa commented Sep 4, 2015

iansmith commented Sep 4, 2015

jwilder commented Sep 4, 2015

iansmith commented Sep 17, 2015