Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

Closed
iansmith opened this issue Sep 2, 2015 · 11 comments · Fixed by #4000
Closed

[0.9.4-nightly] possible "catch up" bug with nodes down in a cluster #3960

iansmith opened this issue Sep 2, 2015 · 11 comments · Fixed by #4000
Assignees

Comments

@iansmith
Copy link

iansmith commented Sep 2, 2015

[I'm pretty new to influx, so please close if this is silly.]

I'm running a cluster of 3 nodes, per the instructions in the docs on how to create a cluster, on Ubuntu 14.10:

2015/09/01 23:10:13 InfluxDB starting, version 0.9.4-nightly-d6cb017, branch master, commit d6cb01722b2f618af84cb9633bd1fb2efb668a8d

These are all amazon ec2 instances, connected via a security group. I wanted to see if a writes to working nodes A or C in the cluster while node B was down would eventually get "corrected" when B returns. Can I run a query after B is brought back into the cluster and get the same answers from A, B, or C?

So, I used this script to write 10K points, carefully avoiding one of the nodes because I took it down immediately after database creation (I wasn't sure if db creation would affect anything):


#!/bin/bash
set -e


db="tempdb7"
curl -G http://ip-172-31-13-16.us-west-2.compute.internal:8086/query --data-urlencode "q=CREATE DATABASE $db"


###
### use this if you want to sleep a bit to give yourself time to take down the node you don't want
echo "sleeping 15 secs so you can turn off  ip-172-31-13-15.us-west-2.compute.internal"
sleep 15

for i in {1..3333}; do
    t=`expr 1438387200 + $i`
    temp1=`echo $((RANDOM % 60))`
    temp2=`echo $((RANDOM % 60))`
    temp3=`echo $((RANDOM % 60))`
    curl -i -XPOST "http://ip-172-31-13-16.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-a,region=us-east value=$temp1 $t"

    #this can be dropped to experiment with writes during the period when a node is down 
    #curl -i -XPOST "http://ip-172-31-13-15.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-b,region=us-west value=$temp2 $t"

    #this one is what should be used if you drop the one above
    curl -i -XPOST "http://ip-172-31-13-14.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-b,region=us-west value=$temp2 $t"

    curl -i -XPOST "http://ip-172-31-13-14.us-west-2.compute.internal:8086/write?db=$db&precision=s" --data-binary "cpu_temp,host=server-c,region=us-west value=$temp3 $t"

    echo "wrote point $i"
done

Then a brought the node back up after this script had finished.

In the log of the leader, I saw this when I brought up the node that had been previously down:

[handoff] 2015/09/02 21:37:07 6666 queued writes sent to node 2 in 25.509379966s

which seemed good to me.

Then I used the script below to read back the data I had just written, using all three of the nodes as sources. The two that had been up throughout worked fine. The one that had been down during the writes gave weird values, suggesting that it had about 4700 values out of the expected 10,000.

#!/bin/bash

set -e

db="tempdb7"

onethirdCount="3333"
twothirdsCount="6666"

testslicebyserver() {

    influxhost=$1
    datahost=$2

    shouldbecount=`curl -s -S -G "http://$influxhost:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE host='$datahost'" | jq -r .results[0].series[0].columns[1]`
    n=`curl -s -S -G "http://$influxhost:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE host='$datahost'" | jq .results[0].series[0].values[0][1]`

    if [ "$shouldbecount" != "count" ]; then
        echo "expected count but got $shouldbecount"
#       exit 1
    fi

    if [ "$n" != "$onethirdCount" ]; then
        echo "expected $onethirdCount but got $n"
        echo "host was $influxhost and data was $datahost"
#       exit 1
    fi
}

influxnodes=( "ip-172-31-13-16.us-west-2.compute.internal" "ip-172-31-13-15.us-west-2.compute.internal" "ip-172-31-13-14.us-west-2.compute.internal")
servers=("server-a" "server-b" "server-c")

for i in ${influxnodes[@]}; do

    echo  "#### influxnode $i"

    for j in ${servers[@]}; do 
        testslicebyserver $i $j
    done

    echo "### two thirds test $i"
    ###do a test of the other field that is not evenly divided
    n=`curl -s -S -G "http://$i:8086/query" --data-urlencode "db=$db" --data-urlencode "q=SELECT count(value) FROM cpu_temp WHERE region='us-west'"  | jq '.results[0].series[0].values[0][1]'`


    if [ "$n" != "$twothirdsCount" ]; then
        echo "failed uneven test, expected $twothirdsCount but got $n"
        exit 1
    fi
done

Perhaps I'm just confused about how this is supposed to work? That seems the most likely explanation.
thanks

@Jhors2
Copy link

Jhors2 commented Sep 2, 2015

I have noticed the same actually. It appears from logs that a whole slew of connections from the cluster members that were up to the dead node occur, writes occur, and are then closed. However it doesn't seem that the member ever actually gets back into a syncd state.

I verified the raft key is set to true on all servers in the cluster by looking at "show servers"

@beckettsean beckettsean changed the title possible "catch up" bug with nodes down in a cluster [0.9.4-nightly] possible "catch up" bug with nodes down in a cluster Sep 2, 2015
@beckettsean
Copy link
Contributor

@jwilder something amiss with hinted handoff?

@desa
Copy link
Contributor

desa commented Sep 3, 2015

@beckettsean @iansmith I was able to repro the issue. Here are the steps

  1. Launch a cluster (verify that its healthy)
  2. Create a database
  3. Take one of the nodes down
  4. Write 1 million points to the cluster
  5. Check that the other two nodes received all the points select count(value) from cpu
  6. Bring up the node that was down. Wait a few minutes while the points propagate.
  7. Query the node that was brought up.

Ran the test twice and each time the node that was brought down is missing points. I'm going to let the system sit for a while and see if it's a timing issue.

@desa
Copy link
Contributor

desa commented Sep 3, 2015

After letting it sit for a while it seems as if the points are still missing.

@desa
Copy link
Contributor

desa commented Sep 3, 2015

Also one thing to note is that the number of missing points is very clean. When I wrote 2 million points the number of points that the third node was reporting was 19345000. The results were similar for 1 million points.

@beckettsean
Copy link
Contributor

It sounds like entire batches are failing to replicate, not just individual points within a batch.

@jwilder seems like something is amiss with hinted handoff. What other testing can we do or what details can we provide to help characterize it?

@jwilder jwilder self-assigned this Sep 4, 2015
@jwilder
Copy link
Contributor

jwilder commented Sep 4, 2015

I'm able to reproduce some issues and am looking into it.

jwilder added a commit that referenced this issue Sep 4, 2015
The meta-store would open but may not have finished loading the raft log. If write
requests came in, they could fail or be dropped because of missing shard group
info.  This change makes the meta store only return after it has found the leader
and is really ready.

This change also fixed a race in the ClusterRestart test that may be causing it
to fail sporadically.

Fixes #3677 #3960
jwilder added a commit that referenced this issue Sep 4, 2015
The mux listener was handling connections and demux serially.  This could cause
issues if one handler was slow or blocked.  For example, if a node had many
hinted handed writes queued for a down node, when the down node was started, it
would start handling the write requests (possibly before it had synchronized with
the cluster).  This would cause the connectiosn to block and also prevent the cluster
synchronization from occuring because they both use the same mux listener.

Fixes #3960
@desa
Copy link
Contributor

desa commented Sep 4, 2015

woo!

@iansmith
Copy link
Author

iansmith commented Sep 4, 2015

2 days from bug report to fix! Awesome! 🏆

@jwilder
Copy link
Contributor

jwilder commented Sep 4, 2015

@mjdesa @iansmith Please re-open if you still see this issue occurring.

@iansmith
Copy link
Author

I'm testing 0.9.4.1 and I'm not seeing the node that was down during the writes ever recovering. Should I re-open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants