Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node falls behind Metastore updates #2343

Closed
otoolep opened this issue Apr 19, 2015 · 5 comments
Closed

Node falls behind Metastore updates #2343

otoolep opened this issue Apr 19, 2015 · 5 comments
Milestone

Comments

@otoolep
Copy link
Contributor

otoolep commented Apr 19, 2015

Every so often during 3-node integration a testing, a failure occurs as a single node complains that it does not recognise a measurement for querying. This is happening because that node has fallen behind metastore updates. Even after many seconds it has not caught up.

The diagnostics dumps are below, which show that the broadcast index is too low on one of the nodes.

{"results":[{"series":[{"name":"server_go","tags":{"serverID":"1"},"columns":["time","goMaxProcs","numGoRoutine","version"],"values":[["2015-04-19T17:16:40.518967226Z",1,235,"go1.4"]]},{"name":"server_system","tags":{"serverID":"1"},"columns":["time","hostname","pid","os","arch","numCPU"],"values":[["2015-04-19T17:16:40.518968249Z","box446",12600,"linux","amd64",2]]},{"name":"server_memory","tags":{"serverID":"1"},"columns":["time","alloc","totalAlloc","sys","lookups","mallocs","frees","heapAlloc","heapSys","heapIdle","heapInUse","heapReleased","heapObjects","pauseTotalNs","numGG"],"values":[["2015-04-19T17:16:40.518969695Z",2631152,6857999304,26564856,117545,21069460,21055289,2631152,20938752,16162816,4775936,0,14171,5367642346,1711]]},{"name":"server_build","tags":{"serverID":"1"},"columns":["time","version","commitHash"],"values":[["2015-04-19T17:16:40.518972055Z","0.9",""]]},{"name":"server_diag","tags":{"serverID":"1"},"columns":["time","startTime","uptime","id","path","authEnabled","index","retentionAutoCreate","numShards","cqLastRun"],"values":[["2015-04-19T17:16:40.518874138Z","2015-04-19 17:14:42.896718315 +0000 UTC","1m57.622224804s","1","/tmp/influxdb-809097743/data-integration-test/0",false,290,true,8,"0001-01-01 00:00:00 +0000 UTC"]]},{"name":"shardGroups_diag","tags":{"serverID":"1"},"columns":["time","database","retentionPolicy","id","startTime","endTime","duration","numShards"],"values":[["2015-04-19T17:16:40.518874138Z","mydb","myrp","8","2000-01-01 00:00:00 +0000 UTC","2000-01-01 01:00:00 +0000 UTC","1h0m0s",1]]},{"name":"shards_diag","tags":{"serverID":"1"},"columns":["time","id","dataNodes","index","path","path","path","path","path","path","path","path"],"values":[["2015-04-19T17:16:40.518874138Z","6","1","222"],["/tmp/influxdb-809097743/data-integration-test/0/shards/6"],["2015-04-19T17:16:40.518874138Z","7","1","244"],["/tmp/influxdb-809097743/data-integration-test/0/shards/7"],["2015-04-19T17:16:40.518874138Z","8","1","292"],["/tmp/influxdb-809097743/data-integration-test/0/shards/8"],["2015-04-19T17:16:40.518874138Z","1","1","37"],["/tmp/influxdb-809097743/data-integration-test/0/shards/1"],["2015-04-19T17:16:40.518874138Z","2","1","52"],["/tmp/influxdb-809097743/data-integration-test/0/shards/2"],["2015-04-19T17:16:40.518874138Z","3","1","71"],["/tmp/influxdb-809097743/data-integration-test/0/shards/3"],["2015-04-19T17:16:40.518874138Z","4","1","87"],["/tmp/influxdb-809097743/data-integration-test/0/shards/4"],["2015-04-19T17:16:40.518874138Z","5","1","174"],["/tmp/influxdb-809097743/data-integration-test/0/shards/5"]]}]}]}

{"results":[{"series":[{"name":"server_go","tags":{"serverID":"0"},"columns":["time","goMaxProcs","numGoRoutine","version"],"values":[["2015-04-19T17:16:40.520398863Z",1,235,"go1.4"]]},{"name":"server_system","tags":{"serverID":"0"},"columns":["time","hostname","pid","os","arch","numCPU"],"values":[["2015-04-19T17:16:40.520399766Z","box446",12600,"linux","amd64",2]]},{"name":"server_memory","tags":{"serverID":"0"},"columns":["time","alloc","totalAlloc","sys","lookups","mallocs","frees","heapAlloc","heapSys","heapIdle","heapInUse","heapReleased","heapObjects","pauseTotalNs","numGG"],"values":[["2015-04-19T17:16:40.52040084Z",4249800,6859617952,26564856,117550,21070606,21055382,4249800,20922368,14589952,6332416,0,15224,5367642346,1711]]},{"name":"server_build","tags":{"serverID":"0"},"columns":["time","version","commitHash"],"values":[["2015-04-19T17:16:40.520402794Z","0.9",""]]},{"name":"server_diag","tags":{"serverID":"0"},"columns":["time","startTime","uptime","id","path","authEnabled","index","retentionAutoCreate","numShards","cqLastRun"],"values":[["2015-04-19T17:16:40.520331728Z","2015-04-19 17:14:42.896718315 +0000 UTC","1m57.623664339s","0","/tmp/influxdb-809097743/data-integration-test/1",false,290,true,8,"0001-01-01 00:00:00 +0000 UTC"]]},{"name":"shardGroups_diag","tags":{"serverID":"0"},"columns":["time","database","retentionPolicy","id","startTime","endTime","duration","numShards"],"values":[["2015-04-19T17:16:40.520331728Z","mydb","myrp","8","2000-01-01 00:00:00 +0000 UTC","2000-01-01 01:00:00 +0000 UTC","1h0m0s",1]]},{"name":"shards_diag","tags":{"serverID":"0"},"columns":["time","id","dataNodes","index"],"values":[["2015-04-19T17:16:40.520331728Z","8","1","0"],["2015-04-19T17:16:40.520331728Z","1","1","0"],["2015-04-19T17:16:40.520331728Z","2","1","0"],["2015-04-19T17:16:40.520331728Z","3","1","0"],["2015-04-19T17:16:40.520331728Z","4","1","0"],["2015-04-19T17:16:40.520331728Z","5","1","0"],["2015-04-19T17:16:40.520331728Z","6","1","0"],["2015-04-19T17:16:40.520331728Z","7","1","0"]]}]}]}

{"results":[{"series":[{"name":"server_go","tags":{"serverID":"0"},"columns":["time","goMaxProcs","numGoRoutine","version"],"values":[["2015-04-19T17:16:40.52486571Z",1,233,"go1.4"]]},{"name":"server_system","tags":{"serverID":"0"},"columns":["time","hostname","pid","os","arch","numCPU"],"values":[["2015-04-19T17:16:40.524866761Z","box446",12600,"linux","amd64",2]]},{"name":"server_memory","tags":{"serverID":"0"},"columns":["time","alloc","totalAlloc","sys","lookups","mallocs","frees","heapAlloc","heapSys","heapIdle","heapInUse","heapReleased","heapObjects","pauseTotalNs","numGG"],"values":[["2015-04-19T17:16:40.524867741Z",4174392,6861259592,26564856,117567,21071744,21057507,4174392,20922368,14573568,6348800,0,14237,5370146389,1712]]},{"name":"server_build","tags":{"serverID":"0"},"columns":["time","version","commitHash"],"values":[["2015-04-19T17:16:40.52486972Z","0.9",""]]},{"name":"server_diag","tags":{"serverID":"0"},"columns":["time","startTime","uptime","id","path","authEnabled","index","retentionAutoCreate","numShards","cqLastRun"],"values":[["2015-04-19T17:16:40.524797911Z","2015-04-19 17:14:42.896718315 +0000 UTC","1m57.628131383s","0","/tmp/influxdb-809097743/data-integration-test/2",false,273,true,8,"0001-01-01 00:00:00 +0000 UTC"]]},{"name":"shardGroups_diag","tags":{"serverID":"0"},"columns":["time","database","retentionPolicy","id","startTime","endTime","duration","numShards"],"values":[["2015-04-19T17:16:40.524797911Z","mydb","myrp","8","2000-01-01 00:00:00 +0000 UTC","2000-01-01 01:00:00 +0000 UTC","1h0m0s",1]]},{"name":"shards_diag","tags":{"serverID":"0"},"columns":["time","id","dataNodes","index"],"values":[["2015-04-19T17:16:40.524797911Z","3","1","0"],["2015-04-19T17:16:40.524797911Z","4","1","0"],["2015-04-19T17:16:40.524797911Z","5","1","0"],["2015-04-19T17:16:40.524797911Z","6","1","0"],["2015-04-19T17:16:40.524797911Z","7","1","0"],["2015-04-19T17:16:40.524797911Z","8","1","0"],["2015-04-19T17:16:40.524797911Z","1","1","0"],["2015-04-19T17:16:40.524797911Z","2","1","0"]]}]}]}

@otoolep otoolep added this to the 0.9.0 milestone Apr 19, 2015
@otoolep
Copy link
Contributor Author

otoolep commented Apr 19, 2015

In this test, the node is given 2 minutes to catch up, so this is clearly a real problem.

@toddboom
Copy link
Contributor

I think this is fixed with PR #2353.

@jwilder
Copy link
Contributor

jwilder commented Apr 22, 2015

Reproduced locally w/ raft tracing enabled: https://gist.github.com/jwilder/1af8a408c98b2916c131

@jwilder jwilder reopened this Apr 22, 2015
@otoolep
Copy link
Contributor Author

otoolep commented Apr 23, 2015

Gist link doesn't work for me.

@toddboom toddboom modified the milestones: 0.9.0, 0.9.1 May 8, 2015
@toddboom toddboom modified the milestones: 0.9.1, 0.9.2 Jun 5, 2015
@otoolep
Copy link
Contributor Author

otoolep commented Jun 15, 2015

No longer applicable with the new clustering design.

@otoolep otoolep closed this as completed Jun 15, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants