Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

panic: runtime error: invalid memory address or nil pointer dereference / [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xeb3c26] #2051

Closed
ymettier opened this issue Nov 27, 2019 · 6 comments

Comments

@ymettier
Copy link

@ymettier ymettier commented Nov 27, 2019

Disclaimer

I'm new to M3DB and I'm just testing it.

This is a bug report that I will not be able to follow. If you think it is useless, feel free to close it.

I have no sensitive data. And I guess that I will remove everything and restart from scratch.

Context

Deploying in Kubernetes thanks to m3db-operator

Step 1 : my config

I'm running with this config :

spec:
  image: quay.io/m3db/m3dbnode:master
  replicationFactor: 2
  isolationGroups:
  - name: group1
    numInstances: 1
  - name: group2
    numInstances: 1

I have 2 pods running :

  • simple-cluster-rep0-0
  • simple-cluster-rep1-0

Both are running build revision 638ecfb

This is my initial deployment so there are no data that comes from any previous deployment that could explain the bug.

Now let's play...

Step 2 : Scaling

I update my config :

spec:
  image: quay.io/m3db/m3dbnode:master
  replicationFactor: 2
  isolationGroups:
  - name: group1
    numInstances: 3
  - name: group2
    numInstances: 3

I have 3 pods that spawn (one after the other) :

  • simple-cluster-rep0-0 (the old one)
  • simple-cluster-rep1-0 (the old one)
  • simple-cluster-rep1-1 (1st new)
  • simple-cluster-rep0-1 (2nd new)
  • simple-cluster-rep0-2 (3rd new)

Notice : no simple-cluster-rep1-2

All new pods are running build revision de4bc64.

Step 3 : Upgrade the 1st old pod

kubectl delete pod simple-cluster0-0

The pod terminates and a new one spawns. build revision de4bc64.

I wait for more than 5 minutes and I see simple-cluster-rep1-2 that spawns (build revision de4bc64) with no action of mine. Thanks m3db-operator.

OK all is running fine. I still have one old pod.

Step 4 : upgrade the last pod

My Grafana says that I have 5 boostrapped nodes and 1 bootstrapping. But i'm already waiting for more than 5 minutes and nothing happens.

kubectl delete pod simple-cluster1-0

The pod terminates and a new one spawns. build revision de4bc64.

Fine. Not fine : Error. Then CrashLoopBackOff.

$ kubectl logs -f simple-cluster-rep1-0
2019/11/27 10:29:32 Go Runtime version: go1.12.13
2019/11/27 10:29:32 Build Version:      v0.14.2
2019/11/27 10:29:32 Build Revision:     de4bc64eb
2019/11/27 10:29:32 Build Branch:       HEAD
2019/11/27 10:29:32 Build Date:         2019-11-27-08:25:55
2019/11/27 10:29:32 Build TimeUnix:     1574843155
{"level":"warn","ts":1574850572.9373457,"msg":"using deprecated configuration field","field":"DeprecatedBlockSize"}
{"level":"info","ts":1574850572.937423,"msg":"raised rlimit no file fds limit","required":true,"sysNROpenValue":3000000,"noFileMaxValue":3000000,"noFileCurrValue":3000000}
{"level":"warn","ts":1574850572.9374633,"msg":"using deprecated configuration field","field":"DeprecatedMaxComputedDatapoints"}
{"level":"warn","ts":1574850572.93749,"msg":"using deprecated configuration field","field":"DeprecatedQueryConversion"}
{"level":"warn","ts":1574850572.9374928,"msg":"using deprecated configuration field","field":"DeprecatedCache"}
{"level":"info","ts":1574850572.9377108,"msg":"tracing disabled for m3query; set `tracing.backend` to enable"}
{"level":"info","ts":1574850572.938164,"msg":"tracing disabled; set `tracing.backend` to enable"}
{"level":"info","ts":1574850572.938192,"msg":"no seed nodes set, using dedicated etcd cluster"}
{"level":"info","ts":1574850572.957037,"msg":"resolved cluster namespace","namespace":"default"}
{"level":"info","ts":1574850572.9570768,"msg":"resolved cluster namespace","namespace":"short"}
{"level":"info","ts":1574850572.9570804,"msg":"resolved cluster namespace","namespace":"mid"}
{"level":"info","ts":1574850572.9570825,"msg":"resolved cluster namespace","namespace":"long"}
{"level":"info","ts":1574850573.12038,"msg":"configuring downsampler to use with aggregated cluster namespaces","numAggregatedClusterNamespaces":3}
{"level":"info","ts":1574850573.1245902,"msg":"no m3msg server configured"}
{"level":"info","ts":1574850573.1246095,"msg":"using registered interrupt handler"}
{"level":"info","ts":1574850573.1246233,"msg":"starting API server","address":"0.0.0.0:7201"}
{"level":"warn","ts":1574850573.190948,"msg":"max index query IDs concurrency was not set, falling back to default value"}
{"level":"warn","ts":1574850573.1914613,"msg":"host doesn't support HugeTLB, proceeding without it"}
{"level":"info","ts":1574850573.2283068,"msg":"set thrift bytes pool alloc size","size":2048}
{"level":"info","ts":1574850573.2284174,"msg":"bytes pool registering bucket capacity=824634475824, size=824634475832, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.228425,"msg":"bytes pool registering bucket capacity=824634475840, size=824634475848, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.2284431,"msg":"bytes pool registering bucket capacity=824634475856, size=824634475864, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.228466,"msg":"bytes pool registering bucket capacity=824634475872, size=824634475880, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.2284837,"msg":"bytes pool registering bucket capacity=824634475888, size=824634475896, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.2284884,"msg":"bytes pool registering bucket capacity=824634475904, size=824634475912, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.2285097,"msg":"bytes pool registering bucket capacity=824634475920, size=824634475928, refillLowWatermark=0.300000, refillHighWatermark=0.600000"}
{"level":"info","ts":1574850573.2285898,"msg":"bytes pool init"}
{"level":"info","ts":1574850573.77733,"msg":"bytes pool init done"}
{"level":"info","ts":1574850574.4343927,"msg":"creating dynamic config service client with m3cluster"}
{"level":"info","ts":1574850574.4348063,"msg":"successfully created new cache dir","path":"/var/lib/m3kv","mode":493}
{"level":"info","ts":1574850574.4349284,"msg":"successfully loaded cache from file","file":"/var/lib/m3kv/_kv_monitoring_simple-cluster_m3db_embedded.json"}
{"level":"info","ts":1574850574.7601018,"msg":"node tchannelthrift: listening","address":"0.0.0.0:9000"}
{"level":"info","ts":1574850574.7603784,"msg":"node httpjson: listening","address":"0.0.0.0:9002"}
{"level":"info","ts":1574850574.7604082,"msg":"waiting for dynamic topology initialization, if this takes a long time, make sure that a topology/placement is configured"}
{"level":"info","ts":1574850574.7604241,"msg":"adding a watch","service":"m3db","env":"monitoring/simple-cluster","zone":"embedded","includeUnhealthy":true}
{"level":"info","ts":1574850574.7604885,"msg":"successfully created new cache dir","path":"/var/lib/m3kv","mode":493}
{"level":"info","ts":1574850574.760679,"msg":"successfully loaded cache from file","file":"/var/lib/m3kv/m3db_embedded.json"}
{"level":"info","ts":1574850574.7631948,"msg":"initial topology / placement value received"}
{"level":"error","ts":1574850584.4473028,"msg":"error initializing namespaces values, retrying in the background","key":"/namespaces","error":"initializing value error:init watch timeout"}
{"level":"info","ts":1574850584.4949942,"msg":"received kv update","version":1,"key":"/placement"}
{"level":"info","ts":1574850584.4951572,"msg":"election manager opened successfully"}
{"level":"info","ts":1574850584.869887,"msg":"cluster tchannelthrift: listening","address":"0.0.0.0:9001"}
{"level":"info","ts":1574850584.870028,"msg":"cluster httpjson: listening","address":"0.0.0.0:9003"}
{"level":"info","ts":1574850584.8701055,"msg":"cluster database initializing topology"}
{"level":"info","ts":1574850584.870117,"msg":"cluster database resolving topology"}
{"level":"info","ts":1574850584.8701205,"msg":"cluster database resolved topology"}
{"level":"info","ts":1574850584.8897624,"msg":"creating namespaces watch"}
{"level":"info","ts":1574850584.8898044,"msg":"waiting for dynamic namespace registry initialization, if this takes a long time, make sure that a namespace is configured"}
{"level":"info","ts":1574850584.8927026,"msg":"initial namespace value received"}
{"level":"info","ts":1574850584.892856,"msg":"resolving namespaces with namespace watch"}
{"level":"info","ts":1574850584.8929048,"msg":"updating database namespaces","adds":"[short, default, mid, long]","updates":"[]","removals":"[]"}
{"level":"info","ts":1574850584.9130955,"msg":"using registered interrupt handler"}
{"level":"info","ts":1574850584.9135246,"msg":"bootstrap started","numShards":128}
{"level":"info","ts":1574850584.9137526,"msg":"bootstrap range starting","bootstrapper":"base","namespace":"mid","numShards":128,"dataFrom":1574805600,"dataTo":1574848800,"dataRange":43200,"indexFrom":1574805600,"indexTo":1574848800,"indexRange":43200}
{"level":"info","ts":1574850584.9137657,"msg":"bootstrap range starting","bootstrapper":"base","namespace":"long","numShards":128,"dataFrom":1573981200,"dataTo":1574845200,"dataRange":864000,"indexFrom":1573981200,"indexTo":1574845200,"indexRange":864000}
{"level":"info","ts":1574850584.913771,"msg":"bootstrap range starting","bootstrapper":"base","namespace":"short","numShards":128,"dataFrom":1574828400,"dataTo":1574850000,"dataRange":21600,"indexFrom":1574828400,"indexTo":1574850000,"indexRange":21600}
{"level":"info","ts":1574850584.9137757,"msg":"bootstrap range starting","bootstrapper":"base","namespace":"default","numShards":128,"dataFrom":1574848200,"dataTo":1574850000,"dataRange":1800,"indexFrom":1574848200,"indexTo":1574850000,"indexRange":1800}
{"level":"info","ts":1574850584.9138005,"msg":"bootstrap from source requested","bootstrapper":"filesystem","namespace":"long","numShards":128,"dataRange":864000,"dataFrom":1573981200,"dataTo":1574845200,"indexRange":864000,"indexFrom":1573981200,"indexTo":1574845200}
{"level":"info","ts":1574850585.4953759,"msg":"election state changed from follower to leader"}
{"level":"info","ts":1574850586.044373,"msg":"bootstrap from source ready after availability query","bootstrapper":"filesystem","namespace":"long","numShards":128,"dataRange":864000,"dataFrom":1573981200,"dataTo":1574845200,"indexRange":864000,"indexFrom":1573981200,"indexTo":1574845200}
{"level":"info","ts":1574850586.0444348,"msg":"bootstrap from source requested","bootstrapper":"filesystem","namespace":"short","numShards":128,"dataRange":21600,"dataFrom":1574828400,"dataTo":1574850000,"indexRange":21600,"indexFrom":1574828400,"indexTo":1574850000}
{"level":"info","ts":1574850586.2754247,"msg":"bootstrap from source ready after availability query","bootstrapper":"filesystem","namespace":"short","numShards":128,"dataRange":20400,"dataFrom":1574828400,"dataTo":1574848800,"indexRange":20400,"indexFrom":1574828400,"indexTo":1574848800}
{"level":"info","ts":1574850586.275479,"msg":"bootstrap from source requested","bootstrapper":"filesystem","namespace":"default","numShards":128,"dataRange":1800,"dataFrom":1574848200,"dataTo":1574850000,"indexRange":1800,"indexFrom":1574848200,"indexTo":1574850000}
{"level":"info","ts":1574850586.3150926,"msg":"bootstrap from source ready after availability query","bootstrapper":"filesystem","namespace":"default","numShards":128,"dataRange":600,"dataFrom":1574848200,"dataTo":1574848800,"indexRange":600,"indexFrom":1574848200,"indexTo":1574848800}
{"level":"info","ts":1574850586.3151445,"msg":"bootstrap from source requested","bootstrapper":"filesystem","namespace":"mid","numShards":128,"dataRange":43200,"dataFrom":1574805600,"dataTo":1574848800,"indexRange":43200,"indexFrom":1574805600,"indexTo":1574848800}
{"level":"info","ts":1574850586.4673948,"msg":"bootstrap from source ready after availability query","bootstrapper":"filesystem","namespace":"mid","numShards":128,"dataRange":43200,"dataFrom":1574805600,"dataTo":1574848800,"indexRange":43200,"indexFrom":1574805600,"indexTo":1574848800}
{"level":"info","ts":1574850586.4674304,"msg":"bootstrap from source started","bootstrapper":"filesystem"}
{"level":"info","ts":1574850586.4674404,"msg":"bootstrapping time series data start","bootstrapper":"filesystem"}
{"level":"info","ts":1574850587.2488518,"msg":"bootstrapping time series data success","bootstrapper":"filesystem","took":0.781410907}
{"level":"info","ts":1574850587.2488859,"msg":"bootstrapping index metadata start","bootstrapper":"filesystem"}
{"level":"info","ts":1574850588.7929142,"msg":"bootstrapping index metadata success","bootstrapper":"filesystem","took":"1.544025869s"}
{"level":"info","ts":1574850588.7929425,"msg":"bootstrap from source completed","bootstrapper":"filesystem","took":2.325511435}
{"level":"info","ts":1574850588.824829,"msg":"bootstrapping from source completed successfully","bootstrapper":"filesystem","took":2.325511435,"namespace":"long","numShards":128,"dataRangeRequested":864000,"dataRangeFulfilled":864000,"indexRangeRequested":864000,"indexRangeFulfilled":864000,"numIndexBlocks":80}
{"level":"info","ts":1574850588.829526,"msg":"bootstrapping from source completed successfully","bootstrapper":"filesystem","took":2.325511435,"namespace":"short","numShards":128,"dataRangeRequested":20400,"dataRangeFulfilled":20400,"indexRangeRequested":20400,"indexRangeFulfilled":20400,"numIndexBlocks":17}
{"level":"info","ts":1574850588.8303614,"msg":"bootstrapping from source completed successfully","bootstrapper":"filesystem","took":2.325511435,"namespace":"default","numShards":128,"dataRangeRequested":600,"dataRangeFulfilled":600,"indexRangeRequested":600,"indexRangeFulfilled":600,"numIndexBlocks":1}
{"level":"info","ts":1574850588.8335907,"msg":"bootstrapping from source completed successfully","bootstrapper":"filesystem","took":2.325511435,"namespace":"mid","numShards":128,"dataRangeRequested":43200,"dataRangeFulfilled":43200,"indexRangeRequested":43200,"indexRangeFulfilled":43200,"numIndexBlocks":12}
{"level":"info","ts":1574850588.8336365,"msg":"bootstrap from source requested","bootstrapper":"commitlog","namespace":"short","numShards":128,"dataRange":1200,"dataFrom":1574848800,"dataTo":1574850000,"indexRange":1200,"indexFrom":1574848800,"indexTo":1574850000}
{"level":"info","ts":1574850588.8339224,"msg":"bootstrap from source ready after availability query","bootstrapper":"commitlog","namespace":"short","numShards":128,"dataRange":1200,"dataFrom":1574848800,"dataTo":1574850000,"indexRange":1200,"indexFrom":1574848800,"indexTo":1574850000}
{"level":"info","ts":1574850588.833963,"msg":"bootstrap from source requested","bootstrapper":"commitlog","namespace":"default","numShards":128,"dataRange":1200,"dataFrom":1574848800,"dataTo":1574850000,"indexRange":1200,"indexFrom":1574848800,"indexTo":1574850000}
{"level":"info","ts":1574850588.8341637,"msg":"bootstrap from source ready after availability query","bootstrapper":"commitlog","namespace":"default","numShards":128,"dataRange":1200,"dataFrom":1574848800,"dataTo":1574850000,"indexRange":1200,"indexFrom":1574848800,"indexTo":1574850000}
{"level":"info","ts":1574850588.8341737,"msg":"bootstrap from source started","bootstrapper":"commitlog"}
{"level":"info","ts":1574850588.8341894,"msg":"read snapshots start","bootstrapper":"commitlog"}
{"level":"info","ts":1574850591.4526007,"msg":"read snapshots done","bootstrapper":"commitlog","took":2.618410891}
{"level":"info","ts":1574850591.452748,"msg":"read commit logs start","bootstrapper":"commitlog"}
{"level":"info","ts":1574850591.4784453,"msg":"read finished","bootstrapper":"commitlog","took":"25.695581ms","datapointsRead":0,"datapointsSkippedNotInRange":0,"datapointsSkippedNotBootstrappingNamespace":0}
{"level":"error","ts":1574850591.4784842,"msg":"error bootstrapping from source","bootstrapper":"commitlog","took":2.644308902,"error":"not responsible for shard 130"}
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0xeb3c26]

goroutine 13214 [running]:
github.com/m3db/m3/src/dbnode/storage/bootstrap.(*NamespaceResultsMap).Get(0x0, 0x1fc08c0, 0xc09d86cd20, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
	/go/src/github.com/m3db/m3/src/dbnode/storage/bootstrap/namespace_results_map_gen.go:189 +0x66
github.com/m3db/m3/src/dbnode/storage/bootstrap.bootstrapProcess.Run(0x1fc46a0, 0xc09d662360, 0x1fd4020, 0xc095c57e50, 0x1ce9590, 0xc0004f0600, 0x1f917e0, 0xc0a592f0e0, 0xc0a592df20, 0xbf6fb1e6367346c6, ...)
	/go/src/github.com/m3db/m3/src/dbnode/storage/bootstrap/process.go:223 +0x10bd
github.com/m3db/m3/src/dbnode/storage.(*bootstrapManager).bootstrap(0xc09d76cbb0, 0x0, 0x0)
	/go/src/github.com/m3db/m3/src/dbnode/storage/bootstrap.go:266 +0x746
github.com/m3db/m3/src/dbnode/storage.(*bootstrapManager).Bootstrap(0xc09d76cbb0, 0x0, 0x0, 0x0, 0x473900, 0x0, 0x0)
	/go/src/github.com/m3db/m3/src/dbnode/storage/bootstrap.go:146 +0x129
github.com/m3db/m3/src/dbnode/storage.(*db).Bootstrap(0xc0a3fbc000, 0x0, 0x0)
	/go/src/github.com/m3db/m3/src/dbnode/storage/database.go:830 +0x56
github.com/m3db/m3/src/dbnode/server.Run.func5(0xc0004fe300, 0x1fe3180, 0xc099a409c0, 0xc0004f0600, 0x1f70340, 0xc088d86450, 0x1f8dc20, 0xc088d86510, 0x1fb63a0, 0xc0884b6e60, ...)
	/go/src/github.com/m3db/m3/src/dbnode/server/server.go:862 +0x62
created by github.com/m3db/m3/src/dbnode/server.Run
	/go/src/github.com/m3db/m3/src/dbnode/server/server.go:853 +0x7561

I hope you can find the origin of this crash.

$ kubectl get pod simple-cluster-rep1-0
NAME                    READY   STATUS             RESTARTS   AGE
simple-cluster-rep1-0   0/1     CrashLoopBackOff   9          28m

As I said at the beginning, if you cannot do anything with this bug report, feel free to close it.

@ymettier

This comment has been minimized.

Copy link
Author

@ymettier ymettier commented Nov 27, 2019

More info...

kubectl delete sts simple-cluster-rep1

All pods simple-cluster-rep1* terminated. And respawned. Thanks to m3db-operator.
And...

$ kubectl get pod -l operator.m3db.io/stateful-set=simple-cluster-rep1
NAME                    READY   STATUS             RESTARTS   AGE
simple-cluster-rep1-0   0/1     CrashLoopBackOff   4          4m49s
simple-cluster-rep1-1   1/1     Running            0          4m4s
simple-cluster-rep1-2   1/1     Running            0          3m31s

Is there something wrong with the disks ?

$ kubectl edit m3dbcluster simple-cluster

Set replicationFactor=1

$ kubectl delete sts simple-cluster-rep1
statefulset.apps "simple-cluster-rep1" deleted
$ kubectl delete persistentvolumeclaim/m3db-data-simple-cluster-rep1-0
persistentvolumeclaim "m3db-data-simple-cluster-rep1-0" deleted
$ kubectl delete persistentvolumeclaim/m3db-data-simple-cluster-rep1-1
persistentvolumeclaim "m3db-data-simple-cluster-rep1-1" deleted
$ kubectl km delete persistentvolumeclaim/m3db-data-simple-cluster-rep1-2
persistentvolumeclaim "m3db-data-simple-cluster-rep1-2" deleted
$ kubectl edit m3dbcluster simple-cluster

Set replicationFactor=2

Enjoy the magic of m3db-operator...

$ kubectl get pod -l operator.m3db.io/stateful-set=simple-cluster-rep1
NAME                    READY   STATUS    RESTARTS   AGE
simple-cluster-rep1-0   1/1     Running   0          10m
simple-cluster-rep1-1   1/1     Running   0          10m
simple-cluster-rep1-2   1/1     Running   0          9m32s

OK, so I guess the problem has something related with the disk.

@ymettier

This comment has been minimized.

Copy link
Author

@ymettier ymettier commented Nov 27, 2019

(forgot to say that now, all 6 pods have build revision de4bc64)

@robskillington

This comment has been minimized.

Copy link
Collaborator

@robskillington robskillington commented Nov 27, 2019

@ymettier unfortunately this is an issue introduced using master - we had a large refactor of all the bootstrappers to dramatically improve bootstrapping memory usage and speed. We believed most users were using :latest instead of :master and were hoping to get feedback from a few other :master users before releasing.

Looks like in the future we'll do more select testing with other users before merging to master as it seems like more people are pinned to :master than we thought.

Here's the fix, it'll go into master in the next hour or two:
#2052

Apologies and thank you for the detailed bug report, greatly appreciated.

@robskillington

This comment has been minimized.

Copy link
Collaborator

@robskillington robskillington commented Nov 27, 2019

Fixed with #2052 - would you be open to testing this @ymettier ?

@ymettier

This comment has been minimized.

Copy link
Author

@ymettier ymettier commented Nov 27, 2019

Fixed with #2052 - would you be open to testing this @ymettier ?

Thanks. Not easy for me to test (I'm still a newbie with k8s and m3). Sorry.
However I continue to test (and bootstrap) with :master because I definitely need PR #2037. If something wrong occurs, I'll tell you.

@robskillington

This comment has been minimized.

Copy link
Collaborator

@robskillington robskillington commented Nov 27, 2019

Np, thanks a lot that sounds good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.