Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

add clauses to detect nil Node's in idx.Find #812

Merged
merged 7 commits into from
Jan 24, 2018
Merged

Conversation

Dieterbe
Copy link
Contributor

@Dieterbe Dieterbe commented Jan 8, 2018

helps #770
helps #811

@motyla
Copy link

motyla commented Jan 8, 2018

@Dieterbe my build fail on this:

./scripts/build.sh
+ go build -ldflags '-X main.gitHash=0.7.4-658-ga0a00c4' -o /root/gopkgs/src/github.com/grafana/metrictank/build/metrictank
# github.com/grafana/metrictank/idx/memory
idx/memory/tag_query.go:59: undefined: sync.Map
idx/memory/tag_query.go:61: undefined: sync.Map
make[2]: *** [bin] Error 2
make[2]: Leaving directory `/root/gopkgs/src/github.com/grafana/metrictank'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/root/gopkgs/src/github.com/grafana/metrictank'
make: *** [default] Error 2

@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 8, 2018

it requires go 1.9 to build. you can also download the builds generated at https://circleci.com/gh/grafana/metrictank/3470#artifacts/containers/0 (navigate to home/circleci/.go_workspace/src/github.com/grafana/metrictank/build/ or build_pkgs at the end)

@motyla
Copy link

motyla commented Jan 8, 2018

@Dieterbe , Here is the output:

2018/01/08 14:27:52 [D] HTTP Render querying metrictank003/index/find for 1:["*"]
2018/01/08 14:27:52 [D] memory-idx: found first pattern sequence at node * pos 0
2018/01/08 14:27:52 [D] memory-idx: starting search at the root node
2018/01/08 14:27:52 [D] memory-idx: found first pattern sequence at node * pos 0
2018/01/08 14:27:52 [D] memory-idx: searching 17 children of  that match *
2018/01/08 14:27:52 [D] memory-idx: Matching all children
2018/01/08 14:27:52 [D] memory-idx: starting search at the root node
2018/01/08 14:27:52 [D] memory-idx: searching 17 children of  that match *
2018/01/08 14:27:52 [D] memory-idx: Matching all children
2018/01/08 14:27:52 [D] memory-idx: reached pattern length. 17 nodes matched
2018/01/08 14:27:52 [D] memory-idx: reached pattern length. 17 nodes matched
2018/01/08 14:27:52 [D] memory-idx: orgId -1 has no metrics indexed.
2018/01/08 14:27:52 [D] memory-idx: orgId -1 has no metrics indexed.
2018/01/08 14:27:52 [D] memory-idx: 17 nodes matching pattern * found
2018/01/08 14:27:52 [D] memory-idx: 17 nodes matching pattern * found
2018/01/08 14:27:52 [D] memory-idx Find: adding to path host archive id=1.3688022064380aeae3228ecdd381adc7 name=host int=60 schemaId=28 aggId=0 lastSave=1515421173
2018/01/08 14:27:52 [D] memory-idx Find: adding to path host archive id=1.3688022064380aeae3228ecdd381adc7 name=host int=60 schemaId=28 aggId=0 lastSave=1515421173
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0xa68c35]

goroutine 233965 [running]:
github.com/grafana/metrictank/idx/memory.(*MemoryIdx).Find(0xc420bb74a0, 0x1, 0xc81496f845, 0x1, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
        /home/circleci/.go_workspace/src/github.com/grafana/metrictank/idx/memory/memory.go:822 +0x475
github.com/grafana/metrictank/api.(*Server).findSeriesLocal(0xc4202f0460, 0x1116280, 0xcb7267d900, 0x1, 0xc51208c300, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
        /home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:133 +0x432
github.com/grafana/metrictank/api.(*Server).findSeries.func1(0xc4202f0460, 0x1116280, 0xcb7267d900, 0x1, 0xc51208c300, 0x1, 0x1, 0x0, 0xc51208c7a0, 0xcbc8fd0360, ...)
        /home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:76 +0x9b
created by github.com/grafana/metrictank/api.(*Server).findSeries
        /home/circleci/.go_workspace/src/github.com/grafana/metrictank/api/graphite.go:75 +0x44d

previous version didn't catch cases in last loop run and checked
startNode twice
they can also be internal server errors
also: correctly return internal errors as errors
@Dieterbe
Copy link
Contributor Author

Dieterbe commented Jan 8, 2018

@motyla my previous patch didn't cover all cases. would you mind trying again with the latest patch? (click on the last green checkmark on this page, go to artifacts and you can download from there)

@motyla
Copy link

motyla commented Jan 9, 2018

Thanks @Dieterbe , we tried this build yesterday and we're getting messages like:

Jan  9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [memory.go:934 find()] [E] memory-idx: grandChild is nil. o
rg=1,patt="*",i=0,pos=0,p="*",path="megaraidsas-metrics"
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] 2018-01-09 09:21:16: Started POST /getdata for 10.106.38.189
Jan  9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [graphite.go:89 func2()] [E] HTTP Render error querying met
rictank001/index/find: "500 Internal Server Error"
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;32m2018-01-09 09:21:16: Completed /index/find 200 OK in 152.75
447ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] 2018-01-09 09:21:16: Started POST /index/find for 10.106.38.181
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;32m2018-01-09 09:21:16: Completed /index/find 200 OK in 173.47
1681ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] 2018-01-09 09:21:16: Started POST /index/find for 10.106.38.183
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;36m2018-01-09 09:21:16: Completed /index/find 500 Internal Ser
ver Error in 117.559836ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [memory.go:934 find()] [E] memory-idx: grandChild is nil. o
rg=1,patt="*",i=0,pos=0,p="*",path="megaraidsas-metrics"
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;32m2018-01-09 09:21:16: Completed /index/find 200 OK in 232.47
9484ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] 2018-01-09 09:21:16: Started POST /getdata for 172.16.122.89
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;36m2018-01-09 09:21:16: Completed /index/find 500 Internal Ser
ver Error in 114.990421ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [memory.go:934 find()] [E] memory-idx: grandChild is nil. o
rg=1,patt="*",i=0,pos=0,p="*",path="megaraidsas-metrics"
Jan  9 09:21:16 metrictank004 metrictank[36063]: 2018/01/09 09:21:16 [memory.go:934 find()] [E] memory-idx: grandChild is nil. o
rg=1,patt="*",i=0,pos=0,p="*",path="megaraidsas-metrics"
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;36m2018-01-09 09:21:16: Completed /index/find 500 Internal Ser
ver Error in 147.276538ms#033[0m
Jan  9 09:21:16 metrictank004 metrictank[36063]: [Macaron] #033[1;32m2018-01-09 09:21:16: Completed /index/find 200 OK in 102.90
1476ms#033[0m

So it seems that the service is not crashing now.
we found out we have an index starting with '.megaraidsas-metrics' (starting with a dot). We will clean that and will continue monitor these messages in the logs

@Dieterbe Dieterbe added this to the 0.8.1 milestone Jan 11, 2018
api/graphite.go Outdated
return nil, response.NewError(http.StatusBadRequest, err.Error())
err := response.WrapError(err)
if err.Code() != http.StatusBadRequest {
tags.Error.Set(span, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems to be the same like the below used tracing.Failure(span) or not? for consistency i'd use either one or the other only, otherwise it's confusing because the reader first needs to figure out that they're the same

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right

if pos == 0 {
//we need to start at the root.
log.Debug("memory-idx: starting search at the root node")
startNode = tree.Items[""]
} else {
branch := strings.Join(nodes[0:pos], ".")
branch = strings.Join(nodes[0:pos], ".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couldn't that just be nodes[:pos] without the 0

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

}
}

if startNode == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the assignment of startNode on line :894 fails because the searched branch is not present, then ok should be false and the error handler on :895 would return. so the only case how this startNode could be nil is if the assignment in :890 fails or not? so why not just add an if !ok check after line :890 instead

Copy link
Contributor

@replay replay Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or is the goal to check if a value of nil was actually stored in tree.Items? then that makes sense, although i think it should be treated as a separate error case from the case where "" does not exist in tree.Items, with the current log statement we wouldn't know if the problem is that tree.Items[""] does not exist or if the problem is that tree.Items[""] refers to a value nil

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

!ok means an entry wasn't found in the map (and the returned value is then nil in that case).
however it is also possible there was an entry in the map that is nil (e.g. ok is true, returned value is a nil pointer).
i want to protect against all scenarios. (in fact i have a suspicion there might be a nilpointer in the map)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @replay. An index that is empty is not corrupt. We should just return nil, nil after line :890

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i just changed the code to allow for the root node to not exist. and still treat found-but-nil nodes as corruption case. 3f78925

@@ -61,6 +62,19 @@ func NewError(code int, err string) *ErrorResp {
}
}

func NewInternalError(err string) *ErrorResp {
Copy link
Member

@woodsaj woodsaj Jan 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesnt sit well with me to have idx/memory depend on api/response when the idx has a lot of uses that dont use the api.

response.WrapError() will set the status code correctly if the pass in error implements the response.Error interface.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. ideally none of the internal libraries should import any api libraries. to be more precise: none of the internal libraries should be concerned about http implementation details.
but those internal libraries should be able to denote whether an error is caused by user, or caused by the MT platform. the most practical way to do this is still use the http codes (hence your Error interface).
we could create an internal error type that can denote the difference between these two scenarios (and perhaps a few more) but invariably it would tie into http status codes anyway, so I figured, might as well put it with the rest of that stuff.

what's the alternative? create another internal error type outside of api/response, but it would still have to be able to set code 500, 400, etc (so practically it would still import the http package because it allows us to refer to http.StatusBadRequest and so forth). Would that be your preferred approach? (it would be mine but i abandoned that approach and i forgot why)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create another internal error type outside of api/response

That would be my approach. It seems excessive to have to re-define an error struct that impliments the response.Error interface in every package. So maybe we just create a top level
metrictank/errors package and move instance.Error and instance.ErrorResp there (or just copy them for now and open another issue to update all of the api package to use it)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

assuming you meant response.Error instead of instance.Error, that's really an interface to describe an error type that has useful stuff in a http context. it makes sense for that to be in our api package which deals with all http stuff. not in the new errors package which tries to be generic and not tied to http.
that said, the new error types support the interface of course, but that's about as far as the errors package should go. I think this is in line with how/where interfaces are typically defined in go software. see 4212925 for the new version

it's legal for even the root node to not be in the tree,
e.g. if everything was deleted. so we need to simply return
no results in that case
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good now

@Dieterbe Dieterbe merged commit 4212925 into master Jan 24, 2018
Dieterbe added a commit that referenced this pull request Jan 24, 2018
@Dieterbe Dieterbe deleted the diagnose-idx-nil-node branch September 18, 2018 09:07
@Dieterbe Dieterbe modified the milestones: 1.1, 0.8.1 Dec 12, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants