Configurable index op timeout #1880

shanson7 · 2020-08-12T11:07:50Z

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic.

docker/docker-chaos/metrictank.ini

replay

LGTM, just a minor comment

shanson7 · 2020-08-21T10:54:15Z

Could someone merge or take a look?

Dieterbe · 2020-08-24T11:33:00Z

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic

yeah this seems like a problem. is this still the case with this pr?

shanson7 · 2020-08-24T11:36:57Z

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic

yeah this seems like a problem. is this still the case with this pr?

Yes, still the case. There is no error response path for this function, so the signature would need to change and all callers would need to be updated to check for error

Dieterbe · 2020-08-24T11:38:50Z

@replay your thoughts?

replay · 2020-08-24T14:52:13Z

I think changing the index interface is the only option to avoid returning incomplete results. This will require updating quite a few places of code, but it won't be a complicated change. If we change the index interface anyway, then we should consider also passing a context into the index, so the index methods can be cancelled when f.e. the client closes the connection which is likely to happen before the default 5min timeout.

I would avoid panics in the index as much as possible, just to be sure that we never forget to unlock a lock

Dieterbe · 2020-08-26T18:17:03Z

@shanson7 do you want us to make this api change or will you do it?

shanson7 · 2020-08-27T08:46:31Z

I can do it, but maybe not for a couple weeks. If one of you wants to do it earlier, just let me know.

shanson7 · 2020-09-16T09:42:36Z

Looking into this more, returning an error is non-trivial. Some parts of the system use tag lookups internally (i.e. not from a user request context). In these cases, it is fairly important to propagate an error. However, most of the time the records are processed as a stream and the error can happen at any point during the stream processing.

So, returning an error from TagQueryContext.Run is easy enough, but we need to look at how this function is used.

Some examples of usage:

idsByTagQuery - Callers pass in idCh and iterate over it after the call (it is populated asynchronously). This function could additionally require an error pointer/chan and all callers would need to be updated to select from the err chan or check err when done with idCh.

Metatags - used with idsByTagQueryIntoCallback. Each record calls the callback independently. Action is taken each time. Does this need to be transactional? Should the callback take an error?

idSelector - Also runs subqueries for metatags. Also async...

replay · 2020-09-18T20:22:33Z

idsByTagQuery - Callers pass in idCh and iterate over it after the call (it is populated asynchronously). This function could additionally require an error pointer/chan and all callers would need to be updated to select from the err chan or check err when done with idCh.

I think making idsByTagQuery() take an error chan errCh in addition to the current idCh is the nicest solution. Assuming that TagQueryContext.Run is going to close the idCh when an error occurs, this errCh wouldn't need to be checked at every iteration which is iterating over the elements from idCh, it would be sufficient to only check errCh after the idCh loop is done.

Metatags - used with idsByTagQueryIntoCallback. Each record calls the callback independently. Action is taken each time. Does this need to be transactional? Should the callback take an error?

This one is hard because we also need to decide what to do if a lookup by a method such as MetaTagRecordSwap() times out. We can't just skip some enricher updates, because otherwise we might end up in an inconsistent state which is only recoverable by restarting Metrictank. If there's potential for the index lookup to time out then we need to make the meta record swap operation transactional, as you suggest, so we'd have a way to roll-back all the changes which have already been applied and attempt to execute the same swap again later, but this won't be trivial. So for now I'd just disable the time-out for index lookups by MetaTagRecordSwap() and MetaTagRecordUpsert().

idSelector - Also runs subqueries for metatags. Also async...

if TagQueryContext.Run() would accept an optional timeoutChan as a parameter, in which case it then wouldn't make its own, then the timeoutChan of the top TagQueryContext.Run() could be passed through by the idSelector to the TagQueryContext.Run() which are executing its sub-queries.

shanson7 · 2020-10-15T12:37:20Z

I think the "full" solution is too much effort for me at the moment. I still need a hard stop timeout (such as is provided this PR) to prevent issues with a locked index.

Dieterbe · 2020-10-26T18:04:20Z

ok, so then here's what i suggest: in the configs, in the comments above the setting. we say it's an experimental feature and that we have this known limitation. we also set it to default to 0 everywhere and make sure that 0 disables the feature.

those who want the feature can then opt-in, and be aware of the limitation.
whereas others who never want time-outing requests to return broken responses, can then simply ignore this patch and they wouldn't be impacted when they upgrade.

shanson7 · 2020-10-29T09:19:23Z

Makes sense to me

shanson7 · 2020-11-04T13:29:33Z

@Dieterbe - Ready for another review when you get the chance

Dieterbe · 2020-11-10T10:48:38Z

see #1944

shanson7 added 4 commits August 12, 2020 11:49

Hardcode 60s timeout for index requests

f766844

Add configurable timeout

702593e

Update ini files and docs

7768399

Keep spaces to appease qa

07002fa

Dieterbe requested a review from replay August 12, 2020 11:34

Dieterbe added this to the sprint-16 milestone Aug 12, 2020

Stop the timer to release resources

4df235e

replay reviewed Aug 12, 2020

View reviewed changes

docker/docker-chaos/metrictank.ini Outdated Show resolved Hide resolved

replay approved these changes Aug 12, 2020

View reviewed changes

fkaleo modified the milestones: sprint-16, sprint-17 Sep 21, 2020

fkaleo assigned replay Sep 21, 2020

Dieterbe modified the milestones: sprint-17, sprint-18 Oct 28, 2020

shanson7 added 4 commits October 29, 2020 15:53

0 means disabled and is default

c217e30

Clean up timeout behavior

bece001

Update ini files

2e922c4

Update docs

71a90ea

Dieterbe assigned Dieterbe and unassigned replay Nov 4, 2020

Dieterbe mentioned this pull request Nov 10, 2020

Index op timeout #1944

Merged

Dieterbe closed this Nov 10, 2020

shanson7 deleted the index_op_timeout branch October 22, 2021 16:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Configurable index op timeout #1880

Configurable index op timeout #1880

shanson7 commented Aug 12, 2020

replay left a comment

shanson7 commented Aug 21, 2020

Dieterbe commented Aug 24, 2020

shanson7 commented Aug 24, 2020

Dieterbe commented Aug 24, 2020

replay commented Aug 24, 2020 •

edited

Dieterbe commented Aug 26, 2020

shanson7 commented Aug 27, 2020

shanson7 commented Sep 16, 2020

replay commented Sep 18, 2020

shanson7 commented Oct 15, 2020

Dieterbe commented Oct 26, 2020

shanson7 commented Oct 29, 2020

shanson7 commented Nov 4, 2020

Dieterbe commented Nov 10, 2020

Configurable index op timeout #1880

Configurable index op timeout #1880

Conversation

shanson7 commented Aug 12, 2020

replay left a comment

Choose a reason for hiding this comment

shanson7 commented Aug 21, 2020

Dieterbe commented Aug 24, 2020

shanson7 commented Aug 24, 2020

Dieterbe commented Aug 24, 2020

replay commented Aug 24, 2020 • edited

Dieterbe commented Aug 26, 2020

shanson7 commented Aug 27, 2020

shanson7 commented Sep 16, 2020

replay commented Sep 18, 2020

shanson7 commented Oct 15, 2020

Dieterbe commented Oct 26, 2020

shanson7 commented Oct 29, 2020

shanson7 commented Nov 4, 2020

Dieterbe commented Nov 10, 2020

replay commented Aug 24, 2020 •

edited