Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

Configurable index op timeout #1880

Closed
wants to merge 9 commits into from

Conversation

shanson7
Copy link
Collaborator

Fixes #1870

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic.

@Dieterbe Dieterbe requested a review from replay August 12, 2020 11:34
@Dieterbe Dieterbe added this to the sprint-16 milestone Aug 12, 2020
Copy link
Contributor

@replay replay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a minor comment

@shanson7
Copy link
Collaborator Author

Could someone merge or take a look?

@Dieterbe
Copy link
Contributor

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic

yeah this seems like a problem. is this still the case with this pr?

@shanson7
Copy link
Collaborator Author

IMO, the failing of this change is that it will return a partial response instead of an error. Alternative options are to refactor to return errors from TagQueryContext.Run() or to panic

yeah this seems like a problem. is this still the case with this pr?

Yes, still the case. There is no error response path for this function, so the signature would need to change and all callers would need to be updated to check for error

@Dieterbe
Copy link
Contributor

@replay your thoughts?

@replay
Copy link
Contributor

replay commented Aug 24, 2020

I think changing the index interface is the only option to avoid returning incomplete results. This will require updating quite a few places of code, but it won't be a complicated change. If we change the index interface anyway, then we should consider also passing a context into the index, so the index methods can be cancelled when f.e. the client closes the connection which is likely to happen before the default 5min timeout.

I would avoid panics in the index as much as possible, just to be sure that we never forget to unlock a lock

@Dieterbe
Copy link
Contributor

@shanson7 do you want us to make this api change or will you do it?

@shanson7
Copy link
Collaborator Author

I can do it, but maybe not for a couple weeks. If one of you wants to do it earlier, just let me know.

@shanson7
Copy link
Collaborator Author

Looking into this more, returning an error is non-trivial. Some parts of the system use tag lookups internally (i.e. not from a user request context). In these cases, it is fairly important to propagate an error. However, most of the time the records are processed as a stream and the error can happen at any point during the stream processing.

So, returning an error from TagQueryContext.Run is easy enough, but we need to look at how this function is used.

Some examples of usage:

idsByTagQuery - Callers pass in idCh and iterate over it after the call (it is populated asynchronously). This function could additionally require an error pointer/chan and all callers would need to be updated to select from the err chan or check err when done with idCh.

Metatags - used with idsByTagQueryIntoCallback. Each record calls the callback independently. Action is taken each time. Does this need to be transactional? Should the callback take an error?

idSelector - Also runs subqueries for metatags. Also async...

@replay
Copy link
Contributor

replay commented Sep 18, 2020

idsByTagQuery - Callers pass in idCh and iterate over it after the call (it is populated asynchronously). This function could additionally require an error pointer/chan and all callers would need to be updated to select from the err chan or check err when done with idCh.

I think making idsByTagQuery() take an error chan errCh in addition to the current idCh is the nicest solution. Assuming that TagQueryContext.Run is going to close the idCh when an error occurs, this errCh wouldn't need to be checked at every iteration which is iterating over the elements from idCh, it would be sufficient to only check errCh after the idCh loop is done.

Metatags - used with idsByTagQueryIntoCallback. Each record calls the callback independently. Action is taken each time. Does this need to be transactional? Should the callback take an error?

This one is hard because we also need to decide what to do if a lookup by a method such as MetaTagRecordSwap() times out. We can't just skip some enricher updates, because otherwise we might end up in an inconsistent state which is only recoverable by restarting Metrictank. If there's potential for the index lookup to time out then we need to make the meta record swap operation transactional, as you suggest, so we'd have a way to roll-back all the changes which have already been applied and attempt to execute the same swap again later, but this won't be trivial. So for now I'd just disable the time-out for index lookups by MetaTagRecordSwap() and MetaTagRecordUpsert().

idSelector - Also runs subqueries for metatags. Also async...

if TagQueryContext.Run() would accept an optional timeoutChan as a parameter, in which case it then wouldn't make its own, then the timeoutChan of the top TagQueryContext.Run() could be passed through by the idSelector to the TagQueryContext.Run() which are executing its sub-queries.

@fkaleo fkaleo modified the milestones: sprint-16, sprint-17 Sep 21, 2020
@shanson7
Copy link
Collaborator Author

I think the "full" solution is too much effort for me at the moment. I still need a hard stop timeout (such as is provided this PR) to prevent issues with a locked index.

@Dieterbe
Copy link
Contributor

ok, so then here's what i suggest: in the configs, in the comments above the setting. we say it's an experimental feature and that we have this known limitation. we also set it to default to 0 everywhere and make sure that 0 disables the feature.

those who want the feature can then opt-in, and be aware of the limitation.
whereas others who never want time-outing requests to return broken responses, can then simply ignore this patch and they wouldn't be impacted when they upgrade.

@Dieterbe Dieterbe modified the milestones: sprint-17, sprint-18 Oct 28, 2020
@shanson7
Copy link
Collaborator Author

Makes sense to me

@shanson7
Copy link
Collaborator Author

shanson7 commented Nov 4, 2020

@Dieterbe - Ready for another review when you get the chance

@Dieterbe Dieterbe assigned Dieterbe and unassigned replay Nov 4, 2020
@Dieterbe Dieterbe mentioned this pull request Nov 10, 2020
@Dieterbe
Copy link
Contributor

see #1944

@Dieterbe Dieterbe closed this Nov 10, 2020
@shanson7 shanson7 deleted the index_op_timeout branch October 22, 2021 16:14
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add configurable timeout for index ops
4 participants