Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

check-endpoints: handle out of order results #917

Merged

Conversation

sanchezl
Copy link
Contributor

@sanchezl sanchezl commented Jul 24, 2020

When a check results in a latency longer than the check period (1s), the result of a check is reported after the result of subsequent checks. This PR introduces a delay to give the long running checks a chance to be processed in the correct order.

Further enhancements in the followup PR: #924

@openshift-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sanchezl
To complete the pull request process, please assign sttts
You can assign the PR to them by writing /assign @sttts in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sanchezl sanchezl marked this pull request as draft July 25, 2020 13:45
@openshift-ci-robot openshift-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2020
@sanchezl sanchezl marked this pull request as ready for review July 25, 2020 16:04
@openshift-ci-robot openshift-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 25, 2020
@sanchezl
Copy link
Contributor Author

/test e2e-aws
/test e2e-aws-operator

@sanchezl
Copy link
Contributor Author

/test e2e-aws

4 similar comments
@sanchezl
Copy link
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Contributor Author

/test e2e-aws

@@ -186,7 +185,7 @@ func isDNSError(err error) bool {

// manageStatusLogs returns a status update function that updates the PodNetworkConnectivityCheck.Status's
// Successes/Failures logs reflect the results of the check.
func manageStatusLogs(check *operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, checkErr error, latency *trace.LatencyInfo) []v1alpha1helpers.UpdateStatusFunc {
func manageStatusLogs(check *operatorcontrolplanev1alpha1.PodNetworkConnectivityCheck, checkErr error, latency *trace.LatencyInfo) ([]v1alpha1helpers.UpdateStatusFunc, time.Time) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update doc. This time is the time the check started?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// UpdatesManager manages a queue of updates. The lock must be obtained before
// invoking any of the methods.
type UpdatesManager interface {
sync.Locker
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no direct embedding.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

// Add an update to the queue. There is a delay equal to the size of the sorting window before
// updates are made available on the queue to allow for updates submitted out of order within
// the sorting window to be sorted by timestamp.
func (u *updatesManager) Add(timestamp time.Time, updates ...v1alpha1helpers.UpdateStatusFunc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are you managing the lock outside of the method? I'd rather manage the lock locally here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored.

// outside of a sorting window, anchored on one end by the latest update, for processing.
type updatesManager struct {
sync.Mutex
window time.Duration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you need to doc these. They aren't obvious.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

// Add an update to the queue. There is a delay equal to the size of the sorting window before
// updates are made available on the queue to allow for updates submitted out of order within
// the sorting window to be sorted by timestamp.
func (u *updatesManager) Add(timestamp time.Time, updates ...v1alpha1helpers.UpdateStatusFunc) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in followup #924.

_, _, err := v1alpha1helpers.UpdateStatus(ctx, c.client, c.name, c.updates...)
c.updates.Lock()
defer c.updates.Unlock()
if len(c.updates.Queue()) > 20 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this function return a copy so that you don't need to manage the lock here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make this function return a copy so that you don't need to manage the lock here.

oh, blech. The Clear is being caught up in here too.

How about just making UpdateStatus native on the updatemanager.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactored.

})

latestTimestamp := u.timestamps[len(u.timestamps)-1]
tmp := u.timestamps[:0]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm re-using the array backing the u.timstamps slice.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm re-using the array backing the u.timstamps slice.

let's just make another and burn th e memory

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


latestTimestamp := u.timestamps[len(u.timestamps)-1]
tmp := u.timestamps[:0]
for _, timestamp := range u.timestamps {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, you need to comment this. Looks like this an attempt to having a sliding window of results based on time. your window needs to be at least one second larger than the delay though.

Do you need to delay the events too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The window is initialized as delay + check period, so in this case 10s (conn timeout) + 1s (check period).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See followup PR where I've tried to improve this further: #924.

@sanchezl
Copy link
Contributor Author

/test e2e-aws

@sanchezl
Copy link
Contributor Author

/retest

@openshift-merge-robot openshift-merge-robot merged commit 079e7a0 into openshift:master Aug 10, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants