Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

Closed
danemacmillan opened this issue Feb 12, 2019 · 5 comments

Comments

@danemacmillan
Copy link

danemacmillan commented Feb 12, 2019

In the log for job executions, if one uses the /start endpoint, the next request will display the job execution time:

However, if a notification or alert is sent between the "Started" and "OK" requests, the job's execution time is not displayed.

Edit:

Now I'm thinking it just has to do with a job that runs long. I've been adjusting some jobs that have moved to slower disks, and they now take about three hours to run, so the grace time was adjusted for this, so alerts are no longer being triggered.

Have a look at the last few dozen logs:

@danemacmillan danemacmillan changed the title Runtime calculation not displayed if an integration alert delivered Job execution calculation not displayed if an integration alert delivered Feb 12, 2019
@danemacmillan danemacmillan changed the title Job execution calculation not displayed if an integration alert delivered Job execution calculation not displayed if an integration alert delivered (or a job runs long) Feb 20, 2019
@cuu508
Copy link
Member

cuu508 commented Feb 21, 2019

Yep, the run times are not shown in the UI if the job takes longer than 1 hour to run: https://github.com/healthchecks/healthchecks/blob/master/hc/front/views.py#L415

A couple of reasons for this:

  • I assumed the run times would typically be in seconds and minutes, and not in hours. If a ping arrives very late, it might be unrelated to the last /start. Take this example: a worker process pings the /start endpoint and promptly crashes. Three days later somebody notices and, while investigating, manually pings the ping URL. The runtime now shows "3 days X hours" which is not useful.
  • Less worries about formatting and displaying the run time in UI. The run time will always be in "X min Y sec" or "Y sec" form, so the UI string cannot get too long and cause problems. IOW, me being lazy ;-)

I'm thinking, raising the limit from 1 hour to 24 hours (so for long jobs UI would show "X h Y min Z sec") would probably be enough, right?

@danemacmillan
Copy link
Author

danemacmillan commented Feb 21, 2019

Point number 1 makes sense (as does number 2 😉 ). I know at least personally that 24h would be more than enough. Perhaps use the Grace Time setting in determining when (a) a ping is the end of a currently executing job or (b) a newly-started job? This way you avoid arbitrarily opening up the window so wide, while at the same time using concrete settings that play well with this functionality. For example, with these long-running jobs I've set their Grace Time to be 200 minutes (3h20m). I think it's a good indicator, while at the same time minimizing the likelihood of point number 1. Perhaps make it Grace Time + 1 hour padding? Doing this would mean that even if an alert is sent out (exceeding Grace Time), the extra padding would allow the calculation to be done when it eventually pings--and that way the original description in this ticket wouldn't actually become true. If after this, then perhaps add some explicit note in the log that the previous job likely failed, because by this point I think the software has done it's due diligence in ensuring what it can.

@cuu508 cuu508 closed this as completed in 954ca45 Mar 1, 2019
@cuu508
Copy link
Member

cuu508 commented Mar 1, 2019

Thanks, these are good ideas. I went with your suggested Grace Time + 1 hour formula. I also clamped it to 12 hours max.

@danemacmillan
Copy link
Author

danemacmillan commented Mar 1, 2019

Sweet! I already retroactively see the change in logic:

screen shot 2019-03-01 at 10 15 55 am

Something I just considered, and I don't expect the need to use it, but should a job ever radically become faster in execution time--and I decrease the grace time--the older ones with longer grace times may disappear if the grace time is not persistent data associated with the particular execution of the job. This would be true if it's being stored as a high-level property or descriptive attribute of the job that applies to every runtime of a job, past, present, and future. I don't know if this is the case or not as I haven't read through the code very thoroughly. Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. I mention this only as a consideration, as I know it will increase the complexity, and I don't see the urgency. It's more like one of those back-of-the-mind quirks that you know about.

@cuu508
Copy link
Member

cuu508 commented Mar 6, 2019

That's a good point – changing the grace time can make the historic execution times flip on and off.

Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this.

Will keep this in mind. In future we might build new features that require more data to be stored with each ping. At that time it would make sense to improve this area as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants