Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

danemacmillan · 2019-02-12T12:58:49Z

In the log for job executions, if one uses the /start endpoint, the next request will display the job execution time:

However, if a notification or alert is sent between the "Started" and "OK" requests, the job's execution time is not displayed.

Edit:

Now I'm thinking it just has to do with a job that runs long. I've been adjusting some jobs that have moved to slower disks, and they now take about three hours to run, so the grace time was adjusted for this, so alerts are no longer being triggered.

Have a look at the last few dozen logs:

The text was updated successfully, but these errors were encountered:

cuu508 · 2019-02-21T18:45:01Z

Yep, the run times are not shown in the UI if the job takes longer than 1 hour to run: https://github.com/healthchecks/healthchecks/blob/master/hc/front/views.py#L415

A couple of reasons for this:

I assumed the run times would typically be in seconds and minutes, and not in hours. If a ping arrives very late, it might be unrelated to the last /start. Take this example: a worker process pings the /start endpoint and promptly crashes. Three days later somebody notices and, while investigating, manually pings the ping URL. The runtime now shows "3 days X hours" which is not useful.
Less worries about formatting and displaying the run time in UI. The run time will always be in "X min Y sec" or "Y sec" form, so the UI string cannot get too long and cause problems. IOW, me being lazy ;-)

I'm thinking, raising the limit from 1 hour to 24 hours (so for long jobs UI would show "X h Y min Z sec") would probably be enough, right?

danemacmillan · 2019-02-21T20:34:56Z

Point number 1 makes sense (as does number 2 😉 ). I know at least personally that 24h would be more than enough. Perhaps use the Grace Time setting in determining when (a) a ping is the end of a currently executing job or (b) a newly-started job? This way you avoid arbitrarily opening up the window so wide, while at the same time using concrete settings that play well with this functionality. For example, with these long-running jobs I've set their Grace Time to be 200 minutes (3h20m). I think it's a good indicator, while at the same time minimizing the likelihood of point number 1. Perhaps make it Grace Time + 1 hour padding? Doing this would mean that even if an alert is sent out (exceeding Grace Time), the extra padding would allow the calculation to be done when it eventually pings--and that way the original description in this ticket wouldn't actually become true. If after this, then perhaps add some explicit note in the log that the previous job likely failed, because by this point I think the software has done it's due diligence in ensuring what it can.

cuu508 · 2019-03-01T12:42:18Z

Thanks, these are good ideas. I went with your suggested Grace Time + 1 hour formula. I also clamped it to 12 hours max.

danemacmillan · 2019-03-01T15:28:01Z

Sweet! I already retroactively see the change in logic:

Something I just considered, and I don't expect the need to use it, but should a job ever radically become faster in execution time--and I decrease the grace time--the older ones with longer grace times may disappear if the grace time is not persistent data associated with the particular execution of the job. This would be true if it's being stored as a high-level property or descriptive attribute of the job that applies to every runtime of a job, past, present, and future. I don't know if this is the case or not as I haven't read through the code very thoroughly. Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this. I mention this only as a consideration, as I know it will increase the complexity, and I don't see the urgency. It's more like one of those back-of-the-mind quirks that you know about.

cuu508 · 2019-03-06T10:22:17Z

That's a good point – changing the grace time can make the historic execution times flip on and off.

Storing the current grace time value at the time of the execution, if not already being done, along with the job execution results would address this.

Will keep this in mind. In future we might build new features that require more data to be stored with each ping. At that time it would make sense to improve this area as well.

danemacmillan changed the title ~~Runtime calculation not displayed if an integration alert delivered~~ Job execution calculation not displayed if an integration alert delivered Feb 12, 2019

danemacmillan changed the title ~~Job execution calculation not displayed if an integration alert delivered~~ Job execution calculation not displayed if an integration alert delivered (or a job runs long) Feb 20, 2019

cuu508 closed this as completed in 954ca45 Mar 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

danemacmillan commented Feb 12, 2019 •

edited

cuu508 commented Feb 21, 2019 •

edited

danemacmillan commented Feb 21, 2019 •

edited

cuu508 commented Mar 1, 2019

danemacmillan commented Mar 1, 2019 •

edited

cuu508 commented Mar 6, 2019

Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

Job execution calculation not displayed if an integration alert delivered (or a job runs long) #219

Comments

danemacmillan commented Feb 12, 2019 • edited

cuu508 commented Feb 21, 2019 • edited

danemacmillan commented Feb 21, 2019 • edited

cuu508 commented Mar 1, 2019

danemacmillan commented Mar 1, 2019 • edited

cuu508 commented Mar 6, 2019

danemacmillan commented Feb 12, 2019 •

edited

cuu508 commented Feb 21, 2019 •

edited

danemacmillan commented Feb 21, 2019 •

edited

danemacmillan commented Mar 1, 2019 •

edited