Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition in push monitors #1422

Closed
2 tasks done
kaysond opened this issue Mar 27, 2022 · 3 comments · Fixed by #1428 or louislam/redbean-node#7
Closed
2 tasks done

Race condition in push monitors #1422

kaysond opened this issue Mar 27, 2022 · 3 comments · Fixed by #1428 or louislam/redbean-node#7
Labels
bug Something isn't working

Comments

@kaysond
Copy link
Contributor

kaysond commented Mar 27, 2022

⚠️ Please verify that this bug has NOT been raised before.

  • I checked and didn't find similar issue

🛡️ Security Policy

Description

There is a race condition in the code for push monitors that can cause repeat notifications within the same second or shortly thereafter. (see some background in #922)
image

http logs show that the GET calls are generally 60s apart, but because of discrepancies in system time down to the ms level, it can appear that they happen a second apart, even if they're very close together (ie.e 17:58:05.999 to 17:59:06.000)

[25/Mar/2022:17:54:05 -0700]  "GET /api/push/xxxx?msg=OK&ping=92.177 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:55:05 -0700] "GET /api/push/xxxx?msg=OK&ping=92.196 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
-----------------------------
[25/Mar/2022:17:58:05 -0700] "GET /api/push/xxx?msg=OK&ping=90.552 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:59:06 -0700]  "GET /api/push/xxx?msg=OK&ping=91.308 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567

👟 Reproduction steps

Repro is tricky, but it could be possible by monitoring uptime kuma debug logs and synchronizing to them. Basically, the push URL calls have to happen just shortly after the monitor's calls to beat() or safe_beat()

👀 Expected behavior

Uptime kuma should be robust against very small clock discrepancies

😓 Actual Behavior

Get double notifications

🐻 Uptime-Kuma Version

1.12.1

💻 Operating System and Arch

Docker on Ubuntu 20.04

🌐 Browser

Firefox

🐋 Docker Version

Latest

🟩 NodeJS Version

No response

📝 Relevant log output

No response

@kaysond kaysond added the bug Something isn't working label Mar 27, 2022
@kaysond
Copy link
Contributor Author

kaysond commented Mar 27, 2022

The race condition is caused because the monitor is running on a setTimeout that is totally independent of the call of the push URL. That means that beat() and router.get("/api/push/:pushToken") can both get called at the same time in seconds (e.g. both at 17:15:02), but if beat() happens first, it will mark the monitor as down, then it will immediately mark it as up when router.get() is run.

This is partly an issue because setTimeout is not a guaranteed execution time. Node will run the function at least as late as the timeout, but it could be later. This means that the time at which beat()/safeBeat() are run will drift! Suppose we start with beat() being called at 16:15:30, 16:16:30, and so on, and router.get() being called at 16:15:50, 16:16:50, and so on. This won't cause any issues because every call to beat() will see the previous heartbeat.

Because of the behavior of setTimeout, beat() will start being called later and later (or if you just have bad luck the calls could get synchronized somehow on startup). At some point, beat() will also start getting called at xx:xx:50, causing the problem.

} else if (this.type === "push") { // Type: Push
const time = R.isoDateTime(dayjs.utc().subtract(this.interval, "second"));
let heartbeatCount = await R.count("heartbeat", " monitor_id = ? AND time > ? ", [
this.id,
time
]);
debug("heartbeatCount" + heartbeatCount + " " + time);
if (heartbeatCount <= 0) {
// Fix #922, since previous heartbeat could be inserted by api, it should get from database
previousBeat = await Monitor.getPreviousHeartbeat(this.id);
throw new Error("No heartbeat in the time window");
} else {
// No need to insert successful heartbeat for push type, so end here
retries = 0;
this.heartbeatInterval = setTimeout(beat, beatInterval * 1000);
return;
}

if (this.type === "push") {
setTimeout(() => {
safeBeat();
}, this.interval * 1000);
} else {
safeBeat();
}
}

router.get("/api/push/:pushToken", async (request, response) => {
try {
let pushToken = request.params.pushToken;
let msg = request.query.msg || "OK";
let ping = request.query.ping || null;
let monitor = await R.findOne("monitor", " push_token = ? AND active = 1 ", [
pushToken
]);
if (! monitor) {
throw new Error("Monitor not found or not active.");
}
const previousHeartbeat = await Monitor.getPreviousHeartbeat(monitor.id);
let status = UP;
if (monitor.isUpsideDown()) {
status = flipStatus(status);
}
let isFirstBeat = true;
let previousStatus = status;
let duration = 0;
let bean = R.dispense("heartbeat");
bean.time = R.isoDateTime(dayjs.utc());
if (previousHeartbeat) {
isFirstBeat = false;
previousStatus = previousHeartbeat.status;
duration = dayjs(bean.time).diff(dayjs(previousHeartbeat.time), "second");
}
debug("PreviousStatus: " + previousStatus);
debug("Current Status: " + status);
bean.important = Monitor.isImportantBeat(isFirstBeat, previousStatus, status);
bean.monitor_id = monitor.id;
bean.status = status;
bean.msg = msg;
bean.ping = ping;
bean.duration = duration;
await R.store(bean);
io.to(monitor.user_id).emit("heartbeat", bean.toJSON());
Monitor.sendStats(io, monitor.id, monitor.user_id);
response.json({
ok: true,
});
if (bean.important) {
await Monitor.sendNotification(isFirstBeat, monitor, bean);
}
} catch (e) {
response.json({
ok: false,
msg: e.message
});
}
});

I would recommend following the approach of healthchecks. Healthchecks waits for the first "ping" (aka push URL call), then 60s after that ping, checks again to make sure another ping has happened. There is also a grace period to accommodate clock differences. For example, if you have a 5s grace time, the monitor won't be marked as down unless 65s has elapsed since the last ping.

I think the right thing to do, though it may be a little difficult with the way you have the architecture set up, is to do the first setTimeout on monitor.start(), but every other setTimeout should be run from router.get("/api/push:pushToken"). The timeout would be set to interval + grace period, like setTimeout(safebeat, (interval + gracePeriod)*1000).

@kaysond
Copy link
Contributor Author

kaysond commented Apr 12, 2022

technically this is still not finished yet until #1428 is finished and gets merged!

@louislam louislam reopened this Apr 12, 2022
@louislam
Copy link
Owner

technically this is still not finished yet until #1428 is finished and gets merged!

It was closed automatically. Sounds horrible because it could be cross repo.

louislam added a commit that referenced this issue May 29, 2022
Synchronize push monitor heartbeats to api calls (fixes #1422)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants