Race condition in push monitors #1422

kaysond · 2022-03-27T20:36:46Z

⚠️ Please verify that this bug has NOT been raised before.

I checked and didn't find similar issue

🛡️ Security Policy

I agree to have read this project Security Policy

Description

There is a race condition in the code for push monitors that can cause repeat notifications within the same second or shortly thereafter. (see some background in #922)

http logs show that the GET calls are generally 60s apart, but because of discrepancies in system time down to the ms level, it can appear that they happen a second apart, even if they're very close together (ie.e 17:58:05.999 to 17:59:06.000)

[25/Mar/2022:17:54:05 -0700]  "GET /api/push/xxxx?msg=OK&ping=92.177 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:55:05 -0700] "GET /api/push/xxxx?msg=OK&ping=92.196 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
-----------------------------
[25/Mar/2022:17:58:05 -0700] "GET /api/push/xxx?msg=OK&ping=90.552 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567
[25/Mar/2022:17:59:06 -0700]  "GET /api/push/xxx?msg=OK&ping=91.308 HTTP/1.1" "-" "curl/7.74.0" TLSv1.3 TLS_AES_256_GCM_SHA384 200 5567

👟 Reproduction steps

Repro is tricky, but it could be possible by monitoring uptime kuma debug logs and synchronizing to them. Basically, the push URL calls have to happen just shortly after the monitor's calls to beat() or safe_beat()

👀 Expected behavior

Uptime kuma should be robust against very small clock discrepancies

😓 Actual Behavior

Get double notifications

🐻 Uptime-Kuma Version

1.12.1

💻 Operating System and Arch

Docker on Ubuntu 20.04

🌐 Browser

Firefox

🐋 Docker Version

Latest

🟩 NodeJS Version

No response

📝 Relevant log output

No response

The text was updated successfully, but these errors were encountered:

kaysond · 2022-03-27T20:47:38Z

The race condition is caused because the monitor is running on a setTimeout that is totally independent of the call of the push URL. That means that beat() and router.get("/api/push/:pushToken") can both get called at the same time in seconds (e.g. both at 17:15:02), but if beat() happens first, it will mark the monitor as down, then it will immediately mark it as up when router.get() is run.

This is partly an issue because setTimeout is not a guaranteed execution time. Node will run the function at least as late as the timeout, but it could be later. This means that the time at which beat()/safeBeat() are run will drift! Suppose we start with beat() being called at 16:15:30, 16:16:30, and so on, and router.get() being called at 16:15:50, 16:16:50, and so on. This won't cause any issues because every call to beat() will see the previous heartbeat.

Because of the behavior of setTimeout, beat() will start being called later and later (or if you just have bad luck the calls could get synchronized somehow on startup). At some point, beat() will also start getting called at xx:xx:50, causing the problem.

uptime-kuma/server/model/monitor.js

Lines 309 to 329 in 12237de

    
           } else if (this.type === "push") {      // Type: Push 
        
               const time = R.isoDateTime(dayjs.utc().subtract(this.interval, "second")); 
        
               let heartbeatCount = await R.count("heartbeat", " monitor_id = ? AND time > ? ", [ 
        
                   this.id, 
        
                   time 
        
               ]); 
        
               debug("heartbeatCount" + heartbeatCount + " " + time); 
        
               if (heartbeatCount <= 0) { 
        
                   // Fix #922, since previous heartbeat could be inserted by api, it should get from database 
        
                   previousBeat = await Monitor.getPreviousHeartbeat(this.id); 
        
                   throw new Error("No heartbeat in the time window"); 
        
               } else { 
        
                   // No need to insert successful heartbeat for push type, so end here 
        
                   retries = 0; 
        
                   this.heartbeatInterval = setTimeout(beat, beatInterval * 1000); 
        
                   return; 
        
               }

uptime-kuma/server/model/monitor.js

Lines 468 to 475 in 12237de

    
               if (this.type === "push") { 
        
                   setTimeout(() => { 
        
                       safeBeat(); 
        
                   }, this.interval * 1000); 
        
               } else { 
        
                   safeBeat(); 
        
               } 
        
           }

uptime-kuma/server/routers/api-router.js

Lines 20 to 84 in 12237de

    
           router.get("/api/push/:pushToken", async (request, response) => { 
        
               try { 
        
                   let pushToken = request.params.pushToken; 
        
                   let msg = request.query.msg || "OK"; 
        
                   let ping = request.query.ping || null; 
        
                   let monitor = await R.findOne("monitor", " push_token = ? AND active = 1 ", [ 
        
                       pushToken 
        
                   ]); 
        
                   if (! monitor) { 
        
                       throw new Error("Monitor not found or not active."); 
        
                   } 
        
                   const previousHeartbeat = await Monitor.getPreviousHeartbeat(monitor.id); 
        
                   let status = UP; 
        
                   if (monitor.isUpsideDown()) { 
        
                       status = flipStatus(status); 
        
                   } 
        
                   let isFirstBeat = true; 
        
                   let previousStatus = status; 
        
                   let duration = 0; 
        
                   let bean = R.dispense("heartbeat"); 
        
                   bean.time = R.isoDateTime(dayjs.utc()); 
        
                   if (previousHeartbeat) { 
        
                       isFirstBeat = false; 
        
                       previousStatus = previousHeartbeat.status; 
        
                       duration = dayjs(bean.time).diff(dayjs(previousHeartbeat.time), "second"); 
        
                   } 
        
                   debug("PreviousStatus: " + previousStatus); 
        
                   debug("Current Status: " + status); 
        
                   bean.important = Monitor.isImportantBeat(isFirstBeat, previousStatus, status); 
        
                   bean.monitor_id = monitor.id; 
        
                   bean.status = status; 
        
                   bean.msg = msg; 
        
                   bean.ping = ping; 
        
                   bean.duration = duration; 
        
                   await R.store(bean); 
        
                   io.to(monitor.user_id).emit("heartbeat", bean.toJSON()); 
        
                   Monitor.sendStats(io, monitor.id, monitor.user_id); 
        
                   response.json({ 
        
                       ok: true, 
        
                   }); 
        
                   if (bean.important) { 
        
                       await Monitor.sendNotification(isFirstBeat, monitor, bean); 
        
                   } 
        
               } catch (e) { 
        
                   response.json({ 
        
                       ok: false, 
        
                       msg: e.message 
        
                   }); 
        
               } 
        
           });

I would recommend following the approach of healthchecks. Healthchecks waits for the first "ping" (aka push URL call), then 60s after that ping, checks again to make sure another ping has happened. There is also a grace period to accommodate clock differences. For example, if you have a 5s grace time, the monitor won't be marked as down unless 65s has elapsed since the last ping.

I think the right thing to do, though it may be a little difficult with the way you have the architecture set up, is to do the first setTimeout on monitor.start(), but every other setTimeout should be run from router.get("/api/push:pushToken"). The timeout would be set to interval + grace period, like setTimeout(safebeat, (interval + gracePeriod)*1000).

kaysond · 2022-04-12T02:51:10Z

technically this is still not finished yet until #1428 is finished and gets merged!

louislam · 2022-04-12T02:59:50Z

technically this is still not finished yet until #1428 is finished and gets merged!

It was closed automatically. Sounds horrible because it could be cross repo.

Synchronize push monitor heartbeats to api calls (fixes #1422)

kaysond added the bug Something isn't working label Mar 27, 2022

This was referenced Mar 28, 2022

Synchronize push monitor heartbeats to api calls (fixes #1422) #1428

Merged

isoDateTime() and isoTime() don't use milliseconds which causes issues for uptime-kuma push notifications louislam/redbean-node#6

Closed

kaysond mentioned this issue Apr 7, 2022

Add isoDateTimeMillis and isoTimeMillis louislam/redbean-node#7

Merged

louislam closed this as completed in louislam/redbean-node#7 Apr 12, 2022

louislam reopened this Apr 12, 2022

louislam closed this as completed in #1428 May 29, 2022

louislam added a commit that referenced this issue May 29, 2022

Merge pull request #1428 from kaysond/master

90fe25e

Synchronize push monitor heartbeats to api calls (fixes #1422)

louislam mentioned this issue Jun 10, 2022

[push] reporting down when servcice is up in same second. #1747

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition in push monitors #1422

Race condition in push monitors #1422

kaysond commented Mar 27, 2022

kaysond commented Mar 27, 2022

kaysond commented Apr 12, 2022 •

edited

Loading

louislam commented Apr 12, 2022

Race condition in push monitors #1422

Race condition in push monitors #1422

Comments

kaysond commented Mar 27, 2022

⚠️ Please verify that this bug has NOT been raised before.

🛡️ Security Policy

Description

👟 Reproduction steps

👀 Expected behavior

😓 Actual Behavior

🐻 Uptime-Kuma Version

💻 Operating System and Arch

🌐 Browser

🐋 Docker Version

🟩 NodeJS Version

📝 Relevant log output

kaysond commented Mar 27, 2022

kaysond commented Apr 12, 2022 • edited Loading

louislam commented Apr 12, 2022

kaysond commented Apr 12, 2022 •

edited

Loading