From fb6fd02fdce51468031a5b6c24323e6e941fbc38 Mon Sep 17 00:00:00 2001 From: Peter Johnson Date: Wed, 19 Nov 2025 14:03:10 +0800 Subject: [PATCH 1/4] Added two incident reviews --- docs/releases/status.md | 67 +++++++++++++++++++++++++++++++++++++++-- 1 file changed, 65 insertions(+), 2 deletions(-) diff --git a/docs/releases/status.md b/docs/releases/status.md index 53bc83787..18b3e86ae 100644 --- a/docs/releases/status.md +++ b/docs/releases/status.md @@ -2,13 +2,76 @@ Lambda Feedback is a cloud-native application that is available with full servic This page contains information about any known incidents where service was interrupted. The page begain in November 2024 following a significant incident. The purpose is to be informative, transparent, and ensure lessons are always learned so that service improves over time. -The Severity of incidents is the product of number of users affected (for 100 users, N = 1), magnitude of the effect (scale 1-5 from workable to no service), and the duration (in hours). Severity below 1 is LOW, between 1 and 100 is SIGNIFICANT, and above 100 is HIGH. The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals. +The Severity of incidents is the product of: + +- number of users affected (for 100 users, N = 1), +- magnitude of the effect (scale 1-5 from workable to no service), +- duration (in hours). + +Severity: + +- < 1 is LOW +- 1 - 100 is SIGNIFICANT +- > 100 is HIGH. + +The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals. + +## 2025 November 18th: Some evaluation functions failing (Severity: LOW): + +Some evaluation functions returned errors. + +### Timeline (UK / GMT) + +2025/11/18 21:18 GMT: some but not all feedback functions failed. Investigation initiated and message on home page +2025/11/18 21:39 GMT: updated to users that the cause was identified. +2025/11/18 21:45 GMT: issue resolved. Home page updated. + +### Analysis + +The root cause of the issue was the outage of Cloudflare, which cause wide issues on the internet including for example X, ChatGPT, and other services being unavailable. + +Our system does not use Cloudflare so was unaffected. However, any of our evaluation functions using an old version of our baselayer rely on calling GitHub to retrieve a schema. GitHub git services were down (presumably due to the Cloudflare outage), which meant that our functions could not validate their schemas and therefore failed. + +We tried to implement a solution but were unable to because implementation relied on GitHub workflows, which failed for the same reason. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal. + +The solution in this case is to upgrade all of our evaluation functions to a newer version of the baselayer, which has schemas bundled and does not rely on external services. + +### Recommended action + +Update all evaluation function baselayers. + +N=1, effect = 2, duration = 0.5. Severity = 1 (LOW) + +## 2025 November 10th: Service unresponsive (Severity: SIGNIFICANT): + +The application was unresponsive. + +### Timeline (UK / GMT) + +2025/11/10 14:21 Service became unresponseive, e.g. pages not loading. Reports from users through various channels. Developers began investigating, message sent to Teachers. + +2025/11/10 14:28 Service returned to normal. Home page message displayed to inform users. + +### Analysis + +During the period of unresponsiveness, the key symptoms within the system were overloading the CPU of the servers. Error logging and alerts did successfully detect downtime and alert the developer team, who responded. Although developers were looking into the problem, and tried to increase resource to resolve the problem, in fact the autoscaling solved the problem itself. + +The underlying cause was a combination of high usage, leading to CPU overload. This type of scenario is normal and correctly triggered autoscaling. The issue in this case was that autoscaling should happen seamlessly, without service interruptions in the intervening period. + +### Action taken: + +- Decrease the CPU and memory usage level at which scaling is triggered. This increases overall costs but decreases the chance of service interruptions. +- Enhance system logs so that more information is available if a similar event occurs +- Investigate CPU and memory usage to identify opportunities for improvements (outcome: useage is typical for NODE.js applications, no further action) + +N=3, effect = 5, duration = 0.15. Severity = 2.25 (SIGNIFICANT) + ## 2025 October 17th: Handwriting input temporarily unavailable (Severity: SIGNIFICANT) Handwriting in response areas (but not in the canvas) did not return a preview and could not be submitted. Users received an error in a toast saying that the service would not work. All other services remained operational. -### Timeline (UK / BST) +### Timeline (UK / BST) 2025/10/17 08:24 Handwriting inputs ceased to return previews to the user due to a deployed code change that removed redudant code, but also code that it transpired was required. From cfb1c484ab272859511dee5a0e000a909f6989b0 Mon Sep 17 00:00:00 2001 From: Peter Johnson Date: Wed, 22 Oct 2025 17:23:45 +0100 Subject: [PATCH 2/4] i g interactive rebase in progress; onto f0718c91 --- docs/releases/status.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/releases/status.md b/docs/releases/status.md index 18b3e86ae..47ad99286 100644 --- a/docs/releases/status.md +++ b/docs/releases/status.md @@ -10,9 +10,9 @@ The Severity of incidents is the product of: Severity: -- < 1 is LOW -- 1 - 100 is SIGNIFICANT -- > 100 is HIGH. +- x < 1 is LOW +- 1 < x < 100 is SIGNIFICANT +- x > 100 is HIGH. The severity is used to decide how much we invest in preventative measures, detection, mitigation plans, and rehearsals. From 72202dc40551d9edb89acb6329e42564a5ac4f48 Mon Sep 17 00:00:00 2001 From: Peter Johnson Date: Fri, 21 Nov 2025 07:51:27 +0000 Subject: [PATCH 3/4] Corrected root cause (GitHub not CloudFlare) --- docs/releases/status.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/docs/releases/status.md b/docs/releases/status.md index 47ad99286..a2272ca54 100644 --- a/docs/releases/status.md +++ b/docs/releases/status.md @@ -28,13 +28,11 @@ Some evaluation functions returned errors. ### Analysis -The root cause of the issue was the outage of Cloudflare, which cause wide issues on the internet including for example X, ChatGPT, and other services being unavailable. +Some of our evaluation functions still use an old version of our baselayer, which calls GitHub to retrieve a schema. GitHub git services were down (https://www.githubstatus.com/incidents/5q7nmlxz30sk), which meant that our functions could not validate their schemas and therefore failed. -Our system does not use Cloudflare so was unaffected. However, any of our evaluation functions using an old version of our baselayer rely on calling GitHub to retrieve a schema. GitHub git services were down (presumably due to the Cloudflare outage), which meant that our functions could not validate their schemas and therefore failed. +The same issue meant that we could not push updates to code during the incident, due code being deployed via GitHub. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal. -We tried to implement a solution but were unable to because implementation relied on GitHub workflows, which failed for the same reason. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal. - -The solution in this case is to upgrade all of our evaluation functions to a newer version of the baselayer, which has schemas bundled and does not rely on external services. +The solution in this case is to upgrade all of our evaluation functions to a newer version of the baselayer, which has schemas bundled and does not rely on external services. ### Recommended action From f2495d9b21ac8e2eca18597354b2417c72d242bf Mon Sep 17 00:00:00 2001 From: Peter Johnson Date: Fri, 21 Nov 2025 07:54:58 +0000 Subject: [PATCH 4/4] Minor corrections --- docs/releases/status.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/docs/releases/status.md b/docs/releases/status.md index a2272ca54..837a55210 100644 --- a/docs/releases/status.md +++ b/docs/releases/status.md @@ -22,21 +22,21 @@ Some evaluation functions returned errors. ### Timeline (UK / GMT) -2025/11/18 21:18 GMT: some but not all feedback functions failed. Investigation initiated and message on home page -2025/11/18 21:39 GMT: updated to users that the cause was identified. +The application was fully available during this time period. + +2025/11/18 21:18 GMT: some but not all evaluation functions (external microservices) failed. Investigation initiated and message added on home page +2025/11/18 21:39 GMT: home page updated to users that the cause was identified. 2025/11/18 21:45 GMT: issue resolved. Home page updated. ### Analysis -Some of our evaluation functions still use an old version of our baselayer, which calls GitHub to retrieve a schema. GitHub git services were down (https://www.githubstatus.com/incidents/5q7nmlxz30sk), which meant that our functions could not validate their schemas and therefore failed. - -The same issue meant that we could not push updates to code during the incident, due code being deployed via GitHub. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal. +Some of our evaluation functions still use an old version of our baselayer, which calls GitHub to retrieve a schema and validate inputs. GitHub git services were down (https://www.githubstatus.com/incidents/5q7nmlxz30sk), which meant that those of our functions that call GitHub could not validate their schemas and therefore failed. Other evaluation functions had previously been updated to remove the need to call GitHub and were therefore not affected by the issue. -The solution in this case is to upgrade all of our evaluation functions to a newer version of the baselayer, which has schemas bundled and does not rely on external services. +The same root cuase meant that we could not push updates to code during the incident, due code being deployed via GitHub. GitHub had announced they were resolving the issue, and when it was resolved our services returned to normal. ### Recommended action -Update all evaluation function baselayers. +Update all evaluation function baselayers to remove dependency on external calls when validating. N=1, effect = 2, duration = 0.5. Severity = 1 (LOW)