## Mock Scenario Question 6: Multi-Tenant Resource Starvation + Evictions (Vanilla Kubernetes)

### The Scenario

You operate a **shared, multi-tenant** Kubernetes cluster (self-managed). Two namespaces matter here:

* `prod` (customer-facing APIs)
* `ml` (data science workloads)

Since this afternoon’s traffic peak:

* `prod` services show **higher latency** and occasional **503s** from Ingress.
* Several `prod` pods restart, and you see some pods moved to new nodes.
* In `ml`, a team started a new batch of GPU training jobs earlier today.

You observe:

* `kubectl get pods -n prod` shows a few pods in **`OOMKilled`** restart loops.
* `kubectl get events -A` shows:

  * `Evicted: The node was low on resource: memory`
  * `Preempted` messages for some lower-priority pods
* `kubectl top nodes` shows memory is near saturation on many nodes.
* **No node is “down.”** CPU is moderate, but **memory pressure** is widespread.
* There is a ResourceQuota in `ml`, but it only limits **CPU**, not memory.
* PriorityClasses exist, but `ml` jobs were (accidentally) assigned a **high priority**.

### The Question

Walk me through your **step-by-step** troubleshooting and stabilization plan. Cover:

1. How you prove this is **resource contention** (not an app bug) and identify the blast radius.
2. The most important **signals/commands** to identify the culprits (which workloads/namespaces/nodes).
3. **Immediate mitigations** you’d apply to restore `prod` reliability *while preserving multi-tenancy fairness*.
4. Longer-term fixes: quotas, limits, PriorityClasses, scheduling isolation (taints/affinity), and capacity planning.

---

Ask your clarification questions first (I’ll answer short, like an interviewer), then give your final response.


My final response:
Okay, thanks for the scenario. And we'll go through the data given and try to mitigate to the problems in the data structurally. So there are a few points that are given under you observe, like there are pods with OMM killed, restart loops, which indicates there is some memory issues. And there is also events saying that the node was low on memory. And again, the cube, sorry, kubectl top nodes shows memory is near saturation on many nodes. Again, no node is down, but the memory pressure is widespread. And there are a bit of issues when it comes to resource quota applications in ML namespace, which are only applied to CPU but not the memory. And also wrongly assigned priority classes in the same ML namespace. All of these data points strongly signals there is a widespread memory issues on the nodes. So, as an immediate Remedy. What I would do first is, as an immediate mitigation, the first step I would take is to, in the ML namespace, the team started a new batch of GPU training jobs earlier today. So this might be occupying a lot of memory and resources. So if this is not something very critical to business or time bounded, I would ask the team to stop this process, which is safe if it's not really important to the business to do it today. I would ask them to schedule this batch GPU training jobs to a non-peak time. This is one of the steps that I would take. And the very next step that I would consider doing is the resource quota in the ML namespace. While it is evident from the data that the memory resource quota is not applied in the ML namespace, I would check the metrics of pods and nodes using kubectl describe pods and kubectl get nodes and top nodes to see which node is having the memory pressure. At least I will get the top few to get an idea. And I would also check the same metrics with respect to the namespace. So my target would be increasing the resource quota in ML. When I say increase, meaning I will first set the resource quota for the ML namespace for the memory. And also, I would... I would lower the priority for the ML jobs that were running in the priority classes. While that might not be a huge impact right now, but setting the resource quota in ML would do good. So, but at least setting the priority classes to the right priority for the ML jobs would help in future. Now, I would adjust the resource quotas and probably separate the ML and prod workloads into different nodes using the taints and tolerations. So in that way, we have more control over the nodes and we will know, we will have better visibility on what is going on with respect to each workloads and how much resource are needing, how much resource needs for ML jobs. So, to conclude my response, I would fix the resource quotas for the ML namespace with probably with limits as well. and update the priority classes for the future use, and I would reschedule the pods way in two different node sets or node pools, one for prod and one for ML using the taints and tolerations, which overall should give us a better understanding to plan for capacity for the future runs.

✅ Confirmed — treating what you wrote as your final response.

## Score: **7 / 10**

### What you did well

* You correctly identified **resource contention (memory pressure)** as the primary root cause (OOMKilled + Evicted low memory + top nodes memory saturation).
* You connected the governance misconfigurations: **ML quota missing memory** + **wrong PriorityClass** causing **preemption** of lower-priority workloads.
* You proposed the right long-term direction: **quotas/limits + priority classes + scheduling isolation (taints/tolerations)**.

### What would make it 9–10/10 (brief)

1. **Immediate stabilization should be more “Kubernetes-native” and safer**
   Instead of “ask team to stop,” say what *you* do in-cluster first:

* Scale down ML workloads **yourself** (if you have permission): `kubectl -n ml scale deploy/job/... --replicas=0` or pause their pipeline.
* Fix the **PriorityClass** assignment quickly (or patch the ML workloads to the correct PriorityClass) so **prod stops getting preempted**.

2. **Show the exact highest-signal checks**
   You mentioned top/describe broadly. Add 2–3 precise commands that prove culpability fast:

* `kubectl get events -A --sort-by=.lastTimestamp | tail -50` (see evictions/preemption)
* `kubectl top pods -A --sort-by=memory` (identify top memory consumers)
* `kubectl describe node <node>` (MemoryPressure, eviction thresholds, allocated requests/limits)
* `kubectl -n ml get pods -o wide` + `kubectl describe pod` for the high-priority ML pods (confirm PriorityClass and requests)

3. **Call out the key mechanism**

* Preemption + eviction: wrong PriorityClass + no memory quota = ML can starve the cluster and evict others.
* Also mention **requests/limits**: without limits, pods can balloon; without requests, scheduler can overpack nodes.

4. **Multi-tenancy fairness**
   You said “separate node pools”—good. Add:

* enforce `ml` memory quotas **and** LimitRange defaults (requests/limits)
* consider `prod` PriorityClass (high), `ml` (lower)
* optionally use **taints** on dedicated ML nodes so ML can’t land on prod nodes.

### Topics to skim (brief and targeted)

**Kubernetes in Action**

* Evictions, MemoryPressure, OOMKilled root causes
* Requests/limits behavior + QoS classes (Guaranteed/Burstable/BestEffort)
* PriorityClass + preemption

**Kubernetes Up & Running**

* Operational debugging with events/top/describe
* Multi-tenant guardrails basics

**Kubernetes Patterns**

* Multi-tenancy patterns: quotas, limit ranges, priority, isolation via node pools/taints

---

If you want to do another one **properly as an interview loop**, next time ask clarification questions first, then answer. But your content here is a solid step up from earlier scenarios.
