## Mock Scenario Question 5: Pods Can’t Start During Rollout (Image Pull + Probes + Multi-Tenant Constraints)

### The Scenario

You run a self-managed Kubernetes cluster with 2 namespaces: `prod` and `batch` (shared cluster, strict isolation rules).

A rollout of `orders-api` in `prod` starts failing during peak hours:

* New pods for `orders-api` are stuck in **`ImagePullBackOff`** for several minutes, then some recover, some don’t.
* A subset of pods that do start enter **`CrashLoopBackOff`** within 1–2 minutes.
* The old ReplicaSet pods are still serving, but latency is increasing and you’re seeing **sporadic 503s** from the Ingress.
* `kubectl get events -n prod` shows a mix of:

  * `Failed to pull image ... dial tcp: i/o timeout`
  * `Back-off pulling image`
  * `Readiness probe failed: HTTP probe failed with statuscode: 503`
* Nodes show normal CPU/memory, but **egress bandwidth is high** during peak hours.
* You also know `batch` namespace runs heavy nightly jobs that pull large images and hit external APIs.

### The Question

Give your **step-by-step troubleshooting plan** to restore stability and complete the rollout safely. Cover:

1. How you determine whether the primary issue is **registry/network**, **node-level image cache/container runtime**, **application crash**, or **probe misconfiguration**.
2. The **highest-signal checks** you’ll run (commands + what you’re looking for).
3. **Safe mitigations** you’d apply immediately to reduce user impact (without breaking multi-tenancy isolation).
4. Longer-term fixes to prevent recurrence (think: image strategy, registry mirroring/caching, QoS, quotas, scheduling isolation, probe tuning).

---

Ask your clarification questions first (I’ll answer short, like an interviewer). Then give your final response.


* **Is it new?** No — `orders-api` has been running in `prod` for **~9 months**.

* **When did the failures start?** This began **today at ~11:30 AM ET**, right when traffic peaked.

* **When did this rollout start?** The new rollout was triggered at **11:20 AM ET** (CI pipeline pushed a new image tag).

* **What changed in this rollout?** Only the **container image tag** (same Deployment, same manifests). No intentional probe changes.

* **Extra context:** The image is hosted on an **external registry** (not inside the cluster).
---

MY FINAL RESPONSE:
With the new information, it's great. Now, we also have the data that the kubectl get events from the namespace prod of the new rollout pods is giving us failed to pull image and backoff pulling image and readiness probe failed alerts. So the first two events related to images are quite suspicious.And I would also check the logs of these pods using kubectl logs, hyphen n namespace and pod name. So that may give us a bit more information on the image pulling and the readiness probe. But considering the image pulling issues and the readiness probe issues, and also there is some information that the egress bandwidth is high during this peak hour. So egress meaning the traffic that's going out, the bandwidth to pull the image, I assume. So let's check the details that we are pulling from the image or the image itself. In the latest data, it is mentioned that only the container image tag has been changed. So I would first and foremost check if the name tag is correct. And second, since this is an external registry, I would check if there is proper authentication and authorization to the external registry to pull the image. So by doing these two checks, we'll get a good idea where the problem is. If the tags are correct and if we have the right access to pull the image from the external registry, things should be smooth.Plus, it is also mentioned that from the batch namespace, it runs heavy nightly jobs that pulls larger images and hit external APIs. So the egress could be again heavy, but what we can do here is both should not happen on the same time. The action in the prod namespace, as we can see usually in the morning, and the batch jobs should be scheduled and should be run in the night times so that the load is equally distributed.And finally, I would also take a quick look in the probe, the readiness probe. I would check why is this failing? Is the timeout too less or in order to download larger images, it may need more buffer time. So I would also try tweaking the readiness probe. So my immediate actions would be the name tag verifications and access to the registry like are we authenticated and authorized to pull images and as well as to increase the limits for in the readiness probe in order to facilitate pulling larger images. And for long-term fixes, we should consider the image strategy and as well as executing the bad jobs in the night away from the peak time.

✅ Confirmed — that’s your final response.

## Score: **5.5 / 10**

### What you did well

* You focused on the most visible signals: **ImagePullBackOff + i/o timeout + high egress**.
* You recognized multi-tenant impact: `batch` jobs can compete for egress and external APIs.

### Where it missed (why it’s not yet “same difficulty” quality)

You drifted into assumptions that don’t fit the evidence and missed the highest-signal checks + safest mitigations.

Key gaps:

1. **Readiness probe has nothing to do with image pulling**

* Image pull happens **before** the container starts. A readiness probe can’t “facilitate pulling larger images.”
* The readiness 503 is about the app not being ready (or dependency failing), **after** the container starts.

2. **You didn’t split the problem into the two distinct failure modes**
   This scenario has two separate classes:

* **A) ImagePullBackOff**: registry/network/egress/DNS/proxy/auth/cache/container runtime
* **B) CrashLoopBackOff / readiness 503**: app startup config, downstream dependency, DB, secrets, env vars, migrations, probes too strict

3. **Your checks weren’t specific enough**
   Most important checks you didn’t mention:

* `kubectl describe pod <pod>` → exact pull error (TLS, timeout, 401, DNS)
* Node-level runtime logs: containerd/docker pull errors
* From node: `crictl pull <image>` (or equivalent) to isolate kubelet vs registry
* `kubectl get events --sort-by=.lastTimestamp`
* Compare: `curl` inside pod to its dependencies (DB/cache), because readiness is failing with 503

4. **Mitigation should prioritize stability**
   You didn’t mention the obvious “stop the bleeding” actions:

* **Pause/rollback the rollout** (safe because old RS is healthy)
* Temporarily set `maxUnavailable: 0` to preserve serving capacity during peak
* Temporarily reduce `batch` egress impact (quota, limit, separate nodes, or schedule)

Scheduling batch “only at night” is a nice longer-term goal, but it’s not a reliable immediate fix (and doesn’t explain why it’s happening right now unless batch is actually running now—which the scenario didn’t confirm).

---

## What a 9/10 answer looks like (brief template)

### 1) Stabilize

* `kubectl rollout pause deploy/orders-api -n prod` (or rollback if needed)
* Ensure old RS stays serving (`maxUnavailable: 0` temporarily)

### 2) Triage ImagePullBackOff (fast proof)

* `kubectl describe pod -n prod <newpod>` → read Events
* Check if error is:

  * `i/o timeout` (network/egress/DNS)
  * `401/403` (auth)
  * `x509` (TLS/CA)
  * `manifest unknown` (bad tag)
* On an affected node:

  * `crictl pull <image>` (bypass k8s objects) to confirm node→registry connectivity
* Check if only some nodes fail → suggests node-level egress path, DNS, NAT, or runtime cache issues

### 3) Triage CrashLoop / readiness 503 (separately)

* `kubectl logs -p` and `kubectl describe` for restart reason
* Exec into a running pod and hit:

  * `curl localhost:<port>/health`
  * `curl <dependency>` (DB/redis/external API)
* Confirm probe config:

  * readiness endpoint correct? needs initialDelaySeconds? startupProbe missing?

### 4) Mitigations

* Scale CoreDNS/egress path? (if DNS/reg connectivity is slow)
* Add image pull caching/mirroring; pre-pull on nodes
* Enforce tenant isolation:

  * `ResourceQuota` / limit egress-heavy batch
  * separate node pool with taints for batch
  * registry mirror inside network

---

## Topics to skim in your books (targeted)

**Kubernetes in Action**

* Pod lifecycle + kubelet image pulling and Events
* Probes: readiness vs liveness vs startupProbe

**Kubernetes Up & Running**

* Deployments rollout mechanics (pause/rollback, maxUnavailable/maxSurge)
* Debugging patterns with `describe`, events, logs

**Kubernetes Patterns**

* Deployment patterns (rollout safety, progressive delivery basics)
* Multi-tenant isolation patterns (taints, quotas, separate node pools)
* Image management patterns (registry mirror/cache, pre-pull)

---

If you want, I can ask a follow-up as the interviewer: **“What exact 3 commands do you run first, and what outcomes would make you rollback vs keep investigating?”**
