## Mock Scenario Question 7: NetworkPolicy + Service Reachability Regression (Vanilla Kubernetes)

### The Scenario

You run a self-managed Kubernetes cluster using **Calico** for networking (CNI) and NetworkPolicies. In namespace `prod` you have:

* `web` Deployment (serves HTTP)
* `api` Deployment (internal API)
* `db` StatefulSet (Postgres)

Traffic flow should be:
**Ingress → web → api → db**

Since this morning:

* External users intermittently see **500s** on the website.
* `web` pods log: `ERROR calling api: connect timeout`
* `api` pods log: `ERROR db connection timeout`
* All pods are **Running/Ready**.
* `kubectl get svc -n prod` looks normal.
* You discover a change merged last night: a new **default-deny NetworkPolicy** was applied in `prod`, and a “allow-from-ingress” policy was added for `web`.

### The Question

Give your **step-by-step troubleshooting plan** to isolate the break in the network path and restore service safely. Cover:

1. How you prove whether the issue is **DNS**, **Service/Endpoints**, **NetworkPolicy**, or **CNI/kube-proxy**.
2. The highest-signal commands you’ll run (be specific) and what you expect to see.
3. The minimal safe policy changes you’d apply to restore:

   * Ingress → web
   * web → api
   * api → db
4. How you validate the fix and prevent recurrence.

---

Ask your clarification questions first (I’ll answer short and to the point, like an interviewer). Then give your final response.


---
---
---


Desired functionality (intended state):

* `prod` should be **default-deny** for both **ingress and egress**.
* Only these flows should be allowed:

  1. **Ingress controller namespace → `web` pods** on TCP 80/443
  2. **`web` pods → `api` pods** on TCP 8080
  3. **`api` pods → `db` pods** on TCP 5432
* **No other cross-namespace traffic** into `prod` is intended.
* DNS should still work (pods should be able to query CoreDNS in `kube-system`).

What was implemented last night:

* A **default-deny** policy was applied in `prod`.
* An allow policy was added for **Ingress → web**, but nothing explicit was added for **web → api**, **api → db**, or **DNS egress**.


---
---
---

✅ Confirmed — that’s your final response.

## Score: **6.5 / 10**

### What you did well

* You correctly identified the likely root cause: **default-deny NetworkPolicy was applied without allowing the needed east-west flows and DNS egress**.
* You proposed the right policy intents: **web→api**, **api→db**, and **DNS access**.
* You had a sensible “stop the bleeding” instinct (rollback).

### What kept it from 9–10/10 (brief)

1. **You didn’t “prove” it with high-signal checks**
   In this scenario, a TSE answer should explicitly run 2–3 checks that immediately confirm NetworkPolicy is blocking traffic:

* `kubectl get netpol -n prod` (see default-deny + current allow rules)
* `kubectl describe netpol -n prod <policy>` (confirm missing egress/ingress selectors)
* From pods:

  * `kubectl exec -n prod <web-pod> -- curl -m2 http://api:8080/health`
  * `kubectl exec -n prod <api-pod> -- nc -vz db 5432`
  * DNS test: `nslookup api.prod.svc.cluster.local` or `dig`
    If DNS works but connect times out → strongly points to NetworkPolicy.

2. **You conflated “roll back” vs “minimal fix”**
   Rollback is acceptable as an emergency action, but the question asked for **minimal safe policy changes**. You should show the *least-privilege* patch plan:

* Add *only* the missing allows, not “open everything.”

3. **Missing nuance: egress needs are broader than just db**
   Apps often need:

* DNS (UDP/TCP 53 to CoreDNS)
* sometimes time sync, metrics, external calls
  A strong answer mentions starting with the **required flows only** and adding others as observed.

### What a 9/10 answer sounds like (tight)

* Prove it: show default-deny + missing rules + failing curls from web→api and api→db while pods are Ready.
* Apply minimal policies:

  * Allow ingress-ns → web on 80/443
  * Allow web → api on 8080
  * Allow api → db on 5432
  * Allow DNS egress from prod → kube-system/CoreDNS on 53
* Validate: repeat curl/nc + monitor error rate drops.
* Prevent recurrence: policy tests in CI (conformance smoke tests), policy templates, staged rollout, and clear label conventions.

### Topics to skim (brief + targeted)

**Kubernetes in Action**

* Services/Endpoints troubleshooting + debugging connectivity
* NetworkPolicy fundamentals (ingress vs egress, selectors)

**Kubernetes Up & Running**

* Cluster networking basics + “life of a request” through Service routing

**Kubernetes Patterns**

* Multi-tenant isolation patterns and safe rollout patterns (progressive + validation)

---

If you want the next scenario, I’ll make it **system design + scale** (multi-region failover + data recovery) with Kubernetes primitives, same difficulty.
