## Mock Scenario Question 2: Control Plane Connectivity + “kubectl timeouts” (Vanilla Kubernetes)

### The Scenario

You operate a **self-managed Kubernetes cluster** (not GKE). Since this morning, multiple engineers report:

* **All `kubectl` commands are timing out** (`kubectl get pods`, `kubectl apply`, etc.).
* **Workloads are still serving traffic** through the existing Ingress and Services (customers aren’t fully down).
* You recently rotated **API server certificates** 2 days ago.
* The API server endpoint is exposed via a **TCP load balancer** in front of **3 control-plane nodes**.
* Some engineers are on VPN, some are not — both groups see the issue.

### The Question

Outline your **step-by-step troubleshooting plan** to restore kubectl access. In your answer, cover:

1. How you quickly decide if it’s **client-side**, **network path**, **load balancer**, **API server**, or **etcd**.
2. The **most important signals / commands** you’d use at each layer (don’t list everything — just the highest-signal checks).
3. What failure modes you suspect given that **apps still serve traffic**, but the **control plane is unreachable**.
4. If the issue is intermittent, what you’d check to confirm **LB health checks / backend readiness** and **API server health endpoints**.

---

Your move: ask your clarification questions (I’ll answer short and to the point, like an interviewer). Then you’ll give your final response.

---
clarification questions,

1. "All kubectl commands are timing out" since when is happening.
2. Even the "kubectl get cluster-info" is not working?
3. You mentioned you rotated the API certificates 2 days ago, so I want to confirm if this issue is happening since then or more recent? so we can eliminate the possibility of cert issues. 

Since when: It started today around 9:10 AM ET (about 3 hours ago).

kubectl cluster-info: Yes — it also times out.

Relation to cert rotation: The cluster was stable for the last 2 days after the rotation. The timeouts began this morning, not immediately after the cert change.

more clarification questions,

so this looks like kubectl unable to connect to the cluster. 

Are there any more recent changes that I need to be aware of, from this morning.

Changes like, any api-server version changes, or kubectl version changes? For kubectl itself, the binary path should be correct and the version of api-server should be compatible to kubectl's version. and the kubeconfig file has any changes recently?
if none of these are the reason, we will proceed to troubleshooting beyond kubectl, like api-server and control-plane.

Recent changes this morning: No planned changes to the cluster. However, there was an urgent firewall policy update pushed org-wide at 8:55 AM ET (meant to tighten inbound rules). It may have affected the API endpoint path.

API server / cluster version: No upgrade or config change since last week.

kubectl versions: Mixed across engineers (1.26–1.29), but all were working yesterday.

kubeconfig: No intentional changes; a couple users re-downloaded it, but it didn’t help.

Clarification question: Okay, lets talk bit more about the firewall changes. I want to make sure the api-server's port 6443 is accessible for kubectl.

The firewall change likely impacted that.

From a test machine on the same network segment as most engineers:

TCP 6443 to the API load balancer VIP fails (connection times out).

Port 443 to the Ingress load balancer works.

now be an beginner to intermediate technical solutions engineer and answer the above scenario.  Follow a structured or layered approach like a good TSE. I will use your response to prepare myself for the actual interview. Keep it simple and nice for my preparation.

Final response:

So, here is how I would approach the problem. Firstly, I already mentioned my clarification questions, but let's go through it again. So I would start with checking the kubectl commands, like kubectl get cluster info, just to make sure we are trying to reach the right API server and the right cluster. Since that is also not working, there is a suspect of API certificates being changed two days ago, but it is confirmed that the cluster was stable for the last two days and the timeouts only began this morning. So we can rule out certification changes as a problem. Going forward, there are some recent changes to the cluster, which is an urgent firewall policy update, which is meant to tighten the inbound rules. However, there are no config changes or upgrades to API server or kubectl versions, meanwhile, and no changes to kubeconfig file as well. So, based on all the information, The firewall changes looks like the culprit here. So when we TCP the port 6443 to the load API load balancer, the VIP fails. So what I would do is I would open up the port of 6443, which is the port of our API server, to make sure the kubectl can pass the firewall rule and reach the API server. So this could be done in a couple of ways. I would do the safest way to whitelist the kubectl to reach the 6443 or add an inbound rule which exempts the kubectl to not allow for kubectl to reach 6443 port of API server.

---
Score: 6 / 10
What you did well

You followed a layered elimination (kubectl → cert timing → recent change → network reachability).

You used the strongest clue correctly: TCP 6443 to the API LB VIP times out.

You proposed the right remediation direction: fix the firewall rule.

What would make it 9–10/10 (brief)

You jumped to “open 6443” without showing enough control-plane/LB validation and without stating the minimum-safe change + verification steps.