## **Kubernetes Troubleshooting Commands: The TSE Cheat Sheet**

For your Google interview, here's a **structural categorization** that mirrors how you'd actually troubleshoot‚Äîfrom high-level cluster issues down to container and network debugging.

---

## **üìä THE TROUBLESHOOTING PYRAMID**

```
                    üîç USER-FACING ISSUES
                    (kubectl, apps, services)
                           ‚Üì
                    üåê CLUSTER LEVEL
                    (API server, nodes, control plane)
                           ‚Üì
                    üì¶ WORKLOAD LEVEL
                    (Pods, deployments, services)
                           ‚Üì
                    üê≥ CONTAINER LEVEL
                    (Processes, logs, files)
                           ‚Üì
                    üîå NETWORK LEVEL
                    (DNS, connectivity, policies)
                           ‚Üì
                    ‚öôÔ∏è SYSTEM LEVEL
                    (Node OS, services, resources)
```

---

## **üîç LEVEL 1: USER-FACING / KUBECTL COMMANDS**

### **Cluster Health & Info**
```bash
# Basic cluster info
kubectl cluster-info                    # Show cluster endpoints
kubectl cluster-info dump                # Detailed debug dump
kubectl version --short                  # Client & server versions
kubectl api-resources                     # List all resource types
kubectl explain pod                       # Documentation for any resource

# Node status
kubectl get nodes -o wide                  # List all nodes with details
kubectl describe node <node-name>          # Node conditions, pods, resources
kubectl top node                          # Node resource usage (metrics server)
```

### **Workload Troubleshooting**
```bash
# Pod debugging
kubectl get pods --all-namespaces -o wide   # All pods with IPs and nodes
kubectl describe pod <pod-name>              # Events, status, containers
kubectl logs <pod-name> [-c container]       # Container logs
kubectl logs --previous <pod-name>            # Logs from previous crash
kubectl exec -it <pod-name> -- /bin/sh       # Shell into container
kubectl cp <pod-name>:/path/file ./local     # Copy files from pod

# Common one-liners
kubectl get events --sort-by='.lastTimestamp'  # Chronological events
kubectl get pods --field-selector=status.phase=Failed  # Only failed pods
```

### **Resource Inspection**
```bash
# All the "get" commands you'll need
kubectl get quota                    # ResourceQuota usage
kubectl get limitrange                # LimitRange values
kubectl get pvc                       # PersistentVolumeClaims
kubectl get netpol                    # NetworkPolicies
kubectl get hpa                        # HorizontalPodAutoscalers
kubectl get pdb                        # PodDisruptionBudgets
kubectl get priorityclass              # PriorityClass definitions

# With useful flags
kubectl get pods -o yaml               # Full pod spec (debug config)
kubectl get pods --show-labels          # See all labels
kubectl get pods -l app=myapp           # Label selector
```

### **Service & Endpoint Debugging**
```bash
kubectl get svc                         # Services
kubectl get endpoints                    # Endpoint IPs backing services
kubectl describe svc <svc-name>          # Service details
kubectl get ingress                      # Ingress rules

# Port forwarding (local access)
kubectl port-forward pod/<pod-name> 8080:80
kubectl port-forward svc/<svc-name> 8080:80
```

---

## **üåê LEVEL 2: API SERVER ENDPOINTS (Control Plane)**

These are HTTP endpoints on the API server (port 6443) for health checks.

### **Health & Readiness**
```bash
# Direct API server checks
curl -k https://<apiserver>:6443/healthz       # Basic health
curl -k https://<apiserver>:6443/livez          # Liveness probe
curl -k https://<apiserver>:6443/readyz         # Readiness probe
curl -k https://<apiserver>:6443/metrics        # Prometheus metrics

# With detailed info
curl -k https://<apiserver>:6443/livez?verbose  # Detailed component status
curl -k https://<apiserver>:6443/readyz?verbose

# Individual component health
curl -k https://<apiserver>:6443/livez/etcd     # etcd health only
curl -k https://<apiserver>:6443/livez/apiserver
```

**Interview Soundbite:** *"When the API server seems slow, I hit /livez and /readyz with verbose flag to see which component is failing. If etcd is unhealthy, everything grinds to a halt."*

---

## **üì¶ LEVEL 3: WORKLOAD DEBUGGING (Inside Pods)**

### **Container Inspection**
```bash
# Once inside a pod (kubectl exec)
ps aux                                      # Processes running
top                                         # Resource usage
df -h                                       # Disk usage
cat /proc/1/status | grep -i oom            # Check if OOM-killed
cat /sys/fs/cgroup/memory/memory.limit_in_bytes  # Memory limit
cat /sys/fs/cgroup/cpu/cpu.shares           # CPU shares

# Network inside pod
ip addr                                     # Pod IP
ip route                                    # Routing table
ss -tulpn                                   # Listening ports
curl localhost:8080/health                   # Check local endpoint
```

### **Application Debugging**
```bash
# From outside (using kubectl)
kubectl logs -f <pod-name>                   # Follow logs
kubectl logs --tail=50 <pod-name>             # Last 50 lines
kubectl logs -l app=myapp --all-containers    # All pods with label

# Execute diagnostic commands in container
kubectl exec <pod-name> -- env                 # Environment variables
kubectl exec <pod-name> -- cat /etc/config/config.yaml  # Check mounted config
```

---

## **üîå LEVEL 4: NETWORK TROUBLESHOOTING**

### **DNS Debugging**
```bash
# From inside a pod
nslookup kubernetes.default.svc.cluster.local
dig kubernetes.default.svc.cluster.local

# Check CoreDNS
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system <coredns-pod>

# Test DNS resolution
kubectl run test-pod --rm -it --image=busybox -- nslookup google.com
```

### **Connectivity Testing**
```bash
# From inside pod
ping <service-name>                          # Basic connectivity
telnet <service-name> <port>                  # Port connectivity
curl -v http://service-name:8080              # HTTP test
wget -O- http://service-name:8080             # Alternative

# Network policy debugging
kubectl run tmp --rm -it --image=alpine -- sh
apk add curl
curl <service-ip>:<port>

# From node level
iptables -L -n -t nat | grep <service-name>   # Check kube-proxy rules
iptables -L -n | grep <pod-ip>                 # Pod-specific rules
```

---

## **‚öôÔ∏è LEVEL 5: SYSTEM LEVEL (Node OS)**

### **Service Management (systemd)**
```bash
# Check Kubernetes components on node
systemctl status kubelet                       # Is kubelet running?
systemctl status docker                        # Or containerd
systemctl status kube-proxy                     # If running as service

# Service logs
journalctl -u kubelet -f                        # Follow kubelet logs
journalctl -u docker --since "5 minutes ago"     # Recent docker logs
journalctl -u kubelet --output=short-precise     # With timestamps

# Common patterns
journalctl -u kubelet | grep -i error            # Errors only
journalctl -u kubelet | grep -i "failed"         # Failures
```

### **Node Resource Debugging**
```bash
# System resources
top -c                                          # Processes with commands
htop                                            # Better top (if installed)
free -h                                         # Memory usage
df -h                                           # Disk usage
du -sh /var/log                                  # Log size

# Docker/containerd
docker ps -a                                     # All containers
docker logs <container-id>                        # Container logs
crictl ps -a                                      # CRI-compatible (containerd)
crictl logs <container-id>

# Node networking
ip addr                                          # Interface IPs
ss -tulpn                                        # Listening ports
netstat -tulpn                                    # Alternative
tcpdump -i any port 6443 -w capture.pcap          # Packet capture
```

---

## **üéØ THE TSE QUICK REFERENCE CARD**

### **The Diagnostic Flow**

```
USER REPORTS ISSUE
‚îÇ
‚îú‚îÄ‚Üí CLUSTER LEVEL
‚îÇ   kubectl cluster-info
‚îÇ   kubectl get nodes
‚îÇ   curl apiserver:6443/healthz
‚îÇ
‚îú‚îÄ‚Üí WORKLOAD LEVEL
‚îÇ   kubectl get pods -o wide
‚îÇ   kubectl describe pod <pod>
‚îÇ   kubectl logs <pod>
‚îÇ
‚îú‚îÄ‚Üí CONTAINER LEVEL
‚îÇ   kubectl exec -it <pod> -- sh
‚îÇ   ps aux, top, df -h
‚îÇ   env, cat /proc/1/status
‚îÇ
‚îú‚îÄ‚Üí NETWORK LEVEL
‚îÇ   kubectl run tmp --image=busybox -- nslookup
‚îÇ   iptables -L -n -t nat
‚îÇ   curl from inside pod
‚îÇ
‚îî‚îÄ‚Üí SYSTEM LEVEL
    systemctl status kubelet
    journalctl -u kubelet
    free -h, df -h, top
```

### **The 5 Most Important Commands (TSE Edition)**

| # | Command | When to Use |
|---|---------|-------------|
| 1 | `kubectl describe pod <pod>` | Everything about a failing pod |
| 2 | `journalctl -u kubelet -f` | Node-level issues, kubelet crashes |
| 3 | `kubectl logs --previous <pod>` | Container crashed, need last logs |
| 4 | `curl -k https://localhost:6443/readyz?verbose` | API server health |
| 5 | `kubectl exec -it <pod> -- sh` | Get inside and look around |

### **Interview Soundbites**

**When asked how you troubleshoot:**
> *"I work from the outside in. First, cluster-level‚Äîare nodes and API server healthy? Then workload‚Äîpod status, events, logs. Then I exec into the container to check processes and config. If needed, I drop to the node for system logs and resources."*

**When asked about a crashing pod:**
> *"I run `kubectl describe pod` to see events and exit codes, then `kubectl logs --previous` to see what happened before the crash. If it's OOM, I check memory limits and pod status. If it's a config error, I exec in and verify mounted ConfigMaps."*

**When asked about node issues:**
> *"First, `kubectl describe node` shows conditions. Then I SSH to the node and check `systemctl status kubelet` and `journalctl -u kubelet`. I also verify disk space with `df -h` and memory with `free -m`‚Äînodes often go NotReady when resources are exhausted."*

---

## üìù **Pro Tip for Google TSE Interview**

**Don't just list commands‚Äîtell a story with them.** When they ask how you'd debug X, structure your answer:

1. **What command you'd run first** (and why)
2. **What you expect to see** (healthy vs unhealthy output)
3. **What you'd do next based on that output**

This shows systematic thinking, not just memorized commands.