title | titleSuffix | description | author | ms.date | ms.topic | ms.service | ms.subservice | azureCategories | categories | products | ms.custom | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Azure Kubernetes Service (AKS) operations triage |
Azure Architecture Center |
Learn about the five articles that describe the triage practices for AKS operations. Get an overview of the top-down triage approach. |
kevingbb |
11/22/2023 |
conceptual |
architecture-center |
azure-guide |
compute |
compute |
|
|
A root-cause analysis for an Azure Kubernetes Service (AKS) cluster is often challenging. To simplify the process, consider triaging issues by using a top-down approach based on the cluster hierarchy. Start at the cluster level and drill down if necessary.
The following section provides an overview of a series about triage practices, which describes the top-down approach in detail. The articles provide examples that use a set of tools and dashboards. The articles describe how these examples highlight symptoms of problems.
Common problems that are addressed in this series include:
- Network and connectivity problems that are caused by improper configuration.
- Broken communication between the control plane and the node.
- Kubelet pressures that are caused by insufficient compute, memory, or storage resources.
- Domain Name System (DNS) resolution problems.
- Nodes that run out of disk input/output operations per second (IOPS).
- An admission control pipeline that blocks several requests to the API server.
- A cluster that doesn't have permissions to pull from the appropriate container registry.
This series isn't intended to resolve specific problems. For information about troubleshooting specific problems, see AKS troubleshooting.
Step | Description |
---|---|
1. Evaluate AKS cluster health. | Check the overall health of the cluster and networking. |
2. Examine node and pod health. | Evaluate the health of the AKS worker nodes. |
3. Monitor workload deployments. | Ensure that all deployments and DaemonSet features are running. |
4. Validate admission controllers. | Check whether the admission controllers are working as expected. |
5. Verify the connection to the container registry. | Verify the connection to the container registry. |
This article is maintained by Microsoft. It was originally written by the following contributors.
Principal author:
- Kevin Harris | Principal Solution Specialist
Other contributors:
- Paolo Salvatori | Principal Customer Engineer
- Francis Simy Nazareth | Senior Technical Specialist
To see nonpublic LinkedIn profiles, sign in to LinkedIn.