Triage practices for AKS operations

A root-cause analysis for an Azure Kubernetes Service (AKS) cluster is often challenging. To simplify the process, consider triaging issues by using a top-down approach based on the cluster hierarchy. Start at the cluster level and drill down if necessary.

The following section provides an overview of a series about triage practices, which describes the top-down approach in detail. The articles provide examples that use a set of tools and dashboards. The articles describe how these examples highlight symptoms of problems.

Common problems that are addressed in this series include:

Network and connectivity problems that are caused by improper configuration.
Broken communication between the control plane and the node.
Kubelet pressures that are caused by insufficient compute, memory, or storage resources.
Domain Name System (DNS) resolution problems.
Nodes that run out of disk input/output operations per second (IOPS).
An admission control pipeline that blocks several requests to the API server.
A cluster that doesn't have permissions to pull from the appropriate container registry.

This series isn't intended to resolve specific problems. For information about troubleshooting specific problems, see AKS troubleshooting.

The triage practices series

Step	Description
1. Evaluate AKS cluster health.	Check the overall health of the cluster and networking.
2. Examine node and pod health.	Evaluate the health of the AKS worker nodes.
3. Monitor workload deployments.	Ensure that all deployments and `DaemonSet` features are running.
4. Validate admission controllers.	Check whether the admission controllers are working as expected.
5. Verify the connection to the container registry.	Verify the connection to the container registry.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Kevin Harris | Principal Solution Specialist

Other contributors:

Paolo Salvatori | Principal Customer Engineer
Francis Simy Nazareth | Senior Technical Specialist

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps

Day-2 operations
AKS periscope
AKS roadmap
AKS resources

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

aks-triage-practices.md

aks-triage-practices.md

Triage practices for AKS operations

The triage practices series

Contributors

Next steps

Files

aks-triage-practices.md

Latest commit

History

aks-triage-practices.md

File metadata and controls

Triage practices for AKS operations

The triage practices series

Contributors

Next steps