Skip to content

Latest commit

 

History

History
68 lines (50 loc) · 3.29 KB

aks-triage-practices.md

File metadata and controls

68 lines (50 loc) · 3.29 KB
title titleSuffix description author ms.date ms.topic ms.service ms.subservice azureCategories categories products ms.custom
Azure Kubernetes Service (AKS) operations triage
Azure Architecture Center
Learn about the five articles that describe the triage practices for AKS operations. Get an overview of the top-down triage approach.
kevingbb
11/22/2023
conceptual
architecture-center
azure-guide
compute
compute
azure-kubernetes-service
e2e-aks

Triage practices for AKS operations

A root-cause analysis for an Azure Kubernetes Service (AKS) cluster is often challenging. To simplify the process, consider triaging issues by using a top-down approach based on the cluster hierarchy. Start at the cluster level and drill down if necessary.

Diagram that shows the hierarchy of AKS cluster components: Cluster, node pools, nodes, pods, and containers.

The following section provides an overview of a series about triage practices, which describes the top-down approach in detail. The articles provide examples that use a set of tools and dashboards. The articles describe how these examples highlight symptoms of problems.

Common problems that are addressed in this series include:

  • Network and connectivity problems that are caused by improper configuration.
  • Broken communication between the control plane and the node.
  • Kubelet pressures that are caused by insufficient compute, memory, or storage resources.
  • Domain Name System (DNS) resolution problems.
  • Nodes that run out of disk input/output operations per second (IOPS).
  • An admission control pipeline that blocks several requests to the API server.
  • A cluster that doesn't have permissions to pull from the appropriate container registry.

This series isn't intended to resolve specific problems. For information about troubleshooting specific problems, see AKS troubleshooting.

The triage practices series

Step Description
1. Evaluate AKS cluster health. Check the overall health of the cluster and networking.
2. Examine node and pod health. Evaluate the health of the AKS worker nodes.
3. Monitor workload deployments. Ensure that all deployments and DaemonSet features are running.
4. Validate admission controllers. Check whether the admission controllers are working as expected.
5. Verify the connection to the container registry. Verify the connection to the container registry.

Contributors

This article is maintained by Microsoft. It was originally written by the following contributors.

Principal author:

Other contributors:

To see nonpublic LinkedIn profiles, sign in to LinkedIn.

Next steps