diff --git a/01-prerequisites.md b/01-prerequisites.md index e899a1cd..cfe56c88 100644 --- a/01-prerequisites.md +++ b/01-prerequisites.md @@ -1,6 +1,6 @@ # Prerequisites -This is the starting point for the instructions on deploying the [AKS Baseline reference implementation](./README.md). There is required access and tooling you'll need in order to accomplish this. Follow the instructions below and on the subsequent pages so that you can get your environment ready to proceed with the AKS cluster creation. +This is the starting point for the instructions on deploying the [AKS baseline reference implementation](./README.md). There is required access and tooling you'll need in order to accomplish this. Follow the instructions below and on the subsequent pages so that you can get your environment ready to proceed with the AKS cluster creation. | :clock10: | These steps are intentionally verbose, intermixed with context, narrative, and guidance. The deployments are all conducted via [Bicep templates](https://learn.microsoft.com/azure/azure-resource-manager/bicep/overview), but they are executed manually via `az cli` commands. We strongly encourage you to dedicate time to walk through these instructions, with a focus on learning. We do not provide any "one click" method to complete all deployments.

Once you understand the components involved and have identified the shared responsibilities between your team and your greater organization, you are encouraged to build suitable, repeatable deployment processes around your final infrastructure and cluster bootstrapping. The [AKS baseline automation guidance](https://github.com/Azure/aks-baseline-automation#aks-baseline-automation) is a great place to learn how to build your own automation pipelines. That guidance is based on the same architecture foundations presented here in the AKS baseline, and illustrates GitHub Actions based deployments for all components, including workloads. | |-----------|:--------------------------| diff --git a/02-ca-certificates.md b/02-ca-certificates.md index 762caec9..d6d7abf2 100644 --- a/02-ca-certificates.md +++ b/02-ca-certificates.md @@ -1,4 +1,4 @@ -# Generate Your Client-Facing and AKS Ingress Controller TLS Certificates +# Generate your client-facing and AKS ingress controller TLS certificates Now that you have the [prerequisites](./01-prerequisites.md) met, follow the steps below to create the TLS certificates that Azure Application Gateway will serve for clients connecting to your web app as well as the AKS Ingress Controller. If you already have access to an appropriate certificates, or can procure them from your organization, consider doing so and skipping the certificate generation steps. The following will describe using a self-signed certs for instructive purposes only. @@ -10,7 +10,7 @@ Now that you have the [prerequisites](./01-prerequisites.md) met, follow the ste export DOMAIN_NAME_AKS_BASELINE="contoso.com" ``` -1. Generate a client-facing self-signed TLS certificate +1. Generate a client-facing, self-signed TLS certificate. > :book: Contoso Bicycle needs to procure a CA certificate for the web site. As this is going to be a user-facing site, they purchase an EV cert from their CA. This will serve in front of the Azure Application Gateway. They will also procure another one, a standard cert, to be used with the AKS Ingress Controller. This one is not EV, as it will not be user facing. @@ -23,7 +23,7 @@ Now that you have the [prerequisites](./01-prerequisites.md) met, follow the ste openssl pkcs12 -export -out appgw.pfx -in appgw.crt -inkey appgw.key -passout pass: ``` -1. Base64 encode the client-facing certificate +1. Base64 encode the client-facing certificate. :bulb: No matter if you used a certificate from your organization or you generated one from above, you'll need the certificate (as `.pfx`) to be Base64 encoded for proper storage in Key Vault later. @@ -32,15 +32,15 @@ Now that you have the [prerequisites](./01-prerequisites.md) met, follow the ste echo APP_GATEWAY_LISTENER_CERTIFICATE_AKS_BASELINE: $APP_GATEWAY_LISTENER_CERTIFICATE_AKS_BASELINE ``` -1. Generate the wildcard certificate for the AKS Ingress Controller +1. Generate the wildcard certificate for the AKS ingress controller. - > :book: Contoso Bicycle will also procure another TLS certificate, a standard cert, to be used with the AKS Ingress Controller. This one is not EV, as it will not be user facing. Finally the app team decides to use a wildcard certificate of `*.aks-ingress.contoso.com` for the ingress controller. + > :book: Contoso Bicycle will also procure another TLS certificate, a standard cert, to be used with the AKS ingress controller. This one is not EV, as it will not be user facing. Finally the app team decides to use a wildcard certificate of `*.aks-ingress.contoso.com` for the ingress controller. ```bash openssl req -x509 -nodes -days 365 -newkey rsa:2048 -out traefik-ingress-internal-aks-ingress-tls.crt -keyout traefik-ingress-internal-aks-ingress-tls.key -subj "/CN=*.aks-ingress.${DOMAIN_NAME_AKS_BASELINE}/O=Contoso AKS Ingress" ``` -1. Base64 encode the AKS Ingress Controller certificate +1. Base64 encode the AKS ingress controller certificate. :bulb: No matter if you used a certificate from your organization or you generated one from above, you'll need the public certificate (as `.crt` or `.cer`) to be Base64 encoded for proper storage in Key Vault later. diff --git a/03-aad.md b/03-aad.md index 00144626..96e54e9c 100644 --- a/03-aad.md +++ b/03-aad.md @@ -1,4 +1,4 @@ -# Prep for Azure Active Directory Integration +# Prep for Azure Active Directory integration In the prior step, you [generated the user-facing TLS certificate](./02-ca-certificates.md); now we'll prepare Azure AD for Kubernetes role-based access control (RBAC). This will ensure you have an Azure AD security group(s) and user(s) assigned for group-based Kubernetes control plane access. diff --git a/04-networking.md b/04-networking.md index bcf8ae25..14415eaf 100644 --- a/04-networking.md +++ b/04-networking.md @@ -1,6 +1,6 @@ -# Deploy the Hub-Spoke Network Topology +# Deploy the hub-spoke network topology -The prerequisites for the [AKS Baseline cluster](./) are now completed with [Azure AD group and user work](./03-aad.md) performed in the prior steps. Now we will start with our first Azure resource deployment, the network resources. +The prerequisites for the [AKS baseline cluster](./) are now completed with [Azure AD group and user work](./03-aad.md) performed in the prior steps. Now we will start with our first Azure resource deployment, the network resources. ## Subscription and resource group topology @@ -8,7 +8,7 @@ This reference implementation is split across several resource groups in a singl ## Expected results -### Resource Groups +### Resource groups The following two resource groups will be created and populated with networking resources in the steps below. @@ -19,10 +19,10 @@ The following two resource groups will be created and populated with networking ### Resources -* Regional Azure Firewall in Hub Virtual Network -* Network Spoke for the Cluster -* Network Peering from the Spoke to the Hub -* Force Tunnel UDR for Cluster Subnets to the Hub +* Regional Azure Firewall in hub virtual network +* Network spoke for the cluster +* Network peering from the spoke to the hub +* Force tunnel UDR for cluster subnets to the hub * Network Security Groups for all subnets that support them ## Steps diff --git a/05-bootstrap-prep.md b/05-bootstrap-prep.md index bdf6c2d5..d4d607d1 100644 --- a/05-bootstrap-prep.md +++ b/05-bootstrap-prep.md @@ -55,7 +55,7 @@ We'll be bootstrapping this cluster with the Flux GitOps agent as installed as a echo ACR_NAME_AKS_BASELINE: $ACR_NAME_AKS_BASELINE # Import core image(s) hosted in public container registries to be used during bootstrapping - az acr import --source ghcr.io/kubereboot/kured:1.11.0 -n $ACR_NAME_AKS_BASELINE + az acr import --source ghcr.io/kubereboot/kured:1.12.0 -n $ACR_NAME_AKS_BASELINE ``` > In this walkthrough, there is only one image that is included in the bootstrapping process. It's included as an reference for this process. Your choice to use Kubernetes Reboot Daemon (Kured) or any other images, including helm charts, as part of your bootstrapping is yours to make. @@ -69,12 +69,12 @@ We'll be bootstrapping this cluster with the Flux GitOps agent as installed as a :warning: Without updating these files and using your own fork, you will be deploying your cluster such that it takes dependencies on public container registries. This is generally okay for exploratory/testing, but not suitable for production. Before going to production, ensure _all_ image references you bring to your cluster are from _your_ container registry (link imported in the prior step) or another that you feel confident relying on. ```bash - sed -i "s:docker.io:${ACR_NAME_AKS_BASELINE}.azurecr.io:" ./cluster-manifests/cluster-baseline-settings/kured.yaml + sed -i "s:ghcr.io:${ACR_NAME_AKS_BASELINE}.azurecr.io:" ./cluster-manifests/cluster-baseline-settings/kured.yaml ``` Note, that if you are on macOS, you might need to use the following command instead: ```bash - sed -i '' 's:docker.io:'"${ACR_NAME_AKS_BASELINE}"'.azurecr.io:g' ./cluster-manifests/cluster-baseline-settings/kured.yaml + sed -i '' 's:ghcr.io:'"${ACR_NAME_AKS_BASELINE}"'.azurecr.io:g' ./cluster-manifests/cluster-baseline-settings/kured.yaml ``` Now commit changes to repository. diff --git a/06-aks-cluster.md b/06-aks-cluster.md index 2c17c031..d07c0e08 100644 --- a/06-aks-cluster.md +++ b/06-aks-cluster.md @@ -1,4 +1,4 @@ -# Deploy the AKS Cluster +# Deploy the AKS cluster Now that your [ACR instance is deployed and ready to support cluster bootstrapping](./05-bootstrap-prep.md), the next step in the [AKS baseline reference implementation](./) is deploying the AKS cluster and its remaining adjacent Azure resources. diff --git a/07-bootstrap-validation.md b/07-bootstrap-validation.md index 8e8b8d20..892377f3 100644 --- a/07-bootstrap-validation.md +++ b/07-bootstrap-validation.md @@ -52,7 +52,7 @@ GitOps allows a team to author Kubernetes manifest files, persist them in their The bootstrapping process that already happened due to the usage of the Flux extension for AKS has set up the following, amoung other things * the workload's namespace named `a0008` - * Installed kured + * installed kured ```bash kubectl get namespaces diff --git a/08-workload-prerequisites.md b/08-workload-prerequisites.md index 18c5cbcb..6d6a8be4 100644 --- a/08-workload-prerequisites.md +++ b/08-workload-prerequisites.md @@ -1,6 +1,6 @@ -# Workload Prerequisites +# Workload prerequisites -The AKS Cluster has been [bootstrapped](./07-bootstrap-validation.md), wrapping up the infrastructure focus of the [AKS Baseline reference implementation](./). Follow the steps below to import the TLS certificate that the Ingress Controller will serve for Application Gateway to connect to your web app. +The AKS Cluster has been [bootstrapped](./07-bootstrap-validation.md), wrapping up the infrastructure focus of the [AKS baseline reference implementation](./). Follow the steps below to import the TLS certificate that the Ingress Controller will serve for Application Gateway to connect to your web app. ## Steps diff --git a/09-secret-management-and-ingress-controller.md b/09-secret-management-and-ingress-controller.md index 9975dfe2..7c92cb58 100644 --- a/09-secret-management-and-ingress-controller.md +++ b/09-secret-management-and-ingress-controller.md @@ -1,4 +1,4 @@ -# Configure AKS Ingress Controller with Azure Key Vault integration +# Configure AKS ingress controller with Azure Key Vault integration Previously you have configured [workload prerequisites](./08-workload-prerequisites.md). These steps configure Traefik, the AKS ingress solution used in this reference implementation, so that it can securely expose the web app to your Application Gateway. @@ -58,7 +58,7 @@ Previously you have configured [workload prerequisites](./08-workload-prerequisi ```bash # Import ingress controller image hosted in public container registries - az acr import --source docker.io/library/traefik:v2.8.1 -n $ACR_NAME_AKS_BASELINE + az acr import --source docker.io/library/traefik:v2.9.6 -n $ACR_NAME_AKS_BASELINE ``` 1. Install the Traefik Ingress Controller. diff --git a/10-workload.md b/10-workload.md index 000c3d7b..5745cd62 100644 --- a/10-workload.md +++ b/10-workload.md @@ -1,4 +1,4 @@ -# Deploy the Workload (ASP.NET Core Docker web app) +# Deploy the workload (ASP.NET Core Docker web app) The cluster now has an [Traefik configured with a TLS certificate](./09-secret-management-and-ingress-controller.md). The last step in the process is to deploy the workload, which will demonstrate the system's functions. diff --git a/11-validation.md b/11-validation.md index 25e6a0c9..b9bad588 100644 --- a/11-validation.md +++ b/11-validation.md @@ -1,14 +1,14 @@ -# End-to-End Validation +# End-to-end validation -Now that you have a workload deployed, the [ASP.NET Core Docker sample web app](./10-workload.md), you can start validating and exploring this reference implementation of the [AKS Baseline cluster](./). In addition to the workload, there are some observability validation you can perform as well. +Now that you have a workload deployed, the [ASP.NET Core sample web app](./10-workload.md), you can start validating and exploring this reference implementation of the [AKS baseline cluster](./). In addition to the workload, there are some observability validation you can perform as well. -## Validate the Web App +## Validate the web app This section will help you to validate the workload is exposed correctly and responding to HTTP requests. ### Steps -1. Get Public IP of Application Gateway +1. Get the public IP address of Application Gateway. > :book: The app team conducts a final acceptance test to be sure that traffic is flowing end-to-end as expected, so they place a request against the Azure Application Gateway endpoint. @@ -18,7 +18,7 @@ This section will help you to validate the workload is exposed correctly and res echo APPGW_PUBLIC_IP: $APPGW_PUBLIC_IP ``` -1. Create `A` Record for DNS +1. Create an `A` record for DNS. > :bulb: You can simulate this via a local hosts file modification. You're welcome to add a real DNS entry for your specific deployment's application domain name, if you have access to do so. @@ -83,7 +83,7 @@ Built-in as well as custom policies are applied to the cluster as part of the [c Error from server (Forbidden): error when creating "STDIN": admission webhook "validation.gatekeeper.sh" denied the request: [azurepolicy-k8scustomingresstlshostshavede-e64871e795ce3239cd99] TLS host must have one of defined domain suffixes. Valid domain names are ["contoso.com"]; defined TLS hosts are {"bu0001a0008-00.aks-ingress.invalid-domain.com"}; incompliant hosts are {"bu0001a0008-00.aks-ingress.invalid-domain.com"}. ``` -## Validate Web Application Firewall functionality +## Validate web application firewall functionality Your workload is placed behind a Web Application Firewall (WAF), which has rules designed to stop intentionally malicious activity. You can test this by triggering one of the built-in rules with a request that looks malicious. @@ -104,7 +104,7 @@ Your workload is placed behind a Web Application Firewall (WAF), which has rules | where ResourceProvider == "MICROSOFT.NETWORK" and Category == "ApplicationGatewayFirewallLog" ``` -## Validate Cluster Azure Monitor Insights and Logs +## Validate cluster Azure Monitor insights and logs Monitoring your cluster is critical, especially when you're running a production cluster. Therefore, your AKS cluster is configured to send [diagnostic information](https://learn.microsoft.com/azure/aks/monitor-aks) of categories _cluster-autoscaler_, _kube-controller-manager_, _kube-audit-admin_ and _guard_ to the Log Analytics Workspace deployed as part of the [bootstrapping step](./05-bootstrap-prep.md). Additionally, [Azure Monitor for containers](https://learn.microsoft.com/azure/azure-monitor/insights/container-insights-overview) is configured on your cluster to capture metrics and logs from your workload containers. Azure Monitor is configured to surface cluster logs, here you can see those logs as they are generated. @@ -115,13 +115,13 @@ Monitoring your cluster is critical, especially when you're running a production 1. In the Azure Portal, navigate to your AKS cluster resource. 1. Click _Insights_ to see captured data. -You can also execute [queries](https://learn.microsoft.com/azure/azure-monitor/log-query/get-started-portal) on the [cluster logs captured](https://learn.microsoft.com/azure/azure-monitor/insights/container-insights-log-search). +You can also execute [queries](https://learn.microsoft.com/azure/azure-monitor/logs/log-analytics-tutorial) on the [cluster logs captured](https://learn.microsoft.com/azure/azure-monitor/containers/container-insights-log-query). 1. In the Azure Portal, navigate to your AKS cluster resource. 1. Click _Logs_ to see and query log data. :bulb: There are several examples on the _Kubernetes Services_ category. -## Validate Azure Monitor for containers (Prometheus Metrics) +## Validate Azure Monitor for containers (Prometheus metrics) Azure Monitor is configured to [scrape Prometheus metrics](https://learn.microsoft.com/azure/azure-monitor/insights/container-insights-prometheus-integration) in your cluster. This reference implementation is configured to collect Prometheus metrics from two namespaces, as configured in [`container-azm-ms-agentconfig.yaml`](./cluster-baseline-settings/container-azm-ms-agentconfig.yaml). There are two pods configured to emit Prometheus metrics: @@ -137,7 +137,7 @@ Azure Monitor is configured to [scrape Prometheus metrics](https://learn.microso 1. Find the one of the above queries in the _Containers_ category. 1. You are able to select and execute the saved query over the scraped metrics. -## Validate Workload Logs +## Validate workload logs The example workload uses the standard dotnet logger interface, which are captured in `ContainerLogs` in Azure Monitor. You could also include additional logging and telemetry frameworks in your workload, such as Application Insights. Here are the steps to view the built-in application logs. @@ -145,7 +145,7 @@ The example workload uses the standard dotnet logger interface, which are captur 1. In the Azure Portal, navigate to your AKS cluster resource group (`rg-bu0001a0008`). 1. Select your Log Analytic Workspace resource and open the _Logs_ blade. -1. Execute the following query +1. Execute the following query. ``` let podInventory = KubePodInventory @@ -185,7 +185,7 @@ A series of metric alerts were configured as well in this reference implementati 1. Select your cluster, then _Insights_. 1. Select _Recommended alerts_ to see those enabled. (Feel free to enable/disable as you see fit.) -## Validate Azure Container Registry Image Pulls +## Validate Azure Container Registry image pulls If you configured your third-party images to be pulled from your Azure Container Registry vs public registries, you can validate that the container registry logs show `Pull` logs for your cluster when you applied your flux configuration. diff --git a/12-cleanup.md b/12-cleanup.md index 3b0a2cdd..8a394256 100644 --- a/12-cleanup.md +++ b/12-cleanup.md @@ -1,6 +1,6 @@ # Clean up -After you are done exploring your deployed [AKS Baseline cluster](./), you'll want to delete the created Azure resources to prevent undesired costs from accruing. Follow these steps to delete all resources created as part of this reference implementation. +After you are done exploring your deployed [AKS baseline cluster](./), you'll want to delete the created Azure resources to prevent undesired costs from accruing. Follow these steps to delete all resources created as part of this reference implementation. ## Steps diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index ea8ecad5..c76d93fe 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,4 +1,4 @@ -# Contributing to the AKS Baseline reference implementation +# Contributing to the AKS baseline reference implementation This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit . diff --git a/README.md b/README.md index c259c74d..f29e9463 100644 --- a/README.md +++ b/README.md @@ -1,4 +1,4 @@ -# Azure Kubernetes Service (AKS) Baseline Cluster +# Azure Kubernetes Service (AKS) baseline cluster This reference implementation demonstrates the _recommended starting (baseline) infrastructure architecture_ for a general purpose [AKS cluster](https://azure.microsoft.com/services/kubernetes-service). This implementation and document is meant to guide an interdisciplinary team or multiple distinct teams like networking, security and development through the process of getting this general purpose baseline infrastructure deployed and understanding the components of it. @@ -6,7 +6,7 @@ We walk through the deployment here in a rather _verbose_ method to help you und ## Azure Architecture Center guidance -This project has a companion set of articles that describe challenges, design patterns, and best practices for a secure AKS cluster. You can find this article on the Azure Architecture Center at [Azure Kubernetes Service (AKS) Baseline cluster](https://aka.ms/architecture/aks-baseline). If you haven't reviewed it, we suggest you read it as it will give added context to the considerations applied in this implementation. Ultimately, this is the direct implementation of that specific architectural guidance. +This project has a companion set of articles that describe challenges, design patterns, and best practices for a secure AKS cluster. You can find this article on the Azure Architecture Center at [Azure Kubernetes Service (AKS) baseline cluster](https://aka.ms/architecture/aks-baseline). If you haven't reviewed it, we suggest you read it as it will give added context to the considerations applied in this implementation. Ultimately, this is the direct implementation of that specific architectural guidance. ## Architecture @@ -98,7 +98,7 @@ Most of the Azure resources deployed in the prior steps will incur ongoing charg - [ ] [Cleanup all resources](./12-cleanup.md) -## Preview features +## Preview and additional features Kubernetes and, by extension, AKS are fast-evolving products. The [AKS roadmap](https://aka.ms/AKS/Roadmap) shows how quick the product is changing. This reference implementation does take dependencies on select preview features which the AKS team describes as "Shipped & Improving." The rational behind that is that many of the preview features stay in that state for only a few months before entering GA. If you are just artchitecting your cluster today, by the time you're ready for production, there is a good chance that many of the preview features are nearing or will have hit GA. @@ -109,14 +109,14 @@ This implementation will not include every preview feature, but instead only tho - [BYO CNI (`--network-plugin none`)](https://learn.microsoft.com/azure/aks/use-byo-cni) - [Simplified application autoscaling with Kubernetes Event-driven Autoscaling (KEDA) add-on](https://learn.microsoft.com/azure/aks/keda) -## Related Reference Implementations +## Related reference implementations -The AKS Baseline was used as the foundation for the following additional reference implementations. These build on the learnings of the AKS baseline and applies a specific lens to the cluster to align a specific topology, requirement, and/or workload type. +The AKS baseline was used as the foundation for the following additional reference implementations. These build on the learnings of the AKS baseline and applies a specific lens to the cluster to align a specific topology, requirement, and/or workload type. -- [AKS Baseline for Multi-Region Clusters](https://github.com/mspnp/aks-baseline-multi-region) -- [AKS Baseline for Regulated Workloads](https://github.com/mspnp/aks-baseline-regulated) -- [AKS Baseline for Microservices](https://github.com/mspnp/aks-fabrikam-dronedelivery) -- [Azure Landing Zones, Enterprise-Scale Reference Implementation using Terraform](https://github.com/Azure/caf-terraform-landingzones-starter/tree/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline) +- [AKS baseline for multi-region clusters](https://github.com/mspnp/aks-baseline-multi-region) +- [AKS baseline for regulated workloads](https://github.com/mspnp/aks-baseline-regulated) +- [AKS baseline for microservices](https://github.com/mspnp/aks-fabrikam-dronedelivery) +- [Azure landing zones, enterprise-scale reference implementation using Terraform](https://github.com/Azure/caf-terraform-landingzones-starter/tree/starter/enterprise_scale/construction_sets/aks/online/aks_secure_baseline) ## Advanced topics diff --git a/cluster-manifests/README.md b/cluster-manifests/README.md index dad94f21..cd308492 100644 --- a/cluster-manifests/README.md +++ b/cluster-manifests/README.md @@ -1,4 +1,4 @@ -# Cluster Baseline Configuration Files (GitOps) +# Cluster baseline configuration files (GitOps) > Note: This is part of the Azure Kubernetes Service (AKS) Baseline cluster reference implementation. For more information check out the [readme file in the root](../README.md). diff --git a/cluster-manifests/cluster-baseline-settings/kured.yaml b/cluster-manifests/cluster-baseline-settings/kured.yaml index dccb305f..a5da63ea 100644 --- a/cluster-manifests/cluster-baseline-settings/kured.yaml +++ b/cluster-manifests/cluster-baseline-settings/kured.yaml @@ -1,4 +1,4 @@ -# Source: https://github.com/kubereboot/charts/tree/kured-4.1.0/charts/kured (1.11.0) +# Source: https://github.com/kubereboot/charts/tree/kured-4.2.0/charts/kured (1.12.0) apiVersion: rbac.authorization.k8s.io/v1 kind: ClusterRole metadata: @@ -48,7 +48,7 @@ rules: # Allow kured to lock/unlock itself - apiGroups: ["extensions"] resources: ["daemonsets"] - resourceNames: ["n-kured"] + resourceNames: ["kured"] verbs: ["update", "patch"] - apiGroups: ["apps"] resources: ["daemonsets"] @@ -118,10 +118,10 @@ spec: # PRODUCTION READINESS CHANGE REQUIRED # This image should be sourced from a non-public container registry, such as the # one deployed along side of this reference implementation. - # az acr import --source ghcr.io/kubereboot/kured:1.11.0 -n + # az acr import --source ghcr.io/kubereboot/kured:1.12.0 -n # and then set this to - # image: .azurecr.io/kubereboot/kured:1.10.1 - image: ghcr.io/kubereboot/kured:1.11.0 + # image: .azurecr.io/kubereboot/kured:1.12.0 + image: ghcr.io/kubereboot/kured:1.12.0 imagePullPolicy: IfNotPresent securityContext: privileged: true # Give permission to nsenter /proc/1/ns/mnt @@ -167,7 +167,8 @@ spec: # - --slack-channel=alerting # - --notify-url="" # See also shoutrrr url format # - --message-template-drain=Draining node %s -# - --message-template-drain=Rebooting node %s +# - --message-template-reboot=Rebooting node %s +# - --message-template-uncordon=Node %s rebooted & uncordoned successfully! # - --blocking-pod-selector=runtime=long,cost=expensive # - --blocking-pod-selector=name=temperamental # - --blocking-pod-selector=... diff --git a/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml b/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml index fd4a02f8..f466fa98 100644 --- a/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml +++ b/cluster-manifests/kube-system/container-azm-ms-agentconfig.yaml @@ -21,19 +21,19 @@ data: # In the absense of this configmap, default value for enabled is true enabled = true # exclude_namespaces setting holds good only if enabled is set to true - # kube-system log collection is disabled by default in the absence of 'log_collection_settings.stdout' setting. If you want to enable kube-system, remove it from the following setting. - # If you want to continue to disable kube-system log collection keep this namespace in the following setting and add any other namespace you want to disable log collection to the array. - # In the absense of this configmap, default value for exclude_namespaces = ["kube-system"] - exclude_namespaces = ["kube-system"] + # kube-system,gatekeeper-system log collection are disabled by default in the absence of 'log_collection_settings.stdout' setting. If you want to enable kube-system,gatekeeper-system, remove them from the following setting. + # If you want to continue to disable kube-system,gatekeeper-system log collection keep the namespaces in the following setting and add any other namespace you want to disable log collection to the array. + # In the absense of this configmap, default value for exclude_namespaces = ["kube-system","gatekeeper-system"] + exclude_namespaces = ["kube-system","gatekeeper-system"] [log_collection_settings.stderr] # Default value for enabled is true enabled = true # exclude_namespaces setting holds good only if enabled is set to true - # kube-system log collection is disabled by default in the absence of 'log_collection_settings.stderr' setting. If you want to enable kube-system, remove it from the following setting. - # If you want to continue to disable kube-system log collection keep this namespace in the following setting and add any other namespace you want to disable log collection to the array. - # In the absense of this cofigmap, default value for exclude_namespaces = ["kube-system"] - exclude_namespaces = ["kube-system"] + # kube-system,gatekeeper-system log collection are disabled by default in the absence of 'log_collection_settings.stderr' setting. If you want to enable kube-system,gatekeeper-system, remove them from the following setting. + # If you want to continue to disable kube-system,gatekeeper-system log collection keep the namespaces in the following setting and add any other namespace you want to disable log collection to the array. + # In the absense of this configmap, default value for exclude_namespaces = ["kube-system","gatekeeper-system"] + exclude_namespaces = ["kube-system","gatekeeper-system"] [log_collection_settings.env_var] # In the absense of this configmap, default value for enabled is true @@ -52,13 +52,14 @@ data: # Supported values for this setting are "v1","v2" # See documentation at https://aka.ms/ContainerLogv2 for benefits of v2 schema over v1 schema before opting for "v2" schema # containerlog_schema_version = "v2" + prometheus-data-collection-settings: |- # Custom Prometheus metrics data collection settings [prometheus_data_collection_settings.cluster] # Cluster level scrape endpoint(s). These metrics will be scraped from agent's Replicaset (singleton) # Any errors related to prometheus scraping can be found in the KubeMonAgentEvents table in the Log Analytics workspace that the cluster is sending data to. - #Interval specifying how often to scrape for metrics. This is duration of time and can be specified for supporting settings by combining an integer value and time unit as a string value. Valid time units are ns, us (or µs), ms, s, m, h. + #Interval specifying how often to scrape for metrics. This is duration of time and can be specified for supporting settings by combining an integer value and time unit as a string value. Valid time units are ns, us (or µs), ms, s, m, h. interval = "1m" ## Uncomment the following settings with valid string arrays for prometheus scraping @@ -95,11 +96,12 @@ data: ## Reference the docs at https://kubernetes.io/docs/concepts/overview/working-with-objects/field-selectors/ ## eg. To scrape pods on a specific node # kubernetes_field_selector = "spec.nodeName=$HOSTNAME" + [prometheus_data_collection_settings.node] # Node level scrape endpoint(s). These metrics will be scraped from agent's DaemonSet running in every node in the cluster # Any errors related to prometheus scraping can be found in the KubeMonAgentEvents table in the Log Analytics workspace that the cluster is sending data to. - #Interval specifying how often to scrape for metrics. This is duration of time and can be specified for supporting settings by combining an integer value and time unit as a string value. Valid time units are ns, us (or µs), ms, s, m, h. + #Interval specifying how often to scrape for metrics. This is duration of time and can be specified for supporting settings by combining an integer value and time unit as a string value. Valid time units are ns, us (or µs), ms, s, m, h. interval = "1m" ## Uncomment the following settings with valid string arrays for prometheus scraping @@ -117,6 +119,7 @@ data: # When the setting is set to false, only the persistent volume metrics outside the kube-system namespace will be collected # When this is enabled (enabled = true), persistent volume metrics including those in the kube-system namespace will be collected enabled = true + alertable-metrics-configuration-settings: |- # Alertable metrics configuration settings for container resource utilization [alertable_metrics_configuration_settings.container_resource_utilization_thresholds] @@ -148,11 +151,28 @@ data: agent-settings: |- # prometheus scrape fluent bit settings for high scale # buffer size should be greater than or equal to chunk size else we set it to chunk size. + # settings scoped to prometheus sidecar container. all values in mb [agent_settings.prometheus_fbit_settings] tcp_listener_chunk_size = 10 tcp_listener_buffer_size = 10 tcp_listener_mem_buf_limit = 200 + # prometheus scrape fluent bit settings for high scale + # buffer size should be greater than or equal to chunk size else we set it to chunk size. + # settings scoped to daemonset container. all values in mb + # [agent_settings.node_prometheus_fbit_settings] + # tcp_listener_chunk_size = 1 + # tcp_listener_buffer_size = 1 + # tcp_listener_mem_buf_limit = 10 + + # prometheus scrape fluent bit settings for high scale + # buffer size should be greater than or equal to chunk size else we set it to chunk size. + # settings scoped to replicaset container. all values in mb + # [agent_settings.cluster_prometheus_fbit_settings] + # tcp_listener_chunk_size = 1 + # tcp_listener_buffer_size = 1 + # tcp_listener_mem_buf_limit = 10 + # The following settings are "undocumented", we don't recommend uncommenting them unless directed by Microsoft. # They increase the maximum stdout/stderr log collection rate but will also cause higher cpu/memory usage. ## Ref for more details about Ignore_Older - https://docs.fluentbit.io/manual/v/1.7/pipeline/inputs/tail @@ -161,5 +181,4 @@ data: # tail_mem_buf_limit_megabytes = "10" # default value is 10 # tail_buf_chunksize_megabytes = "1" # default value is 32kb (comment out this line for default) # tail_buf_maxsize_megabytes = "1" # defautl value is 32kb (comment out this line for default) - # tail_ignore_older = "5m" # default value same as fluent-bit default i.e.0m - + # tail_ignore_older = "5m" # default value same as fluent-bit default i.e.0m \ No newline at end of file diff --git a/cluster-stamp.bicep b/cluster-stamp.bicep index d85f2e74..d49344e7 100644 --- a/cluster-stamp.bicep +++ b/cluster-stamp.bicep @@ -2021,7 +2021,7 @@ resource mcFlux_extension 'Microsoft.KubernetesConfiguration/extensions@2021-09- 'helm-controller.enabled': 'false' 'source-controller.enabled': 'true' 'kustomize-controller.enabled': 'true' - 'notification-controller.enabled': 'false' + 'notification-controller.enabled': 'true' // As of testing on 29-Dec, this is required to avoid an error. Normally it's not a required controller. YMMV 'image-automation-controller.enabled': 'false' 'image-reflector-controller.enabled': 'false' } diff --git a/networking/README.md b/networking/README.md index 2296d97e..c59ab517 100644 --- a/networking/README.md +++ b/networking/README.md @@ -1,6 +1,6 @@ # Networking resource templates -> Note: This is part of the Azure Kubernetes Service (AKS) Baseline cluster reference implementation. For more information check out the [readme file in the root](../README.md). +> Note: This is part of the Azure Kubernetes Service (AKS) baseline cluster reference implementation. For more information check out the [readme file in the root](../README.md). These files are the Bicep templates used in the deployment of this reference implementation. This reference implementation uses a standard hub-spoke model. @@ -12,9 +12,9 @@ These files are the Bicep templates used in the deployment of this reference imp Your organization will likely have its own standards for their hub-spoke implementation. Be sure to follow your organizational guidelines. -## Topology Details +## Topology details -See the [AKS Baseline Network Topology](./topology.md) for specifics on how this hub-spoke model has its subnets defined and IP space allocation concerns accounted for. +See the [AKS baseline network topology](./topology.md) for specifics on how this hub-spoke model has its subnets defined and IP space allocation concerns accounted for. ## See also diff --git a/networking/topology.md b/networking/topology.md index 46f6ad80..c9f311b7 100644 --- a/networking/topology.md +++ b/networking/topology.md @@ -2,7 +2,7 @@ > Note: This is part of the Azure Kubernetes Service (AKS) baseline cluster reference implementation. For more information see the [readme file in the root](../README.md). -## Hub Virtual network +## Hub virtual network `CIDR: 10.200.0.0/24` @@ -12,9 +12,9 @@ This regional VNet hub (shared) is meant to hold the following subnets: * [Gateway subnet] * [Azure Bastion subnet], with reference NSG in place -> Note: For more information about this topology, you can read more at [Azure Hub-Spoke topology]. +> Note: For more information about this topology, you can read more at [Azure hub-spoke topology]. -## Spoke Virtual network +## Spoke virtual network `CIDR: 10.240.0.0/16` @@ -40,7 +40,7 @@ In the future, this VNet might hold more subnets like [ACI Provider instance] su | Azure Firewall Subnet (AzureFirewallSubnet) | - | [59] | - | - | - | 100 | 100 | 0 | 0 | 5 | 0 | 64 | 64 | 26 | 10.200.0.0/26 | 10.200.0.0 | 10.200.0.63 | | Azure Bastion Subnet (AzureBastionSubnet) | - | [50] | - | - | - | 100 | 100 | 0 | 0 | 5 | 0 | 64 | 64 | 26 | 10.200.0.128/26 | 10.200.0.128 | 10.200.0.191 | -## Additional Considerations +## Additional considerations * [AKS System Nodepool] and [AKS User Nodepool] subnet: Multi-tenant or other advanced workloads may have nodepool isolation requirements that might demand more (and likely smaller) subnets. * [AKS Internal Load Balancer subnet]: Multi-tenant, multiple SSL termination rules, single PPE supporting dev/QA/UAT, etc could lead to needing more ingress controllers, but for baseline, we should start with one. @@ -57,7 +57,7 @@ In the future, this VNet might hold more subnets like [ACI Provider instance] su [Private Endpoints]: https://learn.microsoft.com/azure/private-link/private-endpoint-overview#private-endpoint-properties [Minimum Subnet size]: https://learn.microsoft.com/azure/aks/configure-azure-cni#plan-ip-addressing-for-your-cluster [Subnet Mask bits]: https://learn.microsoft.com/azure/virtual-network/virtual-networks-faq#how-small-and-how-large-can-vnets-and-subnets-be -[Azure Hub-Spoke topology]: https://learn.microsoft.com/azure/architecture/reference-architectures/hybrid-networking/hub-spoke +[Azure hub-spoke topology]: https://learn.microsoft.com/azure/architecture/reference-architectures/hybrid-networking/hub-spoke [Azure Firewall subnet]: https://learn.microsoft.com/azure/firewall/firewall-faq#does-the-firewall-subnet-size-need-to-change-as-the-service-scales [Gateway subnet]: https://learn.microsoft.com/azure/vpn-gateway/vpn-gateway-about-vpn-gateway-settings#gwsub [Azure Application Gateway subnet]: https://learn.microsoft.com/azure/application-gateway/configuration-infrastructure#virtual-network-and-dedicated-subnet diff --git a/workload/aspnetapp.yaml b/workload/aspnetapp.yaml index ad25a40f..5a869b78 100644 --- a/workload/aspnetapp.yaml +++ b/workload/aspnetapp.yaml @@ -114,7 +114,7 @@ spec: - bu0001a0008-00.aks-ingress.contoso.com # it is possible to opt for certificate management strategy with dedicated # certificates for each TLS SNI route. - # In this Rereference Implementation for the sake of simplicity we use a + # In this rereference implementation for the sake of simplicity we use a # wildcard default certificate added at Ingress Controller configuration level which is *.example.com # secretName: rules: diff --git a/workload/readme.md b/workload/readme.md index 5d3cbcc4..28312eac 100644 --- a/workload/readme.md +++ b/workload/readme.md @@ -1,10 +1,10 @@ # Workload -> Note: This is part of the Azure Kubernetes Service (AKS) Baseline cluster reference implementation. For more information check out the [readme file in the root](../README.md). +> Note: This is part of the Azure Kubernetes Service (AKS) baseline cluster reference implementation. For more information check out the [readme file in the root](../README.md). This reference implementation is focused on the infrastructure of a secure, baseline AKS cluster. The workload is not fully in scope. However, to demonstrate the concepts and configuration presented in this AKS cluster, a workload needed to be defined. -## Web Service +## Web service The AKS cluster, in our reference implementation, is here to serve as an application platform host for a web-facing application. In this case, the ASP.NET Core Hello World application is serving as that application. diff --git a/workload/traefik.yaml b/workload/traefik.yaml index 5300c1a5..53520396 100644 --- a/workload/traefik.yaml +++ b/workload/traefik.yaml @@ -30,8 +30,8 @@ rules: - extensions - networking.k8s.io resources: - - ingresses - ingressclasses + - ingresses verbs: - get - list @@ -46,14 +46,14 @@ rules: - apiGroups: - traefik.containo.us resources: - - middlewares - - middlewaretcps - ingressroutes - - traefikservices - ingressroutetcps - ingressrouteudps + - middlewares + - middlewaretcps - tlsoptions - tlsstores + - traefikservices - serverstransports verbs: - get @@ -73,9 +73,9 @@ roleRef: kind: ClusterRole name: traefik-ingress-controller subjects: -- kind: ServiceAccount - name: traefik-ingress-controller - namespace: a0008 + - kind: ServiceAccount + name: traefik-ingress-controller + namespace: a0008 --- apiVersion: v1 kind: ConfigMap @@ -200,6 +200,7 @@ spec: rollingUpdate: maxSurge: 1 maxUnavailable: 1 + minReadySeconds: 0 template: metadata: annotations: @@ -227,10 +228,11 @@ spec: # PRODUCTION READINESS CHANGE REQUIRED # This image should be sourced from a non-public container registry, such as the # one deployed along side of this reference implementation. - # az acr import --source docker.io/library/traefik:v2.8.1 -n + # az acr import --source docker.io/library/traefik:v2.9.6 -n # and then set this to - # image: .azurecr.io/library/traefik:v2.8.1 - - image: docker.io/library/traefik:v2.8.1 + # image: .azurecr.io/library/traefik:v2.9.6 + - image: docker.io/library/traefik:v2.9.6 + imagePullPolicy: IfNotPresent name: traefik-ingress-controller resources: requests: @@ -285,6 +287,8 @@ spec: - name: ssl-csi mountPath: /certs readOnly: true + - name: tmp + mountPath: /tmp args: - --configfile=/config/traefik.toml volumes: @@ -299,5 +303,9 @@ spec: secretProviderClass: aks-ingress-tls-secret-csi-akv - name: data emptyDir: {} + - name: tmp + emptyDir: {} + securityContext: + fsGroup: 65532 nodeSelector: - agentpool: npuser01 + agentpool: npuser01 \ No newline at end of file