diff --git a/content/well-architected-framework/data/docs-nav-data.json b/content/well-architected-framework/data/docs-nav-data.json index 5117e6ea03..b668378f9a 100644 --- a/content/well-architected-framework/data/docs-nav-data.json +++ b/content/well-architected-framework/data/docs-nav-data.json @@ -496,6 +496,10 @@ { "title": "Create cloud budgets", "path": "optimize-systems/manage-cost/create-cloud-budgets" + }, + { + "title": "Detect cloud spending anomalies", + "path": "optimize-systems/manage-cost/detect-cloud-spending-anomalies" } ] }, diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/atomic-deployments.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/atomic-deployments.mdx index e99d3e4773..467261ad25 100644 --- a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/atomic-deployments.mdx +++ b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/atomic-deployments.mdx @@ -24,6 +24,6 @@ In this section of Deploy with confidence, you learned how to implement atomic d Refer to the following documents to learn more about deployment strategies: -- [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) to implement zero-downtime deployment strategies +- [Implement zero-downtime deployments with blue/green, canary, and rolling strategies](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) - [Automate deployments](/well-architected-framework/define-and-automate-processes/automate/deployments) to automate your deployment processes - [Automation maturity model](/well-architected-framework/define-and-automate-processes/process-automation) to understand your current automation level \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/applications.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/applications.mdx index d9af7935fc..a25904e0b8 100644 --- a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/applications.mdx +++ b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/applications.mdx @@ -1,9 +1,9 @@ --- -page_title: Application deployments -description: Implement zero-downtime application deployments using blue/green, canary, and rolling strategies for virtual machines and containers. +page_title: Deploy applications with zero downtime +description: Learn how blue/green, canary, and rolling deployment strategies mitigate downtime during application updates. Compare approaches for VMs and containers to choose the right method for your application. --- -# Zero-downtime application deployments +# Deploy applications with zero downtime Application changes can use blue/green, canary, rolling, or a combination of the three. Your deployment method depends on whether you use virtual machines or containers, along with the criticality of your application. In the following sections, you will learn how these deployment strategies work with load balancers, non-containerized applications, and containerized applications. @@ -24,7 +24,7 @@ External resources: - [Azure Blue-Green deployments using Azure Traffic Manager](https://azure.microsoft.com/en-us/blog/blue-green-deployments-using-azure-traffic-manager/) - [F5 Flexible Load Balancing for Blue/Green Deployments and Beyond](https://www.f5.com/resources/solution-guides/flexible-load-balancing-for-blue-green-deployments-and-beyond) -## Non-containerized applications +## Deploy applications on virtual machines Using a blue/green or rolling deployment is a good approach if you are deploying applications on virtual machines. Blue/green deployments limit downtime and reduce risk by maintaining two identical production environments - one live, one idle. You deploy to the idle environment, test thoroughly, then switch traffic over. If problems occur, you can roll back immediately by switching traffic back. @@ -40,7 +40,7 @@ If the canary test succeeds without errors, you can incrementally direct traffic ![Rolling deployment. After the initial canary test, traffic to the green environment is split evenly with the blue environment (50/50). Finally, all traffic is directed to the green environment.](/img/well-architected-framework/blue-green-canary-tests-deployments/rolling-deployment.png) -## Containerized applications +## Deploy containerized applications with orchestration tools Containers can use rolling, blue/green, and canary deployments, through orchestration tools like Nomad and Kubernetes. @@ -56,7 +56,7 @@ Nomad supports rolling updates as a first-class feature. To enable rolling updat By default, Kubernetes uses rolling updates. Kubernetes does this by incrementally replacing current pods with new ones. The new Pods are scheduled on Nodes with available resources, and Kubernetes waits for those new Pods to start before removing the old Pods. -As described in [infrastructure-changes](#infrastructure-changes), both Nomad and Kubernetes support blue/green deployments. Before sending all your traffic to your new cluster, you can use canary testing to ensure the new cluster is working as intended. +Both Nomad and Kubernetes support blue/green deployments. Before sending all your traffic to your new cluster, you can use canary testing to ensure the new cluster is working as intended. HashiCorp resources: - Learn how to use blue/green deployments with the [Nomad blue/green and canary deployments](/nomad/tutorials/job-updates/job-blue-green-and-canary-deployments#blue-green-deployments) tutorial. @@ -69,3 +69,7 @@ External resources: ## Next steps In this section of [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments), you learned about methods to deploy application changes with zero-downtime. Zero-downtime deployments is part of the [Define and automate processes pillar](/well-architected-framework/define-and-automate-processes). + +- [Implement zero-downtime deployments with blue/green, canary, and rolling strategies](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) +- [Deploy blue/green infrastructure for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/applications) +- [Deploy applications with traffic splitting for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh) \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/index.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/index.mdx index 656357d0fc..40c76102cf 100644 --- a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/index.mdx +++ b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/index.mdx @@ -1,9 +1,9 @@ --- -page_title: Zero-downtime deployments -description: Implement zero-downtime deployment strategies to eliminate service disruption during updates and enable continuous delivery with minimal risk. +page_title: Implement zero-downtime deployments with blue/green, canary, and rolling strategies +description: Learn how to eliminate service disruption with zero-downtime deployment strategies. Compare blue/green, canary, and rolling deployments to choose the right approach for your infrastructure and applications. --- -# Zero-downtime deployments +# Implement zero-downtime deployments with blue/green, canary, and rolling strategies Zero-downtime deployment strategies aim to reduce or eliminate downtime when you update your infrastructure or applications. These strategies involve deploying new versions incrementally rather than all at once to detect and resolve issues. Each strategy lets you test the new version in an environment with real user traffic. This helps validate the new release's performance and reliability. @@ -25,6 +25,8 @@ Blue/green, canary, and rolling deployments all improve application reliability The difference between these strategies is how and where the application deploys. This involves the environment the application runs in, cost considerations, deployment methods, and traffic direction. +## When to use each deployment strategy + | | Blue/Green | Canary | Rolling | |-----------------------|-------------------------------------------------|---------------------------------------------------------------------------------------------|-------------------------------------------------| | **Environment Setup** | Requires two nearly identical environments. | Requires two nearly identical environments. Starts with a small subset of users or servers. | Updates subsets of servers in batches. | @@ -49,6 +51,6 @@ External resources: In this overview of Zero-downtime deployments, you learned the benefits and tradeoffs of zero-downtime deployments techniques. Visit the following documents to learn specifics on infrastructure, application, and service mesh. Zero-downtime deployments is part of the [Define and automate processes pillar](/well-architected-framework/define-and-automate-processes). -- [Zero-downtime infrastructure deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure) -- [Zero-downtime application deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/applications) -- [Zero-downtime deployments with service mesh](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh) +- [Deploy applications with zero downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure) +- [Deploy blue/green infrastructure for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/applications) +- [Deploy applications with traffic splitting for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh) diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure.mdx index e84f92ccfe..422ec2aadb 100644 --- a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure.mdx +++ b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure.mdx @@ -1,28 +1,28 @@ --- -page_title: Deploy blue green infrastructure for zero-downtime -description: Learn how to implement blue green deployment strategies for zero-downtime infrastructure changes. +page_title: Deploy blue/green infrastructure for zero-downtime +description: Learn how to implement blue/green deployment strategies for zero-downtime infrastructure changes. --- -# Deploy blue green infrastructure +# Deploy blue/green infrastructure -Infrastructure changes like server or network policy updates can cause costly downtime if not managed correctly. Blue green deployment strategies lower this risk by maintaining two identical production environments, allowing you to test changes before switching traffic. This guide explains what blue green infrastructure is and how Terraform can help you implement it. +Infrastructure changes like server or network policy updates can cause costly downtime if not managed correctly. Blue/green deployment strategies lower this risk by maintaining two identical production environments, allowing you to test changes before switching traffic. This guide explains what blue/green infrastructure is and how Terraform can help you implement it. -## What is blue green infrastructure +## What is blue/green infrastructure -Blue green deployments require two identical application infrastructure environments, a method for deploying your application to your two environments, and a way to route your traffic between them. +Blue/green deployments require two identical application infrastructure environments, a method for deploying your application to your two environments, and a way to route your traffic between them. -The following diagram shows a basic blue green deployment. The blue environment is the infrastructure where your current application runs. The green environment is identical, except that you have upgraded it to host the new version of the application. +The following diagram shows a basic blue/green deployment. The blue environment is the infrastructure where your current application runs. The green environment is identical, except that you have upgraded it to host the new version of the application. -![Typical blue green deployment. The green environment runs in parallel with the blue environment. When you are ready to switch to the green environment the load balancer directs traffic to the green environment.](/img/well-architected-framework/blue-green-canary-tests-deployments/blue-green-deployment.png) +![Typical blue/green deployment. The green environment runs in parallel with the blue environment. When you are ready to switch to the green environment the load balancer directs traffic to the green environment.](/img/well-architected-framework/blue-green-canary-tests-deployments/blue-green-deployment.png) You set up the blue and green environments as similar as possible. Infrastructure as code (IaC) lets you describe your environment as code and consistently deploy identical environments. IaC makes your operations more cost-effective by allowing you to easily build and remove resources when you do not need them. Using IaC also lets you spin up your green environment whenever you need it. Instead of letting your blue and green environments persist indefinitely or allocating time to build them, you deploy your green infrastructure environment when you want to deploy your new software application. Once your green environment is stable, you can tear down your blue environment. -## Using Terraform for blue green deployments +## Using Terraform for blue/green deployments -HashiCorp's Terraform is an infrastructure as code tool that can help you deploy and manage blue green infrastructure environments. By using Terraform modules, you can consistently deploy identical infrastructure using the same code but in different environments through variables. You can also define feature toggles in your Terraform code to create a blue and green deployment environment simultaneously. You can then test your application in your new green environment, and then, when you are ready, set the toggle in your code to destroy your blue environment. +HashiCorp's Terraform is an infrastructure as code tool that can help you deploy and manage blue/green infrastructure environments. By using Terraform modules, you can consistently deploy identical infrastructure using the same code but in different environments through variables. You can also define feature toggles in your Terraform code to create a blue and green deployment environment simultaneously. You can then test your application in your new green environment, and then, when you are ready, set the toggle in your code to destroy your blue environment. HashiCorp resources: @@ -36,4 +36,8 @@ External resources: ## Next steps -In this section of [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments), you learned about methods to deploy infrastructure changes with zero-downtime. Zero-downtime deployments is part of the [Define and automate processes pillar](/well-architected-framework/define-and-automate-processes). \ No newline at end of file +In this section of [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments), you learned about methods to deploy infrastructure changes with zero-downtime. Zero-downtime deployments is part of the [Define and automate processes pillar](/well-architected-framework/define-and-automate-processes). + +- [Implement zero-downtime deployments with blue/green, canary, and rolling strategies](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) +- [Deploy applications with zero downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure) +- [Deploy applications with traffic splitting for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh) \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh.mdx b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh.mdx index 3924fd02ae..05cb56167d 100644 --- a/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh.mdx +++ b/content/well-architected-framework/docs/docs/define-and-automate-processes/deploy/zero-downtime-deployments/service-mesh.mdx @@ -1,9 +1,9 @@ --- -page_title: Service mesh deployments -description: Use service splitters and traffic routing to implement zero-downtime deployments with gradual traffic shifting and rollback capabilities. +page_title: Deploy applications with traffic splitting for zero downtime +description: Deploy application updates without downtime by routing traffic between versions dynamically. Learn gradual traffic shifting strategies that enable instant rollback and reduce deployment risk. --- -# Zero-downtime deployments with service mesh +# Deploy applications with traffic splitting for zero downtime You can use service splitters to implement zero-downtime deployments. These components, often used in service mesh architectures, allow traffic to route between different versions of an application dynamically. @@ -36,3 +36,7 @@ HashiCorp resources: ## Next steps In this section of [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments), you learned how to use service mesh to deploy with zero-downtime. Zero-downtime deployments is part of the [Define and automate processes pillar](/well-architected-framework/define-and-automate-processes). + +- [Implement zero-downtime deployments with blue/green, canary, and rolling strategies](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) +- [Deploy applications with zero downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/infrastructure) +- [Deploy blue/green infrastructure for zero-downtime](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments/applications) \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx index fdf9e99693..6d58d7f138 100644 --- a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx +++ b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/data-management.mdx @@ -1,13 +1,13 @@ --- -page_title: Implement data management policies -description: Implement data management policies to reduce storage costs, ensure compliance, and manage data lifecycles with infrastructure as code. +page_title: Automate cloud storage lifecycle policies +description: Learn how to automate data lifecycle policies using Terraform and infrastructure as code. Reduce cloud storage costs, ensure compliance, and manage AWS S3, GCP, and Azure data retention policies. --- -# Implement data management policies +# Automate cloud storage lifecycle policies -You can use data management policies to manage the lifecycle of your organization's data. When you store data either in the cloud or on-premises, it is important to define and automate the policies around managing that data. Defining management with infrastructure as code tools, such as Terraform, ensures you consistently apply these policies across all environments and resources. +Data lifecycle management policies help organizations automatically manage cloud storage costs, meet compliance requirements, and secure sensitive data. Using infrastructure as code tools like Terraform, you can define, version, and apply lifecycle rules across AWS S3, Google Cloud Storage, and Azure Blob Storage. -## Why you should use lifecycle policies +## Benefits of automated data lifecycle policies Most major cloud providers offer lifecycle management features for their storage services. These features allow you to define rules that automatically transition data between different storage classes based on age or access patterns, and delete data that has reached the end of its retention period. @@ -77,8 +77,8 @@ Other cloud providers, such as [Google Cloud Platform](https://registry.terrafor HashiCorp resources: - Search the [Terraform Registry](https://registry.terraform.io/browse/providers) for the [cloud](https://registry.terraform.io/browse/providers?category=public-cloud) or [database](https://registry.terraform.io/browse/providers?category=database) provider you use. - - Learn best practices for writing Terraform with the Terraform [style guide](/terraform/language/style). +- Start learning Terraform with the [Get started tutorials](/terraform/tutorials). External resources: @@ -91,4 +91,7 @@ External resources: In this section of Lifecycle management, you learned about implementing data management policies, including why you should use lifecycle policies and how to automate policy management with infrastructure as code. Implement data management policies is part of the [Optimize systems](/well-architected-framework/optimize-systems) pillar. To learn more about infrastructure and resource management, refer to the following resources: + - [Automate infrastructure provisioning](/well-architected-framework/define-and-automate-processes/process-automation/process-automation-workflow) +- [Tag cloud resources](/well-architected-framework/define-and-automate-processes/infrastructure-and-resource-management/tag-cloud-resources) +- [Decommission infrastructure resources](/well-architected-framework/optimize-systems/lifecycle-management/decommission-infrastructure) diff --git a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx index 5150daba5a..b8988c9268 100644 --- a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx +++ b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/decommission-infrastructure.mdx @@ -1,9 +1,9 @@ --- -page_title: Decommission resources +page_title: Decommission infrastructure resources description: Learn how to decommission infrastructure components while maintaining system integrity and avoiding disruptions through proper planning and automation. --- -# Decommission resources +# Decommission infrastructure resources Resource decommissioning is the process of safely removing or deleting infrastructure components, applications, or services that are no longer needed or have reached end-of-life. You should remove unused or obsolete resources such as servers, databases, images, IAM, and other infrastructure components. @@ -47,13 +47,13 @@ HashiCorp resources: - [Terraform graph command](/terraform/cli/commands/graph) -## Create a communication plan +## Plan stakeholder communication Your plan should outline how you will inform stakeholders about the decommissioning process, including timelines and potential impacts. Effective communication prevents surprises and ensures all affected teams can prepare for the changes. Start by identifying all stakeholders who might be affected by the decommissioning, including development teams, operations staff, end users, and business owners. Create a notification timeline that provides adequate warning. Your communications should explain what resources you are removing, when the decommissioning will occur, and what actions stakeholders need to take. -## Create backups +## Back up data before decommissioning Before decommissioning, confirm that you have backups of any critical data or configurations associated with the resources you are removing. Backups provide a safety net in case you need to roll back changes. @@ -118,14 +118,7 @@ Consul can help you gradually remove resources by directing traffic away from se If you are using orchestration tools like Nomad or Kubernetes, you can use their built-in capabilities to drain workloads before decommissioning nodes gracefully. Nomad provides node drain functionality through the `nomad node drain` command, which prevents new scheduling new allocations on a node while safely migrating existing jobs to other available nodes. The Kubernetes `kubectl drain` command safely removes pods from nodes while respecting Pod Disruption Budgets, which ensure that a minimum number of application replicas remain available throughout the process. -HashiCorp resources: - -- Review the [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) documentation for strategies on how to redirect traffic and disable functions gradually. -- Learn how to [manage resource lifecycles with Terraform](/terraform/tutorials/state/resource-lifecycle). -- [Get up and running with Nomad](/nomad/tutorials/get-started) by learning about scheduling, setting up a cluster, and deploying an example job. -- Learn the [fundamentals of Consul](/consul/tutorials). - -## Verify health of infrastructure and applications +## Verify infrastructure health post-decommissioning After the decommissioning process, verify that the remaining infrastructure and applications are functioning correctly. Monitor system performance and user feedback to ensure that there are no negative impacts. @@ -138,6 +131,10 @@ You should do the following steps after you decomission the resources: HashiCorp resources: - [Learn to setup monitoring agents](/well-architected-framework/define-and-automate-processes/monitor/setup-monitoring-agents) and [dashboards and alerts](/well-architected-framework/define-and-automate-processes/monitor/dashboards-alerts). +- Review the [Zero-downtime deployments](/well-architected-framework/define-and-automate-processes/deploy/zero-downtime-deployments) documentation for strategies on how to redirect traffic and disable functions gradually. +- Learn how to [manage resource lifecycles with Terraform](/terraform/tutorials/state/resource-lifecycle). +- [Get up and running with Nomad](/nomad/tutorials/get-started) by learning about scheduling, setting up a cluster, and deploying an example job. +- Learn the [fundamentals of Consul](/consul/tutorials). External resources: @@ -149,4 +146,5 @@ In this section of Lifecycle management, you learned about decommissioning resou To learn more about infrastructure and resource management, refer to the following resource: -- [Data management](/well-architected-framework/optimize-systems/lifecycle-management/data-management) \ No newline at end of file +- [Automate cloud storage lifecycle policies](/well-architected-framework/optimize-systems/lifecycle-management/data-management) +- [Tag cloud resources](/well-architected-framework/define-and-automate-processes/infrastructure-and-resource-management/tag-cloud-resources) \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/tag-cloud-resources.mdx b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/tag-cloud-resources.mdx index 93195e6a5e..08a56ac69f 100644 --- a/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/tag-cloud-resources.mdx +++ b/content/well-architected-framework/docs/docs/optimize-systems/lifecycle-management/tag-cloud-resources.mdx @@ -1,9 +1,9 @@ --- -page_title: Tag cloud resources +page_title: Create and implement a cloud resource tagging strategy description: Implement cloud resource tagging best practices with Terraform for AWS, Azure, and GCP. Learn to automate tags, enforce policies, and optimize cost allocation using infrastructure as code. --- -# Tag cloud resources +# Create and implement a cloud resource tagging strategy Managing thousands of cloud resources across regions, environments, and teams is complex. Tags are key-value pairs that help you manage, identify, organize, locate, and filter resources. It is important to have a clear, well-defined cloud resource tagging strategy. You can also use tags to track cost allocation and usage, and automate resource management tasks. @@ -18,7 +18,7 @@ When you implement a tagging strategy, you gain the following benefits: - **Tag-based resource automation:** Automate resource management tasks based on tags, such as starting or stopping instances. - **Default resource compliance:** Enforce tagging policies to ensure all resources are tagged correctly. -## Deploy tags using infrastructure as code +## How to deploy tags using infrastructure as code Consistent implementation of your tagging strategy helps you track infrastructure costs, manage resources, and ensure compliance. When you use an inconsistent tagging strategy, such as manual tagging, you may end up with resources with incorrect or missing tags. @@ -68,7 +68,7 @@ HashiCorp resources: - [GCP default tags](https://registry.terraform.io/providers/hashicorp/google/latest/docs/guides/provider_reference#default_labels-1) - Learn how to [configure default tags for AWS resources](/terraform/tutorials/aws/aws-default-tags) -## Enforce tagging strategy +## Enforce tagging strategy policies Once you define and implement your tagging strategy using infrastructure as code, you can enforce it to prevent the deployment of resources that do not comply. diff --git a/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/create-cloud-budgets.mdx b/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/create-cloud-budgets.mdx index e1aee99510..e2863d3079 100644 --- a/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/create-cloud-budgets.mdx +++ b/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/create-cloud-budgets.mdx @@ -1,6 +1,6 @@ --- -page_title: Create cloud budgets -description: Create cloud budgets and spending alerts using Terraform for AWS, Azure, and GCP. Implement cost monitoring, anomaly detection, and automated notifications with infrastructure as code. +page_title: Prevent cloud cost overruns with automated budgets and spending alerts +description: Learn how to set cloud budgets and spending alerts in AWS, Azure, and GCP with Terraform. Control costs, detect anomalies, and automate notifications through infrastructure as code. --- # Create cloud budgets @@ -16,7 +16,7 @@ Implementing a budget provides you with the following benefits: -The Terraform example in this document uses the `tags` block. Refer to the [Tag cloud resources](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) document to learn about implementing a tagging strategy. +The Terraform example in this document uses the `tags` block. Refer to the [Create and implement a cloud resource tagging strategy](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) document to learn about implementing a tagging strategy. @@ -26,7 +26,9 @@ Most major cloud providers offer native tools to create budgets. These native to You can use Terraform to define and manage cloud budgets across your organization. You can create Terraform modules to create budgets for different teams, projects, or environments. These modules can automatically apply appropriate budget thresholds, alerting mechanisms, and spending limits to new or existing cloud resources. -If you're tracking resources by tags, it is important to have a well-defined tagging strategy to ensure budgets are applied correctly. Terraform can help you enforce tagging policies and ensure that all resources are tagged consistently. Creating infrastructure manually can lead to incorrect or missing tags on resources and result in inaccurate budget tracking. +If you're tracking resources by tags, it is important to have a well-defined tagging strategy to ensure you apply budgets correctly. Terraform enforces tagging policies and tags all resources consistently. Creating infrastructure manually can lead to incorrect or missing tags on resources and result in inaccurate budget tracking. + +## Cloud budget configuration with Terraform The following is an example of a Terraform configuration that creates an AWS EC2 budget. This budget tracks EC2 instance costs and sends an alert to test@example.com when the forecasted cost exceeds 100% of the budget. You can set similar budgets and alerts for other cloud providers, such as Azure and GCP. @@ -78,6 +80,7 @@ For Google Cloud Platform, the `google_billing_budget` resource operates at the HashiCorp resources: - Learn how to [Tag cloud resources](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) +- Start learning Terraform with the [Get started tutorials](/terraform/tutorials). - Terraform resource: [aws_budgets_budget](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/budgets_budget) - Terraform resource: [azurerm_consumption_budget_subscription](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/consumption_budget_subscription) - Terraform resource: [google_billing_budget](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/billing_budget) @@ -88,61 +91,6 @@ External resources: - Azure Cost Management and Billing: [Create and manage budgets](https://learn.microsoft.com/en-us/azure/cost-management-billing/costs/tutorial-acm-create-budgets) - Google Cloud Budgets and alerts: [Creating budgets](https://cloud.google.com/billing/docs/how-to/budgets) -## Detect spending anomalies - -Anomaly detection identifies unusual spending patterns rather than absolute thresholds. For example, if your monthly EC2 spending suddenly doubles from $2,000 to $4,000 but remains under your $5,000 budget, a budget alert would not trigger. However, anomaly detection would flag this unusual increase for investigation. Anomaly detection helps you catch issues like misconfigured autoscaling, forgotten resources, or unauthorized usage before they significantly impact costs. - -Most cloud providers offer machine learning-based anomaly detection that learns your normal usage patterns and alerts you when spending deviates from the baseline. You can configure anomaly detection with AWS Cost Anomaly Detection and Azure Cost Management using Terraform. - -The following is an example Terraform configuration that sets up cost anomaly detection with email alerts in AWS. This cost anomaly detection will detect the previous EC2 scenario. - -```hcl -resource "aws_ce_anomaly_monitor" "test" { - name = "AWSServiceMonitor" - monitor_type = "DIMENSIONAL" - monitor_dimension = "SERVICE" -} - -resource "aws_ce_anomaly_subscription" "test" { - name = "DAILYSUBSCRIPTION" - frequency = "DAILY" - - monitor_arn_list = [ - aws_ce_anomaly_monitor.test.arn - ] - - subscriber { - type = "EMAIL" - address = "abc@example.com" - } - - threshold_expression { - dimension { - key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE" - match_options = ["GREATER_THAN_OR_EQUAL"] - values = ["100"] - } - } -} -``` - -Some of the key components in the previous example include: - -- **aws_ce_anomaly_monitor:** Tracks spending patterns across all AWS services including EC2, S3, and Lambda. -- **frequency = "DAILY":** Sends a daily summary of detected anomalies. -- **threshold_expression:** Alerts when the anomaly's financial impact meets or exceeds $100. - -HashiCorp resources: - -- Terraform resource: [aws_ce_anomaly_subscription](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ce_anomaly_subscription) -- Terraform resource: [azurerm_cost_anomaly_alert](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/cost_anomaly_alert) - -External resources: - -- [AWS getting started with AWS Cost Anomaly Detection](https://docs.aws.amazon.com/cost-anomaly/latest/userguide/what-is-cost-anomaly.html) -- [Azure identify anomalies and unexpected changes in cost](https://learn.microsoft.com/en-us/azure/cost-management-billing/understand/analyze-unexpected-charges) -- [Google cloud anomaly detection overview](https://cloud.google.com/bigquery/docs/anomaly-detection-overview) - ## Next steps In this section of Manage cost, you learned about creating budgets and alerts to manage and control cloud spending, including creating spending limits with cloud provider budgets and detecting spending anomalies automatically. Create cloud budgets is part of the [Optimize systems](/well-architected-framework/optimize-systems). @@ -151,5 +99,6 @@ To learn more about managing resources with Terraform, view the following resour - [Create reusable infrastructure modules](/well-architected-framework/define-and-automate-processes/define/modules) - [Implement CI/CD](/well-architected-framework/define-and-automate-processes/automate/cicd) +- [Create and implement a cloud resource tagging strategy](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) - [Reduce costs with Terraform Cloud ephemeral workspaces](https://www.youtube.com/watch?v=-woCmG8yGdA) -- [Tag cloud resources](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) \ No newline at end of file +- [Create and implement a cloud resource tagging strategy](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) \ No newline at end of file diff --git a/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/detect-cloud-spending-anomalies.mdx b/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/detect-cloud-spending-anomalies.mdx new file mode 100644 index 0000000000..95b6f88a26 --- /dev/null +++ b/content/well-architected-framework/docs/docs/optimize-systems/manage-cost/detect-cloud-spending-anomalies.mdx @@ -0,0 +1,71 @@ +--- +page_title: How to catch cloud spending anomalies before they spike +description: Catch unusual cloud spending patterns before they become costly. Learn how to implement anomaly detection with alerts using infrastructure as code. +--- + +# How to catch cloud spending anomalies before they spike + +Monitoring for cloud spending anomalies help you identify cost issues that budgets miss. For example, if your monthly cloud spending suddenly doubles from $2,000 to $4,000 but remains under your $5,000 budget, a budget alert would not trigger. However, anomaly detection would flag this unusual increase for investigation. Anomaly detection helps you catch issues like misconfigured autoscaling, forgotten resources, or unauthorized usage before they significantly impact costs. + +Most cloud providers offer machine learning-based anomaly detection that learns your normal usage patterns and alerts you when spending deviates from the baseline. You can configure anomaly detection with AWS Cost Anomaly Detection and Azure Cost Management using Terraform. + + +## Set up anomaly detection in AWS + +The following is an example Terraform configuration that sets up cost anomaly detection with email alerts in AWS. This cost anomaly detection will detect the previous EC2 scenario. + +```hcl +resource "aws_ce_anomaly_monitor" "test" { + name = "AWSServiceMonitor" + monitor_type = "DIMENSIONAL" + monitor_dimension = "SERVICE" +} + +resource "aws_ce_anomaly_subscription" "test" { + name = "DAILYSUBSCRIPTION" + frequency = "DAILY" + + monitor_arn_list = [ + aws_ce_anomaly_monitor.test.arn + ] + + subscriber { + type = "EMAIL" + address = "abc@example.com" + } + + threshold_expression { + dimension { + key = "ANOMALY_TOTAL_IMPACT_ABSOLUTE" + match_options = ["GREATER_THAN_OR_EQUAL"] + values = ["100"] + } + } +} +``` + +Some of the key components in the previous example include: + +- **aws_ce_anomaly_monitor:** Tracks spending patterns across all AWS services including EC2, S3, and Lambda. +- **frequency = "DAILY":** Sends a daily summary of detected anomalies. +- **threshold_expression:** Alerts when the anomaly's financial impact meets or exceeds $100. + +HashiCorp resources: + +- Terraform resource: [aws_ce_anomaly_subscription](https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/ce_anomaly_subscription) +- Terraform resource: [azurerm_cost_anomaly_alert](https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/cost_anomaly_alert) +- Start learning Terraform with the [Get started tutorials](/terraform/tutorials). + +External resources: + +- [AWS getting started with AWS Cost Anomaly Detection](https://docs.aws.amazon.com/cost-anomaly/latest/userguide/what-is-cost-anomaly.html) +- [Azure identify anomalies and unexpected changes in cost](https://learn.microsoft.com/en-us/azure/cost-management-billing/understand/analyze-unexpected-charges) +- [Google cloud anomaly detection overview](https://cloud.google.com/bigquery/docs/anomaly-detection-overview) + +In this section of Manage cost, you learned about detecting cloud spending anomalies using Terraform. Create cloud budgets is part of the [Optimize systems](/well-architected-framework/optimize-systems). + +To learn more about managing resources with Terraform, view the following resources: + +- [Reduce costs with Terraform Cloud ephemeral workspaces](https://www.youtube.com/watch?v=-woCmG8yGdA) +- [Create and implement a cloud resource tagging strategy](/well-architected-framework/optimize-systems/lifecycle-management/tag-cloud-resources) +- [Prevent cloud cost overruns with automated budgets and spending alerts](/well-architected-framework/optimize-systems/manage-cost/create-cloud-budgets) \ No newline at end of file