Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/plugins/azure-skills/.claude-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "azure",
"description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from Claude Code.",
"version": "1.1.22",
"version": "1.1.26",
"author": {
"name": "Microsoft",
"url": "https://www.microsoft.com"
Expand Down
2 changes: 1 addition & 1 deletion .github/plugins/azure-skills/.cursor-plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "azure",
"description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from Cursor.",
"version": "1.1.22",
"version": "1.1.26",
"author": {
"name": "Microsoft",
"url": "https://www.microsoft.com"
Expand Down
2 changes: 1 addition & 1 deletion .github/plugins/azure-skills/.plugin/plugin.json
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{
"name": "azure",
"description": "Microsoft Azure MCP and Skills integration for cloud resource management, deployments, and Azure services. Manage your Azure infrastructure, monitor applications, and deploy resources directly from your development environment.",
"version": "1.1.22",
"version": "1.1.26",
"author": {
"name": "Microsoft",
"url": "https://www.microsoft.com"
Expand Down
4 changes: 4 additions & 0 deletions .github/plugins/azure-skills/CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
# Changelog

## 1.1.25

- fix: update toolbox sample link ([#2078](https://github.com/microsoft/GitHub-Copilot-for-Azure/pull/2078))

## 1.1.22

- fix: Remove context7 MCP server from plugin config ([#2100](https://github.com/microsoft/GitHub-Copilot-for-Azure/pull/2100))
Expand Down
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
---
name: azure-diagnostics
description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter."
description: "Debug Azure production issues on Azure using AppLens, Azure Monitor, resource health, and safe triage. WHEN: debug production issues, troubleshoot app service, app service high CPU, app service deployment failure, troubleshoot container apps, troubleshoot functions, troubleshoot AKS, kubectl cannot connect, kube-system/CoreDNS failures, pod pending, crashloop, node not ready, upgrade failures, analyze logs, KQL, insights, image pull failures, cold start issues, health probe failures, resource health, root cause of errors, troubleshoot event hubs, troubleshoot service bus, messaging SDK error, AMQP connection failure, message lock lost, service bus dead letter."
license: MIT
metadata:
author: Microsoft
version: "1.1.3"
version: "1.1.4"
---

# Azure Diagnostics
Expand All @@ -22,6 +22,8 @@ Activate this skill when user wants to:
- Fix image pull, cold start, or health probe issues
- Investigate why Azure resources are failing
- Find root cause of application errors
- Troubleshoot App Service issues (high CPU, deployment failures, crashes, slow responses, TLS/custom domains)
- Respond to prompts like "troubleshoot app service", "app service high CPU", or "app service deployment failure"
- Troubleshoot Azure Function Apps (invocation failures, timeouts, binding errors)
- Find the App Insights or Log Analytics workspace linked to a Function App
- Troubleshoot AKS clusters, nodes, pods, ingress, or Kubernetes networking issues
Expand Down Expand Up @@ -53,6 +55,7 @@ Activate this skill when user wants to:
| Service | Common Issues | Reference |
|---------|---------------|-----------|
| **Container Apps** | Image pull failures, cold starts, health probes, port mismatches | [container-apps/](references/container-apps/README.md) |
| **App Service** | High CPU, deployment failures, crashes, slow responses, TLS/custom domains | [app-service/](references/app-service/README.md) |
| **Function Apps** | App details, invocation failures, timeouts, binding errors, cold starts, missing app settings | [functions/](references/functions/README.md) |
| **AKS** | Cluster access, nodes, `kube-system`, scheduling, crash loops, ingress, DNS, upgrades | [AKS Troubleshooting](troubleshooting/aks/aks-troubleshooting.md) |
| **Messaging** | Event Hubs & Service Bus SDK errors, AMQP failures, message lock, connectivity | [Messaging Troubleshooting](troubleshooting/messaging/README.md) |
Expand Down Expand Up @@ -143,5 +146,6 @@ az monitor activity-log list -g RG --max-events 20

- [KQL Query Library](references/kql-queries.md)
- [Azure Resource Graph Queries](references/azure-resource-graph.md)
- [App Service Troubleshooting](references/app-service/README.md)
- [Function Apps Troubleshooting](references/functions/README.md)
- [Messaging Troubleshooting](troubleshooting/messaging/README.md)
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# App Service Troubleshooting

## Common Issues Matrix

| Symptom | Likely Cause | Action |
|---------|--------------|-----------|
| High CPU / memory | Runaway process, inefficient code | Use Process Explorer via Kudu, scale up |
| Deployment failure | Build error, locked files, quota | Check Kudu logs at `https://APP.scm.azurewebsites.net/api/deployments` to look for details on build errors, locked files or lack of storage quota |
| App crash / restart | Unhandled exception, OOM kill | Review Event Log and STDERR in Diagnose & Solve |
| Slow responses | Downstream dependency, no caching | Enable request tracing, check dependency calls |
| 502 / 503 errors | App not starting, port conflict | Check STDERR logs, verify startup command |
| TLS / domain errors | Certificate expired, DNS mismatch | `az webapp config ssl list`, verify CNAME |
| Health check failure | Endpoint not returning 200 | Verify health check path responds within 2 min |

---

## High CPU / Memory Diagnosis

**Diagnose:**
```bash
# Check app metrics
az monitor metrics list --resource APP_RESOURCE_ID \
--metric "CpuPercentage,MemoryPercentage" --interval PT1M --output table

# View running processes via ARM Processes API (Entra ID auth)
az rest --method get \
--uri "/subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>/processes?api-version=2024-04-01"
```

**Fix:** Scale up (`az appservice plan update -n <app-service-plan-name> -g <resource-group> --sku P1V3`) or profile the app via Kudu Process Explorer at `https://APP.scm.azurewebsites.net/ProcessExplorer/` to identify hot paths.

---

## Deployment Failure Analysis

**Diagnose:**
```bash
# List deployment history
az webapp deployment list -n APP -g RG --output table

# View deployment log for a specific deployment
az webapp log deployment show -n APP -g RG --deployment-id DEPLOY_ID

# Stream build logs from Kudu
az webapp log tail -n APP -g RG
```

**KQL — Failed deployments:**
```kql
// Replace <app-service-resource-id> with the full resource ID, for example:
// /subscriptions/<subscription-id>/resourceGroups/<resource-group>/providers/Microsoft.Web/sites/<app-name>
AppServicePlatformLogs
| where TimeGenerated > ago(24h)
| where Level == "Error" and _ResourceId == "<app-service-resource-id>"
| project TimeGenerated, Level, Message
| order by TimeGenerated desc
```

**Common deployment failures:**

| Error Message | Cause | Fix |
|---------------|-------|-----|
| `WEBSITE_RUN_FROM_PACKAGE=1` but no package | Missing zip deploy artifact | Redeploy with `az webapp deploy --src-path app.zip` |
| `Error building on server` | Oryx build failure | Check build logs, pin runtime version |
| `Locked file` during deploy | Files in use | Set an environment variable named `MSDEPLOY_RENAME_LOCKED_FILES=1` on the App Service resource to enable MSDeploy to rename locked files. |

---

## Application Crash / Restart Diagnosis

**Diagnose:**
```bash
# Check recent restarts via activity log
az monitor activity-log list -g RG --resource-id APP_RESOURCE_ID \
--max-events 10 --query "[?operationName.value=='Microsoft.Web/sites/restart/action']"

# View STDERR/STDOUT (Linux)
az webapp log download -n APP -g RG --log-file logs.zip
```

**KQL — App crashes and errors:**
```kql
AppServiceConsoleLogs
| where TimeGenerated > ago(1h)
| where ResultDescription contains "error" or ResultDescription contains "fatal"
| project TimeGenerated, ResultDescription
| order by TimeGenerated desc
| take 50
```

**Health check failures:**
```bash
# Show health check config
az webapp show -n APP -g RG --query "siteConfig.healthCheckPath"

# Test the endpoint directly
curl -s -o /dev/null -w "%{http_code}" https://APP.azurewebsites.net/health
```

> ⚠️ **Warning:** If the health check fails on >50% of instances for 1 hour, the instance is replaced.

---

## Slow Response Time Investigation

**Diagnose:**
```bash
# Check average response time
az monitor metrics list --resource APP_RESOURCE_ID \
--metric "HttpResponseTime" --interval PT5M --aggregation Average --output table

# Enable failed request tracing
az webapp log config -n APP -g RG --failed-request-tracing true
```

**KQL — Slow requests with dependency analysis:**
```kql
AppServiceHTTPLogs
| where TimeGenerated > ago(1h)
| where TimeTaken > 5000
| project TimeGenerated, CsUriStem, ScStatus, TimeTaken, CsHost
| order by TimeTaken desc
| take 20
```

**Auto-Heal — Automatic mitigation:**
```bash
# Configure auto-heal to recycle on slow requests
az webapp config set -n APP -g RG \
--auto-heal-enabled true \
--generic-configurations '{"autoHealRules":{"triggers":{"slowRequests":{"timeTaken":"00:00:30","count":10,"timeInterval":"00:02:00"}},"actions":{"actionType":"Recycle"}}}'
```

---

## Custom Domain / TLS Certificate Issues

**Diagnose:**
```bash
# List custom domains
az webapp config hostname list -g RG --webapp-name APP --output table

# List TLS certificates
az webapp config ssl list -g RG --output table

# Check SSL binding
az webapp config ssl show --certificate-name CERT -g RG
```

| Symptom | Cause | Fix |
|---------|-------|-----|
| `ERR_CERT_DATE_INVALID` | Certificate expired | If certificate came from an external certificate authority, renew with `az webapp config ssl upload` and upload a new certificate or enable managed certificates to allow Azure to provide a free TLS/SSL certificate |
| `DNS_PROBE_FINISHED_NXDOMAIN` | CNAME not configured | Add CNAME record pointing to `APP.azurewebsites.net` |
| `SSL binding not found` | Missing SNI binding | Add the missing SNI binding using `az webapp config ssl bind --certificate-thumbprint THUMB --ssl-type SNI -n APP -g RG` |
| Managed cert pending | DNS validation incomplete | Verify TXT record `asuid.DOMAIN` matches custom domain verification ID |

---

## AZ CLI or MCP Tools for App Service Diagnostics

| Tool | Command | Use When |
|----------|---------|----------|
| `Azure CLI` | `az webapp list` | List all web apps in subscription |
| `Azure CLI` | `az webapp show -n APP -g RG` | Get app config, stack, status |
| `Azure CLI` | `az webapp config appsettings list -n APP -g RG` | Check env vars and connection strings |
| `Azure CLI` | `az webapp deployment slot list -n APP -g RG` | Compare slot configurations |
| `mcp_azure_mcp_appservice` | `appservice_webapp_diagnostic_diagnose` | AI-powered root cause analysis |
| `mcp_azure_mcp_monitor` | `monitor_resource_log_query` | Run KQL against Log Analytics |
| `mcp_azure_mcp_resourcehealth` | `get` | Check platform-level health status |

> 💡 **Tip:** Start with `mcp_azure_mcp_appservice` (`diagnose`) — it automatically runs relevant detectors and surfaces the most likely root cause before you dig into logs manually.

---

## Combined Diagnostic Script

```bash
echo "=== App Service Diagnostics ===" && \
echo "App Config:" && az webapp show -n APP -g RG --query "{state:state, runtime:siteConfig.linuxFxVersion, healthCheck:siteConfig.healthCheckPath, alwaysOn:siteConfig.alwaysOn}" -o table && \
echo "Recent Deployments:" && az webapp deployment list -n APP -g RG --query "[:3].{id:id, status:status, time:end_time}" -o table && \
echo "App Settings:" && az webapp config appsettings list -n APP -g RG --query "[].name" -o tsv && \
echo "Custom Domains:" && az webapp config hostname list -g RG --webapp-name APP -o table
```
Loading
Loading