pkg/metrics: add auto analyze failed alert rule#67733
pkg/metrics: add auto analyze failed alert rule#67733ti-chi-bot[bot] merged 1 commit intopingcap:masterfrom
Conversation
|
Review Complete Findings: 0 issues ℹ️ Learn more details on Pantheon AI. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughA new alert rule Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #67733 +/- ##
================================================
- Coverage 77.6100% 77.4344% -0.1756%
================================================
Files 1981 1965 -16
Lines 548611 548624 +13
================================================
- Hits 425777 424824 -953
- Misses 122024 123798 +1774
+ Partials 810 2 -808
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
|
Tested locally: export WORK_DIR="$HOME/tmp/review-67733-alert"
export PLAY_TAG="case-review-67733-alert-default"
export PORT_OFFSET=38000
export SQL_PORT=42000
export STATUS_PORT=48080
export PROM_PORT=43194
export PROM_RULES="$WORK_DIR/tidb-auto-analyze.rules.yml"
export PROM_CFG="$WORK_DIR/prometheus.yml"
export PROM_DATA="$WORK_DIR/prom-data"
export PROM_BIN="$HOME/.tiup/components/prometheus/v8.5.5/prometheus/prometheus"1. Start A Real Playground Clustertiup playground nightly \
--db 1 --pd 1 --kv 1 --tiflash 0 --without-monitor \
--tag "$PLAY_TAG" \
--port-offset "$PORT_OFFSET" --db.binpath /Users/poe/code/tidb/bin/tidb-server2. Start Standalone Prometheus With The PR Rulecommand mkdir -p -- "$WORK_DIR" "$PROM_DATA"
cat > "$PROM_RULES" <<'RULES'
groups:
- name: alert.rules
rules:
- alert: TiDB_auto_analyze_failed
expr: increase( tidb_statistics_auto_analyze_total{type="failed"}[10m] ) > 0
for: 1m
labels:
env: ENV_LABELS_ENV
level: warning
expr: increase( tidb_statistics_auto_analyze_total{type="failed"}[10m] ) > 0
annotations:
description: 'cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, values:{{ $value }}'
value: '{{ $value }}'
summary: TiDB auto analyze failed
RULES
cat > "$PROM_CFG" <<EOF_CFG
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- $PROM_RULES
scrape_configs:
- job_name: "tidb-review-67733-default"
static_configs:
- targets:
- "127.0.0.1:$STATUS_PORT"
EOF_CFG
"$PROM_BIN" \
--config.file="$PROM_CFG" \
--web.listen-address="127.0.0.1:$PROM_PORT" \
--storage.tsdb.path="$PROM_DATA"Wait until: curl -sf "http://127.0.0.1:$PROM_PORT/-/ready"returns: 3. Python CheckerCreate and activate a local venv in the same shell first: python3 -m venv "$WORK_DIR/.venv"
. "$WORK_DIR/.venv/bin/activate"
python -m pip install PyMySQLcat > "$WORK_DIR/check_alert_timeout_e2e.py" <<'PY'
import json
import os
import sys
import time
import urllib.parse
import urllib.request
try:
import pymysql
except ModuleNotFoundError as exc:
raise SystemExit("Missing dependency: activate the venv and run `python -m pip install PyMySQL`") from exc
HOST = os.getenv("CASE_HOST", "127.0.0.1")
PORT = int(os.getenv("CASE_SQL_PORT", "42000"))
STATUS_PORT = int(os.getenv("CASE_STATUS_PORT", "48080"))
PROM_PORT = int(os.getenv("CASE_PROM_PORT", "43194"))
DB = os.getenv("CASE_DB", "review67733_timeout")
TABLE = os.getenv("CASE_TABLE", "t_auto")
ROW_BATCH = int(os.getenv("CASE_ROW_BATCH", "5000"))
BASE_ROWS = int(os.getenv("CASE_BASE_ROWS", "5000000"))
DELTA_ROWS = int(os.getenv("CASE_DELTA_ROWS", "1000000"))
FAIL_ROUNDS = int(os.getenv("CASE_FAIL_ROUNDS", "2"))
MAX_TIME = int(os.getenv("CASE_MAX_AUTO_ANALYZE_TIME", "1"))
POLL_INTERVAL = float(os.getenv("CASE_POLL_INTERVAL", "0.1"))
def connect(db=None, autocommit=True):
return pymysql.connect(
host=HOST,
port=PORT,
user="root",
password="",
database=db,
autocommit=autocommit,
charset="utf8mb4",
cursorclass=pymysql.cursors.DictCursor,
read_timeout=120,
write_timeout=120,
)
def exec_sql(cur, sql, args=None):
cur.execute(sql, args)
try:
return cur.fetchall()
except Exception:
return None
def http_json(url):
with urllib.request.urlopen(url, timeout=15) as resp:
return json.loads(resp.read().decode("utf-8"))
def prom_query(expr):
query = urllib.parse.quote(expr)
data = http_json(f"http://127.0.0.1:{PROM_PORT}/api/v1/query?query={query}")
return data["data"]["result"]
def prom_rule_state():
data = http_json(f"http://127.0.0.1:{PROM_PORT}/api/v1/rules")
for group in data["data"]["groups"]:
for rule in group["rules"]:
if rule.get("name") == "TiDB_auto_analyze_failed":
return rule
raise RuntimeError("TiDB_auto_analyze_failed not found")
def prom_alerts():
return http_json(f"http://127.0.0.1:{PROM_PORT}/api/v1/alerts")["data"]["alerts"]
def wait_for_alert_state(target, timeout_sec):
deadline = time.time() + timeout_sec
last_rule = None
last_alerts = []
while time.time() < deadline:
last_rule = prom_rule_state()
last_alerts = prom_alerts()
if last_rule.get("state") == target:
return last_rule, last_alerts
time.sleep(POLL_INTERVAL)
return last_rule, last_alerts
def insert_rows(cur, start, count):
remaining = count
next_id = start
sql = f'''
insert into {DB}.{TABLE}
(id, a, b, c, d, e, f, g, h, s)
values (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
'''
while remaining > 0:
batch = min(ROW_BATCH, remaining)
values = []
for i in range(batch):
v = next_id + i
row = [v]
for mod in [1000, 2000, 3000, 4000, 5000, 6000, 7000, 8000]:
row.append(v % mod)
row.append(f"s{v:08d}")
values.append(tuple(row))
cur.executemany(sql, values)
next_id += batch
remaining -= batch
def latest_analyze_rows(cur):
rows = exec_sql(cur, f"show analyze status where table_schema = '{DB}' and table_name = '{TABLE}'")
return rows or []
def latest_failed_auto_analyze(rows):
for row in rows:
txt = json.dumps(row, default=str).lower()
if "auto analyze" in txt and "failed" in txt:
return row
return None
def metric_failed():
text = urllib.request.urlopen(f"http://{HOST}:{STATUS_PORT}/metrics", timeout=15).read().decode("utf-8", errors="replace")
for line in text.splitlines():
if line.startswith('tidb_statistics_auto_analyze_total{type="failed"}'):
return float(line.split()[-1])
return 0.0
def main():
urllib.request.urlopen(f"http://127.0.0.1:{PROM_PORT}/-/ready", timeout=15).read()
print("initial rule:", json.dumps(prom_rule_state(), default=str, indent=2), flush=True)
with connect() as conn:
with conn.cursor() as cur:
exec_sql(cur, f"drop database if exists {DB}")
exec_sql(cur, f"create database {DB}")
exec_sql(cur, f"use {DB}")
exec_sql(cur, "set global tidb_enable_auto_analyze = 0")
exec_sql(cur, "set global tidb_auto_analyze_ratio = 0.01")
exec_sql(cur, "set global tidb_auto_analyze_start_time = '00:00 +0000'")
exec_sql(cur, "set global tidb_auto_analyze_end_time = '23:59 +0000'")
exec_sql(cur, "set global tidb_analyze_version = 2")
exec_sql(cur, "set global tidb_max_auto_analyze_time = %s", (MAX_TIME,))
exec_sql(
cur,
f'''
create table {TABLE} (
id bigint primary key,
a bigint,
b bigint,
c bigint,
d bigint,
e bigint,
f bigint,
g bigint,
h bigint,
s varchar(32),
index ia(a),
index ib(b),
index ic(c),
index idd(d),
index ie(e),
index iff(f),
index ig(g),
index ih(h),
index is1(s)
)
''',
)
print("loading baseline rows", BASE_ROWS, flush=True)
insert_rows(cur, 1, BASE_ROWS)
exec_sql(cur, "flush stats_delta")
exec_sql(cur, f"analyze table {TABLE}")
before = metric_failed()
next_id = BASE_ROWS + 1
for i in range(FAIL_ROUNDS):
print("round", i + 1, "delta rows", DELTA_ROWS, flush=True)
round_before = metric_failed()
exec_sql(cur, "set global tidb_enable_auto_analyze = 0")
insert_rows(cur, next_id, DELTA_ROWS)
next_id += DELTA_ROWS
exec_sql(cur, "flush stats_delta")
exec_sql(cur, "set global tidb_enable_auto_analyze = 1")
deadline = time.time() + 300
row = None
rows = []
while time.time() < deadline:
rows = latest_analyze_rows(cur)
if metric_failed() > round_before:
row = latest_failed_auto_analyze(rows)
if row is not None:
break
time.sleep(POLL_INTERVAL)
print("rows", json.dumps(rows, default=str, indent=2), flush=True)
if row is None:
return 2
after = metric_failed()
print("failed metric", before, after, flush=True)
print("prom increase", json.dumps(prom_query('increase(tidb_statistics_auto_analyze_total{type=\"failed\"}[10m])'), indent=2), flush=True)
pending_rule, pending_alerts = wait_for_alert_state("pending", 120)
print("pending_rule", json.dumps(pending_rule, default=str, indent=2), flush=True)
print("pending_alerts", json.dumps(pending_alerts, default=str, indent=2), flush=True)
if pending_rule.get("state") != "pending":
return 3
firing_rule, firing_alerts = wait_for_alert_state("firing", 180)
print("firing_rule", json.dumps(firing_rule, default=str, indent=2), flush=True)
print("firing_alerts", json.dumps(firing_alerts, default=str, indent=2), flush=True)
if firing_rule.get("state") != "firing":
return 4
return 0
if __name__ == "__main__":
sys.exit(main())
PYRun it: CASE_SQL_PORT="$SQL_PORT" CASE_STATUS_PORT="$STATUS_PORT" CASE_PROM_PORT="$PROM_PORT" \
python "$WORK_DIR/check_alert_timeout_e2e.py"Verified Resultinitial rule: {
"state": "inactive",
"name": "TiDB_auto_analyze_failed",
"query": "increase(tidb_statistics_auto_analyze_total{type=\"failed\"}[10m]) > 0",
"duration": 60,
"keepFiringFor": 0,
"labels": {
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"level": "warning"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, values:{{ $value }}",
"summary": "TiDB auto analyze failed",
"value": "{{ $value }}"
},
"alerts": [],
"health": "ok",
"evaluationTime": 0.000341417,
"lastEvaluation": "2026-04-15T14:51:41.250725+02:00",
"type": "alerting"
}
loading baseline rows 5000000
round 1 delta rows 1000000
rows [
{
"Table_schema": "review67733_timeout",
"Table_name": "t_auto",
"Partition_name": "",
"Job_info": "auto analyze table all indexes, all columns with 256 buckets, 100 topn, 0.018333333333333333 samplerate",
"Processed_rows": 655661,
"Start_time": "2026-04-15 14:58:29",
"End_time": "2026-04-15 14:58:30",
"State": "failed",
"Fail_reason": "[executor:1317]Query execution was interrupted",
"Instance": "127.0.0.1:42000",
"Process_ID": null,
"Remaining_seconds": null,
"Progress": null,
"Estimated_total_rows": null
},
{
"Table_schema": "review67733_timeout",
"Table_name": "t_auto",
"Partition_name": "",
"Job_info": "analyze table all indexes, all columns with 256 buckets, 100 topn, 0.03308270676691729 samplerate",
"Processed_rows": 5000000,
"Start_time": "2026-04-15 14:57:20",
"End_time": "2026-04-15 14:57:22",
"State": "finished",
"Fail_reason": null,
"Instance": "127.0.0.1:42000",
"Process_ID": null,
"Remaining_seconds": null,
"Progress": null,
"Estimated_total_rows": null
}
]
round 2 delta rows 1000000
rows [
{
"Table_schema": "review67733_timeout",
"Table_name": "t_auto",
"Partition_name": "",
"Job_info": "auto analyze table all indexes, all columns with 256 buckets, 100 topn, 0.015714285714285715 samplerate",
"Processed_rows": 655661,
"Start_time": "2026-04-15 14:59:38",
"End_time": "2026-04-15 14:59:39",
"State": "failed",
"Fail_reason": "[executor:1317]Query execution was interrupted",
"Instance": "127.0.0.1:42000",
"Process_ID": null,
"Remaining_seconds": null,
"Progress": null,
"Estimated_total_rows": null
},
{
"Table_schema": "review67733_timeout",
"Table_name": "t_auto",
"Partition_name": "",
"Job_info": "auto analyze table all indexes, all columns with 256 buckets, 100 topn, 0.018333333333333333 samplerate",
"Processed_rows": 655661,
"Start_time": "2026-04-15 14:58:29",
"End_time": "2026-04-15 14:58:30",
"State": "failed",
"Fail_reason": "[executor:1317]Query execution was interrupted",
"Instance": "127.0.0.1:42000",
"Process_ID": null,
"Remaining_seconds": null,
"Progress": null,
"Estimated_total_rows": null
},
{
"Table_schema": "review67733_timeout",
"Table_name": "t_auto",
"Partition_name": "",
"Job_info": "analyze table all indexes, all columns with 256 buckets, 100 topn, 0.03308270676691729 samplerate",
"Processed_rows": 5000000,
"Start_time": "2026-04-15 14:57:20",
"End_time": "2026-04-15 14:57:22",
"State": "finished",
"Fail_reason": null,
"Instance": "127.0.0.1:42000",
"Process_ID": null,
"Remaining_seconds": null,
"Progress": null,
"Estimated_total_rows": null
}
]
failed metric 0.0 2.0
prom increase [
{
"metric": {
"instance": "127.0.0.1:48080",
"job": "tidb-review-67733-default",
"type": "failed"
},
"value": [
1776257979.41,
"0"
]
}
]
pending_rule {
"state": "pending",
"name": "TiDB_auto_analyze_failed",
"query": "increase(tidb_statistics_auto_analyze_total{type=\"failed\"}[10m]) > 0",
"duration": 60,
"keepFiringFor": 0,
"labels": {
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"level": "warning"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, values:{{ $value }}",
"summary": "TiDB auto analyze failed",
"value": "{{ $value }}"
},
"alerts": [
{
"labels": {
"alertname": "TiDB_auto_analyze_failed",
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"instance": "127.0.0.1:48080",
"job": "tidb-review-67733-default",
"level": "warning",
"type": "failed"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: 127.0.0.1:48080, values:1.2420914387808162",
"summary": "TiDB auto analyze failed",
"value": "1.2420914387808162"
},
"state": "pending",
"activeAt": "2026-04-15T12:59:56.224973056Z",
"value": "1.2420914387808162e+00"
}
],
"health": "ok",
"evaluationTime": 0.002013916,
"lastEvaluation": "2026-04-15T15:00:11.21733+02:00",
"type": "alerting"
}
pending_alerts [
{
"labels": {
"alertname": "TiDB_auto_analyze_failed",
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"instance": "127.0.0.1:48080",
"job": "tidb-review-67733-default",
"level": "warning",
"type": "failed"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: 127.0.0.1:48080, values:1.2420914387808162",
"summary": "TiDB auto analyze failed",
"value": "1.2420914387808162"
},
"state": "pending",
"activeAt": "2026-04-15T12:59:56.224973056Z",
"value": "1.2420914387808162e+00"
}
]
firing_rule {
"state": "firing",
"name": "TiDB_auto_analyze_failed",
"query": "increase(tidb_statistics_auto_analyze_total{type=\"failed\"}[10m]) > 0",
"duration": 60,
"keepFiringFor": 0,
"labels": {
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"level": "warning"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: {{ $labels.instance }}, values:{{ $value }}",
"summary": "TiDB auto analyze failed",
"value": "{{ $value }}"
},
"alerts": [
{
"labels": {
"alertname": "TiDB_auto_analyze_failed",
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"instance": "127.0.0.1:48080",
"job": "tidb-review-67733-default",
"level": "warning",
"type": "failed"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: 127.0.0.1:48080, values:1.1344957115544",
"summary": "TiDB auto analyze failed",
"value": "1.1344957115544"
},
"state": "firing",
"activeAt": "2026-04-15T12:59:56.224973056Z",
"value": "1.1344957115544e+00"
}
],
"health": "ok",
"evaluationTime": 0.002007042,
"lastEvaluation": "2026-04-15T15:01:11.217456+02:00",
"type": "alerting"
}
firing_alerts [
{
"labels": {
"alertname": "TiDB_auto_analyze_failed",
"env": "ENV_LABELS_ENV",
"expr": "increase( tidb_statistics_auto_analyze_total{type=\"failed\"}[10m] ) > 0",
"instance": "127.0.0.1:48080",
"job": "tidb-review-67733-default",
"level": "warning",
"type": "failed"
},
"annotations": {
"description": "cluster: ENV_LABELS_ENV, instance: 127.0.0.1:48080, values:1.1344957115544",
"summary": "TiDB auto analyze failed",
"value": "1.1344957115544"
},
"state": "firing",
"activeAt": "2026-04-15T12:59:56.224973056Z",
"value": "1.1344957115544e+00"
}
] |
0xPoe
left a comment
There was a problem hiding this comment.
🔢 Self-check (PR reviewed by myself and ready for feedback)
-
Code compiles successfully
-
Tested locally
-
No AI-generated elegant nonsense in PR.
-
Comments added where necessary
-
PR title and description updated
-
Documentation PR created (I will update it later)
-
PR size is reasonable
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: time-and-fate, XuHuaiyu The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
[LGTM Timeline notifier]Timeline:
|
What problem does this PR solve?
Issue Number: ref #63934
Problem Summary:
TiDB does not have an alert rule for failed auto-analyze tasks.
What changed and how does it work?
Add an Alertmanager rule that fires when
tidb_statistics_auto_analyze_total{type="failed"}increases in the last 10 minutes.Check List
Tests
Side effects
Documentation
Release note
Please refer to Release Notes Language Style Guide to write a quality release note.
Summary by CodeRabbit