Skip to content

server/schedule: refine metrics about store limit#2404

Merged
sre-bot merged 15 commits intotikv:masterfrom
wangrzneu:store_limit_metric
Jun 8, 2020
Merged

server/schedule: refine metrics about store limit#2404
sre-bot merged 15 commits intotikv:masterfrom
wangrzneu:store_limit_metric

Conversation

@wangrzneu
Copy link
Copy Markdown
Contributor

@wangrzneu wangrzneu commented May 8, 2020

What problem does this PR solve?

refine metrics about store limit: add storeLimitCostCounter and storeLimitAvailableGauge to monitor the status of store limiter.

What is changed and how it works?

refine metrics about store limit

  1. add storeLimitCostCounter to monitor the step cost
  2. add storeLimitAvailableGauge to monitor the available status of storeLimit, and it is collected in a background job.
  3. add storeLimitRateGauge to show the store limit setting in pd.

Check List

Tests

  • Manual test (add detailed scripts or steps below)
  1. run a tidb cluste with 4 tikv instances
  2. prepare some data using sysbench like : sysbench --mysql-host=127.0.0.1 --mysql-port=4000 --mysql-user=root --table-size=1000000 oltp_insert prepare
  3. run pd-ctl -u 0.0.0.0:32774 store limit all 1 region-add and pd-ctl -u 0.0.0.0:32774 store limit all 1 region-remove to set the store limit
  4. run pd-ctl -u 0.0.0.0:32774 store delete 3057 to remove one store
  5. run curl 0.0.0.0:32774/metrics | grep store_limit to get the metrics. The metrics should be like:
# HELP pd_schedule_store_limit_available available limit rate of store.
# TYPE pd_schedule_store_limit_available gauge
pd_schedule_store_limit_available{limit_type="region-add",store="1"} 1
pd_schedule_store_limit_available{limit_type="region-add",store="3057"} 1
pd_schedule_store_limit_available{limit_type="region-add",store="4"} 1
pd_schedule_store_limit_available{limit_type="region-add",store="5"} 1
pd_schedule_store_limit_available{limit_type="region-remove",store="1"} 1
pd_schedule_store_limit_available{limit_type="region-remove",store="3057"} 100000
pd_schedule_store_limit_available{limit_type="region-remove",store="4"} 1
pd_schedule_store_limit_available{limit_type="region-remove",store="5"} 1
# HELP pd_schedule_store_limit_cost limit rate cost of store.
# TYPE pd_schedule_store_limit_cost counter
pd_schedule_store_limit_cost{limit_type="region-add",store="1"} 1
pd_schedule_store_limit_cost{limit_type="region-add",store="4"} 2
pd_schedule_store_limit_cost{limit_type="region-remove",store="3057"} 3
# HELP pd_schedule_store_limit_rate the limit rate of store.
# TYPE pd_schedule_store_limit_rate gauge
pd_schedule_store_limit_rate{limit_type="region-add",store="1"} 1.000000016666667
pd_schedule_store_limit_rate{limit_type="region-add",store="3057"} 6e+06
pd_schedule_store_limit_rate{limit_type="region-add",store="4"} 1.000000016666667
pd_schedule_store_limit_rate{limit_type="region-add",store="5"} 1.000000016666667
pd_schedule_store_limit_rate{limit_type="region-remove",store="1"} 1.000000016666667
pd_schedule_store_limit_rate{limit_type="region-remove",store="3057"} 6e+06
pd_schedule_store_limit_rate{limit_type="region-remove",store="4"} 1.000000016666667
pd_schedule_store_limit_rate{limit_type="region-remove",store="5"} 1.000000016666667

Side effects

  • Breaking backward compatibility
    The metrics pd_schedule_store_limit is removed.

@sre-bot sre-bot added the contribution This PR is from a community contributor. label May 8, 2020
@rleungx rleungx requested review from lhy1024, rleungx and shafreeck May 8, 2020 08:02
@rleungx
Copy link
Copy Markdown
Member

rleungx commented May 11, 2020

I am not sure the metrics of cost or available token is useful (due to the time interval of metrics collection?), but we can record the capacity of the store limit to avoid using pd-ctl to find it. What do you think?

@codecov-io
Copy link
Copy Markdown

Codecov Report

Merging #2404 into master will decrease coverage by 0.04%.
The diff coverage is 32.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2404      +/-   ##
==========================================
- Coverage   76.32%   76.28%   -0.05%     
==========================================
  Files         206      206              
  Lines       22018    22035      +17     
==========================================
+ Hits        16805    16809       +4     
- Misses       3936     3941       +5     
- Partials     1277     1285       +8     
Impacted Files Coverage Δ
server/cluster/cluster.go 80.23% <0.00%> (-0.53%) ⬇️
server/schedule/operator_controller.go 78.85% <27.27%> (-1.99%) ⬇️
server/schedule/metrics.go 100.00% <100.00%> (ø)
server/region_syncer/client.go 80.15% <0.00%> (-6.11%) ⬇️
server/schedulers/random_merge.go 61.19% <0.00%> (-2.99%) ⬇️
server/cluster/coordinator.go 74.52% <0.00%> (-2.20%) ⬇️
server/tso/tso.go 81.02% <0.00%> (-2.19%) ⬇️
server/member/leader.go 74.31% <0.00%> (-1.17%) ⬇️
server/core/storage.go 74.38% <0.00%> (-0.83%) ⬇️
server/server.go 77.53% <0.00%> (-0.47%) ⬇️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update feb738d...bd0a047. Read the comment docs.

@wangrzneu
Copy link
Copy Markdown
Contributor Author

I am not sure the metrics of cost or available token is useful (due to the time interval of metrics collection?), but we can record the capacity of the store limit to avoid using pd-ctl to find it. What do you think?

I add a metric pd_schedule_store_limit_rate to show the store limit setting of pd.

}

// CollectStoreLimitMetrics collects the metrics about store limit
func (oc *OperatorController) CollectStoreLimitMetrics() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the new added function called?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the new added function called?

It is called in server/cluster/cluster.go:115.

@nolouch
Copy link
Copy Markdown
Contributor

nolouch commented May 15, 2020

PTAL @rleungx

Copy link
Copy Markdown
Contributor

@shafreeck shafreeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Copy Markdown
Contributor

@nolouch nolouch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @wangrzneu. rest LGTM.

Subsystem: "schedule",
Name: "store_limit",
Help: "Limit of store.",
}, []string{"store", "type", "limit_type"})
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics will not be compatible with the old version of grafana template, why can't it keep the original way?

Copy link
Copy Markdown
Contributor Author

@wangrzneu wangrzneu May 25, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics will not be compatible with the old version of grafana template, why can't it keep the original way?

The old way has two type, "take" and "available". I use a new metrics storeLimitAvailableGauge to record the "available" type storeLimitGauge, but use a different metrics storeLimitCostCounter (it is a counter, not gauge like before) to record the original "take" type storeLimitGauge metrics.

}
storeIDStr := strconv.FormatUint(storeID, 10)
storeLimitAvailableGauge.WithLabelValues(storeIDStr, n).Set(float64(storeLimit.Available()) / float64(storelimit.RegionInfluence[v]))
storeLimitRateGauge.WithLabelValues(storeIDStr, n).Set(storeLimit.Rate() * StoreBalanceBaseTime)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may more likely want to know the store how many operators can be produced in one minute, for example, default is 15 opm.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics storeLimitRateGauge can show the operator producing rate.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got.

@wangrzneu wangrzneu force-pushed the store_limit_metric branch from 7005cd3 to b2dfaa0 Compare May 25, 2020 05:07
@rleungx
Copy link
Copy Markdown
Member

rleungx commented Jun 1, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 1, 2020

Your auto merge job has been accepted, waiting for:

  • 2467

@sre-bot sre-bot added the status/can-merge Indicates a PR has been approved by a committer. label Jun 1, 2020
@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 1, 2020

/run-all-tests

@nolouch
Copy link
Copy Markdown
Contributor

nolouch commented Jun 1, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 1, 2020

/run-all-tests

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

Merging #2404 into master will increase coverage by 0.05%.
The diff coverage is 60.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2404      +/-   ##
==========================================
+ Coverage   77.01%   77.07%   +0.05%     
==========================================
  Files         205      205              
  Lines       21948    22016      +68     
==========================================
+ Hits        16903    16968      +65     
- Misses       3755     3768      +13     
+ Partials     1290     1280      -10     
Impacted Files Coverage Δ
server/cluster/coordinator.go 71.95% <7.14%> (-2.57%) ⬇️
server/schedule/operator_controller.go 78.50% <33.33%> (-2.51%) ⬇️
server/cluster/cluster.go 80.06% <66.66%> (-0.23%) ⬇️
client/base_client.go 90.60% <100.00%> (+3.70%) ⬆️
client/client.go 70.60% <100.00%> (+1.00%) ⬆️
pkg/cache/ttl.go 90.66% <100.00%> (+4.72%) ⬆️
server/cluster/cluster_worker.go 72.95% <100.00%> (+0.22%) ⬆️
server/grpc_service.go 59.96% <100.00%> (+1.98%) ⬆️
server/schedule/metrics.go 100.00% <100.00%> (ø)
pkg/mock/mockhbstream/mockhbstream.go 89.23% <0.00%> (-1.54%) ⬇️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 06fbee6...4965bf3. Read the comment docs.

@lhy1024
Copy link
Copy Markdown
Contributor

lhy1024 commented Jun 1, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 1, 2020

/run-all-tests

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 1, 2020

@wangrzneu merge failed.

@wangrzneu
Copy link
Copy Markdown
Contributor Author

/merge

@wangrzneu
Copy link
Copy Markdown
Contributor Author

/run-all-tests

@nolouch
Copy link
Copy Markdown
Contributor

nolouch commented Jun 3, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 3, 2020

/run-all-tests

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 3, 2020

@wangrzneu merge failed.

@wangrzneu
Copy link
Copy Markdown
Contributor Author

/run-all-tests

@lhy1024
Copy link
Copy Markdown
Contributor

lhy1024 commented Jun 3, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 3, 2020

/run-all-tests

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 3, 2020

@wangrzneu merge failed.

@nolouch
Copy link
Copy Markdown
Contributor

nolouch commented Jun 8, 2020

/merge

@sre-bot
Copy link
Copy Markdown
Contributor

sre-bot commented Jun 8, 2020

/run-all-tests

@sre-bot sre-bot merged commit 0d2eb9c into tikv:master Jun 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

contribution This PR is from a community contributor. status/can-merge Indicates a PR has been approved by a committer.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants