Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added 'Mimir / Overview resources' dashboard #3481

Merged
merged 3 commits into from
Nov 22, 2022

Conversation

pracucci
Copy link
Collaborator

@pracucci pracucci commented Nov 21, 2022

What this PR does

In this PR I'm proposing a new dashboard called "Mimir / Overview resources" which shows an overview over a Mimir cluster resources utilization (CPU, memory, disk). The Mimir components are grouped into "Writes", "Reads" and "Backend" and the dashboard both work with microservices and read-write deployment mode.

Note to reviewers:

  • I left few GitHub comments to explain some of the diff / changes
  • If you think reviewing this PR is too difficult, I can open a preliminary PR with the utilities refactoring

Example of microservices deployment:

Screenshot 2022-11-21 at 17 11 23

Example of read-write deployment:

Screenshot 2022-11-21 at 17 11 12

Which issue(s) this PR fixes or relates to

Part of #3361

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@@ -526,7 +526,7 @@
"steppedLine": false,
"targets": [
{
"expr": "sum by(instance, pod, device) (\n rate(\n node_disk_written_bytes_total[$__rate_interval]\n )\n)\n+\nignoring(pod) group_right() (\n label_replace(\n count by(\n instance,\n pod,\n device\n )\n (\n container_fs_writes_bytes_total{\n cluster=~\"$cluster\", namespace=~\"$namespace\",\n container=\"alertmanager\",\n device!~\".*sda.*\"\n }\n ),\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ) * 0\n)\n\n",
"expr": "sum by(instance, pod, device) (\n rate(\n node_disk_written_bytes_total[$__rate_interval]\n )\n)\n+\nignoring(pod) group_right() (\n label_replace(\n count by(\n instance,\n pod,\n device\n )\n (\n container_fs_writes_bytes_total{\n cluster=~\"$cluster\", namespace=~\"$namespace\",\n container=~\"alertmanager\",\n device!~\".*sda.*\"\n }\n ),\n \"device\",\n \"$1\",\n \"device\",\n \"/dev/(.*)\"\n ) * 0\n)\n\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: few other dashboards have this change to disk I/O query because I've changed the container="..." matcher to a regex matcher container=~"..."

@@ -690,7 +690,7 @@
"steppedLine": false,
"targets": [
{
"expr": "max by(persistentvolumeclaim) (\n kubelet_volume_stats_used_bytes{cluster=~\"$cluster\", namespace=~\"$namespace\"} /\n kubelet_volume_stats_capacity_bytes{cluster=~\"$cluster\", namespace=~\"$namespace\"}\n)\nand\ncount by(persistentvolumeclaim) (\n kube_persistentvolumeclaim_labels{\n cluster=~\"$cluster\", namespace=~\"$namespace\",\n label_name=\"alertmanager\"\n }\n)\n",
"expr": "max by(persistentvolumeclaim) (\n kubelet_volume_stats_used_bytes{cluster=~\"$cluster\", namespace=~\"$namespace\"} /\n kubelet_volume_stats_capacity_bytes{cluster=~\"$cluster\", namespace=~\"$namespace\"}\n)\nand\ncount by(persistentvolumeclaim) (\n kube_persistentvolumeclaim_labels{\n cluster=~\"$cluster\", namespace=~\"$namespace\",\n label_name=~\"(alertmanager).*\"\n }\n)\n",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: to simplify the multi-zone matching, I've applied the .* suffix to all label_name matchers.


resourcesPanelQueries(metric, instanceName)::
resourceUtilizationQuery(metric, instanceName)::
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: introduced a new utility function to generate the resource utilization query (without request/limit queries).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to not have requests/limits? I think we found them helpful when they were added on the other resource dashboards

I imagine because those will be different when this dashboard is used on a microservices deployment. In that case we can repeat the panel for every different container we have.

Alternatively, we can display the memory utilization as a % of the limit instead of an absolute value. I think I prefer this option.

Copy link
Collaborator Author

@pracucci pracucci Nov 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine because those will be different when this dashboard is used on a microservices deployment.

For this reason.

In that case we can repeat the panel for every different container we have.

Well, that goes against the idea of having an "Overview". If you want to go deep, you open the "Writes resources" or "Reads resources" dashboard.

$._config.resources_panel_queries[$._config.deployment_type]['%s_usage' % metric] % {
instance: $._config.per_instance_label,
namespace: $.namespaceMatcher(),
instanceName: instanceName,
},

resourceUtilizationAndLimitQueries(metric, instanceName)::
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note to reviewers: I renamed resourcesPanelQueries() to resourceUtilizationAndLimitQueries() to clarify it also include resources request/limit.

@pracucci pracucci force-pushed the add-overview-resources-dashboard branch from ad54cdf to ca43ec6 Compare November 21, 2022 16:56
@pracucci pracucci marked this pull request as ready for review November 21, 2022 16:57
@pracucci pracucci requested review from a team as code owners November 21, 2022 16:57
Copy link
Contributor

@johannaratliff johannaratliff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


resourcesPanelQueries(metric, instanceName)::
resourceUtilizationQuery(metric, instanceName)::
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose to not have requests/limits? I think we found them helpful when they were added on the other resource dashboards

I imagine because those will be different when this dashboard is used on a microservices deployment. In that case we can repeat the panel for every different container we have.

Alternatively, we can display the memory utilization as a % of the limit instead of an absolute value. I think I prefer this option.

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Signed-off-by: Marco Pracucci <marco@pracucci.com>
@pracucci pracucci force-pushed the add-overview-resources-dashboard branch from ca43ec6 to f8c95c2 Compare November 22, 2022 08:45
…oards

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Copy link
Contributor

@dimitarvdimitrov dimitarvdimitrov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for addressing my feedback

@pracucci pracucci merged commit 4e7a423 into main Nov 22, 2022
@pracucci pracucci deleted the add-overview-resources-dashboard branch November 22, 2022 09:18
@@ -163,7 +163,7 @@
"height": "250px",
"panels": [
{
"content": "These panels show an overview on the write path. \nVisit the following specific dashboards to drill down into the write path:\n\n- <a target=\"_blank\" href=\"./d/8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes?${__url_time_range}&${__all_variables}\">Writes</a>\n- <a target=\"_blank\" href=\"./d/bc9160e50b52e89e0e49c840fea3d379/mimir-writes-resources?${__url_time_range}&${__all_variables}\">Writes resources</a>\n- <a target=\"_blank\" href=\"./d/978c1cb452585c96697a238eaac7fe2d/mimir-writes-networking?${__url_time_range}&${__all_variables}\">Writes networking</a>\n",
"content": "These panels show an overview on the write path. \nVisit the following specific dashboards to drill down into the write path:\n\n- <a target=\"_blank\" href=\"./d/8280707b8f16e7b87b840fc1cc92d4c5/mimir-writes?${__url_time_range}&${__all_variables}\">Writes</a>\n- <a target=\"_blank\" href=\"./d/bc9160e50b52e89e0e49c840fea3d379/mimir-writes-resources?${__url_time_range}&${__all_variables}\">Writes resources</a>\n- <a target=\"_blank\" href=\"./d/978c1cb452585c96697a238eaac7fe2d/mimir-writes-networking?${__url_time_range}&${__all_variables}\">Writes networking</a>\n- <a target=\"_blank\" href=\"./d/a9b92d3c4d1af325d872a9e9a7083d71/mimir-overview-resources?${__url_time_range}&${__all_variables}\">Overview resources</a>\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic because "writes" is a verb and a noun in this context. Is it "Writes’ resources"? or is it "[Something] writes resources"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My impression is that this is multiple writes’ resources rather than a single write’s resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please simplify the multiple prepositions (down into) in "Visit the following specific dashboards to drill down into the write path:" Suggestion: To examine the write path in detail, see a specific dashboard:

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #3488

@osg-grafana osg-grafana added the type/docs Improvements or additions to documentation label Nov 22, 2022
masonmei pushed a commit to udmire/mimir that referenced this pull request Dec 16, 2022
* Added 'Mimir / Overview resources' dashboard

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Simplified resourceUtilizationQuery()

Signed-off-by: Marco Pracucci <marco@pracucci.com>

* Do not allow to select multiple namespaces like other resources dashboards

Signed-off-by: Marco Pracucci <marco@pracucci.com>

Signed-off-by: Marco Pracucci <marco@pracucci.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/docs Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants