CPU and Memory usage monitoring feature #1010

PiotrDabkowski · 2016-07-11T12:55:23Z

I am working on a new cpu and memory monitoring feature for our dashboard. The aim is to add graphs of CPU and memory usage to the details page. Example can be seen in the design specification
.

My idea is to show not only resource usage versus time but also add extra annotations to the graph, showing events of interest and max/min resource consumtion.

I would really appreciate some discussion here as I am not exactly sure how the graph should look like and how the annotations should be displayed so that the the feature is both easy to use and neat. The goal of this discussion is to create more detailed mock designs.

cc @romlein @Lukenickerson @floreks @bryk @pwittrock

PiotrDabkowski · 2016-07-22T09:30:25Z

I added first version of the graph to the poddetail page - see screenshot below. I will now add similar graph to the podlist page + arrows indicating events of interest.

All comments and suggestions are welcome

maciaszczykm · 2016-07-22T11:26:07Z

It looks nice!

I'm wondering if it wouldn't look a bit cleaner to have both of these plots in different graphs. It's not so easy to read from it at the moment. Perhaps update of legend could help a bit.

Also some kind of title (separation from other details) would be nice. I guess it's important, so we shouldn't move it to another tab, but I think it could be improved a bit.

PiotrDabkowski · 2016-07-22T12:55:12Z

@maciaszczykm thank you for your comments!

It is possible to choose one graph by clicking on a graph name. We thought that having an option to see 2 graphs in the same time would be a nice feature as this allows the user to see relation between memory/cpu usage.

Regarding title - I am still experimenting with it and I will post new screenshots soon.

cheld · 2016-07-22T12:55:55Z

The correlation of metrics and events is very nice. Most tools show events as vertical lines like this: https://codeascraft.com/2010/12/08/track-every-release/ and show additional information on hover. But this is not touch friendly.
Another example of metrics + events http://blog.librato.com/posts/cronotations

danielromlein · 2016-07-22T13:05:55Z

I like where this is headed @PiotrDabkowski!
I dropped the screenshot into an InVision Prototype so you can more easily see what I'm commenting on. @PiotrDabkowski I've added you as a collaborator, so if it interests you, you should be able to add screens / update this screen as you make any changes.

I'm inclined to agree with @maciaszczykm that trying to combine the two graphs could potentially be confusing. As a user it's not immediately apparent that I can choose one graph by clicking on its name. Datadog uses a line that extends between graphs to help establish relationship between them; I think something like that could be effective.

PiotrDabkowski · 2016-07-22T13:55:37Z

@cheld Thank you for the suggestion, that looks nice. I thought about something like this http://www.nytimes.com/interactive/2013/03/29/sports/baseball/Strikeouts-Are-Still-Soaring.html?ref=baseball&_r=1& but that may not be very readable if number of events is big.

@romlein Thank you, I will add my future scans there. Regarding 2 graphs solution you suggested: that may work and it's not hard to change, but graph is already taking up 25% of the screen and I am afraid that adding another graph will make page cluttered and less readable.

cheld · 2016-07-22T14:18:50Z

Ah ok. From user point of view it might be nice to have some kind of selection for the type of event. E.g. 'only show deployment events'. However, out of my head I am not sure if this is possible.

cheld · 2016-07-22T14:20:41Z

BTW: this https://square.github.io/cubism/ is a bit crazy condensed view. It actually makes sense when you see it in real action....

maciaszczykm · 2016-07-25T07:03:23Z

@cheld's example looks very good!

floreks · 2016-07-25T07:53:44Z

@PiotrDabkowski looks nice! @cheld's example also looks good.

What do you think about doing something in the middle? Remove left/right scale. Move units to legend and normalize data so we can display cpu/ram on one graph using common scale.

cc @maciaszczykm @bryk

PiotrDabkowski · 2016-07-25T10:33:44Z

@floreks I can do everything, changing the graph is very easy with this nvd3 library :) I am only afraid that displaying the data under common scale will make the graph harder to understand. What do you think about this idea @bryk ?

floreks · 2016-07-25T10:38:23Z

Ye, I can see how this may cause some troubles. CPU is most common using mili prefix and RAM mega, so we have 10^-3 vs 10^6.

Then maybe similar to cubism show graph under graph without Y axis scale and show data on hover.

Lukenickerson · 2016-07-27T20:24:00Z

Taking a step back, I'd like to better understand why we're showing the user this information -- what is the problem that the graph is solving? Is it it identify memory leaks, or identify how some events cause changes in resource usage? (I like Christoph's idea of tying resource graph to events.)

PiotrDabkowski · 2016-07-28T10:57:26Z

@Lukenickerson The purpose of the graph is to visualize historical CPU and Memory usage of the given resource. It's hard to do that without a graph :) And just like you said the purpose of that is to identify potential problems like memory leaks, but it also helps to analyse patterns in resource consumption. Overlaying graphs with events was the idea from the start :) and again, as you said it helps to determine impact of certain events.

Currently, Heapster provides only 15 minutes of historical resource consumption so the graph is not very useful, but longer time periods will be available in the future and that will make graphs much more helpful.

bryk · 2016-07-28T20:33:51Z

Taking a step back, I'd like to better understand why we're showing the user this information -- what is the problem that the graph is solving?

This is to monitor and troubleshoot your cluster and applications. I.e., to understand why your application is crashing (you can see that mem usage is above limits) or why it is serving too little requests per second (e.g. CPU usage went high after a release).

Usually this would be: I open the page and see from the graph that everything is fine. I can also notice a spike/drop in some metrics or see that a particular event caused something. All the graphs we're going to show are K8s context aware and this the reason why they can be very powerful and can differentiate us from all other generic tools that just show numbers.

Does this answer your question?

Lukenickerson · 2016-07-28T22:51:37Z

Thanks @bryk , those are good examples. In both of them it would be somewhat difficult for the user to understand the issue unless they knew the time of the events, and if the events happened within the last 15 minutes.

I think with some additional information the graphs could become even more useful:

Why is my app crashing? To help solve this case it would be nice to connect the graph of mem/cpu with crash events (if logged), or with certain thresholds.
Why is my app slow? To help the user solve this problem the graph would need to show cpu along with events, such as new releases, so the user can make correlation.
In both cases it might be helpful to somehow only show the time around certain events. (Maybe condense the time x-axis during periods of time when nothing out of the ordinary happens.)

bryk · 2016-07-29T00:22:47Z

Yes, I agree here. The 15 minutes is a temporary limitation that we'll overcome in the future. Once we have more data the use cases you've mentioned can be satisfied (or satisfied better). All we need to make sure is to design for them, assuming we'll get more data in the future.

Lukenickerson · 2016-07-29T14:10:57Z

@bryk : Do we currently have a way to track events in the system? Would things like an app crash or an app at/over a certain resource threshold be part of the logs, and can be easily identified?

bryk · 2016-07-29T20:51:21Z

Do we currently have a way to track events in the system?

We can get all events in the system or all events that are associated with a thing. So when we show, e.g., a replica set, we can show events related to it or its pods. All events, the positive and negative ones.

PiotrDabkowski · 2016-09-06T11:28:11Z

Graphs displaying CPU and memory usage history have been added to all list and detail pages. See example graph for replica set detail:

@Lukenickerson @romlein @bryk Do you have any comments on how the graph could be improved?

Finally, I am still unsure about graph titles - on detail pages it is "Resource usage history" and on list pages, where it shows cumulative resource usage of all resources the title is "Cumulative resource usage history". Both titles are quite long and sound too complicated, do you have any ideas? Maybe we should display short title and add a question mark icon next to the graph title that would provide more explanation to the user upon click or hover?

danielromlein · 2016-09-06T19:10:09Z

Thanks for posting @PiotrDabkowski!

As per earlier comments about the combined graph being confusing, I think a more effective solution may be separating out the CPU and memory data into two separate visualizations; perhaps compressed vertically to save space.

It seems very hidden that clicking the graph title deselects it. We could just use a checkbox for these, but again, I think a better solution still might just be separating the graphs.

Time values along the bottom should be vertically aligned.

danielromlein · 2016-09-09T13:17:03Z

@PiotrDabkowski just wanted to follow up and see if you'd had any thoughts around this? 👆

PiotrDabkowski · 2016-09-12T10:11:39Z

@romlein thank you for you feedback! I already implemented you suggestions and I think you were right, it looks better now :)

What do you think?

floreks · 2016-09-12T10:58:17Z

Nice! I like it. It's more transparent now and easier to understand.

danielromlein · 2016-09-12T17:32:00Z

@PiotrDabkowski right on! Vastly improved.
Though I like the side by side arrangement because it leaves more space for viewing the [pods], I can also see how it would be helpful for users to correlate events in the two graphs – which would be easier to spot if the two graphs were stacked vertically. See @bgrant0607's note on #8270:

Presentation guidelines: timeseries graphs: single-column, all same time-scale, for easy correlation.

Perhaps their vertical height could be reduced so as to not push the resource too far down the page?
Thoughts @bryk?

digitalfishpond · 2016-09-13T06:26:10Z

If ease of correlation is what we're going for, a possible solution to the problem of height might be a 'Combine views / Split views' toggle button? Just a thought.

bryk · 2016-09-13T09:31:30Z

Perhaps their vertical height could be reduced so as to not push the resource too far down the page?
Thoughts @bryk?

I'm also thinking about moving the graphs to the bottom of the page.

douxiaofeng99 · 2017-01-21T14:52:17Z

I use the gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.0, i can view the dashborad, but there is no cpu and memory usage chart. Does it released?

cheld · 2017-01-23T07:15:14Z

You have to deploy the heapster container as well. Heapster is (more or less) a required kubernetes component

Globegitter · 2017-08-24T13:15:08Z

What does CPU usage of e.g. 0.1 actually mean? Is that 10%? And for a pod would that then be 10% of what is allocated to that container? And 0.1 in the overview graph would then mean 10% of all the cpu available from all the nodes? Could not find any clarification on that.

floreks · 2017-08-24T13:21:11Z

CPU (Cores). In example if node has 4 cores then 0.1 means that all cores (together, not all of them at once) are roughly under 10% load. For more information you can check heapster and how exactly they are scraping metrics.

floreks · 2017-08-24T13:27:40Z

As stated in heapster documentation: https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md

cpu/usage | Cumulative CPU usage on all cores.

Globegitter · 2017-08-28T14:11:37Z

@floreks thanks for that description - is it also possible to see that total? So 1.4/8 for example? Or 6.71Gi/32Gi memory usage. Imo makes these stats immediately more useful and one does not have to think, oh this cluster has 6 cores, this 16 etc.

floreks · 2017-08-28T18:52:10Z

There is no such information available at the time. Calculating actual max value might be a bit tricky here because usually you will get total number of cores/memory available on the node and not the actual limits assigned to the k8s apps pool. That is why there are so many metrics available in heapster. Only for checking CPU limits there are like 5 metrics:

Metric name	Description
cpu/limit	CPU hard limit in millicores.
cpu/node_capacity	Cpu capacity of a node.
cpu/node_allocatable	Cpu allocatable of a node.
cpu/node_reservation	Share of cpu that is reserved on the node allocatable.
cpu/node_utilization	CPU utilization as a share of node allocatable.

On the node details page we are showing allocated resources reported directly by kubelet. This can give a rough idea about available resources. I think it's more likely that we'll add more advanced graphs on an overview page and leave sparklines as is.

@maciaszczykm WDYT?

maciaszczykm added priority/P1 kind/feature Categorizes issue or PR as related to a new feature. labels Jul 12, 2016

Lukenickerson mentioned this issue Jul 29, 2016

Overview pages - UX and content #1068

Closed

rf232 closed this as completed Nov 18, 2016

avgKol mentioned this issue Jul 12, 2018

Cpu/usage - what does it mean 'cumulative'? kubernetes-retired/heapster#2062

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPU and Memory usage monitoring feature #1010

CPU and Memory usage monitoring feature #1010

PiotrDabkowski commented Jul 11, 2016

PiotrDabkowski commented Jul 22, 2016

maciaszczykm commented Jul 22, 2016

PiotrDabkowski commented Jul 22, 2016

cheld commented Jul 22, 2016

danielromlein commented Jul 22, 2016

PiotrDabkowski commented Jul 22, 2016

cheld commented Jul 22, 2016

cheld commented Jul 22, 2016

maciaszczykm commented Jul 25, 2016

floreks commented Jul 25, 2016

PiotrDabkowski commented Jul 25, 2016

floreks commented Jul 25, 2016

Lukenickerson commented Jul 27, 2016

PiotrDabkowski commented Jul 28, 2016

bryk commented Jul 28, 2016 •

edited

Lukenickerson commented Jul 28, 2016

bryk commented Jul 29, 2016

Lukenickerson commented Jul 29, 2016

bryk commented Jul 29, 2016

PiotrDabkowski commented Sep 6, 2016 •

edited

danielromlein commented Sep 6, 2016

danielromlein commented Sep 9, 2016

PiotrDabkowski commented Sep 12, 2016 •

edited

floreks commented Sep 12, 2016

danielromlein commented Sep 12, 2016 •

edited

digitalfishpond commented Sep 13, 2016

bryk commented Sep 13, 2016

douxiaofeng99 commented Jan 21, 2017

cheld commented Jan 23, 2017

Globegitter commented Aug 24, 2017

floreks commented Aug 24, 2017 •

edited

floreks commented Aug 24, 2017

Globegitter commented Aug 28, 2017

floreks commented Aug 28, 2017 •

edited

CPU and Memory usage monitoring feature #1010

CPU and Memory usage monitoring feature #1010

Comments

PiotrDabkowski commented Jul 11, 2016

PiotrDabkowski commented Jul 22, 2016

maciaszczykm commented Jul 22, 2016

PiotrDabkowski commented Jul 22, 2016

cheld commented Jul 22, 2016

danielromlein commented Jul 22, 2016

PiotrDabkowski commented Jul 22, 2016

cheld commented Jul 22, 2016

cheld commented Jul 22, 2016

maciaszczykm commented Jul 25, 2016

floreks commented Jul 25, 2016

PiotrDabkowski commented Jul 25, 2016

floreks commented Jul 25, 2016

Lukenickerson commented Jul 27, 2016

PiotrDabkowski commented Jul 28, 2016

bryk commented Jul 28, 2016 • edited

Lukenickerson commented Jul 28, 2016

bryk commented Jul 29, 2016

Lukenickerson commented Jul 29, 2016

bryk commented Jul 29, 2016

PiotrDabkowski commented Sep 6, 2016 • edited

danielromlein commented Sep 6, 2016

danielromlein commented Sep 9, 2016

PiotrDabkowski commented Sep 12, 2016 • edited

floreks commented Sep 12, 2016

danielromlein commented Sep 12, 2016 • edited

digitalfishpond commented Sep 13, 2016

bryk commented Sep 13, 2016

douxiaofeng99 commented Jan 21, 2017

cheld commented Jan 23, 2017

Globegitter commented Aug 24, 2017

floreks commented Aug 24, 2017 • edited

floreks commented Aug 24, 2017

Globegitter commented Aug 28, 2017

floreks commented Aug 28, 2017 • edited

bryk commented Jul 28, 2016 •

edited

PiotrDabkowski commented Sep 6, 2016 •

edited

PiotrDabkowski commented Sep 12, 2016 •

edited

danielromlein commented Sep 12, 2016 •

edited

floreks commented Aug 24, 2017 •

edited

floreks commented Aug 28, 2017 •

edited