Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU and Memory usage monitoring feature #1010

Closed
PiotrDabkowski opened this issue Jul 11, 2016 · 34 comments
Closed

CPU and Memory usage monitoring feature #1010

PiotrDabkowski opened this issue Jul 11, 2016 · 34 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@PiotrDabkowski
Copy link

I am working on a new cpu and memory monitoring feature for our dashboard. The aim is to add graphs of CPU and memory usage to the details page. Example can be seen in the design specification
.

My idea is to show not only resource usage versus time but also add extra annotations to the graph, showing events of interest and max/min resource consumtion.

I would really appreciate some discussion here as I am not exactly sure how the graph should look like and how the annotations should be displayed so that the the feature is both easy to use and neat. The goal of this discussion is to create more detailed mock designs.

cc @romlein @Lukenickerson @floreks @bryk @pwittrock

@maciaszczykm maciaszczykm added priority/P1 kind/feature Categorizes issue or PR as related to a new feature. labels Jul 12, 2016
@PiotrDabkowski
Copy link
Author

I added first version of the graph to the poddetail page - see screenshot below. I will now add similar graph to the podlist page + arrows indicating events of interest.

graphv1

All comments and suggestions are welcome

@maciaszczykm
Copy link
Member

It looks nice!

I'm wondering if it wouldn't look a bit cleaner to have both of these plots in different graphs. It's not so easy to read from it at the moment. Perhaps update of legend could help a bit.

Also some kind of title (separation from other details) would be nice. I guess it's important, so we shouldn't move it to another tab, but I think it could be improved a bit.

@PiotrDabkowski
Copy link
Author

@maciaszczykm thank you for your comments!

It is possible to choose one graph by clicking on a graph name. We thought that having an option to see 2 graphs in the same time would be a nice feature as this allows the user to see relation between memory/cpu usage.

Regarding title - I am still experimenting with it and I will post new screenshots soon.

@cheld
Copy link
Contributor

cheld commented Jul 22, 2016

The correlation of metrics and events is very nice. Most tools show events as vertical lines like this: https://codeascraft.com/2010/12/08/track-every-release/ and show additional information on hover. But this is not touch friendly.
Another example of metrics + events http://blog.librato.com/posts/cronotations

@danielromlein
Copy link
Contributor

I like where this is headed @PiotrDabkowski!
I dropped the screenshot into an InVision Prototype so you can more easily see what I'm commenting on. @PiotrDabkowski I've added you as a collaborator, so if it interests you, you should be able to add screens / update this screen as you make any changes.

I'm inclined to agree with @maciaszczykm that trying to combine the two graphs could potentially be confusing. As a user it's not immediately apparent that I can choose one graph by clicking on its name. Datadog uses a line that extends between graphs to help establish relationship between them; I think something like that could be effective.

@PiotrDabkowski
Copy link
Author

@cheld Thank you for the suggestion, that looks nice. I thought about something like this http://www.nytimes.com/interactive/2013/03/29/sports/baseball/Strikeouts-Are-Still-Soaring.html?ref=baseball&_r=1& but that may not be very readable if number of events is big.

@romlein Thank you, I will add my future scans there. Regarding 2 graphs solution you suggested: that may work and it's not hard to change, but graph is already taking up 25% of the screen and I am afraid that adding another graph will make page cluttered and less readable.

@cheld
Copy link
Contributor

cheld commented Jul 22, 2016

Ah ok. From user point of view it might be nice to have some kind of selection for the type of event. E.g. 'only show deployment events'. However, out of my head I am not sure if this is possible.

@cheld
Copy link
Contributor

cheld commented Jul 22, 2016

BTW: this https://square.github.io/cubism/ is a bit crazy condensed view. It actually makes sense when you see it in real action....

@maciaszczykm
Copy link
Member

@cheld's example looks very good!

@floreks
Copy link
Member

floreks commented Jul 25, 2016

@PiotrDabkowski looks nice! @cheld's example also looks good.

What do you think about doing something in the middle? Remove left/right scale. Move units to legend and normalize data so we can display cpu/ram on one graph using common scale.

cc @maciaszczykm @bryk

@PiotrDabkowski
Copy link
Author

@floreks I can do everything, changing the graph is very easy with this nvd3 library :) I am only afraid that displaying the data under common scale will make the graph harder to understand. What do you think about this idea @bryk ?

@floreks
Copy link
Member

floreks commented Jul 25, 2016

Ye, I can see how this may cause some troubles. CPU is most common using mili prefix and RAM mega, so we have 10^-3 vs 10^6.

Then maybe similar to cubism show graph under graph without Y axis scale and show data on hover.

@Lukenickerson
Copy link
Contributor

Taking a step back, I'd like to better understand why we're showing the user this information -- what is the problem that the graph is solving? Is it it identify memory leaks, or identify how some events cause changes in resource usage? (I like Christoph's idea of tying resource graph to events.)

@PiotrDabkowski
Copy link
Author

@Lukenickerson The purpose of the graph is to visualize historical CPU and Memory usage of the given resource. It's hard to do that without a graph :) And just like you said the purpose of that is to identify potential problems like memory leaks, but it also helps to analyse patterns in resource consumption. Overlaying graphs with events was the idea from the start :) and again, as you said it helps to determine impact of certain events.

Currently, Heapster provides only 15 minutes of historical resource consumption so the graph is not very useful, but longer time periods will be available in the future and that will make graphs much more helpful.

@bryk
Copy link
Contributor

bryk commented Jul 28, 2016

Taking a step back, I'd like to better understand why we're showing the user this information -- what is the problem that the graph is solving?

This is to monitor and troubleshoot your cluster and applications. I.e., to understand why your application is crashing (you can see that mem usage is above limits) or why it is serving too little requests per second (e.g. CPU usage went high after a release).

Usually this would be: I open the page and see from the graph that everything is fine. I can also notice a spike/drop in some metrics or see that a particular event caused something. All the graphs we're going to show are K8s context aware and this the reason why they can be very powerful and can differentiate us from all other generic tools that just show numbers.

Does this answer your question?

@Lukenickerson
Copy link
Contributor

Thanks @bryk , those are good examples. In both of them it would be somewhat difficult for the user to understand the issue unless they knew the time of the events, and if the events happened within the last 15 minutes.

I think with some additional information the graphs could become even more useful:

  • Why is my app crashing? To help solve this case it would be nice to connect the graph of mem/cpu with crash events (if logged), or with certain thresholds.
  • Why is my app slow? To help the user solve this problem the graph would need to show cpu along with events, such as new releases, so the user can make correlation.
  • In both cases it might be helpful to somehow only show the time around certain events. (Maybe condense the time x-axis during periods of time when nothing out of the ordinary happens.)

@bryk
Copy link
Contributor

bryk commented Jul 29, 2016

Yes, I agree here. The 15 minutes is a temporary limitation that we'll overcome in the future. Once we have more data the use cases you've mentioned can be satisfied (or satisfied better). All we need to make sure is to design for them, assuming we'll get more data in the future.

@Lukenickerson
Copy link
Contributor

@bryk : Do we currently have a way to track events in the system? Would things like an app crash or an app at/over a certain resource threshold be part of the logs, and can be easily identified?

@bryk
Copy link
Contributor

bryk commented Jul 29, 2016

Do we currently have a way to track events in the system?

We can get all events in the system or all events that are associated with a thing. So when we show, e.g., a replica set, we can show events related to it or its pods. All events, the positive and negative ones.

@PiotrDabkowski
Copy link
Author

PiotrDabkowski commented Sep 6, 2016

Graphs displaying CPU and memory usage history have been added to all list and detail pages. See example graph for replica set detail:

graph merged

@Lukenickerson @romlein @bryk Do you have any comments on how the graph could be improved?

Finally, I am still unsure about graph titles - on detail pages it is "Resource usage history" and on list pages, where it shows cumulative resource usage of all resources the title is "Cumulative resource usage history". Both titles are quite long and sound too complicated, do you have any ideas? Maybe we should display short title and add a question mark icon next to the graph title that would provide more explanation to the user upon click or hover?

@danielromlein
Copy link
Contributor

Thanks for posting @PiotrDabkowski!

As per earlier comments about the combined graph being confusing, I think a more effective solution may be separating out the CPU and memory data into two separate visualizations; perhaps compressed vertically to save space.

It seems very hidden that clicking the graph title deselects it. We could just use a checkbox for these, but again, I think a better solution still might just be separating the graphs.

Time values along the bottom should be vertically aligned.

@danielromlein
Copy link
Contributor

@PiotrDabkowski just wanted to follow up and see if you'd had any thoughts around this? 👆

@PiotrDabkowski
Copy link
Author

PiotrDabkowski commented Sep 12, 2016

@romlein thank you for you feedback! I already implemented you suggestions and I think you were right, it looks better now :)

graph_row2

What do you think?

@floreks
Copy link
Member

floreks commented Sep 12, 2016

Nice! I like it. It's more transparent now and easier to understand.

@danielromlein
Copy link
Contributor

danielromlein commented Sep 12, 2016

@PiotrDabkowski right on! Vastly improved.
Though I like the side by side arrangement because it leaves more space for viewing the [pods], I can also see how it would be helpful for users to correlate events in the two graphs – which would be easier to spot if the two graphs were stacked vertically. See @bgrant0607's note on #8270:

Presentation guidelines: timeseries graphs: single-column, all same time-scale, for easy correlation.

Perhaps their vertical height could be reduced so as to not push the resource too far down the page?
Thoughts @bryk?

@digitalfishpond
Copy link
Contributor

If ease of correlation is what we're going for, a possible solution to the problem of height might be a 'Combine views / Split views' toggle button? Just a thought.

@bryk
Copy link
Contributor

bryk commented Sep 13, 2016

Perhaps their vertical height could be reduced so as to not push the resource too far down the page?
Thoughts @bryk?

I'm also thinking about moving the graphs to the bottom of the page.

@rf232 rf232 closed this as completed Nov 18, 2016
@douxiaofeng99
Copy link

I use the gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.0, i can view the dashborad, but there is no cpu and memory usage chart. Does it released?

@cheld
Copy link
Contributor

cheld commented Jan 23, 2017

You have to deploy the heapster container as well. Heapster is (more or less) a required kubernetes component

@Globegitter
Copy link

What does CPU usage of e.g. 0.1 actually mean? Is that 10%? And for a pod would that then be 10% of what is allocated to that container? And 0.1 in the overview graph would then mean 10% of all the cpu available from all the nodes? Could not find any clarification on that.

@floreks
Copy link
Member

floreks commented Aug 24, 2017

CPU (Cores). In example if node has 4 cores then 0.1 means that all cores (together, not all of them at once) are roughly under 10% load. For more information you can check heapster and how exactly they are scraping metrics.

@floreks
Copy link
Member

floreks commented Aug 24, 2017

As stated in heapster documentation: https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md

cpu/usage | Cumulative CPU usage on all cores.

@Globegitter
Copy link

@floreks thanks for that description - is it also possible to see that total? So 1.4/8 for example? Or 6.71Gi/32Gi memory usage. Imo makes these stats immediately more useful and one does not have to think, oh this cluster has 6 cores, this 16 etc.

@floreks
Copy link
Member

floreks commented Aug 28, 2017

There is no such information available at the time. Calculating actual max value might be a bit tricky here because usually you will get total number of cores/memory available on the node and not the actual limits assigned to the k8s apps pool. That is why there are so many metrics available in heapster. Only for checking CPU limits there are like 5 metrics:

Metric name Description
cpu/limit CPU hard limit in millicores.
cpu/node_capacity Cpu capacity of a node.
cpu/node_allocatable Cpu allocatable of a node.
cpu/node_reservation Share of cpu that is reserved on the node allocatable.
cpu/node_utilization CPU utilization as a share of node allocatable.

On the node details page we are showing allocated resources reported directly by kubelet. This can give a rough idea about available resources. I think it's more likely that we'll add more advanced graphs on an overview page and leave sparklines as is.

@maciaszczykm WDYT?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests