New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CPU and Memory usage monitoring feature #1010
Comments
It looks nice! I'm wondering if it wouldn't look a bit cleaner to have both of these plots in different graphs. It's not so easy to read from it at the moment. Perhaps update of legend could help a bit. Also some kind of title (separation from other details) would be nice. I guess it's important, so we shouldn't move it to another tab, but I think it could be improved a bit. |
@maciaszczykm thank you for your comments! It is possible to choose one graph by clicking on a graph name. We thought that having an option to see 2 graphs in the same time would be a nice feature as this allows the user to see relation between memory/cpu usage. Regarding title - I am still experimenting with it and I will post new screenshots soon. |
The correlation of metrics and events is very nice. Most tools show events as vertical lines like this: https://codeascraft.com/2010/12/08/track-every-release/ and show additional information on hover. But this is not touch friendly. |
I like where this is headed @PiotrDabkowski! I'm inclined to agree with @maciaszczykm that trying to combine the two graphs could potentially be confusing. As a user it's not immediately apparent that I can choose one graph by clicking on its name. Datadog uses a line that extends between graphs to help establish relationship between them; I think something like that could be effective. |
@cheld Thank you for the suggestion, that looks nice. I thought about something like this http://www.nytimes.com/interactive/2013/03/29/sports/baseball/Strikeouts-Are-Still-Soaring.html?ref=baseball&_r=1& but that may not be very readable if number of events is big. @romlein Thank you, I will add my future scans there. Regarding 2 graphs solution you suggested: that may work and it's not hard to change, but graph is already taking up 25% of the screen and I am afraid that adding another graph will make page cluttered and less readable. |
Ah ok. From user point of view it might be nice to have some kind of selection for the type of event. E.g. 'only show deployment events'. However, out of my head I am not sure if this is possible. |
BTW: this https://square.github.io/cubism/ is a bit crazy condensed view. It actually makes sense when you see it in real action.... |
@cheld's example looks very good! |
@PiotrDabkowski looks nice! @cheld's example also looks good. What do you think about doing something in the middle? Remove left/right scale. Move units to legend and normalize data so we can display cpu/ram on one graph using common scale. |
Ye, I can see how this may cause some troubles. CPU is most common using mili prefix and RAM mega, so we have 10^-3 vs 10^6. Then maybe similar to cubism show graph under graph without Y axis scale and show data on hover. |
Taking a step back, I'd like to better understand why we're showing the user this information -- what is the problem that the graph is solving? Is it it identify memory leaks, or identify how some events cause changes in resource usage? (I like Christoph's idea of tying resource graph to events.) |
@Lukenickerson The purpose of the graph is to visualize historical CPU and Memory usage of the given resource. It's hard to do that without a graph :) And just like you said the purpose of that is to identify potential problems like memory leaks, but it also helps to analyse patterns in resource consumption. Overlaying graphs with events was the idea from the start :) and again, as you said it helps to determine impact of certain events. Currently, Heapster provides only 15 minutes of historical resource consumption so the graph is not very useful, but longer time periods will be available in the future and that will make graphs much more helpful. |
This is to monitor and troubleshoot your cluster and applications. I.e., to understand why your application is crashing (you can see that mem usage is above limits) or why it is serving too little requests per second (e.g. CPU usage went high after a release). Usually this would be: I open the page and see from the graph that everything is fine. I can also notice a spike/drop in some metrics or see that a particular event caused something. All the graphs we're going to show are K8s context aware and this the reason why they can be very powerful and can differentiate us from all other generic tools that just show numbers. Does this answer your question? |
Thanks @bryk , those are good examples. In both of them it would be somewhat difficult for the user to understand the issue unless they knew the time of the events, and if the events happened within the last 15 minutes. I think with some additional information the graphs could become even more useful:
|
Yes, I agree here. The 15 minutes is a temporary limitation that we'll overcome in the future. Once we have more data the use cases you've mentioned can be satisfied (or satisfied better). All we need to make sure is to design for them, assuming we'll get more data in the future. |
@bryk : Do we currently have a way to track events in the system? Would things like an app crash or an app at/over a certain resource threshold be part of the logs, and can be easily identified? |
We can get all events in the system or all events that are associated with a thing. So when we show, e.g., a replica set, we can show events related to it or its pods. All events, the positive and negative ones. |
Graphs displaying CPU and memory usage history have been added to all list and detail pages. See example graph for replica set detail: @Lukenickerson @romlein @bryk Do you have any comments on how the graph could be improved? Finally, I am still unsure about graph titles - on detail pages it is "Resource usage history" and on list pages, where it shows cumulative resource usage of all resources the title is "Cumulative resource usage history". Both titles are quite long and sound too complicated, do you have any ideas? Maybe we should display short title and add a question mark icon next to the graph title that would provide more explanation to the user upon click or hover? |
Thanks for posting @PiotrDabkowski! As per earlier comments about the combined graph being confusing, I think a more effective solution may be separating out the CPU and memory data into two separate visualizations; perhaps compressed vertically to save space. It seems very hidden that clicking the graph title deselects it. We could just use a checkbox for these, but again, I think a better solution still might just be separating the graphs. Time values along the bottom should be vertically aligned. |
@PiotrDabkowski just wanted to follow up and see if you'd had any thoughts around this? 👆 |
Nice! I like it. It's more transparent now and easier to understand. |
@PiotrDabkowski right on! Vastly improved.
Perhaps their vertical height could be reduced so as to not push the resource too far down the page? |
If ease of correlation is what we're going for, a possible solution to the problem of height might be a 'Combine views / Split views' toggle button? Just a thought. |
I'm also thinking about moving the graphs to the bottom of the page. |
I use the gcr.io/google_containers/kubernetes-dashboard-amd64:v1.5.0, i can view the dashborad, but there is no cpu and memory usage chart. Does it released? |
You have to deploy the heapster container as well. Heapster is (more or less) a required kubernetes component |
What does CPU usage of e.g. |
|
As stated in heapster documentation: https://github.com/kubernetes/heapster/blob/master/docs/storage-schema.md
|
@floreks thanks for that description - is it also possible to see that total? So 1.4/8 for example? Or 6.71Gi/32Gi memory usage. Imo makes these stats immediately more useful and one does not have to think, oh this cluster has 6 cores, this 16 etc. |
There is no such information available at the time. Calculating actual max value might be a bit tricky here because usually you will get total number of cores/memory available on the node and not the actual limits assigned to the k8s apps pool. That is why there are so many metrics available in heapster. Only for checking CPU limits there are like 5 metrics:
On the node details page we are showing allocated resources reported directly by kubelet. This can give a rough idea about available resources. I think it's more likely that we'll add more advanced graphs on an overview page and leave sparklines as is. @maciaszczykm WDYT? |
I am working on a new cpu and memory monitoring feature for our dashboard. The aim is to add graphs of CPU and memory usage to the details page. Example can be seen in the design specification
.
My idea is to show not only resource usage versus time but also add extra annotations to the graph, showing events of interest and max/min resource consumtion.
I would really appreciate some discussion here as I am not exactly sure how the graph should look like and how the annotations should be displayed so that the the feature is both easy to use and neat. The goal of this discussion is to create more detailed mock designs.
cc @romlein @Lukenickerson @floreks @bryk @pwittrock
The text was updated successfully, but these errors were encountered: