Skip to content

add example diagrams#2

Open
remylouisew wants to merge 2 commits intonstogner:mainfrom
remylouisew:diagrams
Open

add example diagrams#2
remylouisew wants to merge 2 commits intonstogner:mainfrom
remylouisew:diagrams

Conversation

@remylouisew
Copy link
Copy Markdown

Adding diagrams and tables depicting how metrics are calculated when a job completes vs when a job fails.

Copy link
Copy Markdown
Owner

@nstogner nstogner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some of these metric names dont match what is being produced. You probably need to query the metrics in the project to see exactly what is there (search for megamon_alpha_ and remove the "alpha" part from the final doc. PS: I think you checked in a duplicate diagram.

@remylouisew
Copy link
Copy Markdown
Author

Ok the metric names should be correct now (hopefully). I see two different diagrams, let me know if you still don't

@nstogner
Copy link
Copy Markdown
Owner

Hey @remylouisew! I am still seeing some differences. Here are the metrics that are produced today:

(where * might be jobset or nodepool or jobset_nodes)

megamon_*_up
megamon_*_up_time_seconds
megamon_*_down_time_seconds
megamon_*_interruption_count
megamon_*_recovery_count
megamon_*_up_time_between_interruption_seconds
megamon_*_up_time_between_interruption_mean_seconds
megamon_*_up_time_between_interruption_latest_seconds
megamon_*_down_time_initial_seconds
megamon_*_down_time_between_recovery_seconds
megamon_*_down_time_between_recovery_mean_seconds
megamon_*_down_time_between_recovery_latest_seconds

PS: Technically the megamon_ part is actually configurable.

@remylouisew
Copy link
Copy Markdown
Author

remylouisew commented Dec 31, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants