Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: health status implementation #1406

Merged
merged 13 commits into from
Dec 19, 2023
Merged

feat: health status implementation #1406

merged 13 commits into from
Dec 19, 2023

Conversation

kohlisid
Copy link
Contributor

@kohlisid kohlisid commented Dec 6, 2023

Health status definitions

It is divided into two parts:

  1. Pipeline Resource Health: It is based on the health of each vertex in the pipeline
  2. Data Criticality: It is based on the data movement of the pipeline

Resource Health can be "healthy (0) | unhealthy (1) | paused (3) | unknown (4)".

Resource Health purely means it is up and running.
Resource health will be the max(health) based of each vertex's health

Resource health checks if all the pods are in running state for the pipeline, and also for paused, unknown pipeline etc

Data Criticality on the other end shows whether the pipeline is working as expected.
It represents the pending messages, lags, etc.
Data Criticality can be "ok (0) | warning (1) | critical (2)".
A backlogged pipeline can be healthy even though it has an increasing back-pressure.

For data criticality the timeline data is populated for the pipeline, and if the average usage lies above the thresholds then the required state is assigned. For critical states we have the option do a lookback and only assign it to be critical if we see a predefined number of critical state in a lookback period window.

@kohlisid kohlisid marked this pull request as ready for review December 11, 2023 19:29
@kohlisid kohlisid requested a review from a team December 11, 2023 19:29
go.mod Outdated Show resolved Hide resolved
Copy link
Member

@whynowy whynowy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Show resolved Hide resolved
Copy link
Contributor

@vigith vigith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome job on the well-commented code :)

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Show resolved Hide resolved
pkg/shared/ewma/interface.go Show resolved Hide resolved
pkg/shared/ewma/simple_ewma.go Outdated Show resolved Hide resolved
pkg/shared/health-status-code/code_map.go Show resolved Hide resolved
server/apis/v1/health.go Outdated Show resolved Hide resolved
server/apis/v1/health.go Outdated Show resolved Hide resolved
pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
pkg/shared/ewma/interface.go Show resolved Hide resolved
func (hc *HealthChecker) StartHealthCheck(ctx context.Context) {
// Goroutine to listen for ticks
// At every tick, check and update the health status of the pipeline.
go func(ctx context.Context) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't the function StartHealthCheck be blocked? It's using a goroutine to start it in the daemon server.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated. Please check if it seems fine now

server/apis/v1/health.go Outdated Show resolved Hide resolved
Copy link
Contributor

@yhl25 yhl25 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a lot of methods and types exposed in healthStatus.go, avoid exposing types and methods if they are not used in other packages

pkg/daemon/server/service/healthStatus.go Outdated Show resolved Hide resolved
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
@whynowy
Copy link
Member

whynowy commented Dec 19, 2023

@kohlisid - resolve the conflicts?

Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
Signed-off-by: Sidhant Kohli <sidhant.kohli@gmail.com>
@whynowy whynowy merged commit bca1b3b into numaproj:main Dec 19, 2023
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants