Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(agw): all AGW containeres have health checks #13918

Merged
merged 1 commit into from
Sep 19, 2022

Conversation

mpfirrmann
Copy link
Contributor

@mpfirrmann mpfirrmann commented Sep 13, 2022

Signed-off-by: Marco Pfirrmann marco.pfirrmann@tngtech.com

Summary

This PR is embedded into the larger issue #13684.

  • The health checks for the magmad and sctpd containers are added
  • The time between health checks, interval, is reduced from its default 30s to 4s.

Test Plan

Follow the README to build the containers and check their status with docker ps.

Additional Information

  • This change is backwards-breaking

@mpfirrmann mpfirrmann requested a review from a team September 13, 2022 16:53
@mpfirrmann mpfirrmann requested a review from a team as a code owner September 13, 2022 16:53
@pull-request-size pull-request-size bot added the size/XS Denotes a PR that changes 0-9 lines. label Sep 13, 2022
@github-actions
Copy link
Contributor

Thanks for opening a PR! 💯

A couple initial guidelines

Howto

  • Reviews. The "Reviewers" listed for this PR are the Magma maintainers who will shepherd it.
  • Checks. All required CI checks must pass before merge.
  • Merge. Once approved and passing CI checks, use the ready2merge label to indicate the maintainers can merge your PR.

More info

Please take a moment to read through the Magma project's

If this is your first Magma PR, also consider reading

@github-actions github-actions bot added the component: ci All updates on CI (Jenkins/CircleCi/Github Action) label Sep 13, 2022
@github-actions
Copy link
Contributor

github-actions bot commented Sep 13, 2022

feg-workflow

    2 files  203 suites   40s ⏱️
374 tests 374 ✔️ 0 💤 0
388 runs  388 ✔️ 0 💤 0

Results for commit 4565ec9.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 13, 2022

dp-workflow

14 tests   14 ✔️  2m 16s ⏱️
  1 suites    0 💤
  1 files      0

Results for commit 4565ec9.

♻️ This comment has been updated with latest results.

@github-actions
Copy link
Contributor

github-actions bot commented Sep 13, 2022

agw-workflow

473 tests   468 ✔️  1m 12s ⏱️
    1 suites      4 💤
    1 files        1

For more details on these failures, see this check.

Results for commit 9059b9a.

♻️ This comment has been updated with latest results.

Copy link
Contributor

@Neudrino Neudrino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we have this for all the components, should we abstract this in x-generic-service:? At least the timeout and retries part, such what we only configure the target for the components?

@@ -178,6 +182,10 @@ services:
container_name: sctpd
ulimits:
core: -1
healthcheck:
test: ["CMD", "bash", "-c", "[ -S /tmp/sctpd_downstream.sock ]"]
timeout: "5s"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a different timeout as for everything else configured here. Why?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It failed with the 4s timeout.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What was the test setup that made it fail? On my machine it worked with a timeout of one second. To be fair I only started sctpd and not the other containers, but then again a simple bash test command should usually take way less than a second I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worked for me as well with 4s (I tested starting all containers).

@pull-request-size pull-request-size bot added size/M Denotes a PR that changes 30-99 lines. and removed size/XS Denotes a PR that changes 0-9 lines. labels Sep 14, 2022
@mpfirrmann
Copy link
Contributor Author

If we have this for all the components, should we abstract this in x-generic-service:? At least the timeout and retries part, such what we only configure the target for the components?

Changed.

command: /usr/bin/env python3 -m magma.ctraced.main

sctpd:
<<: *ltecservice
container_name: sctpd
ulimits:
core: -1
healthcheck:
test: ["CMD", "bash", "-c", "[ -S /tmp/sctpd_downstream.sock ]"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: We could also write test: "test -S /tmp/sctpd_downstream.sock". Personally I find that more readable, especially the combination of square brackets for Yaml lists and for the bash test command are a bit hard to parse.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And as a compromise if we want to stay consistent with the ["CMD", ...] syntax used in this file:

test: ["CMD", "test", "-S", "/tmp/sctpd_downstream.sock"]

disclaimer: This one I didn't test, but I think it should work.

@@ -25,6 +25,9 @@ x-generic-service: &service
logging: *logging_anchor
restart: always
network_mode: host
healthcheck:
timeout: "4s"
retries: 3
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we should deduplicate this. Currently deduplicating is possible because the parameters are not fine-tuned to the individual checks. Once we start to fine-tune the timeout (like further down in this PR with sctpd) then we actually have two places where we define the timeout: One default and one overwrite. Defaults + overwrites are more complex than just having one definite place to look. Also, by deduplicating, we now split the healthcheck blocks into two: This part that defines the timeouts and retries, and the part in the respective services that defines the tests. Without de-duplicating you can easily see the whole healthcheck definition for a service at a glance.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The classic discussion on code abstraction vs duplication. I am sure there are heaps of internet resources arguing for both sides.
  • Once we start: But we do not at the moment. Would it not be sensible to focus on the current state, on the here and now, instead of anticipating what we might or might not want to do in future?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we do not at the moment

Yes we do, in this very PR, with sctpd. That's what I wrote in paranthesis.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For me it works if we set all timeouts to 4s, so maybe there is no need for fine-tuning at this point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if it's possible to use the same values everywhere, I wouldn't say this is a good candidate for deduplication. I think de-duplicating those values would be important if those values need to be the same for all checks, and if we want to prevent ourselves from accidentally modifying the setting for only one service. I don't see the need for that here. So I think this would basically save us some lines of code, and I don't think that justifies having to look up the healthcheck definitions in two different places.

Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 14, 2022
…gma#13918

Signed-off-by: Fritz Lehnert <13189449+Neudrino@users.noreply.github.com>
@Neudrino
Copy link
Contributor

After merge of #13852 a rebase with appropriate changes to align would probably be good.

Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 15, 2022
…gma#13918

Signed-off-by: Fritz Lehnert <13189449+Neudrino@users.noreply.github.com>
Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 15, 2022
…gma#13918

Signed-off-by: Fritz Lehnert <13189449+Neudrino@users.noreply.github.com>
@Neudrino Neudrino enabled auto-merge (squash) September 19, 2022 14:07
Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 19, 2022
…gma#13918

Signed-off-by: Fritz Lehnert <13189449+Neudrino@users.noreply.github.com>
Signed-off-by: Marco Pfirrmann <marco.pfirrmann@tngtech.com>
@mpfirrmann mpfirrmann self-assigned this Sep 19, 2022
@mpfirrmann mpfirrmann linked an issue Sep 19, 2022 that may be closed by this pull request
10 tasks
@Neudrino Neudrino merged commit dc69a34 into magma:master Sep 19, 2022
@mpfirrmann mpfirrmann deleted the pr/enable_health_checks branch September 20, 2022 06:31
Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 20, 2022
…gma#13918

Signed-off-by: Fritz Lehnert <13189449+Neudrino@users.noreply.github.com>
Neudrino added a commit to alexzurbonsen/magma that referenced this pull request Sep 20, 2022
mpfirrmann added a commit to wolfseb/magma that referenced this pull request Sep 21, 2022
Signed-off-by: Marco Pfirrmann <marco.pfirrmann@tngtech.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: ci All updates on CI (Jenkins/CircleCi/Github Action) size/M Denotes a PR that changes 30-99 lines.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fully test containerized AGW
4 participants