Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

e2e-aws: stream bootkube and 'kubectl get all' to artifacts #2064

Closed
wants to merge 1 commit into from

Conversation

sttts
Copy link
Contributor

@sttts sttts commented Nov 5, 2018

This PR should make issues during bootstrapping much easier to understand.

@openshift-ci-robot openshift-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Nov 5, 2018

# stream bootkube into artifact file
mkdir -p /tmp/artifacts/bootstrap
ssh -o "StrictHostKeyChecking=no" core@${ip} sudo journal -u bootkube 2>&1 > /tmp/artifacts/bootstrap/bootkube.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This container is named stream-kubectl-get-all, but it is streaming journal -u bootkube? And it looks like your earlier container isstreaming bootkube too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo. Had planned two containers, but then noticed the amount of shared code and merged. will fix.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

# wait for the bootstrap node to show up with an IP
ip=""
while [ -z "${ip}" ]; do
ip=$(terraform state show -state=terraform.tfstate module.bootstrap.aws_instance.bootstrap | sed -n 's/^public_ip *= *//p')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@smarterclayton wanted this IP in the build logs (I think?), I'm just not clear if there's a way to get information there from a pod container. Any ideas?

stream-kubectl-get-all ${ip} &
}

stream &
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream is backgrounding the long-running resource streamers internally, so we can probably drop & here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@openshift-ci-robot openshift-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Nov 5, 2018
function stream-bootkube () {
ip=${1}
while true; do
ssh -o "StrictHostKeyChecking=no" core@${ip} sudo journal -u bootkube 2>&1 >> /tmp/artifacts/bootstrap/bootkube.log
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we use the fully-qualified bootkube.service?

Also, this redirect isn't quite right, since you redirect stderr into stdout before adjusting stdout. For example:

$ (echo hi >&2) 2>&1 >>/tmp/bootkube.log
hi

You should instead redirect stdout first, and then redirect stderr into stdout:

$ (echo hi >&2) >>/tmp/bootkube.log 2>&1
$ cat /tmp/bootkube.log 
hi

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the file redirection out of the loop. Makes it slightly easier to read.

if [[ -f /tmp/shared/exit ]]; then
exit 0
fi
sleep 60 & wait
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the backgrounded sleep here.

$ (sleep 10 && echo 'done sleeping') & (wait && echo 'done waiting')
done waiting  # this shows up very quickly
done sleeping  # this is delayed by 10 seconds

Why have a non-blocking sleep? I'd expect sleep here with a trailing kill:

for i in $(seq 1 120); do
  if [[ -f /tmp/shared/exit ]]; then
    break
  fi
  sleep 60
done
kill-streams

where kill-streams had internal wait calls (possibly using explicit PIDs).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the loop with the sleep is copied from the teardown script below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kill-streams is not necessary due to the TERM trap

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: sttts
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: rajatchopra

If they are not already assigned, you can assign the PR to them by writing /assign @rajatchopra in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@sttts
Copy link
Contributor Author

sttts commented Jan 3, 2019

@wking addressed your comments.

Still wondering how to test the script. It's only blindly written right now. How do you do that usually?

@wking
Copy link
Member

wking commented Jan 3, 2019

Still wondering how to test the script. It's only blindly written right now. How do you do that usually?

There are some docs here. Personally I've been using the "hope it works and file patches if it turns out to be broken" approach unless I'm doing something big 😊

while true; do
ssh -o "StrictHostKeyChecking=no" core@${ip} sudo journalctl -u bootkube.service -f --no-tail 2>&1
echo "=================== journalctl terminated ==================="
date
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we embed this in the termination marker? It feels like it might be associated with the next iteration's logs if it comes after a big banner. Something like:

echo "========== journalctl terminated $(date --iso=s --utc) =========="

would do it.

cp /etc/openshift-installer/ssh-privatekey .ssh/id_rsa
cp /etc/openshift-installer/ssh-publickey .ssh/id_rsa.pub

function stream-bootkube () {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is super complicated. Why isn't the installer fetching the bootkube logs if anything fails?

@smarterclayton
Copy link
Contributor

This is way more complicated than I would like. If you can't debug a failure of bootstrap from the installer logs, I think we're doing something wrong. I don't really see the value in watch -w over possibly just dumping all the objects at the beginning of teardown.

@wking
Copy link
Member

wking commented Jan 4, 2019

If you can't debug a failure of bootstrap from the installer logs, I think we're doing something wrong.

openshift/installer#967 is work towards the installer being able to collect bootkube.service logs on its own.

@wking
Copy link
Member

wking commented Jan 4, 2019

I don't really see the value in watch -w over possibly just dumping all the objects at the beginning of teardown.

For successful runs, the bootstrap node will be gone by the time the teardown CI container starts going through it's log collection. But yeah, you could put something there to attempt to grab the bootstrap logs, and have it only succeed when bootstrapping hung.

@wking
Copy link
Member

wking commented Jan 24, 2019

Obsoleted by #2633?

@openshift-ci-robot
Copy link
Contributor

@sttts: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/prow-config 99c4bec link /test prow-config
ci/prow/step-registry-shellcheck 99c4bec link /test step-registry-shellcheck
ci/prow/app-ci-config 99c4bec link /test app-ci-config
ci/prow/ci-operator-config 99c4bec link /test ci-operator-config
ci/prow/release-controller-config 99c4bec link /test release-controller-config
ci/prow/step-registry-metadata 99c4bec link /test step-registry-metadata
ci/prow/ci-testgrid-allow-list 99c4bec link /test ci-testgrid-allow-list
ci/prow/yamllint 99c4bec link /test yamllint
ci/prow/boskos-config 99c4bec link /test boskos-config

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot
Copy link
Contributor

@sttts: The following test failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/release-config 99c4bec link /test release-config

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@sttts
Copy link
Contributor Author

sttts commented Oct 22, 2020

/close

@openshift-ci-robot
Copy link
Contributor

@sttts: Closed this PR.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
5 participants