Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP VM instance not terminating after timeout #834

Closed
eemelipa opened this issue Dec 4, 2021 · 14 comments · Fixed by #689
Closed

GCP VM instance not terminating after timeout #834

eemelipa opened this issue Dec 4, 2021 · 14 comments · Fixed by #689
Labels
cloud-gcp Google Cloud cml-runner Subcommand p0-critical Max priority (ASAP)

Comments

@eemelipa
Copy link

eemelipa commented Dec 4, 2021

Similar issue to #678

I'm starting a self hosted runner via Gitlab CICD to GCP:

deploy-runner:
  stage: start runner
  image: iterativeai/cml:0-dvc2-base1
  resource_group: all
  script:
    - cml runner --cloud=gcp --cloud-region=eu-north --cloud-type=c2-standard-4 --labels=cml-runner --reuse --idle-timeout 600

After the timeout the VM instance is not shutting down.

journalctl --unit cml --no-pager command shows

-- Logs begin at Sat 2021-12-04 18:40:23 UTC, end at Sat 2021-12-04 19:02:12 UTC. --
Dec 04 18:43:29 cml-4ejd6b8lzc systemd[1]: Started cml.service.
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Preparing workdir /tmp/tmp.rtgpwktKf5/.cml/cml-4ejd6b8lzc..."}
Dec 04 18:43:37 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Launching gitlab runner"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"level":"warn","message":"SpotNotifier can not be started."}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.453Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:43:41 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:41.454Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"ready"}
Dec 04 18:43:42 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:43:42.276Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_started"}
Dec 04 18:44:38 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:44:38.904Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:46 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:45:49 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:45:46.363Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.303Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml","status":"job_ended","success":false}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.649Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.952Z","job":1850241152,"level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:46:26 cml-4ejd6b8lzc cml.sh[17099]: {"date":"2021-12-04T18:46:26.953Z","level":"info","message":"runner status","repo":"https://gitlab.com/common-kube/ml"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"runner status","reason":"timeout:600","status":"terminated"}
Dec 04 18:56:28 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"waiting 20 seconds before exiting..."}
Dec 04 18:56:48 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"Unregistering runner cml-4ejd6b8lzc..."}
Dec 04 18:56:49 cml-4ejd6b8lzc cml.sh[17099]: {"level":"info","message":"\tSuccess"}
Dec 04 18:56:50 cml-4ejd6b8lzc systemd[1]: cml.service: Succeeded.

The runner picks up a job correctly and the runner deregisters itself from the Gitlab project. The VM instance just does not shutdown.

On Azure similar config worked ok and the instances were shutting down

@eemelipa
Copy link
Author

eemelipa commented Dec 4, 2021

Tried in different region and instance type: --cloud-region=europe-west1-b --cloud-type=n1-standard-1 but that did not help

@DavidGOrtega
Copy link
Contributor

@eemelipa thanks a lot for the issue.

@DavidGOrtega DavidGOrtega added p0-critical Max priority (ASAP) cml-runner Subcommand cloud-gcp Google Cloud labels Dec 6, 2021
@casperdcl casperdcl linked a pull request Dec 7, 2021 that will close this issue
@dacbd
Copy link
Contributor

dacbd commented Dec 10, 2021

hmm 🤔 the opposite of mine #808

@eemelipa can you post a screenshot for me? After you run your pipeline go to the GCP dashboard for the project then the Activity tab and take a screenshot?

If it's in a busy project that has other stuff, search for the "Logs Explorer" and narrow the time to when your pipeline ran, and look for any "severity" high than "notice"

@eemelipa
Copy link
Author

image

@dacbd here's the screenshot ^ There weren't any warn/error logs 🤔 The bootCounter: 2 latebootreportevent seems to happen correctly at 5min mark which was the idle-timeout in this case. I don't know if the logs are missing any rows though

The credentials have "Compute Admin" privilege

@dacbd
Copy link
Contributor

dacbd commented Dec 12, 2021

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

@eemelipa
Copy link
Author

@eemelipa there should be some firewall/network API calls. When I had the teardowns fail these calls failed and so the instance remained.

Ok, I adjusted the log filtering and now the firewall inserts are visible:

image

No errors/warnings though. Any thoughts on what to look for next? I tried giving the service account owner access to the project (i.e., all privileges) but that didn't help

@dacbd
Copy link
Contributor

dacbd commented Dec 12, 2021

An error here would have made this an easy fix 😞

There is something more complicated going on, "Compute Admin" is a sufficient role.

@dacbd
Copy link
Contributor

dacbd commented Dec 12, 2021

I'm not sure how much help I can provide without digging into the setup more, or building a custom cml image with more debug logging for the terraform provider

Here you can see a full successful logging lifecycle in gcp
image

@dacbd
Copy link
Contributor

dacbd commented Dec 12, 2021

We might be able to get more info out of journalctl by adding a start-up script that creates a custom environment over-ride for the cml.service /etc/systemd/system/cml.service.d/debug.conf

I think something like this should work:
debug.conf

[Service]
Environment="TF_LOG=DEBUG"

@eemelipa
Copy link
Author

Spend good time fiddling manually on the VM instance and got a step forward

Looks like the problem is that the CML VM instance did not get any GCP service account:

image

When I put that in place manually and restarted the instance things worked! The instance got deleted after the idle-timeout.

Sounds like something should fail if the instance does not have a service account. Here's couple options that came to my mind (obviously you guys know better what's under the hood):

  1. The iterative terraform provider should fail when fetching credentials in case service account is not in place (maybe somewhere here https://github.com/iterative/terraform-provider-iterative/blob/550c698677b94644a80fa019fab6f62b26787543/iterative/gcp/provider.go#L295)
  2. The GCP service account is specified with --cloud-permission-set cml cli argument. If the --cloud=gcp but the service account is missing then the cml runner command should fail

So missing service account is one problem and we seem to also have a second problem. When I give service account to cml runner command it creates the VM instance correctly with the account but it does not give correct Cloud API access scopes:

image

When I gave manually the API scopes the idle-timeout deletion worked ok. Sounds like some changes might be needed to the CML instance creation

@dacbd
Copy link
Contributor

dacbd commented Dec 13, 2021

Hmm, it sounds like some documentation clarification might be required?

Under the hood, cml runner adds the GOOGLE_APPLICATION_CREDENTIALS_DATA that cml was invoked with into the systemd service unit as those should be the credentials used for the creation of the instance and thus also should be used for the teardown of the instance.

The --cloud-permission-set takes (in GCP's case) the service account email to attach to the instance, the intent behind that is for the application or ML model to use to access other services from the cloud provider like s3/object storage.

Are you saying it looks like terraform tried to use those (the --cloud-permission-set) creds instead of the original cml runner ones? That is definitely not intended.

This should be easy for me to reproduce and I'll try to get it fixed soon, if you are on discord and willing to test out a patch I can let you know when I have something working (dabarnes on discord)

@dacbd
Copy link
Contributor

dacbd commented Dec 13, 2021

Can you share more of the yml from gitlab-ci?

I think I misunderstood some of your last reply.

Can you share the permissions list or GCP managed role for the service account whose key should be set in the GOOGLE_APPLICATION_CREDENTIALS_DATA environment variable when you invoke your cml runner command and the same for the one you are trying to set with the --cloud-permission-set ?

Also, cat out the /etc/systemd/system/cml.service file and be sure to check and redact any sensitive values that might be there

edit:
In the same screen as your first screenshot there's a metadata section. Can you verify GOOGLE_APPLICATION_CREDENTIALS_DATA json is your service account you invoked cml runner with, in the startup-script?

@dacbd
Copy link
Contributor

dacbd commented Dec 14, 2021

Can be closed with TPI fix: iterative/terraform-provider-iterative#333

@casperdcl
Copy link
Contributor

fixed by iterative/terraform-provider-iterative#333; thanks @dacbd

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-gcp Google Cloud cml-runner Subcommand p0-critical Max priority (ASAP)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants