Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move kubeflow jobs to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/prowjobs #14343

Closed
fejta opened this issue Sep 16, 2019 · 52 comments
Assignees
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.

Comments

@fejta
Copy link
Contributor

fejta commented Sep 16, 2019

@jlewi what do you think about moving the kubeflow jobs over to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/prowjobs and prow.gflocks.com?

@fejta fejta added the kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. label Sep 16, 2019
@jlewi
Copy link
Contributor

jlewi commented Sep 16, 2019

@fejta That seems fine with me.

What else would need to change? i.e

Do testgrids move?
Do we use a different GCS bucket for prow artifacts?

@scottilee is this something you could help with?

@jlewi
Copy link
Contributor

jlewi commented Sep 16, 2019

@fejta How urgent is this on your end?

@fejta
Copy link
Contributor Author

fejta commented Sep 16, 2019

Not urgent.

Testgrid can stay the same.
Are you still not using pod utils? If so then yes, you'll upload to a different bucket (maybe prow specifies where to upload it?)

@chases2 we should probably set up GCP/oss-test-infra to work like istio -- where we can annotate prowjobs there and have them show up in this testgrid.

@jlewi
Copy link
Contributor

jlewi commented Sep 16, 2019

@fejta correct we manually upload our files to GCS right now; but we could probably switch to use pod_utils.

@scottilee
Copy link
Contributor

@fejta Can you share some more info (e.g., a link to a ticket or document with explanation if available) on why the move from prow.k8s.io to prow.gflocks.com?

Also, would it just be creating a "kubeflow" folder in https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/prowjobs and moving the files in https://github.com/kubernetes/test-infra/tree/master/config/jobs/kubeflow to there?

Lastly, I'm not familiar with pod-utils. I'm assuming it's this https://github.com/kubernetes/test-infra/tree/master/prow/pod-utils? Any more info anywhere so I can read up on it?

@fejta
Copy link
Contributor Author

fejta commented Sep 16, 2019

why the move

prow.k8s.io is for kubernetes (or at least CNCF project)
prow.gflocks.com is for public google projects.

Eventually we want to migrate all non-CNCF projects out of prow.k8s.io

And yes, it is ideally
a) creating a GKE cluster to run jobs (provides you with isolation from other jobs)
b) configuring prow to schedule kubeflow jobs into that cluster
b1) also moving any secrets, configmaps, etc that jobs use
c) moving jobs to that prow instance

pod-utils

Let's not worry about this for now, see https://github.com/kubernetes/test-infra/blob/master/prow/jobs.md#pod-utilities for more detail.

Test containers should no longer need to check out repos and/or upload results to GCS. Sidecar containers will do this for you.

@BenTheElder BenTheElder changed the title Move kubetest jobs to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/prowjobs Move kubeflow jobs to https://github.com/GoogleCloudPlatform/oss-test-infra/tree/master/prow/prowjobs Sep 17, 2019
@scottilee
Copy link
Contributor

@fejta I apologize for the delay on this. I started a PR at GoogleCloudPlatform/oss-test-infra#93, which is probably wrong 🙄 but it's a start! Let me know what's missing...

  1. In your last comment above you mention "creating a GKE cluster", is that a GKE cluster that the Kubeflow project should create? If so, where do we specify the details for Prow (I don't think I saw any examples in the existing folders in oss-test-infra/prow/prowjobs)?
  2. Once the PR to oss-test-infra is merged I should open a PR to delete the kubeflow folder and the associated YAML at https://github.com/kubernetes/test-infra/tree/master/config/jobs, correct?

@jlewi
Copy link
Contributor

jlewi commented Sep 23, 2019

@scottilee Kubeflow already has a Kubernetes cluster in project kubeflow-ci which we use for testing purposes. So I believe with the new approach the goal would be to have prow schedule the jobs for Kubeflow on that instance. I'm not sure what we need to do to make that happen. I suspect we need to install some CRs and other infra on our test cluster.

Given that we are getting close to 0.7 we might want to be careful not make an infra changes that could inhibit us releasing on time.

@scottilee
Copy link
Contributor

scottilee commented Sep 26, 2019

@jlewi can I either get access to the kubeflow-ci project or can you create the test-pods namespace and generate the cluster values according to the directions here GoogleCloudPlatform/oss-test-infra#93 (comment).

@jlewi
Copy link
Contributor

jlewi commented Oct 2, 2019

If you need access to the CI cluster please join this group.
https://github.com/kubeflow/internal-acls/blob/master/ci-team.members.txt

Lets proceed cautiously in terms of moving our prow infrastructure because we are getting ready to do a release and don't want to disrupt our test infra.

@jlewi
Copy link
Contributor

jlewi commented Oct 2, 2019

@scottilee opened up kubeflow/testing#475 to track changes to our test infra. I will run mkbuild-cluster as soon as I can.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 31, 2019
@scottilee
Copy link
Contributor

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 10, 2020
@scottilee
Copy link
Contributor

scottilee commented Mar 21, 2020

PR open in #16898

@scottilee
Copy link
Contributor

Both #16898 and are #16906 merged. Can we close this issue?

@jlewi
Copy link
Contributor

jlewi commented Mar 27, 2020

#16898 moved the prow jobs onto the kubeflow testing cluster but we are still using the kubernetes instance of prow. So I think this issue should remain open.

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

I'm fixing build cluster setup for tests of other test clusters.
Tests for kubeflow/testing can be found in
PR: kubeflow/testing#904
Prow: https://oss-prow.knative.dev/?repo=kubeflow%2Ftesting.

Current error: https://oss-prow.knative.dev/log?job=kubeflow-testing-presubmit&id=1362976347659440128

INFO|2021-02-20T04:04:27|/src/kubeflow/testing/py/kubeflow/testing/util.py|927| Writing gs://kubernetes-jenkins/pr-logs/pull/kubeflow_testing/904/kubeflow-testing-presubmit/1362976347659440128/started.json
Traceback (most recent call last):
File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
"main", fname, loader, pkg_name)
File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
exec code in run_globals
File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 720, in
final_result = main()
File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 710, in main
return run(args, file_handler)
File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 256, in run
create_started_file(args.bucket, {})
File "/src/kubeflow/testing/py/kubeflow/testing/run_e2e_workflow.py", line 136, in create_started_file
util.upload_to_gcs(contents, target)
File "/src/kubeflow/testing/py/kubeflow/testing/util.py", line 929, in upload_to_gcs
blob.upload_from_string(contents)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 1028, in upload_from_string
content_type=content_type, client=client)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 949, in upload_from_file
_raise_from_invalid_response(exc)
File "/usr/local/lib/python2.7/dist-packages/google/cloud/storage/blob.py", line 1735, in _raise_from_invalid_response
raise exceptions.from_http_response(error.response)
google.api_core.exceptions.Forbidden: 403 POST https://www.googleapis.com/upload/storage/v1/b/kubernetes-jenkins/o?uploadType=multipart: kf-ci-v1-prow@kubeflow-ci.iam.gserviceaccount.com does not have storage.objects.create access to the Google Cloud Storage object.

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

Other Kubeflow tests do not use pod-utils, they directly write to the prow gcs bucket to report status back to prow.

@chaodaiG what would be recommended location of these artifacts in gcp oss prow?
Do I need to override them like in https://github.com/kubernetes/test-infra/blob/ee10d3cb6cec8d3ee1f4e979d4e49402be0bde5c/config/testgrids/kubeflow/kubeflow.yaml?

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

I'd like some feedback, I'm starting to think that I should just refactor it to use pod-utils instead, it seems we can maintain fewer stuff this way.

There's one feature I'm not sure how to use with pod-utils --
image
https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/pull/kubeflow_testing/904/kubeflow-testing-presubmit/1362976346724110336/

How can I add key-value properties dynamically during test like above? Showing the URI helps us navigate to dashboards where the actual tests were running (a tekton UI).

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

@Bobgy
Copy link
Contributor

Bobgy commented Feb 20, 2021

Corresponding gubernator gcs path for gcp oss prow seems to look like:

gs://oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1362898955494494208

(Found on https://oss-prow.knative.dev/view/gs/oss-prow/logs/kubeflow-pipeline-postsubmit-integration-test/1362898955494494208 -> click "artifacts" button)

Can we add our CI prow service account (kf-ci-v1-prow@kubeflow-ci.iam.gserviceaccount.com) with write access to this gs://os-prow/logs directory?

@chaodaiG
Copy link
Contributor

chaodaiG commented Feb 20, 2021

How can I add key-value properties dynamically during test like above? Showing the URI helps us navigate to dashboards where the actual tests were running (a tekton UI).

Migrating to pod-utils is highly recommended. the meta data expected can be done by writing a file called metadata.json under $ARTIFACTS dir in prow job. There is a util https://github.com/knative/test-infra/blob/master/pkg/metautil/client.go and a tool https://github.com/knative/test-infra/blob/master/kntest/pkg/metadata/command.go (Invoked by https://github.com/knative/test-infra/blob/fb344e6aa47f2d0de571e19e81e352326a0b1c34/pkg/clustermanager/kubetest2/gke.go#L181) that utilize this approach in knative, and here is an example output

Screen Shot 2021-02-20 at 8 06 41 AM

https://prow.knative.dev/view/gs/knative-prow/logs/ci-knative-serving-continuous/1363096292372254720

@chaodaiG
Copy link
Contributor

Can we add our CI prow service account (kf-ci-v1-prow@kubeflow-ci.iam.gserviceaccount.com) with write access to this gs://os-prow/logs directory?

A prow job doesn't and shouldn't need to know where the results are stored, anything stored under $ARTIFACTS directory in a prow job is uploaded into artifacts directory in GCS and will be discovered by spyglass

@Bobgy
Copy link
Contributor

Bobgy commented Feb 21, 2021

Looks great, can I confirm whether the custom key value pairs can show when the test is still running?

@chaodaiG
Copy link
Contributor

Looks great, can I confirm whether the custom key value pairs can show when the test is still running?

I'm afraid that it's not, I have just inspected one of the same job as in the example I pointed out above, and after 1 hour it's still not displayed (The metadata.json file should have been created long before that).

@Bobgy
Copy link
Contributor

Bobgy commented Feb 23, 2021

Hmm, we'd want to show an URL to a tekton instance running the steps, and ideally when the job is still running.
Also, if I need to refactor the tests to use pod-utils, that would be quite some work just for migration. Can this be allowed as an exception?

@chaodaiG
Copy link
Contributor

Hmm, we'd want to show an URL to a tekton instance running the steps, and ideally when the job is still running.

started.json is a prow property that prow relies on, modifying it during runtime is not preferable imho. Would highlight the URL in build log work for you?

Also, if I need to refactor the tests to use pod-utils, that would be quite some work just for migration. Can this be allowed as an exception?

The biggest concern of the old approach, is that the service account used for running test pod can access a shared GCS bucket that all prow jobs use, which technically is a security concern. But if it's preferable to you we can work together figuring out a plan for this one time exception. There are a couple of possible approaches in my mind:

  1. Fine grained ACL, the service account from kubeflow can only access the directory
  2. Kubeflow provide a kubeflow owned GCS bucket that can be accessed by prow. I think this will work, add @cjwagner for more insights.

@cjwagner
Copy link
Member

There are a couple of possible approaches in my mind:

  1. Fine grained ACL, the service account from kubeflow can only access the directory
  2. Kubeflow provide a kubeflow owned GCS bucket that can be accessed by prow. I think this will work, add @cjwagner for more insights.

We have uniform bucket ACLs enabled so 1 is not an option. 2 should work though and may be worthwhile either way since we'd like to move toward per-team GCS buckets.
That being said I'm curious about why refactoring to use the pod utilities would be difficult. I wouldn't expect it to be much work given that it would primarily entail removing steps that are handled automatically by the pod utils. Additionally we can't really provide support for jobs that are using custom logic to upload results rather than Prow's standard practices.

Please check out the docs for the pod utilities to see if migrating would be feasible. If you don't believe it is, we can rely on option 2 to ensure that the custom upload logic doesn't interfere with other tenant's job results.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 1, 2021

Explaining a little bit of history, there were quite some chunk of code written in Kubeflow CI that had corresponding responsibility of pod utilities --- and no one with familiarity to them are still on the team, so it'll be difficult to migrate because of the unmaintained test code.

Anyway, I've decided to finish the migration first by not setting up presubmit tests for the remaining repos, because they are not under active development. My team can either migrate or build new tests later.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2021

The migration is done!

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2021

/close

@k8s-ci-robot
Copy link
Contributor

@Bobgy: You can't close an active issue/PR unless you authored it or you are a collaborator.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 2, 2021

Actually, there's one last step of adding back the testgrid in gcp oss prow.

@Bobgy
Copy link
Contributor

Bobgy commented Mar 3, 2021

/assign @chaodaiG
We can close this now!

@chaodaiG
Copy link
Contributor

chaodaiG commented Mar 3, 2021

yes https://testgrid.k8s.io/googleoss-kubeflow-pipelines is online now.

Great to see this eventually done!

/close

@k8s-ci-robot
Copy link
Contributor

@chaodaiG: Closing this issue.

In response to this:

yes https://testgrid.k8s.io/googleoss-kubeflow-pipelines is online now.

Great to see this eventually done!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/cleanup Categorizes issue or PR as related to cleaning up code, process, or technical debt. lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants