Add the Viewer CRD controller for managing web views such as Tensorboard instances from within the Pipelines UI. #449

neuromage · 2018-12-04T00:11:33Z

This PR adds a basic controller for managing a new type of CRD for managing instances of web views such as Tensorboard from within the Pipelines system.

Currently, the controller only supports Tensorboard views, but additional views should be trivial to add.

This PR makes use of the new Kubernetes controller-runtime libraries, which is the set of libraries underlying kubebuilder. These libraries make creating CRD controllers a lot easier by managing internal data structures like informers and workqueues, so we can focus on writing just the business logic.

I still need to do the following before merging this PR:

Add tests
Update the README to describe the approach and how to run this controller
Add YAML definitions for the Viewer CRD type

For now, I'd like to get some feedback on this approach, before adding any tests.

Once this PR goes in, I will send a separate PR for bundling this controller and deploying it as part of pipelines.

This change is

neuromage · 2018-12-04T00:12:56Z

Fixes #443

neuromage · 2018-12-04T04:03:14Z

/retest

vicaire

This looks great! Thanks for finding a new way to create CRDs more easily. I have a few comments.

vicaire · 2018-12-10T03:56:18Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+}
+
+// New returns a new Reconciler.
+func New(cli client.Client, scheme *runtime.Scheme) *Reconciler {


Are you only creating/deleting the instances?

What about limiting the number of total instances?

What about automatically terminating an instance?

Does it fit with this scheme?

Thanks! I updated the logic to limit the number of instances as discussed. I default to 50 but it's configurable via a flag. When we hit the limit, we delete the oldest one before creating the next one. I also added a test for this behaviour.

vicaire · 2018-12-10T03:58:00Z

backend/src/crd/pkg/apis/viewer/register.go

+package viewer
+
+const (
+	Kind      string = "Viewer"


Nit: the name does not really fit what it currently does. Any other idea? If not, let's keep this.

Yeah, I agree, it's not a great name. I thought of viewer since it's meant to proxy views of another webapp through ambassador that we launch and control. I can't think of another one right now, and I'm open to suggestions :-)

vicaire · 2018-12-10T04:02:37Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+	return reconcile.Result{}, nil
+}
+
+func setPodSpecForTensorboard(view *viewerV1alpha1.Viewer, s *corev1.PodSpec) {


Are you setting the owner references for the created resources so that garbage collection is taken care of?

Do you need labels so that we can query by type of resource? Otherwise the UI may list all the resources which could be a problem once we add additional types.

Yes, the owner reference ensures Kubernetes will take care of garbage collection of the deployment+service whenever we delete a viewer instance.

Good idea on the labels. I added labels to the created deployment+service so we can easily query them for them by asking for all viewer created ones, as well as those created for a given viewer type.

vicaire · 2018-12-10T04:04:12Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+		}
+	}
+	glog.Infof("Created new deployment with spec: %+v", dpl)
+


Is there a way to write simple fakes like in the case of the persistence agent:

https://github.com/kubeflow/pipelines/tree/master/backend/src/agent/persistence/client

So that we can test end to end?

Yes, there is. This would be what we'd use:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/reconcile/reconciletest

I will add an e2e test in a follow up PR. For now, I added unit tests testing the behaviour of the reconciler using the fake kubernetes client here:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/client/fake

neuromage · 2018-12-17T20:33:08Z

Added unit tests and CRD definitions. This is ready to be looked at again.

/retest

code.

…e testing.

…types

neuromage · 2018-12-17T22:16:14Z

/retest

neuromage · 2018-12-20T01:06:25Z

/assign @IronPan

yebrahim · 2018-12-20T05:42:54Z

backend/src/crd/pkg/apis/viewer/v1alpha1/types.go

+type TensorboardSpec struct {
+	// LogDir is the location of the log directory to be read by tensorboard, i.e.,
+	// ---log_dir.
+	LogDir string `json:"logDir"`


This should be an array (slice?) of strings. Tensorboard supports comparing outputs by passing a list of logdirs, we're using this feature in run comparison.

It's just a comma separated list right? Which can be passed in directly here. This may be more flexible, in that I'm not reinterpreting the arguments to Tensorboard. For example, TB allows naming some of those runs as well. Rather than structure that here, just pass in the expected format to Tensorboard. What do you think?

That's fine I guess, although structuring it would give some value, because the structure is not very straightforward. Something like "name1:<encoded_path_1>,name2:<encoded_path_2>" is easy to mistype.

True, but if we're planning on creating the viewer through UI, then we can try to construct it for the user there instead before sending the create request?

Yup, we can have clients do it.

yebrahim · 2018-12-20T05:45:12Z

backend/src/crd/pkg/apis/viewer/v1alpha1/types.go

+const (
+	// ViewerTypeTensorboard is the ViewerType constant used to indicate that the
+	// underlying type is Tensorboard. An instance named `instance123` will serve
+	// under /tensorboard/instance123.


We should also consider mapping logdirs to instances in the route if we can, so that the user can just navigate to /tensorboard/encoded-paths-here, and they get routed to a new or existing instance by the CRD. Is this doable?

I looked into this a bit, and at this point, I don't think we can. In that, the CRD isn't watching the routes, so it can't create a new one when the user navigates to it. The other thing is, I think it's worth having short names (right now, auto-generated by k8s) to give a shorter user-facing url. Mapping the logdirs to a name may result in a longer string, which I'm not in favour of.

Tangential to all this is keeping a unique a instance of TB for a given set of logdirs. I think this is a nice to have optimization, and I'm happy to look into adding this functionality in a follow up PR. What do you think?

Who keeps track of that mapping though?

This isn't P0 by any means for this implementation, but it's good to keep it in mind while we're working on it.

I imagine the right approach would be to label the viewer with the logdirs, which you can select for to ensure it's not already created. The reconciler here aims to be stateless as much as possible.

yebrahim · 2018-12-20T19:52:45Z

/lgtm

IronPan · 2019-01-02T20:46:24Z

backend/src/crd/README.md

+```
+
+The Tensorboard instance should be accessible via Ambassador. Set up port


i am wondering if ambassador should be mentioned at this level. It's a kubeflow component. Ideally the TB CRD should be able to run and accessible without ambassador, instead port-forwarding the tb instance pod.

I thought one of the reasons for having a CRD is to manage the routing for the user? We also don't want to be re-inventing routing which is why I am depending on ambassador, which seems reasonable since it's part of a kubeflow install. Port-forwarding directly to the pod is still do-able. The routing here enables us to serve all TB instances under /tensorboard/, on the same address as the pipeline UI, which is under /pipeline/.

IronPan · 2019-01-02T20:47:53Z

backend/src/crd/README.md

+viewer.kubeflow.org/viewer-75tkf created
+
+$ kube getctl -n kubeflow vi


is kube a customized command alias?

same as below

IronPan · 2019-01-02T20:49:21Z

backend/src/crd/samples/viewer/mnist.yaml

+spec:
+  type: tensorboard
+  tensorboardSpec:
+    logDir: "gs://ml-pipeline-playground/mnist/"


is the trailing / needed here?

IronPan · 2019-01-02T20:53:48Z

backend/src/crd/README.md

+http://localhost:8000/tensorboard/viewer-75tkf/. Note that the last path
+corresponds to the new viewer name, and the URL must end with the trailing
+slash.


do you know why it need a trailing slash?

It seems to be quirk of tensorboard serving under a specific path-prefix. I don't know why exactly.

IronPan · 2019-01-02T21:04:25Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+// Package reconciler describes a Reconciler for working with Viewer CRDs. The
+// reconciler takes care of the main logic for ensuring every Viewer CRD
+// corresponds to a unique deployment and backing service. The service is
+// annotated such that it is compatible with Ambassador managed routing.


is it a requirement to have ambassador as dependency? Can we decouple pipeline from ambassador?

Doesn't the pipeline UI also use ambassador currently for routing tensorboard instances?

As discussed offline, I removed explicit references to Ambassador. This now works both with and without (via direct port forwarding). I added instructions in the README for how to use the latter approach as well.

IronPan · 2019-01-02T21:31:58Z

backend/src/crd/controller/viewer/reconciler/reconciler.go

+	s.Containers = append(s.Containers, c)
+}
+
+func deploymentFrom(view *viewerV1alpha1.Viewer) (*appsv1.Deployment, error) {


is it possible to launch a TB pod with a volume mounted, instead of a gcs path?
I guess it's as simple as passing right spec to PodTemplateSpec. is that correct?

Pretty much. I did have to make a slight change to how I set up the container so I can reuse the existing volume mount if one is specified. I also added a unit test for this, and verified it works in my own cluster. I will add testing this functionality as part of my e2e test in an upcoming PR.

IronPan · 2019-01-02T21:38:37Z

Do you mind having a test/sample about using local logdir with persistent volume?

Also add a sample YAML to show how to mount and use a GCE persistent disk in the viewer CRD.

neuromage · 2019-01-04T02:17:48Z

Added tests and sample for persistent volumes, and also removed the explicit dependency on Ambassador. Routing should work through direct port forwarding now as well, and I added this to the README. PTAL.

neuromage · 2019-01-04T22:13:51Z

/retest

neuromage · 2019-01-05T00:08:45Z

/retest

IronPan · 2019-01-05T00:58:35Z

/test kubeflow-pipeline-e2e-test

IronPan · 2019-01-05T01:02:30Z

/lgtm
/approve

k8s-ci-robot · 2019-01-05T01:02:35Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [IronPan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2019-01-05T01:02:48Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [IronPan]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ow#449) This reverts commit 45511ce.

* fix AnySequencer syntax * Update tekton-catalog/any-sequencer/README.md Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com> Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>

neuromage added do-not-merge/work-in-progress area/backend labels Dec 4, 2018

neuromage requested a review from vicaire December 4, 2018 00:11

k8s-ci-robot removed do-not-merge/work-in-progress labels Dec 4, 2018

k8s-ci-robot requested a review from IronPan December 4, 2018 00:11

k8s-ci-robot added size/XXL labels Dec 4, 2018

vicaire suggested changes Dec 10, 2018

View reviewed changes

neuromage added 13 commits December 17, 2018 13:47

Add initial CRD types for Viewer resource, and generate corresponding

4b6ccdf

code.

Use controller-runtime to scaffold out a controller main

baf8591

Start adding a deployment

9ed249d

Clean up and separate reconciler logic into its own package for futur…

b9b543c

…e testing.

Clean up with comments

43895bf

Run dep ensure

ab7a877

Update auto-generate script. Only need deepcopy funcs for viewer crd …

9581374

…types

Cleanup previously generated but unused viewer client code

3e31241

[WIP] Adding tests

b283dc6

More tests

6a360ea

Completed unit tests for reconciler with logic for max viewers

268f8ea

Add CRD definition, sample instance and update README.

b1ed634

Fix merge conflict

9aed2f6

neuromage force-pushed the tensorboard-crd branch from a605de8 to 9aed2f6 Compare December 17, 2018 21:56

k8s-ci-robot assigned IronPan Dec 20, 2018

yebrahim reviewed Dec 20, 2018

View reviewed changes

k8s-ci-robot assigned yebrahim Dec 20, 2018

k8s-ci-robot added the lgtm label Dec 20, 2018

IronPan reviewed Jan 2, 2019

View reviewed changes

Fix readme typo for kube and add direct port-forwarding instructions.

c4b7d7b

k8s-ci-robot removed the lgtm label Jan 3, 2019

Add tests for when persistent volume is used with Tensorboard viewer.

51056c9

Also add a sample YAML to show how to mount and use a GCE persistent disk in the viewer CRD.

neuromage added 2 commits January 4, 2019 13:22

Merge with master and use go modules

1891752

Remove vendor directory

f1aee50

k8s-ci-robot added the lgtm label Jan 5, 2019

k8s-ci-robot added the approved label Jan 5, 2019

k8s-ci-robot merged commit eea6999 into kubeflow:master Jan 5, 2019

neuromage mentioned this pull request Jan 5, 2019

Update WORKSPACE and BUILD files incoporating recent changes #639

Merged

Linchin pushed a commit to Linchin/pipelines that referenced this pull request Apr 11, 2023

Revert "Support Python functions in workflows (kubeflow#431)" (kubefl…

65b445e

…ow#449) This reverts commit 45511ce.

magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023

Fix hardcode k8s domain name (kubeflow#449)

ddeff63

		```

		The Tensorboard instance should be accessible via Ambassador. Set up port

		viewer.kubeflow.org/viewer-75tkf created

		$ kube getctl -n kubeflow vi

Add the Viewer CRD controller for managing web views such as Tensorboard instances from within the Pipelines UI. #449

Add the Viewer CRD controller for managing web views such as Tensorboard instances from within the Pipelines UI. #449

Conversation

neuromage commented Dec 4, 2018 • edited by IronPan Loading

neuromage commented Dec 4, 2018

neuromage commented Dec 4, 2018

vicaire left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

neuromage commented Dec 17, 2018

neuromage commented Dec 17, 2018

neuromage commented Dec 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yebrahim commented Dec 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

IronPan commented Jan 2, 2019

neuromage commented Jan 4, 2019

neuromage commented Jan 4, 2019

neuromage commented Jan 5, 2019

IronPan commented Jan 5, 2019

IronPan commented Jan 5, 2019

k8s-ci-robot commented Jan 5, 2019

k8s-ci-robot commented Jan 5, 2019

neuromage commented Dec 4, 2018 •

edited by IronPan

Loading