Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the Viewer CRD controller for managing web views such as Tensorboard instances from within the Pipelines UI. #449

Merged
merged 17 commits into from
Jan 5, 2019

Conversation

neuromage
Copy link
Contributor

@neuromage neuromage commented Dec 4, 2018

This PR adds a basic controller for managing a new type of CRD for managing instances of web views such as Tensorboard from within the Pipelines system.

Currently, the controller only supports Tensorboard views, but additional views should be trivial to add.

This PR makes use of the new Kubernetes controller-runtime libraries, which is the set of libraries underlying kubebuilder. These libraries make creating CRD controllers a lot easier by managing internal data structures like informers and workqueues, so we can focus on writing just the business logic.

I still need to do the following before merging this PR:

  • Add tests
  • Update the README to describe the approach and how to run this controller
  • Add YAML definitions for the Viewer CRD type

For now, I'd like to get some feedback on this approach, before adding any tests.

Once this PR goes in, I will send a separate PR for bundling this controller and deploying it as part of pipelines.


This change is Reviewable

@neuromage
Copy link
Contributor Author

Fixes #443

@neuromage
Copy link
Contributor Author

/retest

Copy link
Contributor

@vicaire vicaire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great! Thanks for finding a new way to create CRDs more easily. I have a few comments.

}

// New returns a new Reconciler.
func New(cli client.Client, scheme *runtime.Scheme) *Reconciler {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you only creating/deleting the instances?

What about limiting the number of total instances?

What about automatically terminating an instance?

Does it fit with this scheme?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I updated the logic to limit the number of instances as discussed. I default to 50 but it's configurable via a flag. When we hit the limit, we delete the oldest one before creating the next one. I also added a test for this behaviour.

package viewer

const (
Kind string = "Viewer"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: the name does not really fit what it currently does. Any other idea? If not, let's keep this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree, it's not a great name. I thought of viewer since it's meant to proxy views of another webapp through ambassador that we launch and control. I can't think of another one right now, and I'm open to suggestions :-)

return reconcile.Result{}, nil
}

func setPodSpecForTensorboard(view *viewerV1alpha1.Viewer, s *corev1.PodSpec) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you setting the owner references for the created resources so that garbage collection is taken care of?

Do you need labels so that we can query by type of resource? Otherwise the UI may list all the resources which could be a problem once we add additional types.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the owner reference ensures Kubernetes will take care of garbage collection of the deployment+service whenever we delete a viewer instance.

Good idea on the labels. I added labels to the created deployment+service so we can easily query them for them by asking for all viewer created ones, as well as those created for a given viewer type.

}
}
glog.Infof("Created new deployment with spec: %+v", dpl)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to write simple fakes like in the case of the persistence agent:

https://github.com/kubeflow/pipelines/tree/master/backend/src/agent/persistence/client

So that we can test end to end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there is. This would be what we'd use:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/reconcile/reconciletest

I will add an e2e test in a follow up PR. For now, I added unit tests testing the behaviour of the reconciler using the fake kubernetes client here:
https://godoc.org/github.com/kubernetes-sigs/controller-runtime/pkg/client/fake

@neuromage
Copy link
Contributor Author

Added unit tests and CRD definitions. This is ready to be looked at again.

/retest

@neuromage
Copy link
Contributor Author

/retest

@neuromage
Copy link
Contributor Author

/assign @IronPan

type TensorboardSpec struct {
// LogDir is the location of the log directory to be read by tensorboard, i.e.,
// ---log_dir.
LogDir string `json:"logDir"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be an array (slice?) of strings. Tensorboard supports comparing outputs by passing a list of logdirs, we're using this feature in run comparison.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's just a comma separated list right? Which can be passed in directly here. This may be more flexible, in that I'm not reinterpreting the arguments to Tensorboard. For example, TB allows naming some of those runs as well. Rather than structure that here, just pass in the expected format to Tensorboard. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine I guess, although structuring it would give some value, because the structure is not very straightforward. Something like "name1:<encoded_path_1>,name2:<encoded_path_2>" is easy to mistype.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, but if we're planning on creating the viewer through UI, then we can try to construct it for the user there instead before sending the create request?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, we can have clients do it.

const (
// ViewerTypeTensorboard is the ViewerType constant used to indicate that the
// underlying type is Tensorboard. An instance named `instance123` will serve
// under /tensorboard/instance123.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also consider mapping logdirs to instances in the route if we can, so that the user can just navigate to /tensorboard/encoded-paths-here, and they get routed to a new or existing instance by the CRD. Is this doable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked into this a bit, and at this point, I don't think we can. In that, the CRD isn't watching the routes, so it can't create a new one when the user navigates to it. The other thing is, I think it's worth having short names (right now, auto-generated by k8s) to give a shorter user-facing url. Mapping the logdirs to a name may result in a longer string, which I'm not in favour of.

Tangential to all this is keeping a unique a instance of TB for a given set of logdirs. I think this is a nice to have optimization, and I'm happy to look into adding this functionality in a follow up PR. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who keeps track of that mapping though?

This isn't P0 by any means for this implementation, but it's good to keep it in mind while we're working on it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine the right approach would be to label the viewer with the logdirs, which you can select for to ensure it's not already created. The reconciler here aims to be stateless as much as possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM.

@yebrahim
Copy link
Contributor

/lgtm

```

The Tensorboard instance should be accessible via Ambassador. Set up port
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am wondering if ambassador should be mentioned at this level. It's a kubeflow component. Ideally the TB CRD should be able to run and accessible without ambassador, instead port-forwarding the tb instance pod.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought one of the reasons for having a CRD is to manage the routing for the user? We also don't want to be re-inventing routing which is why I am depending on ambassador, which seems reasonable since it's part of a kubeflow install. Port-forwarding directly to the pod is still do-able. The routing here enables us to serve all TB instances under /tensorboard/, on the same address as the pipeline UI, which is under /pipeline/.

viewer.kubeflow.org/viewer-75tkf created

$ kube getctl -n kubeflow vi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is kube a customized command alias?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as below

spec:
type: tensorboard
tensorboardSpec:
logDir: "gs://ml-pipeline-playground/mnist/"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the trailing / needed here?

http://localhost:8000/tensorboard/viewer-75tkf/. Note that the last path
corresponds to the new viewer name, and the URL must end with the trailing
slash.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you know why it need a trailing slash?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to be quirk of tensorboard serving under a specific path-prefix. I don't know why exactly.

// Package reconciler describes a Reconciler for working with Viewer CRDs. The
// reconciler takes care of the main logic for ensuring every Viewer CRD
// corresponds to a unique deployment and backing service. The service is
// annotated such that it is compatible with Ambassador managed routing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it a requirement to have ambassador as dependency? Can we decouple pipeline from ambassador?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the pipeline UI also use ambassador currently for routing tensorboard instances?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, I removed explicit references to Ambassador. This now works both with and without (via direct port forwarding). I added instructions in the README for how to use the latter approach as well.

s.Containers = append(s.Containers, c)
}

func deploymentFrom(view *viewerV1alpha1.Viewer) (*appsv1.Deployment, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to launch a TB pod with a volume mounted, instead of a gcs path?
I guess it's as simple as passing right spec to PodTemplateSpec. is that correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much. I did have to make a slight change to how I set up the container so I can reuse the existing volume mount if one is specified. I also added a unit test for this, and verified it works in my own cluster. I will add testing this functionality as part of my e2e test in an upcoming PR.

@IronPan
Copy link
Member

IronPan commented Jan 2, 2019

Do you mind having a test/sample about using local logdir with persistent volume?

@k8s-ci-robot k8s-ci-robot removed the lgtm label Jan 3, 2019
Also add a sample YAML to show how to mount and use a GCE persistent
disk in the viewer CRD.
@neuromage
Copy link
Contributor Author

Added tests and sample for persistent volumes, and also removed the explicit dependency on Ambassador. Routing should work through direct port forwarding now as well, and I added this to the README. PTAL.

@neuromage
Copy link
Contributor Author

/retest

1 similar comment
@neuromage
Copy link
Contributor Author

/retest

@IronPan
Copy link
Member

IronPan commented Jan 5, 2019

/test kubeflow-pipeline-e2e-test

@IronPan
Copy link
Member

IronPan commented Jan 5, 2019

/lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: IronPan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit eea6999 into kubeflow:master Jan 5, 2019
Linchin pushed a commit to Linchin/pipelines that referenced this pull request Apr 11, 2023
magdalenakuhn17 pushed a commit to magdalenakuhn17/pipelines that referenced this pull request Oct 22, 2023
HumairAK pushed a commit to red-hat-data-services/data-science-pipelines that referenced this pull request Mar 11, 2024
* fix AnySequencer syntax

* Update tekton-catalog/any-sequencer/README.md

Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>

Co-authored-by: Tommy Li <Tommy.chaoping.li@ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants