Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor the TFServing component to better support GPUs and specific clouds #387

Merged
merged 3 commits into from
Mar 8, 2018

Conversation

jlewi
Copy link
Contributor

@jlewi jlewi commented Mar 7, 2018

  • To support GPUs and specific clouds we refactor the component to make
    it easy to override the parts we care about (e.g. container environment
    variables, resources, etc...).

  • We do this by moving the things we care about up to the root of tf-serving.libsonnet.

  • We rely on jsonnet late binding (http://jsonnet.org/docs/tutorial.html).
    Late binding allows us to devine dictionaries (e.g. params, tfServingContainer)
    in tf-serving.libsonnet. We can then create manifests based on those
    objects (e.g. tfDeployment). We can then override values (e.g. params) and
    the derived objecs (e.g. tfDeployment) will use the overwritten values.

  • We introduce a parameter "cloud" which allows us to control which "prototype" to use. We use this to use cloud specific customizations; like setting
    the environment variables on AWS to use S3.

  • Late binding also makes it possible to select an appropriate default image
    based on whether GPUs are bing used or not while still allowing the
    user to override the images.

  • We remove parameter definitions from the prototypes. The set of parameters
    ends up being conditional based on flags like cloud, GPUs so its
    not clear how scalable that was.

Related Issues:
#376 patterns for ksonnet prototypes.
Fix #292


This change is Reviewable

@jlewi
Copy link
Contributor Author

jlewi commented Mar 7, 2018

/cc @lluunn
/cc @elsonrodriguez
/uncc @DjangoPeng
/uncc @jimexist

@k8s-ci-robot k8s-ci-robot requested review from elsonrodriguez and lluunn and removed request for jimexist and DjangoPeng March 7, 2018 23:15
@jlewi
Copy link
Contributor Author

jlewi commented Mar 8, 2018

/test all

…clouds.

* To support GPUs and specific clouds we refactor the component to make
  it easy to override the parts we care about (e.g. container environment
  variables, resources, etc...).

* We do this by moving the things we care about up to the root of tf-serving.libsonnet.

* We rely on jsonnet late binding (http://jsonnet.org/docs/tutorial.html).
Late binding allows us to devine dictionaries (e.g. params, tfServingContainer)
in tf-serving.libsonnet. We can then create manifests based on those
objects (e.g. tfDeployment). We can then override values (e.g. params) and
the derived objecs (e.g. tfDeployment) will use the overwritten values.

* We introduce a parameter "cloud" which allows us to control which "prototype" to use. We use this to use cloud specific customizations; like setting
  the environment variables on AWS to use S3.

* Late binding also makes it possible to select an appropriate default image
  based on whether GPUs are bing used or not while still allowing the
  user to override the images.

* We remove parameter definitions from the prototypes. The set of parameters
  ends up being conditional based on flags like cloud, GPUs so its
  not clear how scalable that was.

* Use camelCase not underscores for parameters.
  See kubeflow#303.

Related Issues:

Fix kubeflow#292

Update the test to work with the changes.

* Parameters are now camelCase. They also aren't parameters of the
  prototype so we can't set them in the call to generate.

* So we need to modify deploy to take a list of the parameters to set
  on the component.
@lluunn
Copy link
Contributor

lluunn commented Mar 8, 2018

Reviewed 6 of 6 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


kubeflow/tf-serving/tf-serving.libsonnet, line 143 at r1 (raw file):

          cpu: "1",
        },
        limits: {

Do you want to lower these numbers?


Comments from Reviewable

@lluunn
Copy link
Contributor

lluunn commented Mar 8, 2018

There are conflicts.
/lgtm

@jlewi
Copy link
Contributor Author

jlewi commented Mar 8, 2018

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.


kubeflow/tf-serving/tf-serving.libsonnet, line 143 at r1 (raw file):

Previously, lluunn (Lun-Kai Hsu) wrote…

Do you want to lower these numbers?

We probably should. If you have good numbers we can change it otherwise we'll leave it for a follow on PR. There's an existing issue related to http proxy resources.


Comments from Reviewable

@k8s-ci-robot k8s-ci-robot removed the lgtm label Mar 8, 2018
@lluunn
Copy link
Contributor

lluunn commented Mar 8, 2018

Reviewed 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion.


Comments from Reviewable

@lluunn
Copy link
Contributor

lluunn commented Mar 8, 2018

/lgtm

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lluunn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 179abd0 into kubeflow:master Mar 8, 2018
components:: {

all::
if $.params.cloud == "aws" then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be better as params.s3Enable. The S3 functionality isn't specific to AWS.

// Parts specific to S3
s3parts:: $.parts {
s3Env:: [
{ name: "AWS_ACCESS_KEY_ID", valueFrom: { secretKeyRef: { name: $.s3params.s3SecretName, key: $.s3params.s3SecretAcesskeyidKeyName } } },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/s3SecretAcesskeyidKeyName/s3SecretAccesskeyidKeyName

@jlewi
Copy link
Contributor Author

jlewi commented Mar 9, 2018

Looks like this got auto-merged because Lunkai is an approver (its a quirk of the automatic review system). I'll send a new PR.

jlewi added a commit to jlewi/kubeflow that referenced this pull request Mar 9, 2018
* Use the params s3Enable not cloud to decide whether to enable S3 modifications. Use of S3 can be orthogonal to cloud.

* Fix a typo in the variable name.

* PR kubeflow#387 was automatically committed as a quirk of our auto-merge code
even though there were outstanding comments.
jlewi added a commit to jlewi/kubeflow that referenced this pull request Mar 9, 2018
* Use the params s3Enable not cloud to decide whether to enable S3 modifications. Use of S3 can be orthogonal to cloud.

* Fix a typo in the variable name.

* PR kubeflow#387 was automatically committed as a quirk of our auto-merge code
even though there were outstanding comments.
k8s-ci-robot pushed a commit that referenced this pull request Mar 12, 2018
* Use the params s3Enable not cloud to decide whether to enable S3 modifications. Use of S3 can be orthogonal to cloud.

* Fix a typo in the variable name.

* PR #387 was automatically committed as a quirk of our auto-merge code
even though there were outstanding comments.
yanniszark pushed a commit to arrikto/kubeflow that referenced this pull request Feb 15, 2021
* Removing Operator specific handling during a StudyJob run

* Return empty in error
elenzio9 pushed a commit to arrikto/kubeflow that referenced this pull request Oct 31, 2022
chore: use Bobgy's github token instead. Fixes kubeflow#387
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants