Refactor the TFServing component to better support GPUs and specific clouds #387

jlewi · 2018-03-07T23:14:32Z

To support GPUs and specific clouds we refactor the component to make
it easy to override the parts we care about (e.g. container environment
variables, resources, etc...).
We do this by moving the things we care about up to the root of tf-serving.libsonnet.
We rely on jsonnet late binding (http://jsonnet.org/docs/tutorial.html).
Late binding allows us to devine dictionaries (e.g. params, tfServingContainer)
in tf-serving.libsonnet. We can then create manifests based on those
objects (e.g. tfDeployment). We can then override values (e.g. params) and
the derived objecs (e.g. tfDeployment) will use the overwritten values.
We introduce a parameter "cloud" which allows us to control which "prototype" to use. We use this to use cloud specific customizations; like setting
the environment variables on AWS to use S3.
Late binding also makes it possible to select an appropriate default image
based on whether GPUs are bing used or not while still allowing the
user to override the images.
We remove parameter definitions from the prototypes. The set of parameters
ends up being conditional based on flags like cloud, GPUs so its
not clear how scalable that was.

Related Issues:
#376 patterns for ksonnet prototypes.
Fix #292

This change is

jlewi · 2018-03-07T23:14:59Z

/cc @lluunn
/cc @elsonrodriguez
/uncc @DjangoPeng
/uncc @jimexist

jlewi · 2018-03-08T06:29:05Z

/test all

…clouds. * To support GPUs and specific clouds we refactor the component to make it easy to override the parts we care about (e.g. container environment variables, resources, etc...). * We do this by moving the things we care about up to the root of tf-serving.libsonnet. * We rely on jsonnet late binding (http://jsonnet.org/docs/tutorial.html). Late binding allows us to devine dictionaries (e.g. params, tfServingContainer) in tf-serving.libsonnet. We can then create manifests based on those objects (e.g. tfDeployment). We can then override values (e.g. params) and the derived objecs (e.g. tfDeployment) will use the overwritten values. * We introduce a parameter "cloud" which allows us to control which "prototype" to use. We use this to use cloud specific customizations; like setting the environment variables on AWS to use S3. * Late binding also makes it possible to select an appropriate default image based on whether GPUs are bing used or not while still allowing the user to override the images. * We remove parameter definitions from the prototypes. The set of parameters ends up being conditional based on flags like cloud, GPUs so its not clear how scalable that was. * Use camelCase not underscores for parameters. See kubeflow#303. Related Issues: Fix kubeflow#292 Update the test to work with the changes. * Parameters are now camelCase. They also aren't parameters of the prototype so we can't set them in the call to generate. * So we need to modify deploy to take a list of the parameters to set on the component.

lluunn · 2018-03-08T17:51:54Z

Reviewed 6 of 6 files at r1.
Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

kubeflow/tf-serving/tf-serving.libsonnet, line 143 at r1 (raw file):

          cpu: "1",
        },
        limits: {

Do you want to lower these numbers?

Comments from Reviewable

lluunn · 2018-03-08T17:52:20Z

There are conflicts.
/lgtm

jlewi · 2018-03-08T19:52:24Z

Review status: all files reviewed at latest revision, 1 unresolved discussion, some commit checks failed.

kubeflow/tf-serving/tf-serving.libsonnet, line 143 at r1 (raw file):

Previously, lluunn (Lun-Kai Hsu) wrote…

Do you want to lower these numbers?

We probably should. If you have good numbers we can change it otherwise we'll leave it for a follow on PR. There's an existing issue related to http proxy resources.

Comments from Reviewable

lluunn · 2018-03-08T22:57:56Z

Reviewed 3 of 3 files at r2.
Review status: all files reviewed at latest revision, 1 unresolved discussion.

Comments from Reviewable

lluunn · 2018-03-08T22:58:08Z

/lgtm

k8s-ci-robot · 2018-03-08T22:58:14Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lluunn

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [lluunn]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

elsonrodriguez · 2018-03-08T22:59:14Z

kubeflow/tf-serving/tf-serving.libsonnet

+  components:: {
+
+    all::
+      if $.params.cloud == "aws" then


This would be better as params.s3Enable. The S3 functionality isn't specific to AWS.

elsonrodriguez · 2018-03-08T23:08:36Z

kubeflow/tf-serving/tf-serving.libsonnet

+  // Parts specific to S3
+  s3parts:: $.parts {
+    s3Env:: [
+      { name: "AWS_ACCESS_KEY_ID", valueFrom: { secretKeyRef: { name: $.s3params.s3SecretName, key: $.s3params.s3SecretAcesskeyidKeyName } } },


s/s3SecretAcesskeyidKeyName/s3SecretAccesskeyidKeyName

jlewi · 2018-03-09T02:55:07Z

Looks like this got auto-merged because Lunkai is an approver (its a quirk of the automatic review system). I'll send a new PR.

* Use the params s3Enable not cloud to decide whether to enable S3 modifications. Use of S3 can be orthogonal to cloud. * Fix a typo in the variable name. * PR kubeflow#387 was automatically committed as a quirk of our auto-merge code even though there were outstanding comments.

* Use the params s3Enable not cloud to decide whether to enable S3 modifications. Use of S3 can be orthogonal to cloud. * Fix a typo in the variable name. * PR #387 was automatically committed as a quirk of our auto-merge code even though there were outstanding comments.

* Removing Operator specific handling during a StudyJob run * Return empty in error

chore: use Bobgy's github token instead. Fixes kubeflow#387

k8s-ci-robot requested review from DjangoPeng and jimexist March 7, 2018 23:14

k8s-ci-robot added the size/L label Mar 7, 2018

k8s-ci-robot requested review from elsonrodriguez and lluunn and removed request for jimexist and DjangoPeng March 7, 2018 23:15

jlewi mentioned this pull request Mar 8, 2018

option naming inconsistencies #303

Closed

k8s-ci-robot added size/XL and removed size/L labels Mar 8, 2018

jlewi force-pushed the gpu_test branch from 0b9b344 to ffc8fbd Compare March 8, 2018 07:38

k8s-ci-robot assigned lluunn Mar 8, 2018

k8s-ci-robot added lgtm approved labels Mar 8, 2018

Merge remote-tracking branch 'upstream/master' into gpu_test

3faf5a3

k8s-ci-robot removed the lgtm label Mar 8, 2018

jsonnet format.

3e4d11f

k8s-ci-robot added the lgtm label Mar 8, 2018

k8s-ci-robot merged commit 179abd0 into kubeflow:master Mar 8, 2018

elsonrodriguez reviewed Mar 8, 2018

View reviewed changes

jlewi mentioned this pull request Mar 9, 2018

Fix 2 issues with s3 config from #387. #399

Merged

yanniszark pushed a commit to arrikto/kubeflow that referenced this pull request Feb 15, 2021

Removing Operator specific handling during a StudyJob run (kubeflow#387)

8a89b9e

* Removing Operator specific handling during a StudyJob run * Return empty in error

elenzio9 pushed a commit to arrikto/kubeflow that referenced this pull request Oct 31, 2022

Merge pull request kubeflow#388 from Bobgy/use-bobgy-token

166ebcd

chore: use Bobgy's github token instead. Fixes kubeflow#387

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor the TFServing component to better support GPUs and specific clouds #387

Refactor the TFServing component to better support GPUs and specific clouds #387

jlewi commented Mar 7, 2018 •

edited

Loading

jlewi commented Mar 7, 2018

jlewi commented Mar 8, 2018

lluunn commented Mar 8, 2018

lluunn commented Mar 8, 2018

jlewi commented Mar 8, 2018

lluunn commented Mar 8, 2018

lluunn commented Mar 8, 2018

k8s-ci-robot commented Mar 8, 2018

elsonrodriguez Mar 8, 2018

elsonrodriguez Mar 8, 2018

jlewi commented Mar 9, 2018

Refactor the TFServing component to better support GPUs and specific clouds #387

Refactor the TFServing component to better support GPUs and specific clouds #387

Conversation

jlewi commented Mar 7, 2018 • edited Loading

jlewi commented Mar 7, 2018

jlewi commented Mar 8, 2018

lluunn commented Mar 8, 2018

lluunn commented Mar 8, 2018

jlewi commented Mar 8, 2018

lluunn commented Mar 8, 2018

lluunn commented Mar 8, 2018

k8s-ci-robot commented Mar 8, 2018

elsonrodriguez Mar 8, 2018

Choose a reason for hiding this comment

elsonrodriguez Mar 8, 2018

Choose a reason for hiding this comment

jlewi commented Mar 9, 2018

jlewi commented Mar 7, 2018 •

edited

Loading