[WIP] Image enhancement example #60

cwbeitel · 2018-03-27T14:50:11Z

Steps 1 - 3 of 10 from #59

Launcher interface for running component steps in batch and testing for job success; each step smoke tested to run in batch at least displaying help message
Illustrate a tfhub-based development workflow (primarily in regard to how model code and dependencies are shipped to jobs) that sufficiently minimizes friction, has support of community (needs discussion)
Batch data downloader pulls raw data to NFS

One notable change here is a divergence from the use of ksonnet to submit training jobs (as in agents example) to a pure python approach. This can be refactored to make use of kube python client objects if people see specific benefit from doing so.

See more detailed notes: cwbeitel@2eb3198

This change is

- data downloader is functional and runs in batch - example generator step appears functional and runs in batch - other steps run in batch and are varying degrees of not implemented besides their interface with launcher.py (but single example training and decoding shouldn't have much implementation since these leverage t2t-trainer and t2t-decoder) - illustrates the use of python alone, eliminating ksonnet, for managing job config and launching. in my view this is a great simplification and strongly sets up for pythonic hparam management by hyperparameter tuner - includes an experiment for building containers with FTL which appears to not work with certain dependencies like tensor2tensor. relatedly experimented with building containers with bazel docker build rule and this completed without error but various dependencies like tensorflow could not be imported in the resulting container (whereas others like tensorboard could). - currently using the approach of building a base container to include all dependencies and shipping model code via NFS with each run which has the added benefit of archiving what code was used in a particular run along with the model parameters that were produced. this approach makes the remote dev loop very tight but still interested in FTL and Bazel for both containers. - this example does currently presume one has NFS deployed but in the future we can both add logic to check this as well as generalize the types of attached volumes that are supported which shouldn't be hard. - need a good solution for progressive testing given tests do and will increasingly include rather long running jobs - beginning toward implementation of hyperparameter tuner where a single tuner service queries job state, collects results, and submits new jobs as opposed to model where a fixed collection of jobs start and continue running, changing their choice of hyperparameters if needed (as it appears learn_runner.tune is designed/stubbed to do?).

cwbeitel · 2018-03-27T15:19:51Z

Suggesting people I think would be relevant reviewers and approvers but no strong preferences
/cc @ankushagarwal
/cc @texasmichelle
/assign @jlewi
/uncc @DjangoPeng
/uncc @zjj2wry

k8s-ci-robot · 2018-03-27T19:00:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jlewi

Assign the PR to them by writing /assign @jlewi in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2018-03-27T19:02:51Z

@cwbeitel: The following test failed, say /retest to rerun them all:

Test name	Commit	Details	Rerun command
kubeflow-examples-presubmit	`6b4e406`	link	`/test kubeflow-examples-presubmit`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

jlewi · 2018-03-28T19:07:34Z

See comments in #69

cwbeitel added 2 commits March 12, 2018 20:15

remove tools; update readme and contributing

500f0af

k8s-ci-robot added the do-not-merge/work-in-progress label Mar 27, 2018

k8s-ci-robot requested review from DjangoPeng and zjj2wry March 27, 2018 14:50

k8s-ci-robot added the size/XXL label Mar 27, 2018

cwbeitel mentioned this pull request Mar 27, 2018

[Enhance] Image enhancement example #59

Closed

10 tasks

k8s-ci-robot assigned jlewi Mar 27, 2018

k8s-ci-robot requested review from ankushagarwal and texasmichelle and removed request for DjangoPeng and zjj2wry March 27, 2018 15:19

cwbeitel mentioned this pull request Mar 27, 2018

[WIP] [RL Agents] log and render to NFS, improve docs #45

Closed

remove ks app/ if managing jobs with python

6b4e406

cwbeitel closed this Mar 30, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Image enhancement example #60

[WIP] Image enhancement example #60

cwbeitel commented Mar 27, 2018 •

edited

Loading

cwbeitel commented Mar 27, 2018

k8s-ci-robot commented Mar 27, 2018

k8s-ci-robot commented Mar 27, 2018

jlewi commented Mar 28, 2018

[WIP] Image enhancement example #60

[WIP] Image enhancement example #60

Conversation

cwbeitel commented Mar 27, 2018 • edited Loading

cwbeitel commented Mar 27, 2018

k8s-ci-robot commented Mar 27, 2018

k8s-ci-robot commented Mar 27, 2018

jlewi commented Mar 28, 2018

cwbeitel commented Mar 27, 2018 •

edited

Loading