Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Image enhancement example #60

Closed
wants to merge 3 commits into from
Closed

[WIP] Image enhancement example #60

wants to merge 3 commits into from

Conversation

cwbeitel
Copy link
Contributor

@cwbeitel cwbeitel commented Mar 27, 2018

Steps 1 - 3 of 10 from #59

  • Launcher interface for running component steps in batch and testing for job success; each step smoke tested to run in batch at least displaying help message
  • Illustrate a tfhub-based development workflow (primarily in regard to how model code and dependencies are shipped to jobs) that sufficiently minimizes friction, has support of community (needs discussion)
  • Batch data downloader pulls raw data to NFS

One notable change here is a divergence from the use of ksonnet to submit training jobs (as in agents example) to a pure python approach. This can be refactored to make use of kube python client objects if people see specific benefit from doing so.

See more detailed notes: cwbeitel@2eb3198


This change is Reviewable

cwbeitel added 2 commits March 12, 2018 20:15
- data downloader is functional and runs in batch
- example generator step appears functional and runs in batch
- other steps run in batch and are varying degrees of not implemented
besides their interface with launcher.py (but single example training
and decoding shouldn't have much implementation since these leverage
t2t-trainer and t2t-decoder)
- illustrates the use of python alone, eliminating ksonnet, for
managing job config and launching. in my view this is a great
simplification and strongly sets up for pythonic hparam management by
hyperparameter tuner
- includes an experiment for building containers with FTL which
appears to not work with certain dependencies like tensor2tensor.
relatedly experimented with building containers with bazel docker
build rule and this completed without error but various dependencies
like tensorflow could not be imported in the resulting container
(whereas others like tensorboard could).
- currently using the approach of building a base container to include
all dependencies and shipping model code via NFS with each run which
has the added benefit of archiving what code was used in a particular
run along with the model parameters that were produced. this approach
makes the remote dev loop very tight but still interested in FTL and
Bazel for both containers.
- this example does currently presume one has NFS deployed but in the
future we can both add logic to check this as well as generalize the
types of attached volumes that are supported which shouldn't be hard.
- need a good solution for progressive testing given tests do and will
increasingly include rather long running jobs
- beginning toward implementation of hyperparameter tuner where a
single tuner service queries job state, collects results, and submits
new jobs as opposed to model where a fixed collection of jobs start
and continue running, changing their choice of hyperparameters if
needed (as it appears learn_runner.tune is designed/stubbed to do?).
@cwbeitel
Copy link
Contributor Author

Suggesting people I think would be relevant reviewers and approvers but no strong preferences
/cc @ankushagarwal
/cc @texasmichelle
/assign @jlewi
/uncc @DjangoPeng
/uncc @zjj2wry

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jlewi

Assign the PR to them by writing /assign @jlewi in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot
Copy link
Contributor

@cwbeitel: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
kubeflow-examples-presubmit 6b4e406 link /test kubeflow-examples-presubmit

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@jlewi
Copy link
Contributor

jlewi commented Mar 28, 2018

See comments in #69

@cwbeitel cwbeitel closed this Mar 30, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants