-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Image enhancement example #60
Conversation
- data downloader is functional and runs in batch - example generator step appears functional and runs in batch - other steps run in batch and are varying degrees of not implemented besides their interface with launcher.py (but single example training and decoding shouldn't have much implementation since these leverage t2t-trainer and t2t-decoder) - illustrates the use of python alone, eliminating ksonnet, for managing job config and launching. in my view this is a great simplification and strongly sets up for pythonic hparam management by hyperparameter tuner - includes an experiment for building containers with FTL which appears to not work with certain dependencies like tensor2tensor. relatedly experimented with building containers with bazel docker build rule and this completed without error but various dependencies like tensorflow could not be imported in the resulting container (whereas others like tensorboard could). - currently using the approach of building a base container to include all dependencies and shipping model code via NFS with each run which has the added benefit of archiving what code was used in a particular run along with the model parameters that were produced. this approach makes the remote dev loop very tight but still interested in FTL and Bazel for both containers. - this example does currently presume one has NFS deployed but in the future we can both add logic to check this as well as generalize the types of attached volumes that are supported which shouldn't be hard. - need a good solution for progressive testing given tests do and will increasingly include rather long running jobs - beginning toward implementation of hyperparameter tuner where a single tuner service queries job state, collects results, and submits new jobs as opposed to model where a fixed collection of jobs start and continue running, changing their choice of hyperparameters if needed (as it appears learn_runner.tune is designed/stubbed to do?).
Suggesting people I think would be relevant reviewers and approvers but no strong preferences |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: Assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@cwbeitel: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
See comments in #69 |
Steps 1 - 3 of 10 from #59
One notable change here is a divergence from the use of ksonnet to submit training jobs (as in agents example) to a pure python approach. This can be refactored to make use of kube python client objects if people see specific benefit from doing so.
See more detailed notes: cwbeitel@2eb3198
This change is![Reviewable](https://camo.githubusercontent.com/23b05f5fb48215c989e92cc44cf6512512d083132bd3daf689867c8d9d386888/68747470733a2f2f72657669657761626c652e696f2f7265766965775f627574746f6e2e737667)