Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to create TF Jobs from the user side? #67

Closed
MarkusTeufelberger opened this issue Oct 20, 2017 · 7 comments
Closed

How to create TF Jobs from the user side? #67

MarkusTeufelberger opened this issue Oct 20, 2017 · 7 comments

Comments

@MarkusTeufelberger
Copy link

I am wondering how the actual procedure would be to go from a simple hello world example to actually deploying it on a mlcube-enabled k8s cluster.

Your example seems to include a Docker container and setting the script (which also has to contain some rather specific scaffolding, like reading config from the environment) as ENTRYPOINT. Is that the recommended way? Or even the only way?

Ideally, I'd like to just map in a file containing the run() function via a volume and avoid forcing everyone to include the scaffolding or something like that. Maybe I'm missing something though.

@wbuchwalter
Copy link
Contributor

wbuchwalter commented Oct 20, 2017

@MarkusTeufelberger currently yes this is the only way. You need to build the container on your end and grab the TF_CONFIG from the environment.
Most likely at some point there will be some tools abstracting the container creation away from the user though, @jlewi has some thoughts on this.
Regarding the tf.clusterSpec, how would you like it to be passed? As arguments instead?

@MarkusTeufelberger
Copy link
Author

Config via the environment is fine, it is just more or less just documented in code right now. Also in both cases (arguments/environment), some custom logic besides the actual machine learning code is required to make sure the script knows which role it should assume (the boilerplate code in main() in https://github.com/jlewi/mlkube.io/blob/master/examples/tf_sample/tf_sample/tf_smoke.py)

Ideally, it would look like the initial example on https://www.tensorflow.org/deploy/distributed, just with the k8s cluster as target instead of a server on localhost. Something similar to:

$ python
>>> import tensorflow as tf
>>> import mlcubeiooperator as tfoperator
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> server = tfoperator.server(ip=1.2.3.4, port=5678)
>>> sess = tf.Session(server.target)  # Create a session on the server.
>>> sess.run(c)
'Hello, distributed TensorFlow!'

Right now it seems to be more like:

  • Take boilerplate code from tf_smoke
  • Write custom function in run()
  • Drop on top of a tensorflow:tensorflow(-gpu) Docker container
  • Build the YAML description of that TF Job
  • Hand the container + YAML to Kubernetes (e.g. via kubectl) and wait for the job to finish

@jlewi
Copy link
Contributor

jlewi commented Oct 20, 2017

Thanks for taking the time to try it out and provide feedback.

For staging your code building docker containers is one approach and common in K8s.

Another approach would be to use a shared filesystem (like NFS). You could then mount your code into the job via volume mounts. If the filesystem is also mountable on your dev box then you can edit code and make it available to your jobs without building docker images. In this case you could just configure your job to do something like

...
- tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: tensorflow/tensorflow:1.3.0
              command: ["python", "/mnt/your_code.py"],
              name: tensorflow
             volumeMounts:
              # name must match the volume name below
              - name: nfs
              mountPath: "/mnt"
      volumes:
      - name: nfs
        persistentVolumeClaim:
          claimName: nfs

You can find more information about NFS and K8s here.

How you setup NFS will depend on your cluster environment.

You only need to parse TF_CONFIG if your job is distributed and you aren't using a high level framework that takes care of this like TensorFlow estimators.

Ideally, it would look like the initial example on https://www.tensorflow.org/deploy/distributed, just with the k8s cluster as target instead of a server on localhost. Something similar to:

You could do this today using K8s but TfJob isn't designed for this case; please consider filing a feature request if you think it would be useful to support this better. Here's how you could do it today.

  • Use a StateFullSets to deploy one or more TensorFlow standard servers on K8s
  • Set up networking/ingress so you can reach the each server from where you are running python
  • You can now start sessions using these servers and assign ops to them.

@jlewi
Copy link
Contributor

jlewi commented Oct 22, 2017

@mgyucht I saw your talk about automating the build and deploy docker image using bazel and ksonnet using your kubecfg tool. Has that been released?

@mgyucht
Copy link

mgyucht commented Oct 23, 2017

@jlewi It hasn't yet... Until then, you might find success using @mattmoor's rules_k8s https://github.com/bazelbuild/rules_k8s + https://github.com/bazelbuild/rules_jsonnet, which supports a similar-ish workflow if you want to try to get something up and running now.

@mattmoor
Copy link

@jlewi I'm happy to send you pointers or give you a demo. The rules_k8s repo has a sample that generates K8s config via rules_jsonnet and feeds them into rules_k8s. I can give pointers to more complex examples too, if you want.

@jlewi
Copy link
Contributor

jlewi commented Oct 23, 2017

@mgyucht @mattmoor Thanks for the pointers.

@MarkusTeufelberger I'm going to close this issue. Feel free to reopen if your question hasn't been addressed.

@jlewi jlewi closed this as completed Oct 23, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants