How to create TF Jobs from the user side? #67

MarkusTeufelberger · 2017-10-20T11:33:45Z

I am wondering how the actual procedure would be to go from a simple hello world example to actually deploying it on a mlcube-enabled k8s cluster.

Your example seems to include a Docker container and setting the script (which also has to contain some rather specific scaffolding, like reading config from the environment) as ENTRYPOINT. Is that the recommended way? Or even the only way?

Ideally, I'd like to just map in a file containing the run() function via a volume and avoid forcing everyone to include the scaffolding or something like that. Maybe I'm missing something though.

The text was updated successfully, but these errors were encountered:

wbuchwalter · 2017-10-20T12:06:45Z

@MarkusTeufelberger currently yes this is the only way. You need to build the container on your end and grab the TF_CONFIG from the environment.
Most likely at some point there will be some tools abstracting the container creation away from the user though, @jlewi has some thoughts on this.
Regarding the tf.clusterSpec, how would you like it to be passed? As arguments instead?

MarkusTeufelberger · 2017-10-20T12:25:32Z

Config via the environment is fine, it is just more or less just documented in code right now. Also in both cases (arguments/environment), some custom logic besides the actual machine learning code is required to make sure the script knows which role it should assume (the boilerplate code in main() in https://github.com/jlewi/mlkube.io/blob/master/examples/tf_sample/tf_sample/tf_smoke.py)

Ideally, it would look like the initial example on https://www.tensorflow.org/deploy/distributed, just with the k8s cluster as target instead of a server on localhost. Something similar to:

$ python
>>> import tensorflow as tf
>>> import mlcubeiooperator as tfoperator
>>> c = tf.constant("Hello, distributed TensorFlow!")
>>> server = tfoperator.server(ip=1.2.3.4, port=5678)
>>> sess = tf.Session(server.target)  # Create a session on the server.
>>> sess.run(c)
'Hello, distributed TensorFlow!'

Right now it seems to be more like:

Take boilerplate code from tf_smoke
Write custom function in run()
Drop on top of a tensorflow:tensorflow(-gpu) Docker container
Build the YAML description of that TF Job
Hand the container + YAML to Kubernetes (e.g. via kubectl) and wait for the job to finish

jlewi · 2017-10-20T19:05:08Z

Thanks for taking the time to try it out and provide feedback.

For staging your code building docker containers is one approach and common in K8s.

Another approach would be to use a shared filesystem (like NFS). You could then mount your code into the job via volume mounts. If the filesystem is also mountable on your dev box then you can edit code and make it available to your jobs without building docker images. In this case you could just configure your job to do something like

...
- tfReplicaType: MASTER
      template:
        spec:
          containers:
            - image: tensorflow/tensorflow:1.3.0
              command: ["python", "/mnt/your_code.py"],
              name: tensorflow
             volumeMounts:
              # name must match the volume name below
              - name: nfs
              mountPath: "/mnt"
      volumes:
      - name: nfs
        persistentVolumeClaim:
          claimName: nfs

You can find more information about NFS and K8s here.

How you setup NFS will depend on your cluster environment.

You only need to parse TF_CONFIG if your job is distributed and you aren't using a high level framework that takes care of this like TensorFlow estimators.

Ideally, it would look like the initial example on https://www.tensorflow.org/deploy/distributed, just with the k8s cluster as target instead of a server on localhost. Something similar to:

You could do this today using K8s but TfJob isn't designed for this case; please consider filing a feature request if you think it would be useful to support this better. Here's how you could do it today.

Use a StateFullSets to deploy one or more TensorFlow standard servers on K8s
Set up networking/ingress so you can reach the each server from where you are running python
You can now start sessions using these servers and assign ops to them.

jlewi · 2017-10-22T23:11:40Z

@mgyucht I saw your talk about automating the build and deploy docker image using bazel and ksonnet using your kubecfg tool. Has that been released?

mgyucht · 2017-10-23T04:40:12Z

@jlewi It hasn't yet... Until then, you might find success using @mattmoor's rules_k8s https://github.com/bazelbuild/rules_k8s + https://github.com/bazelbuild/rules_jsonnet, which supports a similar-ish workflow if you want to try to get something up and running now.

mattmoor · 2017-10-23T04:59:30Z

@jlewi I'm happy to send you pointers or give you a demo. The rules_k8s repo has a sample that generates K8s config via rules_jsonnet and feeds them into rules_k8s. I can give pointers to more complex examples too, if you want.

jlewi · 2017-10-23T17:06:14Z

@mgyucht @mattmoor Thanks for the pointers.

@MarkusTeufelberger I'm going to close this issue. Feel free to reopen if your question hasn't been addressed.

jlewi added the kind/question label Oct 20, 2017

jlewi closed this as completed Oct 23, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to create TF Jobs from the user side? #67

How to create TF Jobs from the user side? #67

MarkusTeufelberger commented Oct 20, 2017

wbuchwalter commented Oct 20, 2017 •

edited

MarkusTeufelberger commented Oct 20, 2017

jlewi commented Oct 20, 2017

jlewi commented Oct 22, 2017

mgyucht commented Oct 23, 2017

mattmoor commented Oct 23, 2017

jlewi commented Oct 23, 2017

How to create TF Jobs from the user side? #67

How to create TF Jobs from the user side? #67

Comments

MarkusTeufelberger commented Oct 20, 2017

wbuchwalter commented Oct 20, 2017 • edited

MarkusTeufelberger commented Oct 20, 2017

jlewi commented Oct 20, 2017

jlewi commented Oct 22, 2017

mgyucht commented Oct 23, 2017

mattmoor commented Oct 23, 2017

jlewi commented Oct 23, 2017

wbuchwalter commented Oct 20, 2017 •

edited