Update README to reflect install experience #617

ykevinc · 2018-04-09T06:33:13Z

Hi,

I learned about kubeflow recently and gave it a run, but it occurred to me that:

If there aren't enough nodes in the cluster, then creating server in browser (tf-hub-0) will wait forever.
When kubeflow docker image is being installed, if node doesn't have enough storage, kubectl port-forward tf-hub-0 process will have issue.

E0405 13:41:46.006868   61040 portforward.go:331] an error occurred forwarding 8000 -> 8000: error forwarding port 8000 to pod 237c9ddb56660461a4328110bdbb2624eea8ef4dc5055031bd00d0c6d393a831, uid : container not running (237c9ddb56660461a4328110bdbb2624eea8ef4dc5055031bd00d0c6d393a831)

This updates the user guide to get around the issues. At the same time I want to raise them if maybe they can be solved without updating the user guide (pre-check cluster size before deploying pod, potentially make kubeflow image smaller)

This change is

inc0 · 2018-04-10T21:21:05Z

user_guide.md

@@ -9,7 +9,7 @@ This guide will walk you through the basics of deploying and interacting with Ku
 ## Requirements
 * Kubernetes >= 1.8 [see here](https://github.com/kubeflow/tf-operator#requirements)
 * ksonnet version [0.9.2](https://ksonnet.io/#get-started) or later. (See [below](#why-kubeflow-uses-ksonnet) for an explanation of why we use ksonnet)
-
+ * An existing 2-node (at least) kubernate cluster. Nodes need to have storage >= 20 GB due to the ML libraries and third party packages being bundled in Kubeflow Docker images 


Why do you need 2 node? I've tested Kubeflow with 1 node minikube multiple times with success. Granted, you need a lot of disk storage mostly because of Jupyter notebook image.

@inc0 after looking at the logs, my wording should be more specific to CPUs instead of nodes. I used a AWS t2.medium node for going through the kubeflow user guide.

If the node had had more than 2 CPUs, spawning would have been fine. t2.medium was documented to have 2 'vCPU', but it just wasn't enough for spawning another pod.

Please let me know if I should update the wordings or I safely can close the PR, assuming they are common prerequisites in Kubernetes/Jupyter worlds.

Name: jupyter-mimi Namespace: kubeflow Node: <none> Labels: app=jupyterhub component=singleuser-server heritage=jupyterhub hub.jupyter.org/username=mimi Annotations: <none> Status: Pending IP: Containers: notebook: Image: gcr.io/kubeflow-images-staging/tensorflow-1.7.0-notebook-gpu:v20180403-1f854c44 Port: 8888/TCP Args: start-singleuser.sh --ip="0.0.0.0" --port=8888 --allow-root Requests: cpu: 1 memory: 1G Environment: JUPYTERHUB_API_TOKEN: X JPY_API_TOKEN: X JUPYTERHUB_CLIENT_ID: user-mimi JUPYTERHUB_HOST: JUPYTERHUB_OAUTH_CALLBACK_URL: /user/mimi/oauth_callback JUPYTERHUB_USER: mimi JUPYTERHUB_API_URL: http://tf-hub-0:8081/hub/api JUPYTERHUB_BASE_URL: / JUPYTERHUB_SERVICE_PREFIX: /user/mimi/ MEM_GUARANTEE: 1G CPU_GUARANTEE: 1.0 Mounts: /var/run/secrets/kubernetes.io/serviceaccount from no-api-access-please (ro) Conditions: Type Status PodScheduled False Volumes: no-api-access-please: Type: EmptyDir (a temporary directory that shares a pod's lifetime) Medium: QoS Class: Burstable Node-Selectors: <none> Tolerations: node.alpha.kubernetes.io/notReady:NoExecute for 300s node.alpha.kubernetes.io/unreachable:NoExecute for 300s Events: Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 1m (x37 over 11m) default-scheduler No nodes are available that match all of the predicates: Insufficient cpu (2), PodToleratesNodeTaints (1).

It would be good to specify minimum requirements about number of CPUs.

jlewi · 2018-05-09T15:43:45Z

@ykevinc Can you sync and make the requested changes please?

jlewi · 2018-05-14T14:03:37Z

Thank you. Sorry the review process was so lengthy.

/lgtm
/approve

k8s-ci-robot · 2018-05-14T14:03:43Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jlewi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [jlewi]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jlewi · 2018-05-21T18:19:20Z

/ok-to-test

k8s-ci-robot requested review from wbuchwalter and zjj2wry April 9, 2018 06:33

k8s-ci-robot added needs-ok-to-test size/XS labels Apr 9, 2018

inc0 suggested changes Apr 10, 2018

View reviewed changes

Update user guide to reflect install experience

8983912

ykevinc force-pushed the update_user_guide branch from 6de576e to 8983912 Compare May 12, 2018 06:24

k8s-ci-robot assigned jlewi May 14, 2018

k8s-ci-robot added the lgtm label May 14, 2018

k8s-ci-robot added the approved label May 14, 2018

k8s-ci-robot removed the needs-ok-to-test label May 21, 2018

k8s-ci-robot merged commit dfad46e into kubeflow:master May 21, 2018

saffaalvi pushed a commit to StatCan/kubeflow that referenced this pull request Feb 11, 2021

Update user guide to reflect install experience (kubeflow#617)

01a6f9a

yanniszark pushed a commit to arrikto/kubeflow that referenced this pull request Feb 15, 2021

Skip creating trials if add count is zero (kubeflow#617)

7a2ffe1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update README to reflect install experience #617

Update README to reflect install experience #617

ykevinc commented Apr 9, 2018 •

edited by jlewi

inc0 Apr 10, 2018

ykevinc Apr 11, 2018

jlewi Apr 26, 2018

jlewi commented May 9, 2018

jlewi commented May 14, 2018

k8s-ci-robot commented May 14, 2018

jlewi commented May 21, 2018

Update README to reflect install experience #617

Update README to reflect install experience #617

Conversation

ykevinc commented Apr 9, 2018 • edited by jlewi

inc0 Apr 10, 2018

Choose a reason for hiding this comment

ykevinc Apr 11, 2018

Choose a reason for hiding this comment

jlewi Apr 26, 2018

Choose a reason for hiding this comment

jlewi commented May 9, 2018

jlewi commented May 14, 2018

k8s-ci-robot commented May 14, 2018

jlewi commented May 21, 2018

ykevinc commented Apr 9, 2018 •

edited by jlewi