Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

trying out dask-gateway #371

Open
jhamman opened this issue Aug 21, 2019 · 14 comments

Comments

@jhamman
Copy link
Member

commented Aug 21, 2019

I got a good way into trying out dask-gateway today. I wanted to share a progress update and rope in a few folks that can help connect a few dots.

Deploying

I installed the helm chart onto our main GCP k8s cluster like this:

helm upgrade --wait --install --namespace=dask-gateway-staging dask-gateway \
    dask-gateway -f dask-gateway-values.yaml

where dask-gateway-values.yaml is:

gateway:
  proxyToken: "secret"

  clusterManager:
    scheduler:
      cores:
        request: .25
      memory:
        request: 1G

    worker:
      cores:
        request: 2
      memory:
        request: 8G

This went smoothly and a few minutes later, I had three pods running in the dask-gateway-staging namespace:

$ kubectl get pods -n dask-gateway-staging
NAME                                            READY   STATUS    RESTARTS   AGE
gateway-dask-gateway-6458b76bfd-6n56w           1/1     Running   1          148m
scheduler-proxy-dask-gateway-6b5f496d84-jm2m5   1/1     Running   0          148m
web-proxy-dask-gateway-5867b9d4dd-5hx6f         1/1     Running   0          148m
$ kubectl get svc --namespace dask-gateway-staging
NAME                            TYPE           CLUSTER-IP    EXTERNAL-IP     PORT(S)          AGE
scheduler-api-dask-gateway      ClusterIP      10.4.9.226    <none>          8001/TCP         154m
scheduler-public-dask-gateway   LoadBalancer   10.4.13.122   35.193.80.169   8786:30884/TCP   154m
web-api-dask-gateway            ClusterIP      10.4.5.169    <none>          8001/TCP         154m
web-public-dask-gateway         LoadBalancer   10.4.1.102    34.67.80.148    80:30848/TCP     154m

Looking at the gateway-dask-gateway pod's logs, we see that things are up and running...

$ kubectl logs -n dask-gateway-staging gateway-dask-gateway-6458b76bfd-6n56w -f
[I 2019-08-20 21:28:14.968 DaskGateway] Generating new cookie secret
[I 2019-08-20 21:28:14.970 DaskGateway] Connecting to dask gateway scheduler proxy at 'tls://10.4.13.122:8786', api at 'http://10.4.9.226:8001'
[I 2019-08-20 21:28:14.980 DaskGateway] Connecting to dask gateway web proxy at 'http://10.4.1.102:80', api at 'http://10.4.5.169:8001'
[I 2019-08-20 21:28:14.989 DaskGateway] Gateway API listening on http://10.32.1.231:8001

Connecting to the gateway

I feel this should be the easy part but this is where I'm currently stuck.

In staging.hub.pangeo.io, I ran:

[1] !pip install dask-gateway

[2] import dask_gateway

[3] gateway = dask_gateway.Gateway('http://34.67.80.148')
 
[4] gateway.new_cluster()
/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _connect(self, cluster_name)
    492                     raise GatewayClusterError(
    493                         "Cluster %r failed to start, see logs for "
--> 494                         "more information" % cluster_name
    495                     )
    496                 elif report.status is ClusterStatus.STOPPED:

GatewayClusterError: Cluster 'd8aaf666db00405382ec1c5b122a430d' failed to start, see logs for more information

I was able to connect to the gateway but creating the new cluster. I noticed that the scheduler pod for this cluster was created on the k8s cluster but it seems to die before any logs are created.

My questions:

  1. Am I connecting to the Gateway correctly?
  2. Are there any other mandatory configuration options I am missing?
  3. How do I access the logs from this cluster?

These python steps I listed above should be repeatable on any of the pangeo GCP clusters ([staging.][hub, ocean, hydro].pangeo.io)

cc @jcrist, @jacobtomlinson, @quasiben, and @mrocklin.

@jcrist

This comment has been minimized.

Copy link

commented Aug 21, 2019

Thanks for doing this, even without docs! The way you set it up looks good to me, so we'd have to debug further to figure things out.

I'm not sure how to deal with logs yet. Backends like YARN handle logs transparently, jobqueue we can leave them on disk for temporary storage. For Kubernetes, the logs lifetime is tied to the lifetime of the pod afaik, which means the logs go away when the gateway realizes the pod has failed. We could do what JupyterHub does and have an option to delay deleting stopped pods for some amount of time, but that becomes tricky between gateway restarts. I've been thinking about adding an abstract LogStorage class, and making the gateway responsible for managing logs of finished jobs. Not sure what a good log storage mechanism for kubernetes is, but google turns up plenty of options (@jacobtomlinson or @quasiben may have suggestions).

@jcrist

This comment has been minimized.

Copy link

commented Aug 21, 2019

If you look in the logs for the gateway server, do you see anything about this cluster/pod? This will at least let us know if it's a timeout (unlikely since you said you saw the pod start? or was it just created?) or an error in the pod.

@mrocklin

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

For context, the Pangeo annual meeting is the next three days. If we can get this running soonish there is probably a lot of value to both Pangeo and Dask-Jobqueue. I'm not sure if either of you are free today, but if you both are I would encourage a screenshare session.

@jcrist

This comment has been minimized.

Copy link

commented Aug 21, 2019

Agreed. I'm busy with errands til 12 central today, but free anytime after, as well as the rest of the week.

a lot of value to both Pangeo and Dask-Jobqueue.

I may be missing something, what does this have to do with dask-jobqueue?

@mrocklin

This comment has been minimized.

Copy link
Member

commented Aug 21, 2019

@jhamman

This comment has been minimized.

Copy link
Member Author

commented Aug 23, 2019

I tried a few more things with @yuvipanda today.

gateway = dask_gateway.Gateway('http://scheduler-public-dask-gateway.dask-gateway-staging:8786', )
gateway.new_cluster()

This errors immediately with:

---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
<ipython-input-24-9f195add8661> in <module>
----> 1 gateway.new_cluster()

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in new_cluster(self, cluster_options, **kwargs)
    457         cluster : GatewayCluster
    458         """
--> 459         return self.sync(self._new_cluster, cluster_options=cluster_options, **kwargs)
    460 
    461     async def _stop_cluster(self, cluster_name):

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in sync(self, func, *args, **kwargs)
    277             return future
    278         else:
--> 279             return sync(self.loop, func, *args, **kwargs)
    280 
    281     async def _fetch(self, req):

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    325             e.wait(10)
    326     if error[0]:
--> 327         six.reraise(*error[0])
    328     else:
    329         return result[0]

/srv/conda/envs/notebook/lib/python3.7/site-packages/six.py in reraise(tp, value, tb)
    691             if value.__traceback__ is not tb:
    692                 raise value.with_traceback(tb)
--> 693             raise value
    694         finally:
    695             value = None

/srv/conda/envs/notebook/lib/python3.7/site-packages/distributed/utils.py in f()
    310             if callback_timeout is not None:
    311                 future = gen.with_timeout(timedelta(seconds=callback_timeout), future)
--> 312             result[0] = yield future
    313         except Exception as exc:
    314             error[0] = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/gen.py in run(self)
    733 
    734                     try:
--> 735                         value = future.result()
    736                     except Exception:
    737                         exc_info = sys.exc_info()

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _new_cluster(self, **kwargs)
    430 
    431     async def _new_cluster(self, **kwargs):
--> 432         cluster_name = await self._submit(**kwargs)
    433         try:
    434             return await self._connect(cluster_name)

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _submit(self, cluster_options, **kwargs)
    366             headers=HTTPHeaders({"Content-type": "application/json"}),
    367         )
--> 368         resp = await self._fetch(req)
    369         data = json.loads(resp.body)
    370         return data["name"]

/srv/conda/envs/notebook/lib/python3.7/site-packages/dask_gateway/client.py in _fetch(self, req)
    281     async def _fetch(self, req):
    282         self._cookie_jar.pre_request(req)
--> 283         resp = await self._http_client.fetch(req, raise_error=False)
    284         if resp.code == 401:
    285             context = self._auth.pre_request(req, resp)

/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/iostream.py in _read_to_buffer(self)
    854                     else:
    855                         buf = bytearray(self.read_chunk_size)
--> 856                     bytes_read = self.read_from_fd(buf)
    857                 except (socket.error, IOError, OSError) as e:
    858                     if errno_from_exception(e) == errno.EINTR:

/srv/conda/envs/notebook/lib/python3.7/site-packages/tornado/iostream.py in read_from_fd(***failed resolving arguments***)
   1130     def read_from_fd(self, buf: Union[bytearray, memoryview]) -> Optional[int]:
   1131         try:
-> 1132             return self.socket.recv_into(buf, len(buf))
   1133         except socket.error as e:
   1134             if e.args[0] in _ERRNO_WOULDBLOCK:

ConnectionResetError: [Errno 104] Connection reset by peer

According to @yuvipanda, we're able to reach the gateway server but the gateway is "bonking" for some (unknown) reasons.

@jcrist

This comment has been minimized.

Copy link

commented Aug 23, 2019

I'm free now if you want to screenshare?

@jhamman

This comment has been minimized.

Copy link
Member Author

commented Aug 23, 2019

@jcrist - I’ll be in https://appear.in/pangeo for the next bit.

@jcrist

This comment has been minimized.

Copy link

commented Aug 23, 2019

The issue here was the cluster start timeout. Initial cluster/worker startup is slower than the default 60 seconds due to images being pulled - increasing the timeout in the helmchart led to everything working. We've demonstrated creating, connecting, and scaling a cluster both from inside the pangeo jupyterhub instance and from outside with the client on someone's laptop.

Todos:

  • Expose configuring authentication via the helm chart. Auth with jupyterhub works already, but isn't configurable in the helm chart.
  • Expose raw configuration in the helm chart, so admins can tweak other settings
@jcrist

This comment has been minimized.

Copy link

commented Aug 23, 2019

Also, the reason the DNS name above didn't work is because you were using the wrong one - the scheduler proxy is only for client-scheduler communication, you want the web proxy for the main address (I need to write docs to make this easier to understand :/). Something like:

gateway = dask_gateway.Gateway(
    "http://web-public-dask-gateway.dask-gateway-staging",
    proxy_address="tls://scheduler-public-dask-gateway.dask-gateway-staging:8786"
)

should work. This could all be configured beforehand using dask's configuration system, e.g. by putting the following in ~/.config/dask/gateway.yaml:

gateway:
  address: http://web-public-dask-gateway.dask-gateway-staging
  proxy-address: tls://scheduler-public-dask-gateway.dask-gateway-staging:8786

so a user would connect as

gateway = dask_gateway.Gateway()
@lila

This comment has been minimized.

Copy link

commented Sep 6, 2019

Is the helm chart for dask-gateway available somewhere?

@TomAugspurger

This comment has been minimized.

Copy link

commented Sep 6, 2019

@lila

This comment has been minimized.

Copy link

commented Sep 6, 2019

Thanks Tom!

@jcrist

This comment has been minimized.

Copy link

commented Sep 13, 2019

I just pushed some basic docs on working with the helm chart: https://jcrist.github.io/dask-gateway/install-kube.html

I'm fairly new to helm, so there's likely improvements that could be made to the chart, and to the docs. In particular, I'm not sure the best way to document all the possible config values (the official helm docs recommend heavily commenting the values.yaml file, but IMO that makes things harder to read).

I'd be more than happy to help setup a demo gateway for people to try out here, just let me know how I can help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants
You can’t perform that action at this time.