Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spark UI in Azure Kubernetes Service #22

Open
quinnsp06 opened this issue Mar 8, 2024 · 6 comments
Open

Spark UI in Azure Kubernetes Service #22

quinnsp06 opened this issue Mar 8, 2024 · 6 comments
Assignees

Comments

@quinnsp06
Copy link

quinnsp06 commented Mar 8, 2024

Hi Hussein,

When executing the steps on an AKS cluster I receive the following errors:

I run the start API I get the error below and in the Browser it gives Internet Server Error. Remembering that to access the AKS UI POD on port 4040 I have to port-forward to localhost on port 8080:

  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 84, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno -2] Name or service not known
kubectl port-forward pod/testapp-20240308105234-driver 8080:4040
Forwarding from [::1]:8080 -> 4040
Handling connection for 8080
Handling connection for 8080
E0308 10:56:16.457516    5406 portforward.go:409] an error occurred forwarding 8080 -> 4040: error forwarding port 4040 to pod dadf61ac7c8d997b5c4148145abc477a6b76c26831c58a0469a7db34dd60540a, uid : failed to execute portforward in network namespace "/var/run/netns/cni-7df48b40-aa72-787f-fe1a-41f2b6711cdf": failed to connect to localhost:4040 inside namespace "dadf61ac7c8d997b5c4148145abc477a6b76c26831c58a0469a7db34dd60540a", IPv4: dial tcp4 127.0.0.1:4040: connect: connection refused IPv6 dial tcp6: address localhost: no suitable address found
E0308 10:56:16.468872    5406 portforward.go:394] error copying from local connection to remote stream: EOF
error: lost connection to pod
image
@hussein-awala hussein-awala self-assigned this Mar 8, 2024
@hussein-awala
Copy link
Owner

Do you have a default deny all network policy in your cluster? something like:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-all
spec:
  podSelector: {}
  policyTypes:
  - Ingress
  - Egress

If so, could you try to add a new network policy in the webserver namespace to permit the webserver pod to route the traffic to the driver pods? something like:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: allow-redirect-traffic
spec:
  podSelector:
    matchLabels:
      app: spark-webserver
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          name: <your namespace>
    ports:
    - protocol: TCP
      port: 8000
  egress:
  - to:
    - podSelector:
        matchLabels:
          spark-role: driver
    ports:
    - protocol: TCP
      port: 4040

@quinnsp06
Copy link
Author

quinnsp06 commented Mar 11, 2024

Hi Hussein,
I applied the policy and the error continues:

spark-on-k8s api start --host 127.0.0.1 --port 8080 --workers 4 --log-level debug --limit-concurrency 100

INFO:     Uvicorn running on http://127.0.0.1:8080 (Press CTRL+C to quit)
INFO:     Started parent process [47648]
INFO:     Started server process [47653]
INFO:     Waiting for application startup.
INFO:     Started server process [47654]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [47655]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Started server process [47652]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Application startup complete.
INFO:     127.0.0.1:40384 - "GET /webserver/apps?namespace=spark-cluster HTTP/1.1" 200 OK
INFO:     127.0.0.1:40384 - "GET /webserver/apps?namespace=spark-cluster HTTP/1.1" 200 OK
INFO:     127.0.0.1:40390 - "GET /webserver/ui/spark-cluster/testapp-20240311104236 HTTP/1.1" 500 Internal Server Error
ERROR:    Exception in ASGI application
Traceback (most recent call last):
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 67, in map_httpcore_exceptions
    yield
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 371, in handle_async_request
    resp = await self._pool.handle_async_request(req)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 216, in handle_async_request
    raise exc from None
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_async/connection_pool.py", line 196, in handle_async_request
    response = await connection.handle_async_request(
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 99, in handle_async_request
    raise exc
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 76, in handle_async_request
    stream = await self._connect(request)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_async/connection.py", line 122, in _connect
    stream = await self._network_backend.connect_tcp(**kwargs)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_backends/auto.py", line 30, in connect_tcp
    return await self._backend.connect_tcp(
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_backends/anyio.py", line 112, in connect_tcp
    with map_exceptions(exc_map):
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpcore/_exceptions.py", line 14, in map_exceptions
    raise to_exc(exc) from exc
httpcore.ConnectError: [Errno -2] Name or service not known
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/elcio/.local/lib/python3.10/site-packages/uvicorn/protocols/http/h11_impl.py", line 408, in run_asgi
    result = await app(  # type: ignore[func-returns-value]
  File "/home/elcio/.local/lib/python3.10/site-packages/uvicorn/middleware/proxy_headers.py", line 69, in __call__
    return await self.app(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/fastapi/applications.py", line 1054, in __call__
    await super().__call__(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/applications.py", line 123, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 186, in __call__
    raise exc
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/middleware/errors.py", line 164, in __call__
    await self.app(scope, receive, _send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/middleware/exceptions.py", line 62, in __call__
    await wrap_app_handling_exceptions(self.app, conn)(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/routing.py", line 758, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/routing.py", line 778, in app
    await route.handle(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/routing.py", line 299, in handle
    await self.app(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/routing.py", line 79, in app
    await wrap_app_handling_exceptions(app, request)(scope, receive, send)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 64, in wrapped_app
    raise exc
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/_exception_handler.py", line 53, in wrapped_app
    await app(scope, receive, sender)
  File "/home/elcio/.local/lib/python3.10/site-packages/starlette/routing.py", line 74, in app
    response = await func(request)
  File "/home/elcio/.local/lib/python3.10/site-packages/fastapi/routing.py", line 299, in app
    raise e
  File "/home/elcio/.local/lib/python3.10/site-packages/fastapi/routing.py", line 294, in app
    raw_response = await run_endpoint_function(
  File "/home/elcio/.local/lib/python3.10/site-packages/fastapi/routing.py", line 191, in run_endpoint_function
    return await dependant.call(**values)
  File "/home/elcio/.local/lib/python3.10/site-packages/spark_on_k8s/api/webserver/__init__.py", line 39, in ui_reverse_proxy
    reverse_proxy_resp = await async_http_client.send(reverse_proxy_req, stream=True)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_client.py", line 1646, in send
    response = await self._send_handling_auth(
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_client.py", line 1674, in _send_handling_auth
    response = await self._send_handling_redirects(
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_client.py", line 1711, in _send_handling_redirects
    response = await self._send_single_request(request)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_client.py", line 1748, in _send_single_request
    response = await transport.handle_async_request(request)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 370, in handle_async_request
    with map_httpcore_exceptions():
  File "/usr/lib/python3.10/contextlib.py", line 153, in __exit__
    self.gen.throw(typ, value, traceback)
  File "/home/elcio/.local/lib/python3.10/site-packages/httpx/_transports/default.py", line 84, in map_httpcore_exceptions
    raise mapped_exc(message) from exc
httpx.ConnectError: [Errno -2] Name or service not known

It looks like there is an error in the API httpx.ConnectError: [Errno -2] Name or service not known.

Does this API in AKS need an ingress controller? (kubernetes-sigs/kind#2953)

@hussein-awala
Copy link
Owner

hussein-awala commented Mar 11, 2024

I believe it's related to your CoreDNS configuration, where the reverse proxy routes the traffic to <service name>.<namspcae>.svc.cluster.local, but maybe you changed the default DNS suffix to something else. Could you check the coredns configmap?

kubectl -n kube-system get configmaps coredns -o yaml

You should have

kubernetes <URL> {
...
}

We need to check if this URL is the default one cluster.local or a different one.

@quinnsp06
Copy link
Author

quinnsp06 commented Mar 19, 2024

Hi Hussein, how are you ?

It Follows.

kubectl -n kube-system get configmaps coredns -o yaml

apiVersion: v1
data:
  Corefile: |
    .:53 {
        errors
        ready
        health {
          lameduck 5s
        }
        kubernetes cluster.local in-addr.arpa ip6.arpa {
          pods insecure
          fallthrough in-addr.arpa ip6.arpa
          ttl 30
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
        import custom/*.override
    }
    import custom/*.server
kind: ConfigMap
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"v1","data":{"Corefile":".:53 {\n    errors\n    ready\n    health {\n      lameduck 5s\n    }\n    kubernetes cluster.local in-addr.arpa ip6.arpa {\n      pods insecure\n      fallthrough in-addr.arpa ip6.arpa\n      ttl 30\n    }\n    prometheus :9153\n    forward . /etc/resolv.conf\n    cache 30\n    loop\n    reload\n    loadbalance\n    import custom/*.override\n}\nimport custom/*.server\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels":{"addonmanager.kubernetes.io/mode":"Reconcile","k8s-app":"kube-dns","kubernetes.io/cluster-service":"true"},"name":"coredns","namespace":"kube-system"}}
  creationTimestamp: "2024-02-21T12:57:42Z"
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    k8s-app: kube-dns
    kubernetes.io/cluster-service: "true"
  name: coredns
  namespace: kube-system
  resourceVersion: "454"
  uid: de1b0cf3-444b-4aea-8f2b-4e341010a05e

@hussein-awala
Copy link
Owner

Since you have import custom/*.override and import custom/*.server in your file, you may have other config maps in the namespace kube-system that override the coredns configurations, could you check?

@hussein-awala
Copy link
Owner

@quinnsp06 any update on this issue?

Unfortunately, I don't have an Azure cloud setup to test it with AKS, but I had the opportunity to use the project in production on the AWS cloud:

  • EKS to deploy the helm chart (for the webserver and Spark history) and submit Spark applications
  • Route53 + OKTA for ingress and authentication

As I mentioned before, I believe it is related to some custom DNS configuration on your cluster especially since you have import custom/*.override which is not a part of the default K8S DNS configurations (link).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants