Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

msg="Error in upgrading websocket" err="badResponseLoggingWriter: can't cast underlying response writer to Hijacker" #3499

Closed
arcivanov opened this issue Mar 17, 2021 · 10 comments
Labels
stale A stale issue or PR that will automatically be closed.

Comments

@arcivanov
Copy link

Running microservice deployment. Grafana runs queries perfectly, promtail pushes logs.
However, live tailing does not work due to WebSocket upgrade failure.
We have a standalone Grafana in the same cluster and WebSocket works there.

$ kubectl -n loki-test logs my-loki-distributed-querier-1  loki -f
level=info ts=2021-03-16T18:27:01.633676411Z caller=main.go:130 msg="Starting Loki" version="(version=2.2.0, branch=HEAD, revision=a27c79913)"
level=info ts=2021-03-16T18:27:01.634076185Z caller=server.go:229 http=[::]:3100 grpc=[::]:9095 msg="server listening on addresses"
level=info ts=2021-03-16T18:27:01.63612292Z caller=shipper_index_client.go:100 msg="starting boltdb shipper in 1 mode"
level=info ts=2021-03-16T18:27:01.63871254Z caller=worker.go:109 msg="Starting querier worker connected to query-frontend" frontend=my-loki-distributed-query-frontend:9095
level=info ts=2021-03-16T18:27:01.640873955Z caller=module_service.go:59 msg=initialising module=memberlist-kv
level=info ts=2021-03-16T18:27:01.640873985Z caller=module_service.go:59 msg=initialising module=server
level=info ts=2021-03-16T18:27:01.64125056Z caller=module_service.go:59 msg=initialising module=ring
level=info ts=2021-03-16T18:27:01.641424912Z caller=module_service.go:59 msg=initialising module=ingester-querier
level=info ts=2021-03-16T18:27:01.641481163Z caller=module_service.go:59 msg=initialising module=store
level=info ts=2021-03-16T18:27:01.641521613Z caller=module_service.go:59 msg=initialising module=querier
level=info ts=2021-03-16T18:27:01.641733185Z caller=loki.go:248 msg="Loki started"

A failure occurs immediately when trying to get the live stream of the logs:

level=error ts=2021-03-17T02:24:02.016278236Z caller=http.go:217 org_id=dev traceID=9361ef54076a59e msg="Error in upgrading websocket" err="badResponseLoggingWriter: can't cast underlying response writer to Hijacker"
level=warn ts=2021-03-17T02:24:02.016350506Z caller=logging.go:71 traceID=9361ef54076a59e msg="GET /loki/api/v1/tail?query=%7Bapp%3D%22service-api%22%7D (500) 144.481µs Response: \"Internal Server Error\\n\" ws: true; Accept-Encoding: gzip, deflate, br; Accept-Language: en-US,en;q=0.9,ru;q=0.8; Cache-Control: no-cache; Connection: Upgrade; Content-Length: 0; Pragma: no-cache; Sec-Websocket-Extensions: permessage-deflate; client_max_window_bits; Sec-Websocket-Key: <snip>; Sec-Websocket-Version: 13; Upgrade: websocket; User-Agent: Grafana/7.2.1; X-B3-Parentspanid: 76abc8935fba5af9; X-B3-Sampled: 0; X-B3-Spanid: 9adc70d9c6c5f0a1; X-B3-Traceid: 63a9c1b3e6a72b6e15cfbb1f0bfc85fc; X-Envoy-Attempt-Count: 1; X-Envoy-External-Address: 100.116.218.195; X-Envoy-Original-Path: /grafana/api/datasources/proxy/3/loki/api/v1/tail?query=%7Bapp%3D%22service-api%22%7D; X-Forwarded-Client-Cert: By=spiffe://cluster.local/ns/loki-test/sa/my-loki-distributed;Hash=6568e502f254078a8c56b33d51a60b0db004de8b483025143a44c962098319af;Subject=\"\";URI=spiffe://cluster.local/ns/monitoring/sa/grafana; X-Forwarded-For: <snip>, 127.0.0.1, 127.0.0.1,100.116.218.195,100.116.218.195; X-Forwarded-Proto: http; X-Request-Id: 12fbbd67-0bde-471a-90ca-f6d27de30057; X-Scope-Orgid: dev; uber-trace-id: 09361ef54076a59e:09361ef54076a59e:0000000000000000:0; "
@arcivanov
Copy link
Author

arcivanov commented Mar 17, 2021

I presume that the failure is here:

https://github.com/grafana/cortex-tools/blob/16b83f14cc7202f61cf4bd7a0d69ed73e29caa12/vendor/github.com/gorilla/websocket/server.go#L178

but because the failure obscures the true cause, it's pretty hard to figure out what's going on.

@arcivanov
Copy link
Author

arcivanov commented Mar 19, 2021

The issue is localized to the query-frontend. The same query going into the front-end returned with 500 succeeds when directly querying the querier.

$ kubectl -n loki-test port-forward loki-distributed-querier-0 3100
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100
Handling connection for 3100
$ curl -vvvv -H"X-Scope-OrgID: dev" -H 'Sec-WebSocket-Version: 13' -H 'Sec-WebSocket-Extensions: permessage-deflate' -H 'Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q=='  -H 'Connection: keep-alive, Upgrade' -H 'Upgrade: websocket' -H 'X-Hello: world' 'http://localhost:3100/loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22'
*   Trying ::1:3100...
* Connected to localhost (::1) port 3100 (#0)
> GET /loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22 HTTP/1.1
> Host: localhost:3100
> User-Agent: curl/7.71.1
> Accept: */*
> X-Scope-OrgID: dev
> Sec-WebSocket-Version: 13
> Sec-WebSocket-Extensions: permessage-deflate
> Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q==
> Connection: keep-alive, Upgrade
> Upgrade: websocket
> X-Hello: world
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 101 Switching Protocols
< Upgrade: websocket
< Connection: Upgrade
< Sec-WebSocket-Accept: ofNzHVvj/ErTqrfW0S1pvqEyhpw=
< 
�~��{"streams":[{"stream":{"job":"loki-test/loki-distributed","namespace":"loki-test","node_name":"ip-10-0-150-250.ec2.internal","pod":"loki-distributed-query-frontend-7466f56c8f-sjnhj","app":"loki-distributed","component":"query-frontend","container":"loki","filename"
$ kubectl -n loki-test port-forward loki-distributed-query-frontend-7466f56c8f-qblhv 3100
Forwarding from 127.0.0.1:3100 -> 3100
Forwarding from [::1]:3100 -> 3100
Handling connection for 3100
$ curl -vvvv -H"X-Scope-OrgID: dev" -H 'Sec-WebSocket-Version: 13' -H 'Sec-WebSocket-Extensions: permessage-deflate' -H 'Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q=='  -H 'Connection: keep-alive, Upgrade' -H 'Upgrade: websocket' -H 'X-Hello: world' 'http://localhost:3100/loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22'
*   Trying ::1:3100...
* Connected to localhost (::1) port 3100 (#0)
> GET /loki/api/v1/tail?query=%7Bapp%3D%22loki-distributed%22%7D%20%7C%3D%22websocket%22 HTTP/1.1
> Host: localhost:3100
> User-Agent: curl/7.71.1
> Accept: */*
> X-Scope-OrgID: dev
> Sec-WebSocket-Version: 13
> Sec-WebSocket-Extensions: permessage-deflate
> Sec-WebSocket-Key: v4vMUSLqpDDrrvhrCqfE+Q==
> Connection: keep-alive, Upgrade
> Upgrade: websocket
> X-Hello: world
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< Content-Type: text/plain; charset=utf-8
< Sec-Websocket-Version: 13
< Vary: Accept-Encoding
< X-Content-Type-Options: nosniff
< Date: Fri, 19 Mar 2021 03:45:25 GMT
< Content-Length: 22
< 
Internal Server Error
* Connection #0 to host localhost left intact

@sherifkayad
Copy link

sherifkayad commented May 27, 2021

@arcivanov are there any updates? Did you manage to solve it? .. I can confirm your comment above that directly querying the querier works but queruing the query-frontend gives an internal server error.

BTW I am using the latest version of loki 2.2.1 and deploying the distributed mode using Helm (using chart v0.31.2)

@sherifkayad
Copy link

sherifkayad commented May 27, 2021

@arcivanov After days of debugging I finally managed to find the problem. You basically have to explicitly configure the query frontend for tailing. The property I mention below is having a default value of "" which causes the 500 Error on the Query Frontend side.

## In the Loki Configs Query Frontend Part:
frontend:
      ..
      tail_proxy_url: http://YOUR_CLUSTER_INTERNAL_LOKI_QUERIER_SVC:3100

Of course YOUR_CLUSTER_INTERNAL_LOKI_QUERIER_SVC should be the service name of the querier.

Source: https://grafana.com/docs/loki/latest/configuration/#query_frontend_config

I will open a PR to the Helm Charts repo and will post the link here so that people using Helm don't have to suffer like I did 😄

@sherifkayad
Copy link

Here's my PR to solve this on the Helm Chart level grafana/helm-charts#456

@arcivanov
Copy link
Author

@arcivanov are there any updates? Did you manage to solve it?

We downgraded to the previous working version.

@arcivanov
Copy link
Author

Here's my PR to solve this on the Helm Chart level grafana/helm-charts#456

That's wonderful, thank you so much for this!!!

@stale
Copy link

stale bot commented Jun 28, 2021

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jun 28, 2021
@stale stale bot closed this as completed Jul 9, 2021
@avinashs433
Copy link

Hi, we also are experiencing this issue on the following version:
grafana version : 8.1.5
loki version : 2.3.0

@sherifkayad
Copy link

@avinashs433 It should be a config issue you need to adjust ,, Check my comment #3499 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

No branches or pull requests

3 participants