Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Unknown Protocol" error in client #68

Closed
JimmyMa opened this issue Jun 17, 2014 · 18 comments
Closed

"Unknown Protocol" error in client #68

JimmyMa opened this issue Jun 17, 2014 · 18 comments

Comments

@JimmyMa
Copy link

JimmyMa commented Jun 17, 2014

In our CloudFoundry deployment, the CloudController (the NATS client) is getting below error intermittently when the CloudController receives the "dea.advertise" message from NATS (gnatsd) server

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,"8fbd89d3-d26c-47b2-94de-9b774bb9687f":1,"a432a6b2-c60a-4801-ba28-cf52811be799":1,"7a98cdb9-630d-4a21-b2b9-8c2012eacc43":1,"3736f818-a39b-4e18-a0a0-afe7f748bd2a":1,"dcf171b3-c13c-4ed9-9INFO {"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

The "INFO" message gets messed with the "dea.advertise" message.

Now I'm guessing below is the scenario which has the above problem:

  1. DEA publishes the below "dea.advertise" message to NATS server:

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,"8fbd89d3-d26c-47b2-94de-9b774bb9687f":1,"a432a6b2-c60a-4801-ba28-cf52811be799":1,"7a98cdb9-630d-4a21-b2b9-8c2012eacc43":1,"3736f818-a39b-4e18-a0a0-afe7f748bd2a":1}}

  1. NATS server tries to relay the message to CC, but during the communication with CC, the network gets into problem, so only part of the message reaches to CC, as below:

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,

  1. CC NATS client receives the part of the message and stores it into @buf (https://github.com/derekcollison/nats/blob/v0.5.0.beta.12/lib/nats/client.rb#L521).
  2. Due to the network problem, CC NATS client loses the connection to NATS server, and it tries to re-connect to NATS server.
  3. NATS server receives the new connection from the CC, and sends below "INFO" message back to CC NATS client. (NATS server always sends an "INFO" message when a client connection gets created https://github.com/apcera/gnatsd/blob/master/server/server.go#L373).

INFO {"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

  1. CC NATS client receives the "INFO" message and appends it to @buf, then the content of @buf is below:
    {"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,INFO {"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

Maybe the @buf needs to be clean when the client do the reconnection: https://github.com/derekcollison/nats/blob/v0.5.0.beta.12/lib/nats/client.rb#L720

@derekcollison
Copy link
Member

First try upgrading the gnatsd to the current version of 0.5.2 and see if
the error can be reproduced there. Also, what is the configuration and
setup for gnatsd? Does it run in clustered mode?

On Tue, Jun 17, 2014 at 6:17 AM, JimmyMa notifications@github.com wrote:

In our CloudFoundry deployment, the CloudController (the NATS client) is
getting below error intermittently when the CloudController receives the
"dea.advertise" message from NATS (gnatsd) server

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,"8fbd89d3-d26c-47b2-94de-9b774bb9687f":1,"a432a6b2-c60a-4801-ba28-cf52811be799":1,"7a98cdb9-630d-4a21-b2b9-8c2012eacc43":1,"3736f818-a39b-4e18-a0a0-afe7f748bd2a":1,"dcf171b3-c13c-4ed9-9INFO
{"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

The "INFO" message gets messed with the "dea.advertise" message.

Now I'm guessing below is the scenario which has the above problem:

  1. DEA publishes the below "dea.advertise" message to NATS server:

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,"8fbd89d3-d26c-47b2-94de-9b774bb9687f":1,"a432a6b2-c60a-4801-ba28-cf52811be799":1,"7a98cdb9-630d-4a21-b2b9-8c2012eacc43":1,"3736f818-a39b-4e18-a0a0-afe7f748bd2a":1}}

  1. NATS server tries to relay the message to CC, but during the
    communication with CC, the network gets into problem, so only part of the
    message reaches to CC, as below:

{"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,

CC NATS client receives the part of the message and stores it into @buf
https://github.com/buf (
https://github.com/derekcollison/nats/blob/v0.5.0.beta.12/lib/nats/client.rb#L521
).
2.

Due to the network problem, CC NATS client loses the connection to
NATS server, and it tries to re-connect to NATS server.
3.

NATS server receives the new connection from the CC, and sends below
"INFO" message back to CC NATS client. (NATS server always sends an "INFO"
message when a client connection gets created
https://github.com/apcera/gnatsd/blob/master/server/server.go#L373).

INFO
{"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

  1. CC NATS client receives the "INFO" message and appends it to @buf
    https://github.com/buf, then the content of @buf
    https://github.com/buf is below:
    {"id":"10-7ee2570336fe4839974e49ee412ccdcb","stacks":["lucid64"],"available_memory":6336,"available_disk":441760,"app_id_to_count":{"2beae3d0-238c-4ec6-a191-890c32a12554":1,INFO
    {"server_id":"5093c491bc3d437e75408f144dd3643f","version":"0.5.0","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}

Maybe the @buf https://github.com/buf needs to be clean when the client
do the reconnection:
https://github.com/derekcollison/nats/blob/v0.5.0.beta.12/lib/nats/client.rb#L720


Reply to this email directly or view it on GitHub
#68.

@JimmyMa
Copy link
Author

JimmyMa commented Jun 17, 2014

The same problem happens with 0.5.2, see below error log:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.stderr.log:343343:/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/1.9.1/gems/nats-0.5.0.beta.12/lib/nats/client.rb:618:in `block in connection_completed': Unknown protocol: {"id":"22-574d3c89e5e248d6b7e64b9119970bf9","stacks":["lucid64"],"available_memory":14592,"available_disk":379296,"app_id_to_count":{"27532d8b-90f1-43e8-9409-0d45b4a1e9ad":1,"1fad2097-68de-4fd6-84d5-ff7ae0d2919INFO {"server_id":"b7a0fe6b1f65433671c7840c2bbaa956","version":"0.5.2","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576} (NATS::ServerError)

Our gnatsd is in clustered mode with 2 nodes.

Please let me know if you need more info. Thanks.

@derekcollison
Copy link
Member

Can you send me the conf file for the 2 gnatsd servers? Thanks.

On Tue, Jun 17, 2014 at 7:45 AM, JimmyMa notifications@github.com wrote:

The same problem happens with 0.5.2, see below error log:
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.stderr.log:343343:/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/1.9.1/gems/nats-0.5.0.beta.12/lib/nats/client.rb:618:in
`block in connection_completed': Unknown protocol:
{"id":"22-574d3c89e5e248d6b7e64b9119970bf9","stacks":["lucid64"],"available_memory":14592,"available_disk":379296,"app_id_to_count":{"27532d8b-90f1-43e8-9409-0d45b4a1e9ad":1,"1fad2097-68de-4fd6-84d5-ff7ae0d2919INFO
{"server_id":"b7a0fe6b1f65433671c7840c2bbaa956","version":"0.5.2","host":"xx.xx.xx.xx","port":4222,"auth_required":true,"ssl_required":false,"max_payload":1048576}
(NATS::ServerError)

Our gnatsd is in clustered mode with 2 nodes.

Please let me know if you need more info. Thanks.


Reply to this email directly or view it on GitHub
#68 (comment).

@JimmyMa
Copy link
Author

JimmyMa commented Jun 17, 2014

Here are the conf file,

ip of 0.nats.default.stage0.microbosh is xx.xx.xx.197
ip of 1.nats.default.stage0.microbosh is xx.xx.xx.19

GNATSD 1:

net: "xx.xx.xx.197"
port: 4222
prof_port: 0
monitor_port: 0

pid_file: "/var/vcap/sys/run/nats/nats.pid"

log_file: "/var/vcap/sys/log/nats/nats.log"
debug: false
trace: false

authorization {
user: "nats"
password: "time4fun"
timeout: 15
}

cluster {
host: "xx.xx.xx.197"
port: 4223

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

routes = [

nats-route://nats:xxxxx@0.nats.default.stage0.microbosh:4223

nats-route://nats:xxxxx@1.nats.default.stage0.microbosh:4223

]
}

GNATSD 2:

net: "xx.xx.xx.19"
port: 4222
prof_port: 0
monitor_port: 0

pid_file: "/var/vcap/sys/run/nats/nats.pid"

log_file: "/var/vcap/sys/log/nats/nats.log"
debug: false
trace: false

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

cluster {
host: "xx.xx.xx.19"
port: 4223

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

routes = [

nats-route://nats:xxxxx@0.nats.default.stage0.microbosh:4223

nats-route://nats:xxxxx@1.nats.default.stage0.microbosh:4223

]
}

@derekcollison
Copy link
Member

In the routes section you should only have one route to the other gnatsd
server, not an entry to the same server as well.

On Tue, Jun 17, 2014 at 8:40 AM, JimmyMa notifications@github.com wrote:

Here are the conf file,

ip of 0.nats.default.stage0.microbosh is xx.xx.xx.197
ip of 1.nats.default.stage0.microbosh is xx.xx.xx.19

GNATSD 1:

net: "xx.xx.xx.197"
port: 4222
prof_port: 0
monitor_port: 0

pid_file: "/var/vcap/sys/run/nats/nats.pid"

log_file: "/var/vcap/sys/log/nats/nats.log"
debug: false
trace: false

authorization {
user: "nats"
password: "time4fun"
timeout: 15
}

cluster {
host: "xx.xx.xx.197"
port: 4223

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

routes = [

nats-route://nats:xxxxx@0.nats.default.stage0.microbosh:4223

nats-route://nats:xxxxx@1.nats.default.stage0.microbosh:4223

]
}

GNATSD 2:

net: "xx.xx.xx.19"
port: 4222
prof_port: 0
monitor_port: 0

pid_file: "/var/vcap/sys/run/nats/nats.pid"

log_file: "/var/vcap/sys/log/nats/nats.log"
debug: false
trace: false

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

cluster {
host: "xx.xx.xx.19"
port: 4223

authorization {
user: "nats"
password: "xxxxx"
timeout: 15
}

routes = [

nats-route://nats:xxxxx@0.nats.default.stage0.microbosh:4223

nats-route://nats:xxxxx@1.nats.default.stage0.microbosh:4223

]
}


Reply to this email directly or view it on GitHub
#68 (comment).

@JimmyMa
Copy link
Author

JimmyMa commented Jun 17, 2014

Hi Derek,

So the wrong configuration about routes caused the "Unknown Protocol" error in nats client?

BTW, the "Unknown Protocol" error not happens always, it only happens in 2-3 of our cloudcontrollers, and 0 - 5 times each day.

@derekcollison
Copy link
Member

Unclear, just trying to eliminate things. Is it possible to have a nats
client accidentally connect to the route port on 4223?

On Tue, Jun 17, 2014 at 8:59 AM, JimmyMa notifications@github.com wrote:

Hi Derek,

So the wrong configuration about routes caused the "Unknown Protocol"
error in nats client?

BTW, the "Unknown Protocol" error not happens always, it only happens in
2-3 of our cloudcontrollers, and 0 - 5 times each day.


Reply to this email directly or view it on GitHub
#68 (comment).

@JimmyMa
Copy link
Author

JimmyMa commented Jun 18, 2014

Hi Derek,

Only 4222 is configured in the nats client, so I don't think it could connect on 4223 accidentally.

I think the root cause is the intermittent network issue in our environment. nats client has "re-connect" capability, and I suspect the "Unknown Protocol" error happens after the "re-connect".

Thanks.

@derekcollison
Copy link
Member

What version of the nats client are you running?

On Jun 17, 2014, at 5:28 PM, JimmyMa notifications@github.com wrote:

Hi Derek,

Only 4222 is configured in the nats client, so I don't think it could connect on 4223 accidentally.

I think the root cause is the intermittent network issue in our environment. nats client has "re-connect" capability, and I suspect the "Unknown Protocol" error happens after the "re-connect".

Thanks.


Reply to this email directly or view it on GitHub.

@JimmyMa
Copy link
Author

JimmyMa commented Jun 18, 2014

According to the below log, the nats client version is 0.5.0.beta.12.
/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.stderr.log:343343:/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/1.9.1/gems/nats-0.5.0.beta.12/lib/nats/client.rb:

@derekcollison
Copy link
Member

Thanks, I will try to dig in a bit deeper on the initial description.

On Tue, Jun 17, 2014 at 9:04 PM, JimmyMa notifications@github.com wrote:

According to the below log, the nats client version is 0.5.0.beta.12.

/var/vcap/sys/log/cloud_controller_ng/cloud_controller_ng.stderr.log:343343:/var/vcap/packages/cloud_controller_ng/cloud_controller_ng/vendor/bundle/ruby/1.9.1/gems/nats-0.5.0.beta.12/lib/nats/client.rb:


Reply to this email directly or view it on GitHub
#68 (comment).

@JimmyMa
Copy link
Author

JimmyMa commented Jun 20, 2014

Hi Derek,

One more thing I want to mention is that all messed messages contain "INFO" message, hope this could give you more clues. Thanks.

@derekcollison
Copy link
Member

Ok thanks. Why is connection being dropped?

On Jun 20, 2014, at 5:35 AM, JimmyMa notifications@github.com wrote:

Hi Derek,

One more thing I want to mention is that all messed messages contain "INFO" message, hope this could give you more clues. Thanks.


Reply to this email directly or view it on GitHub.

@JimmyMa
Copy link
Author

JimmyMa commented Jun 21, 2014

Hi Derek,

We suspect our network is not stable, we are investigating/confirming the network instability issue now ...

@fraenkel
Copy link

You should be able to close this issue now with the latest changes merged.

@derekcollison
Copy link
Member

Still trying to get some flaky tests passing reliably, will release to
rubygems today. Thanks again.

On Fri, Jun 27, 2014 at 4:27 AM, Michael Fraenkel notifications@github.com
wrote:

You should be able to close this issue now with the latest changes merged.


Reply to this email directly or view it on GitHub
#68 (comment).

@nastacio
Copy link

From what I can tell, the connections dropped are always preceeded by a
"Slow Consumer Detected" message on the NATS server.
On Jun 20, 2014 8:35 AM, "JimmyMa" notifications@github.com wrote:

Hi Derek,

One more thing I want to mention is that all messed messages contain
"INFO" message, hope this could give you more clues. Thanks.


Reply to this email directly or view it on GitHub
#68 (comment).

@derekcollison
Copy link
Member

ok I pushed 0.5.0-beta.14 to rubygems. Thanks again for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants