Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RPC reconnection failure leading to infinite request retries #2325

Closed
roman-khimov opened this issue May 1, 2023 · 1 comment · Fixed by #2327
Closed

RPC reconnection failure leading to infinite request retries #2325

roman-khimov opened this issue May 1, 2023 · 1 comment · Fixed by #2327
Assignees
Labels
bug Something isn't working neofs-storage Storage node application issues
Milestone

Comments

@roman-khimov
Copy link
Member

Expected Behavior

When RPC connection fails (and it can), storage node reconnects to some other node and continues to operate.

Current Behavior

neofs-node[381288]: 2023-05-01T06:06:22.386Z#011info#011subscriber/subscriber.go:228#011RPC connection lost, attempting reconnect
neofs-node[381288]: 2023-05-01T06:06:22.392Z#011error#011policer/check.go:87#011could not get container#011{"component": "Object Policer", "cid": "2cVbPMzKoaLd54Yv5XsWmBTa9qFJ63G8VDZjKkxCxov3", "error": "could not perform test invocation (get): connection lost while waiting for the response"}
// this repeats like forever for various containers
neofs-node[381288]: 2023-05-01T06:06:27.402Z#011warn#011client/multi.go:42#011could not establish connection to the switched RPC node#011{"endpoint": "wss://rpc3.morph.t5.fs.neo.org:51331/ws", "error": "WS client creation: read tcp 65.21.148.207:41294->65.108.90.74:51331: i/o timeout"}
neofs-node[381288]: 2023-05-01T06:06:30.362Z#011info#011client/multi.go:52#011connection to the new RPC node has been established#011{"endpoint": "wss://rpc6.morph.t5.fs.neo.org:51331/ws"}
neofs-node[381288]: 2023-05-01T06:06:33.972Z#011error#011policer/check.go:87#011could not get container#011{"component": "Object Policer", "cid": "7Jpq6TrARwb13PcgQg66G8xatXX3RamjgcjqVR5Y2BDg", "error": "could not perform test invocation (get): connection lost while waiting for the response"}
// but it still loops like crazy with this
neofs-node[381288]: 2023-05-01T06:21:33.667Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5936}
neofs-node[381288]: 2023-05-01T07:24:59.575Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5937}
neofs-node[381288]: 2023-05-01T07:24:59.577Z#011info#011neofs-node/config.go:903#011bootstrapping with online state#011{"previous": "ONLINE"}
neofs-node[381288]: 2023-05-01T08:28:25.935Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5938}
neofs-node[381288]: 2023-05-01T09:31:50.731Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5939}
neofs-node[381288]: 2023-05-01T09:31:50.741Z#011info#011neofs-node/config.go:903#011bootstrapping with online state#011{"previous": "ONLINE"}
neofs-node[381288]: 2023-05-01T10:35:22.285Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5940}
neofs-node[381288]: 2023-05-01T11:38:54.169Z#011info#011neofs-node/morph.go:215#011new epoch event from sidechain#011{"number": 5941}
neofs-node[381288]: 2023-05-01T11:38:54.175Z#011info#011neofs-node/config.go:903#011bootstrapping with online state#011{"previous": "ONLINE"}
neofs-node[381288]: 2023-05-01T11:40:48.690Z#011error#011policer/check.go:87#011could not get container#011{"component": "Object Policer", "cid": "CmTo5i5DDZR12sVPgdPM1RneBgVj1rKbUwpnEue69DRp", "error": "could not perform test invocation (get): connection lost while waiting for the response"}
// seems to be better for some time, but then continues looping with error
neofs-node[381288]: 2023-05-01T11:41:01.381Z#011error#011policer/check.go:87#011could not get container#011{"component": "Object Policer", "cid": "DdRBUdCzwVg4431cJYmrdBKLMuqCqEH8q9GwRtbpPacV", "error": "could not perform test invocation (get): connection lost before registering response channel"}
// the switches to this message and never ends

Context

Testnet, 0.36.1.

@roman-khimov roman-khimov added bug Something isn't working neofs-storage Storage node application issues labels May 1, 2023
@roman-khimov roman-khimov added this to the v0.37.0 milestone May 1, 2023
@roman-khimov roman-khimov self-assigned this May 3, 2023
@roman-khimov
Copy link
Member Author

roman-khimov commented May 3, 2023

The first reconnection happens normally, the second one never happens.

sudo nsenter -t $S01_PID -n ss -K dst 192.168.130.90 dport 30333

to test.

roman-khimov added a commit that referenced this issue May 3, 2023
50dd5c2 was perfect. Except is didn't
initialize the channel, so the first reconnection worked fine and the next one
never happened, ooopsie.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
roman-khimov added a commit that referenced this issue May 3, 2023
50dd5c2 was perfect. Except is didn't
initialize the channel, so the first reconnection worked fine and the next one
never happened, ooopsie.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
roman-khimov added a commit to nspcc-dev/neo-go that referenced this issue May 3, 2023
Regular Client doesn't care much about connections, because HTTP client's Do
method can reuse old ones or create additional ones on the fly. So one request
can fail and the next one easily succeed. WSClient is different, it works via
a single connection and if it breaks, it breaks forever for this
client. Callers will get some error on every request afterwards and it'd be
nice for this error to be the same so that API users could detect
disconnection this way too.

Related to nspcc-dev/neofs-node#2325.

Signed-off-by: Roman Khimov <roman@nspcc.ru>
roman-khimov added a commit that referenced this issue May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working neofs-storage Storage node application issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant