Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoch hangs at one node #2212

Closed
vkarak1 opened this issue Jan 25, 2023 · 3 comments
Closed

Epoch hangs at one node #2212

vkarak1 opened this issue Jan 25, 2023 · 3 comments
Labels
bug Something isn't working U0 Needs to be resolved immediately

Comments

@vkarak1
Copy link

vkarak1 commented Jan 25, 2023

I have faced problem when one of the node left at 179 epoch while other nodes were at 189:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Please find output from the netmap snapshot command:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap snapshot -g -r node1:8080
Epoch: 179
Node 1: 02183147e3d30745d3ecf402679d1378bd549f36fc77bf282fc1ffda50a865d8da ONLINE /ip4/172.26.160.150/tcp/8080
        Continent: Europe
        Country: Finland
        CountryCode: FI
        Deployed: YACZROKH
        Location: Helsinki (Helsingfors)
        Node: 172.26.160.150
        Price: 10
        SubDiv: Uusimaa
        SubDivCode: 18
        UN-LOCODE: FI HEL
Node 2: 0309c85a8be8f0a11df58a2ee440a76f37a9fbe4b82b4fd6fcfdb21c95fa8480f7 ONLINE /ip4/172.26.160.46/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Saint Petersburg (ex Leningrad)
        Node: 172.26.160.46
        Price: 10
        SubDiv: Sankt-Peterburg
        SubDivCode: SPE
        UN-LOCODE: RU LED
Node 3: 03736791ef911acdbc37625daf91c796e3fff63d5961e4c048368a6033699f4bfc ONLINE /ip4/172.26.160.157/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Moskva
        Node: 172.26.160.157
        Price: 10
        SubDiv: Moskva
        SubDivCode: MOW
        UN-LOCODE: RU MOW
Node 4: 03855b0548c4408a4d7725111c55c7346715dea8701a948046eecec7928b10cac0 ONLINE /ip4/172.26.160.205/tcp/8080
        Continent: Europe
        Country: Sweden
        CountryCode: SE
        Deployed: YACZROKH
        Location: Stockholm
        Node: 172.26.160.205
        Price: 10
        SubDiv: Stockholms l�n
        SubDivCode: AB
        UN-LOCODE: SE STO

Also tried to issue 'force-new-epoch' command, please find result below:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 187, increase to 188.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Expected Behavior

All nodes have to be at the same epoch.

Possible Solution

Restart 3 services helped me to reach consistency between nodes epoch:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# systemctl restart neofs-ir; systemctl restart neofs-storage; systemctl restart neo-go
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
190

Steps to Reproduce (for bugs)

Nodes had been used for failover tests during 3 days where tests are aimed to kill each service and several "healthcheck" commands were issued against service to 100% sure that service has been restarted and ready to process.

Logs.zip

Your Environment

NeoFS Inner Ring node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoFS Storage node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoGo
Version: 0.100.1-pre-1-g0cb86f39
GoVersion: go1.18.4

@vkarak1
Copy link
Author

vkarak1 commented Jan 25, 2023

This could be related to unable to force-new-epoch

@fyrchik
Copy link
Contributor

fyrchik commented Jan 27, 2023

Right before the 180 epoch tick:

Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:88        internal error        {"msg": "morph chain connection has been lost"}
Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:107        application stopped
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Succeeded.
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:36:55 az neofs-node[480]: 2023-01-25T09:36:55.250Z        info        client/multi.go:51        connection to the new RPC node has been established        {"endpoint": "ws://172.26.
160.205:40332/ws"}
Jan 25 09:36:55 az systemd[1]: neo-go.service: Main process exited, code=killed, status=9/KILL
Jan 25 09:36:55 az systemd[1]: neo-go.service: Failed with result 'signal'.
Jan 25 09:36:55 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:36:57 az neofs-node[480]: 2023-01-25T09:36:57.101Z        debug        neofs-node/morph.go:232        new block        {"index": 13479}
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Scheduled restart job, restart counter is at 2.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Scheduled restart job, restart counter is at 1.
Jan 25 09:37:00 az systemd[1]: Stopped NeoFS InnerRing node.
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:37:00 az systemd[1]: Stopped NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:37:00 az systemd[1]: Started NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: Started NeoFS InnerRing node.

@fyrchik
Copy link
Contributor

fyrchik commented Feb 3, 2023

Closed via #2220.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working U0 Needs to be resolved immediately
Projects
None yet
Development

No branches or pull requests

3 participants