Epoch hangs at one node #2212

vkarak1 · 2023-01-25T11:36:05Z

I have faced problem when one of the node left at 179 epoch while other nodes were at 189:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
189
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Please find output from the netmap snapshot command:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap snapshot -g -r node1:8080
Epoch: 179
Node 1: 02183147e3d30745d3ecf402679d1378bd549f36fc77bf282fc1ffda50a865d8da ONLINE /ip4/172.26.160.150/tcp/8080
        Continent: Europe
        Country: Finland
        CountryCode: FI
        Deployed: YACZROKH
        Location: Helsinki (Helsingfors)
        Node: 172.26.160.150
        Price: 10
        SubDiv: Uusimaa
        SubDivCode: 18
        UN-LOCODE: FI HEL
Node 2: 0309c85a8be8f0a11df58a2ee440a76f37a9fbe4b82b4fd6fcfdb21c95fa8480f7 ONLINE /ip4/172.26.160.46/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Saint Petersburg (ex Leningrad)
        Node: 172.26.160.46
        Price: 10
        SubDiv: Sankt-Peterburg
        SubDivCode: SPE
        UN-LOCODE: RU LED
Node 3: 03736791ef911acdbc37625daf91c796e3fff63d5961e4c048368a6033699f4bfc ONLINE /ip4/172.26.160.157/tcp/8080
        Continent: Europe
        Country: Russia
        CountryCode: RU
        Deployed: YACZROKH
        Location: Moskva
        Node: 172.26.160.157
        Price: 10
        SubDiv: Moskva
        SubDivCode: MOW
        UN-LOCODE: RU MOW
Node 4: 03855b0548c4408a4d7725111c55c7346715dea8701a948046eecec7928b10cac0 ONLINE /ip4/172.26.160.205/tcp/8080
        Continent: Europe
        Country: Sweden
        CountryCode: SE
        Deployed: YACZROKH
        Location: Stockholm
        Node: 172.26.160.205
        Price: 10
        SubDiv: Stockholms l�n
        SubDivCode: AB
        UN-LOCODE: SE STO

Also tried to issue 'force-new-epoch' command, please find result below:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 187, increase to 188.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-adm morph force-new-epoch  -c configuration/config.yaml
Current epoch: 188, increase to 189.
Waiting for transactions to persist...
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
179

Expected Behavior

All nodes have to be at the same epoch.

Possible Solution

Restart 3 services helped me to reach consistency between nodes epoch:

root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# systemctl restart neofs-ir; systemctl restart neofs-storage; systemctl restart neo-go
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node1:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node2:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node3:8080 -g
190
root@az:/etc/neofs/storage/tatlin-object-sber-tfstate/vkarakozov# neofs-cli netmap epoch -r node4:8080 -g
190

Steps to Reproduce (for bugs)

Nodes had been used for failover tests during 3 days where tests are aimed to kill each service and several "healthcheck" commands were issued against service to 100% sure that service has been restarted and ready to process.

Logs.zip

Your Environment

NeoFS Inner Ring node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoFS Storage node
Version: v0.35.0-11-g1a7c827a-dirty
GoVersion: go1.18.4
NeoGo
Version: 0.100.1-pre-1-g0cb86f39
GoVersion: go1.18.4

The text was updated successfully, but these errors were encountered:

vkarak1 · 2023-01-25T11:37:25Z

This could be related to unable to force-new-epoch

fyrchik · 2023-01-27T10:12:56Z

Right before the 180 epoch tick:

Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:88        internal error        {"msg": "morph chain connection has been lost"}
Jan 25 09:36:55 az neofs-ir[638]: 2023-01-25T09:36:55.242Z        info        neofs-ir/main.go:107        application stopped
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Succeeded.
Jan 25 09:36:55 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:36:55 az neofs-node[480]: 2023-01-25T09:36:55.250Z        info        client/multi.go:51        connection to the new RPC node has been established        {"endpoint": "ws://172.26.
160.205:40332/ws"}
Jan 25 09:36:55 az systemd[1]: neo-go.service: Main process exited, code=killed, status=9/KILL
Jan 25 09:36:55 az systemd[1]: neo-go.service: Failed with result 'signal'.
Jan 25 09:36:55 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:36:57 az neofs-node[480]: 2023-01-25T09:36:57.101Z        debug        neofs-node/morph.go:232        new block        {"index": 13479}
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Scheduled restart job, restart counter is at 2.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Scheduled restart job, restart counter is at 1.
Jan 25 09:37:00 az systemd[1]: Stopped NeoFS InnerRing node.
Jan 25 09:37:00 az systemd[1]: neofs-ir.service: Consumed 14.959s CPU time.
Jan 25 09:37:00 az systemd[1]: Stopped NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: neo-go.service: Consumed 44.372s CPU time.
Jan 25 09:37:00 az systemd[1]: Started NeoGo N3-neo-go node.
Jan 25 09:37:00 az systemd[1]: Started NeoFS InnerRing node.

fyrchik · 2023-02-03T11:22:54Z

Closed via #2220.

fyrchik mentioned this issue Jan 27, 2023

[#2212] morph: Fix subscription restoration #2220

Merged

anikeev-yadro mentioned this issue Feb 8, 2023

Couldn't get object after node rebooted (gRPC, complex object) #2225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epoch hangs at one node #2212

Epoch hangs at one node #2212

vkarak1 commented Jan 25, 2023 •

edited

Loading

vkarak1 commented Jan 25, 2023 •

edited

Loading

fyrchik commented Jan 27, 2023

fyrchik commented Feb 3, 2023

Epoch hangs at one node #2212

Epoch hangs at one node #2212

Comments

vkarak1 commented Jan 25, 2023 • edited Loading

Expected Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

vkarak1 commented Jan 25, 2023 • edited Loading

fyrchik commented Jan 27, 2023

fyrchik commented Feb 3, 2023

vkarak1 commented Jan 25, 2023 •

edited

Loading

vkarak1 commented Jan 25, 2023 •

edited

Loading