Auto-repair high cpu load #312

JeffreyDevloo · 2016-08-09T14:47:06Z

Problem description

CPU usage spiking on all nodes within a cluster. (see first picture below)
The CPU spike is coming from alba maintenance (see second picture below)

Possible root of the problem

Unknown

Possible solution

Unknown

Temporary solution

Disabling auto-repair: alba update-maintenance-config --config etcd://127.0.0.1:2379/ovs/arakoon/vm-backend-abm/config --disable-auto-repair

Additional information

Complete log file (gzip)

alba-maintenance_vm-backend-wJ4OUV0jLiZe4P9H.log.gz

Setup

Hyperconverged setup

Three nodes with each three disks for the back-end

Package information

ii openvstorage 2.7.1-fargo.2-1 amd64 openvStorage
ii openvstorage-backend 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin
ii openvstorage-backend-core 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin core
ii openvstorage-backend-webapps 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin Web Applications
ii openvstorage-cinder-plugin 1.2.1-fargo.1-1 amd64 OpenvStorage Cinder plugin for OpenStack
ii openvstorage-core 2.7.1-fargo.2-1 amd64 openvStorage core
ii openvstorage-hc 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin HyperConverged
ii openvstorage-sdm 1.6.1-fargo.1-1 amd64 Open vStorage Backend ASD Manager
ii openvstorage-test 2.7.1-fargo.2-1 amd64 openvStorage autotest suite
ii openvstorage-webapps 2.7.1-fargo.2-1 amd64 openvStorage Web Applications

wimpers · 2016-08-11T08:41:02Z

@JeffreyDevloo what happened on this env? Why was there so much maintenance work to do? Any logs you can add so we can investigate what it was doing?

domsj · 2016-08-11T08:56:42Z

There was no need to do repair work. It's a bug in the detection of when auto-repair should happen (as evidenced by the fact that disabling auto-repair made the load go away).

wimpers · 2016-10-25T15:45:46Z

@domsj what do you want to to with this bug? Can we fix it in Fargo? Is there a workaround (disable repair?)

domsj · 2016-10-25T16:01:31Z

I'm not exactly sure yet where the bug is, so I can't immediately fix it. (Some more code inspection might bring something up though.)
I don't know why @JeffreyDevloo has seen this but why we haven't seen it elsewhere.
I suggest leaving it open for now, but remove the SRP label.
(If we should start seeing it again on other envs it probably makes sense to further investigate.)

wimpers · 2016-10-26T07:14:59Z

Please add a higher priority if this would happen again.

domsj · 2016-11-23T10:03:02Z

It happened again on a @JeffreyDevloo env, not sure what he's doing wrong ;-)

JeffreyDevloo · 2016-11-23T10:19:40Z

Problem description

The maintenance process is hoarding the CPU for his own. I only saw it hoarding cpu on one node though this time.

root      2685  331  0.7 727112 122468 ?       Rsl  Nov22 3512:27 /usr/bin/alba maintenance --config arakoon://config/ovs/alba/backends/a724fb57-1d36-4462-9252-af08f7a11093/maintenance/config?ini=%2Fopt%2Fasd-manager%2Fconfig%2Farakoon_cacc.ini --log-sink console:

Setup

3 node setup
disklayout is the following (identical for every node)
Backend

Steps that I executed

First I deleted the disk with roles (sdd) and an asd disk (sda) on node 1
Then I removed 15 of the 18 asds on node 2.
Afterwards I left the environment running for the night

When I returned I found that the maintenance was spiking in cpu usage.
Had to take to following steps because my root partition was full with connection logs of arakoon:

Remove the syslog.1 file

In the logs I found:

Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 334842 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469378 - info - Exn while repairing osd 49 (~namespace_id:2 ~object ~name:"00_000000d7_00" ~object_id:"\155\150\bs\154\004\152\239\159>>2\003[\219\2020`\189$5\186(\223\006)0L\178\021\179H"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 514456 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469396 - info - Exn while repairing osd 3 (~namespace_id:2 ~object ~name:"00_00000155_00" ~object_id:"\023$4\007\208x/S\213\178H\202V\219\220k\196\206R\162\203\151\202\155\252;\193\230m\015\007\232"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 587093 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469422 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000000c5_00" ~object_id:"8R\139\163\2044K\015\207\240=\253\199lFC\025#y\169\000\136\180K\149\186\148+\146S\210\152"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 593491 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469464 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000003c3_00" ~object_id:"7a\017\206\018i\153u\2292D\158M\020RL\170\233\237\012\2163\225'Y\184\0062\192\162\230\147"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 686855 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469500 - info - Exn while repairing osd 16 (~namespace_id:2 ~object ~name:"00_00000015_00" ~object_id:"amys\nb\148\142?]1\224\185Q\212\2218\191*Bs\143\011B\199\168\159\171\bzQ%"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46

This and many more Exn while repairing osd XX

Temporary solution

Disabled the auto repair and waited 2 mins
Killed the cpu hoarding process
Restarted the maintenance service (no high cpu usage this time)
Enabled auto repair again (and waited 2 mins)

At first the CPU usage spiket back to 350% but after around 10 minutes, the maintenance process was only using 10%.

slightly more info logging for #312

domsj · 2016-11-24T09:49:40Z

It's still unclear why this happened. Added some more logging that should be available in the next version

wimpers · 2016-11-30T15:20:04Z

@domsj what needs to happen with this ticket? It is in status In Progress but no one is working on it/assigned to it?

toolslive · 2017-05-03T08:08:07Z

#708 fixes a case where maintenance starts spinning while trying to repair a bucket.
This might not fix everything observed in this ticket, but closing it nonetheless. New observations need a new ticket then.

domsj mentioned this issue Aug 11, 2016

small fix in autorepair #315

Merged

wimpers added the state_question label Aug 11, 2016

wimpers assigned JeffreyDevloo Aug 11, 2016

wimpers added type_bug priority_normal and removed state_question labels Aug 11, 2016

wimpers added this to the Fargo milestone Aug 11, 2016

wimpers added the SRP label Aug 29, 2016

wimpers removed the SRP label Oct 26, 2016

wimpers modified the milestones: Gilbert, Fargo Oct 26, 2016

wimpers modified the milestones: Roadmap, Gilbert Nov 18, 2016

wimpers added type_enhancement and removed type_bug labels Nov 18, 2016

domsj added priority_critical priority_urgent and removed priority_normal priority_critical labels Nov 23, 2016

domsj added the state_inprogress label Nov 24, 2016

domsj added a commit that referenced this issue Nov 24, 2016

Merge pull request #459 from openvstorage/logging-for-#312

38b73e3

slightly more info logging for #312

JeffreyDevloo removed their assignment Nov 24, 2016

domsj mentioned this issue Nov 30, 2016

get-disk-safety --include-errored-as-dead returns weird results #441

Open

wimpers removed the state_inprogress label Dec 5, 2016

domsj mentioned this issue Dec 9, 2016

More data on backend after repair #492

Closed

domsj mentioned this issue Jan 5, 2017

used space not correct in list-osds #547

Open

toolslive mentioned this issue May 2, 2017

When restarting the maintenance agent, asd add Maintenance.NotMyTask to the recently errors #696

Closed

toolslive closed this as completed May 3, 2017

wimpers modified the milestones: G, Roadmap May 24, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-repair high cpu load #312

Auto-repair high cpu load #312

JeffreyDevloo commented Aug 9, 2016 •

edited by wimpers

Loading

wimpers commented Aug 11, 2016

domsj commented Aug 11, 2016

wimpers commented Oct 25, 2016

domsj commented Oct 25, 2016

wimpers commented Oct 26, 2016

domsj commented Nov 23, 2016

JeffreyDevloo commented Nov 23, 2016 •

edited

Loading

domsj commented Nov 24, 2016

wimpers commented Nov 30, 2016

toolslive commented May 3, 2017

Auto-repair high cpu load #312

Auto-repair high cpu load #312

Comments

JeffreyDevloo commented Aug 9, 2016 • edited by wimpers Loading

Problem description

Possible root of the problem

Possible solution

Temporary solution

Additional information

Complete log file (gzip)

Setup

Package information

wimpers commented Aug 11, 2016

domsj commented Aug 11, 2016

wimpers commented Oct 25, 2016

domsj commented Oct 25, 2016

wimpers commented Oct 26, 2016

domsj commented Nov 23, 2016

JeffreyDevloo commented Nov 23, 2016 • edited Loading

Problem description

Setup

Steps that I executed

Temporary solution

domsj commented Nov 24, 2016

wimpers commented Nov 30, 2016

toolslive commented May 3, 2017

JeffreyDevloo commented Aug 9, 2016 •

edited by wimpers

Loading

JeffreyDevloo commented Nov 23, 2016 •

edited

Loading