Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-repair high cpu load #312

Closed
JeffreyDevloo opened this issue Aug 9, 2016 · 10 comments
Closed

Auto-repair high cpu load #312

JeffreyDevloo opened this issue Aug 9, 2016 · 10 comments

Comments

@JeffreyDevloo
Copy link

JeffreyDevloo commented Aug 9, 2016

Problem description

CPU usage spiking on all nodes within a cluster. (see first picture below)
The CPU spike is coming from alba maintenance (see second picture below)
selection_003
selection_004

Possible root of the problem

Unknown

Possible solution

Unknown

Temporary solution

Disabling auto-repair: alba update-maintenance-config --config etcd://127.0.0.1:2379/ovs/arakoon/vm-backend-abm/config --disable-auto-repair

Additional information

Complete log file (gzip)

alba-maintenance_vm-backend-wJ4OUV0jLiZe4P9H.log.gz

Setup

Hyperconverged setup

  • Three nodes with each three disks for the back-end

Package information

  • ii openvstorage 2.7.1-fargo.2-1 amd64 openvStorage
  • ii openvstorage-backend 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin
  • ii openvstorage-backend-core 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin core
  • ii openvstorage-backend-webapps 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin Web Applications
  • ii openvstorage-cinder-plugin 1.2.1-fargo.1-1 amd64 OpenvStorage Cinder plugin for OpenStack
  • ii openvstorage-core 2.7.1-fargo.2-1 amd64 openvStorage core
  • ii openvstorage-hc 1.7.1-fargo.1-1 amd64 openvStorage Backend plugin HyperConverged
  • ii openvstorage-sdm 1.6.1-fargo.1-1 amd64 Open vStorage Backend ASD Manager
  • ii openvstorage-test 2.7.1-fargo.2-1 amd64 openvStorage autotest suite
  • ii openvstorage-webapps 2.7.1-fargo.2-1 amd64 openvStorage Web Applications
@wimpers
Copy link

wimpers commented Aug 11, 2016

@JeffreyDevloo what happened on this env? Why was there so much maintenance work to do? Any logs you can add so we can investigate what it was doing?

@domsj
Copy link
Contributor

domsj commented Aug 11, 2016

There was no need to do repair work. It's a bug in the detection of when auto-repair should happen (as evidenced by the fact that disabling auto-repair made the load go away).

@wimpers wimpers added this to the Fargo milestone Aug 11, 2016
@wimpers wimpers added the SRP label Aug 29, 2016
@wimpers
Copy link

wimpers commented Oct 25, 2016

@domsj what do you want to to with this bug? Can we fix it in Fargo? Is there a workaround (disable repair?)

@domsj
Copy link
Contributor

domsj commented Oct 25, 2016

I'm not exactly sure yet where the bug is, so I can't immediately fix it. (Some more code inspection might bring something up though.)
I don't know why @JeffreyDevloo has seen this but why we haven't seen it elsewhere.
I suggest leaving it open for now, but remove the SRP label.
(If we should start seeing it again on other envs it probably makes sense to further investigate.)

@wimpers wimpers removed the SRP label Oct 26, 2016
@wimpers wimpers modified the milestones: Gilbert, Fargo Oct 26, 2016
@wimpers
Copy link

wimpers commented Oct 26, 2016

Please add a higher priority if this would happen again.

@wimpers wimpers modified the milestones: Roadmap, Gilbert Nov 18, 2016
@domsj
Copy link
Contributor

domsj commented Nov 23, 2016

It happened again on a @JeffreyDevloo env, not sure what he's doing wrong ;-)

@JeffreyDevloo
Copy link
Author

JeffreyDevloo commented Nov 23, 2016

Problem description

The maintenance process is hoarding the CPU for his own. I only saw it hoarding cpu on one node though this time.

root      2685  331  0.7 727112 122468 ?       Rsl  Nov22 3512:27 /usr/bin/alba maintenance --config arakoon://config/ovs/alba/backends/a724fb57-1d36-4462-9252-af08f7a11093/maintenance/config?ini=%2Fopt%2Fasd-manager%2Fconfig%2Farakoon_cacc.ini --log-sink console:

Setup

  • 3 node setup
  • disklayout is the following (identical for every node)
    selection_073
  • Backend
    selection_074

Steps that I executed

  • First I deleted the disk with roles (sdd) and an asd disk (sda) on node 1
  • Then I removed 15 of the 18 asds on node 2.
  • Afterwards I left the environment running for the night

When I returned I found that the maintenance was spiking in cpu usage.
Had to take to following steps because my root partition was full with connection logs of arakoon:

  • Remove the syslog.1 file

In the logs I found:

Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 334842 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469378 - info - Exn while repairing osd 49 (~namespace_id:2 ~object ~name:"00_000000d7_00" ~object_id:"\155\150\bs\154\004\152\239\159>>2\003[\219\2020`\189$5\186(\223\006)0L\178\021\179H"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 514456 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469396 - info - Exn while repairing osd 3 (~namespace_id:2 ~object ~name:"00_00000155_00" ~object_id:"\023$4\007\208x/S\213\178H\202V\219\220k\196\206R\162\203\151\202\155\252;\193\230m\015\007\232"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 587093 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469422 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000000c5_00" ~object_id:"8R\139\163\2044K\015\207\240=\253\199lFC\025#y\169\000\136\180K\149\186\148+\146S\210\152"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 593491 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469464 - info - Exn while repairing osd 52 (~namespace_id:2 ~object ~name:"00_000003c3_00" ~object_id:"7a\017\206\018i\153u\2292D\158M\020RL\170\233\237\012\2163\225'Y\184\0062\192\162\230\147"), will now try object rewrite: Alba_client_errors.Error.Exn(8); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46
Nov 23 10:52:06 ovs-node-3 alba[2685]: 2016-11-23 10:52:06 686855 +0100 - ovs-node-3 - 2685/0 - alba/maintenance - 12469500 - info - Exn while repairing osd 16 (~namespace_id:2 ~object ~name:"00_00000015_00" ~object_id:"amys\nb\148\142?]1\224\185Q\212\2218\191*Bs\143\011B\199\168\159\171\bzQ%"), will now try object rewrite: Nsm_model.Err.Nsm_exn(7, ""); backtrace:; Raised at file "queue.ml", line 68, characters 17-22; Called from file "src/tools/lwt_pool2.ml", line 98, characters 25-46

This and many more Exn while repairing osd XX

Temporary solution

  • Disabled the auto repair and waited 2 mins
  • Killed the cpu hoarding process
  • Restarted the maintenance service (no high cpu usage this time)
  • Enabled auto repair again (and waited 2 mins)

At first the CPU usage spiket back to 350% but after around 10 minutes, the maintenance process was only using 10%.

domsj added a commit that referenced this issue Nov 24, 2016
@JeffreyDevloo JeffreyDevloo removed their assignment Nov 24, 2016
@domsj
Copy link
Contributor

domsj commented Nov 24, 2016

It's still unclear why this happened. Added some more logging that should be available in the next version

@wimpers
Copy link

wimpers commented Nov 30, 2016

@domsj what needs to happen with this ticket? It is in status In Progress but no one is working on it/assigned to it?

@toolslive
Copy link
Member

#708 fixes a case where maintenance starts spinning while trying to repair a bucket.
This might not fix everything observed in this ticket, but closing it nonetheless. New observations need a new ticket then.

@wimpers wimpers modified the milestones: G, Roadmap May 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants