Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
spec for host-state-level-locking blueprint
Previously-approved: liberty Change-Id: I44bf0140d9300d9cd66b0efeb5fae7658bb8e40c
- Loading branch information
Showing
1 changed file
with
223 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,223 @@ | ||
.. | ||
This work is licensed under a Creative Commons Attribution 3.0 Unported | ||
License. | ||
http://creativecommons.org/licenses/by/3.0/legalcode | ||
|
||
============================================ | ||
Scheduler: Introduce HostState level locking | ||
============================================ | ||
|
||
https://blueprints.launchpad.net/nova/+spec/host-state-level-locking | ||
|
||
Nova FilterScheduler implementation even though inherently multi-threaded, uses | ||
no locking for access to the shared in-memory HostState data structures, that | ||
are shared between all active threads. Even though this means that most of | ||
decisions that scheduler makes under load are not internally consistent, this | ||
is not necessarily a huge issue for the basic use case, as Nova makes sure that | ||
the set resource usage policy is maintained even due to races using the retry | ||
mechanism [1]_. This can however cause issues in several more complex use | ||
cases. A non exhaustive list of some examples would be: high resource | ||
utilization, high load, specific types of host and resources (e.g. Ironic nodes | ||
[2]_ and complex resources such as NUMA topology or PCI devices). | ||
|
||
We propose to change the scheduler code to use a lightweight transactional | ||
approach to avoid full blown locking while still mitigating some of the race | ||
conditions. | ||
|
||
|
||
Problem description | ||
=================== | ||
|
||
Our scheduler service is inherently multi-threaded as it currently runs an | ||
oslo-messaging RpcServer using an EventletExecutor. This means that every | ||
incoming RPC message for select_destinations will be dispatched in it's own | ||
green thread. | ||
|
||
Upon receiving the message, every green thread will read all ComputeNode states | ||
from the database, and potentially [3]_ populate the internal global data | ||
structure that holds the host states which will be used for filtering. | ||
|
||
Further along, after choosing a host, each thread will call the | ||
HostState.consume_from_instance() method on the chosen object, which will | ||
"consume" the resources for the instance being scheduled from the chosen | ||
HostState object. This is the equivalent of what Claims code does once the | ||
request makes it to a nova-compute service, except instead of updating the | ||
ComputeNode table, it updates the scheduler service's in memory HostState | ||
object. | ||
|
||
However since there is no mutual exclusion of threads between | ||
the time a filter function ran and decide that the host passes, until a single | ||
host state was chosen. A number of other concurrent threads could have already | ||
updated the same host state. A classic race condition. Once we consider this, | ||
some obvious avenues for improvement arise. | ||
|
||
1. When calling consume_from_instance() we are basically doing a claim of | ||
resources on the host state, that may have changed since the filter function | ||
that decided to pass the host ran. At that point we have all the information | ||
to know early if a claim is going to fail and try to choose a different | ||
host. This is roughly equivalent to retrying a transaction. | ||
|
||
It is worth noting here that even though we may find that host seems like | ||
it will be failing, we may still want to choose it, as we don't ever drop | ||
the resources consumed on the HostState even after we register a retry from | ||
a already chosen compute host in this refresh cycle, so it may in fact be | ||
a false negative. | ||
|
||
2. There needs to be some kind of locking that is granular enough so as not to | ||
cause too much unnecessary overhead, but also to allow for more consistent | ||
handling of HostState. | ||
|
||
|
||
Use Cases | ||
---------- | ||
|
||
There is no specific use case that this is aimed at. It is an internal | ||
refactoring aimed at improving data consistency in the scheduler, and thus | ||
overall effectiveness of placement decisions. | ||
|
||
Project Priority | ||
----------------- | ||
|
||
Yes - This is work related to the scheduler, one of the priority topics for | ||
Liberty. | ||
|
||
Proposed change | ||
=============== | ||
|
||
Firstly, it would be very useful to use the Claim logic instead (or inside) of | ||
HostState.consume_from_instance() as there is almost complete duplication | ||
there. | ||
|
||
Next change that would be in the scope for this blueprint would be adding | ||
synchronisation primitives around accessing and updating HostState fields. | ||
A lightweight approach would be to not use any synchronisation primitives in | ||
the filters, as access to the host state is a) read-only b) usually per | ||
resource. consume_from_instance is the place where we want to make sure access | ||
is synchronized, as once the host is chosen, it will need to have resources | ||
consumed (remember - many concurrent threads could be trying to consume | ||
resources from the same HostState) and if it fails any of the "claims", no | ||
resources should be consumed. Updating the host state with fresh values after | ||
a DB read should also be synchronized. | ||
|
||
Final piece of the puzzle is modifying the FilterScheduler._schedule() method | ||
to take into account the failure to claim in consume_from_instance() and try | ||
the next host that passed the filters, or choose to ignore the local in memory | ||
failure and risk a retry from the compute host. | ||
|
||
It is worth noting that this proposal only looks at fixing data consistency | ||
among threads of a single nova-scheduler process. Running several workers still | ||
means that their internal state is going to be inconsistent between updates | ||
from the database. Fixing this is outside of the scope of this proposal. | ||
|
||
Alternatives | ||
------------ | ||
|
||
There are a number of ways we could re-design the scheduler so that the issues | ||
discussed in this spec become irrelevant. This blueprint aims to improve some | ||
obvious issues with the current implementation of the scheduler without | ||
changing the basic design. | ||
|
||
Data model impact | ||
----------------- | ||
|
||
None | ||
|
||
REST API impact | ||
--------------- | ||
|
||
None | ||
|
||
Security impact | ||
--------------- | ||
|
||
None | ||
|
||
Notifications impact | ||
-------------------- | ||
|
||
None | ||
|
||
Other end user impact | ||
--------------------- | ||
|
||
None | ||
|
||
Performance Impact | ||
------------------ | ||
|
||
Even though there will be overhead of synchronisation in every request after | ||
this change which may decrease the average response time for basic workloads, | ||
I fully expect this to massively improve the performance in conditions of a | ||
large number of requests, or low overall cloud capacity (or specific resources | ||
such as Ironic hosts), as it will significantly cut down on issued retries. | ||
|
||
Other deployer impact | ||
--------------------- | ||
|
||
There may be several config options deployers would need to consider. Defaults | ||
may be chosen in such a way as to not change previous behaviour. | ||
|
||
Developer impact | ||
---------------- | ||
|
||
Developers would need to understand that there is now locking going on in the | ||
scheduler, and consider this when making changes to the code, especially in | ||
case of adding additional resources. | ||
|
||
Implementation | ||
============== | ||
|
||
Assignee(s) | ||
----------- | ||
|
||
Primary assignee: | ||
<ndipanov> | ||
|
||
Work Items | ||
---------- | ||
|
||
* Refactor Claim classes to not be directly dependent on the resource_tracker, | ||
so that they can be used in the scheduler code and possibly move out of the | ||
compute/ subtree | ||
|
||
* Modify HostState.consume_from_instance() to use the Claim logic and acquire | ||
a HostState instance-wide lock for doing so. | ||
|
||
* Modify HostState.update_from_compute_node() to acquire a HostState | ||
instance-wide lock for updating the host state. | ||
|
||
* Modify FilterSchedule._schedule() method to expect a claim transaction | ||
failure and take appropriate action. | ||
|
||
Dependencies | ||
============ | ||
|
||
None | ||
|
||
Testing | ||
======= | ||
|
||
As is usually the case with race problems, it is notoriously difficult | ||
to come up with deterministic tests. Testing will be limited to unit tests | ||
making sure that proper synchronisation primitives are called as expected. | ||
|
||
Documentation Impact | ||
==================== | ||
|
||
There may be an additional config option to turn on the transactional nature | ||
of consume_from_instance() and possibly another one to tell the scheduler to | ||
go ahead and attempt to land an instance even though a local claim failed. | ||
|
||
References | ||
========== | ||
|
||
.. [1] The Retry mechanism works kind of like a 2PC where the instance | ||
resource usage is consumed on the in memory view the scheduler has, but is | ||
only committed to the DB when the request makes it to the chosen compute | ||
host, and under a global resource lock. | ||
.. [2] This `bug <https://bugs.launchpad.net/nova/+bug/1341420>` shows that | ||
this is pretty bad in case of Ironic. | ||
.. [3] I say potentially because there is a check of a timestamp to see if the | ||
HostState has actually been updated more recently than the ComputeNode | ||
record (with in flight requests not yet claimed on their compute hosts). |