Skip to content

Commit

Permalink
Ensure that bandwidth and VF are from the same PF
Browse files Browse the repository at this point in the history
A neutron port can be created with direct vnic type and also it can have
bandwidth resource request at the same time. In this case placement will
offer allocation candidates that could fulfill the bandwidth request, then
the nova scheduler's PciPassthroughFilter checks if a PCI device, a VF,
is available for such request. This check is based on the physnet
of the neutron port and the physical_network tag in the
pci/passthrough_whitelist config. It does not consider the actual PF
providing the bandwidth.

The currently unsupported case is when a single compute node has
whitelisted VFs from more than one PF which are connected to the same
physnet. These PFs can have totally different bandwidth inventories
in placement. For example PF2 can have plenty of bandwidth available
and PF3 has no bandwidth configured at all.

In this case the PciPassthroughFilter might accept the host simply
because PF3 still has available VFs even if the bandwidth from the port
is fulfilled from PF2 which in return might not have available VFs any
more.

Moreover the PCI claim has the same logic as the filter so it will claim
the VF from PF3 while the bandwidth was allocated from PF2 in placement.

This patch does not try to solve the issue in the PciPassthroughFilter but
it does solves the issue in the pci claim. This means that after
successful scheduling the pci claim can still fail if bandwidth is
allocated from one PF but a VF is not available from that specific PF
any more. This will lead to re-schedule.

Making the PciPassthroughFilter smart enough is complicated because:
* The filters are not knowing about placement allocation candidates at
  all
* The filters are working per compute host not per allocation
  candidates. If there are two allocation candidates for the same host
  then nova will only try to filter for the first one. [1][2]

This patch applies the following logic:

The compute manager checks the InstancePCIRequest ovos in a given
boot request and maps each of them to the neutron port that requested
the PCI device. Then it maps the neutron port to the physical device
RP in the placement allocation made for this server. Then the spec in
the InstancePCIRequest is extended with the interface name of the PF
from where the bandwidth was allocated from based on the name of the
device RP. Then the PCI claim will enforce that the PF interface
name in the request matches the interface name of the PF from where
the VF is selected from. The PCI claim code knows about the PF
interface name of each available VF from the virt driver reporting the
'parent_ifname' key as part of the return value of the
get_available_resource() driver call.

The PCI claim process is not changed as it already enforces that
every fields from the request matches with the fields of the selected
device pool.

The current patch extends the libvirt driver to provider PF interface
name information. Besides the libvirt driver the xenapi driver also
support SRIOV VF handling but this patch does not extend the xenapi
driver. So for the xenapi the above described configuration currently
kept unsupported.

I know that this feels complicated but it is necessary becase VFs has
not been counted as resources in placement yet.

[1] https://github.com/openstack/nova/blob/f6996903d2ef0fdb40135b506c83ed6517b28e19/nova/scheduler/filter_scheduler.py#L239
[2] https://github.com/openstack/nova/blob/f6996903d2ef0fdb40135b506c83ed6517b28e19/nova/scheduler/filter_scheduler.py#L426

blueprint: bandwidth-resource-provider

Change-Id: I038867c4094d79ae4a20615ab9c9f9e38fcc2e0a
  • Loading branch information
Balazs Gibizer committed Mar 5, 2019
1 parent c43c1d3 commit c02e213
Show file tree
Hide file tree
Showing 11 changed files with 440 additions and 10 deletions.
62 changes: 62 additions & 0 deletions nova/compute/manager.py
Expand Up @@ -2115,6 +2115,61 @@ def _get_request_group_mapping(request_spec):
else:
return None

def _update_pci_request_spec_with_allocated_interface_name(
self, context, instance, request_group_resource_providers_mapping):
if not instance.pci_requests:
return

def needs_update(pci_request, mapping):
return (pci_request.requester_id
and pci_request.requester_id in mapping)

modified = False
for pci_request in instance.pci_requests.requests:
if needs_update(
pci_request, request_group_resource_providers_mapping):

provider_uuids = request_group_resource_providers_mapping[
pci_request.requester_id]

if len(provider_uuids) != 1:
reason = (
'Allocating resources from more than one resource '
'providers %(providers)s for a single pci request '
'%(requester)s is not supported.' %
{'providers': provider_uuids,
'requester': pci_request.requester_id})
raise exception.BuildAbortException(
instance_uuid=instance.uuid,
reason=reason)

dev_rp_name = self.reportclient.get_resource_provider_name(
context,
provider_uuids[0])

# NOTE(gibi): the device RP name reported by neutron is
# structured like <hostname>:<agentname>:<interfacename>
rp_name_pieces = dev_rp_name.split(':')
if len(rp_name_pieces) != 3:
reason = (
'Resource provider %(provider)s used to allocate '
'resources for the pci request %(requester)s does not '
'have properly formatted name. Expected name format '
'is <hostname>:<agentname>:<interfacename>, but got '
'%(provider_name)s' %
{'provider': provider_uuids[0],
'requester': pci_request.requester_id,
'provider_name': dev_rp_name})
raise exception.BuildAbortException(
instance_uuid=instance.uuid,
reason=reason)

for spec in pci_request.spec:
spec['parent_ifname'] = rp_name_pieces[2]
modified = True
if modified:
instance.save()

def _build_and_run_instance(self, context, instance, image, injected_files,
admin_password, requested_networks, security_groups,
block_device_mapping, node, limits, filter_properties,
Expand All @@ -2136,6 +2191,13 @@ def _build_and_run_instance(self, context, instance, image, injected_files,
self._check_device_tagging(requested_networks, block_device_mapping)
self._check_trusted_certs(instance)

request_group_resource_providers_mapping = \
self._get_request_group_mapping(request_spec)

if request_group_resource_providers_mapping:
self._update_pci_request_spec_with_allocated_interface_name(
context, instance, request_group_resource_providers_mapping)

try:
scheduler_hints = self._get_scheduler_hints(filter_properties,
request_spec)
Expand Down
8 changes: 8 additions & 0 deletions nova/pci/stats.py
Expand Up @@ -95,6 +95,14 @@ def _create_pool_keys_from_dev(self, dev):
pool = {k: getattr(dev, k) for k in self.pool_keys}
if tags:
pool.update(tags)
# NOTE(gibi): parent_ifname acts like a tag during pci claim but
# not provided as part of the whitelist spec as it is auto detected
# by the virt driver.
# This key is used for match InstancePciRequest backed by neutron ports
# that has resource_request and therefore that has resource allocation
# already in placement.
if dev.extra_info.get('parent_ifname'):
pool['parent_ifname'] = dev.extra_info['parent_ifname']
return pool

def add_device(self, dev):
Expand Down
21 changes: 21 additions & 0 deletions nova/scheduler/client/report.py
Expand Up @@ -400,6 +400,27 @@ def get_provider_traits(self, context, rp_uuid):
'status_code': resp.status_code, 'err_text': resp.text})
raise exception.ResourceProviderTraitRetrievalFailed(uuid=rp_uuid)

def get_resource_provider_name(self, context, uuid):
"""Return the name of a RP. It tries to use the internal of RPs or
falls back to calling placement directly.
:param context: The security context
:param uuid: UUID identifier for the resource provider to look up
:return: The name of the RP
:raise: ResourceProviderRetrievalFailed if the RP is not in the cache
and the communication with the placement is failed.
:raise: ResourceProviderNotFound if the RP does not exists.
"""

try:
return self._provider_tree.data(uuid).name
except ValueError:
rsp = self._get_resource_provider(context, uuid)
if rsp is None:
raise exception.ResourceProviderNotFound(name_or_uuid=uuid)
else:
return rsp['name']

@safe_connect
def _get_resource_provider(self, context, uuid):
"""Queries the placement API for a resource provider record with the
Expand Down
21 changes: 19 additions & 2 deletions nova/tests/fixtures.py
Expand Up @@ -1311,16 +1311,33 @@ class NeutronFixture(fixtures.Fixture):
}
network_2['subnets'] = [subnet_2['id']]

sriov_port = {
'id': '5460ee0c-ffbb-4e45-8d58-37bfceabd084',
'network_id': network_2['id'],
'admin_state_up': True,
'status': 'ACTIVE',
'mac_address': '52:54:00:1e:59:c4',
'fixed_ips': [
{
'ip_address': '192.168.13.2',
'subnet_id': subnet_2['id']
}
],
'tenant_id': tenant_id,
'resource_request': {},
'binding:vnic_type': 'direct',
}

port_with_sriov_resource_request = {
'id': '7059503b-a648-40fd-a561-5ca769304bee',
'network_id': network_2['id'],
'admin_state_up': True,
'status': 'ACTIVE',
'mac_address': '52:54:00:1e:59:c4',
'mac_address': '52:54:00:1e:59:c5',
# Do neutron really adds fixed_ips to an direct vnic_type port?
'fixed_ips': [
{
'ip_address': '192.168.13.2',
'ip_address': '192.168.13.3',
'subnet_id': subnet_2['id']
}
],
Expand Down
116 changes: 115 additions & 1 deletion nova/tests/functional/test_servers.py
Expand Up @@ -5924,13 +5924,15 @@ def setUp(self):

self._create_networking_rp_tree(self.compute1_rp_uuid)

# add an extra port and the related network to the neutron fixture
# add extra ports and the related network to the neutron fixture
# specifically for these tests. It cannot be added globally in the
# fixture init as it adds a second network that makes auto allocation
# based test to fail due to ambiguous networks.
self.neutron._ports[
self.neutron.port_with_sriov_resource_request['id']] = \
copy.deepcopy(self.neutron.port_with_sriov_resource_request)
self.neutron._ports[self.neutron.sriov_port['id']] = \
copy.deepcopy(self.neutron.sriov_port)
self.neutron._networks[
self.neutron.network_2['id']] = self.neutron.network_2
self.neutron._subnets[
Expand Down Expand Up @@ -6498,6 +6500,118 @@ def test_interface_detach_with_port_with_bandwidth_request(self):
binding_profile = updated_port['binding:profile']
self.assertNotIn('allocation', binding_profile)

def test_two_sriov_ports_one_with_request_two_available_pfs(self):
"""Verify that the port's bandwidth allocated from the same PF as
the allocated VF.
One compute host:
* PF1 (0000:01:00) is configured for physnet1
* PF2 (0000:02:00) is configured for physnet2, with 1 VF and bandwidth
inventory
* PF3 (0000:03:00) is configured for physnet2, with 1 VF but without
bandwidth inventory
One instance will be booted with two neutron ports, both ports
requested to be connected to physnet2. One port has resource request
the other does not have resource request. The port having the resource
request cannot be allocated to PF3 and PF1 while the other port that
does not have resource request can be allocated to PF2 or PF3.
For the detailed compute host config see the FakeDriverWithPciResources
class. For the necessary passthrough_whitelist config see the setUp of
the PortResourceRequestBasedSchedulingTestBase class.
"""

sriov_port = self.neutron.sriov_port
sriov_port_with_res_req = self.neutron.port_with_sriov_resource_request
server = self._create_server(
flavor=self.flavor_with_group_policy,
networks=[
{'port': sriov_port_with_res_req['id']},
{'port': sriov_port['id']}])

server = self._wait_for_state_change(self.admin_api, server, 'ACTIVE')

sriov_port = self.neutron.show_port(sriov_port['id'])['port']
sriov_port_with_res_req = self.neutron.show_port(
sriov_port_with_res_req['id'])['port']

allocations = self.placement_api.get(
'/allocations/%s' % server['id']).body['allocations']

# We expect one set of allocations for the compute resources on the
# compute rp and one set for the networking resources on the sriov PF2
# rp.
self.assertEqual(2, len(allocations))
compute_allocations = allocations[self.compute1_rp_uuid]['resources']
self.assertEqual(
self._resources_from_flavor(self.flavor_with_group_policy),
compute_allocations)

sriov_allocations = allocations[self.sriov_pf2_rp_uuid]['resources']
self.assertPortMatchesAllocation(
sriov_port_with_res_req, sriov_allocations)

# We expect that only the RP uuid of the networking RP having the port
# allocation is sent in the port binding for the port having resource
# request
sriov_with_req_binding = sriov_port_with_res_req['binding:profile']
self.assertEqual(
self.sriov_pf2_rp_uuid, sriov_with_req_binding['allocation'])
# and the port without resource request does not have allocation
sriov_binding = sriov_port['binding:profile']
self.assertNotIn('allocation', sriov_binding)

# We expect that the selected PCI device matches with the RP from
# where the bandwidth is allocated from. The bandwidth is allocated
# from 0000:02:00 (PF2) so the PCI device should be a VF of that PF
self.assertEqual('0000:02:00.1', sriov_with_req_binding['pci_slot'])
# But also the port that has no resource request still gets a pci slot
# allocated. The 0000:02:00 has no more VF available but 0000:03:00 has
# one VF available and that PF is also on physnet2
self.assertEqual('0000:03:00.1', sriov_binding['pci_slot'])

def test_one_sriov_port_no_vf_and_bandwidth_available_on_the_same_pf(self):
"""Verify that if there is no PF that both provides bandwidth and VFs
then the boot will fail.
"""

# boot a server with a single sriov port that has no resource request
sriov_port = self.neutron.sriov_port
server = self._create_server(
flavor=self.flavor_with_group_policy,
networks=[{'port': sriov_port['id']}])

self._wait_for_state_change(self.admin_api, server, 'ACTIVE')
sriov_port = self.neutron.show_port(sriov_port['id'])['port']
sriov_binding = sriov_port['binding:profile']

# We expect that this consume the last available VF from the PF2
self.assertEqual('0000:02:00.1', sriov_binding['pci_slot'])

# Now boot a second server with a port that has resource request
# At this point PF2 has available bandwidth but no available VF
# and PF3 has available VF but no available bandwidth so we expect
# the boot to fail.

sriov_port_with_res_req = self.neutron.port_with_sriov_resource_request
server = self._create_server(
flavor=self.flavor_with_group_policy,
networks=[{'port': sriov_port_with_res_req['id']}])

# NOTE(gibi): It should be NoValidHost in an ideal world but that would
# require the scheduler to detect the situation instead of the pci
# claim. However that is pretty hard as the scheduler does not know
# anything about allocation candidates (e.g. that the only candidate
# for the port in this case is PF2) it see the whole host as a
# candidate and in our host there is available VF for the request even
# if that is on the wrong PF.
server = self._wait_for_state_change(self.admin_api, server, 'ERROR')
self.assertIn(
'Exceeded maximum number of retries. Exhausted all hosts '
'available for retrying build failures for instance',
server['fault']['message'])


class PortResourceRequestReSchedulingTestIgnoreMicroversionCheck(
PortResourceRequestBasedSchedulingTestBase):
Expand Down

0 comments on commit c02e213

Please sign in to comment.