Ensure that bandwidth and VF are from the same PF

A neutron port can be created with direct vnic type and also it can have bandwidth resource request at the same time. In this case placement will offer allocation candidates that could fulfill the bandwidth request, then the nova scheduler's PciPassthroughFilter checks if a PCI device, a VF, is available for such request. This check is based on the physnet of the neutron port and the physical_network tag in the pci/passthrough_whitelist config. It does not consider the actual PF providing the bandwidth. The currently unsupported case is when a single compute node has whitelisted VFs from more than one PF which are connected to the same physnet. These PFs can have totally different bandwidth inventories in placement. For example PF2 can have plenty of bandwidth available and PF3 has no bandwidth configured at all. In this case the PciPassthroughFilter might accept the host simply because PF3 still has available VFs even if the bandwidth from the port is fulfilled from PF2 which in return might not have available VFs any more. Moreover the PCI claim has the same logic as the filter so it will claim the VF from PF3 while the bandwidth was allocated from PF2 in placement. This patch does not try to solve the issue in the PciPassthroughFilter but it does solves the issue in the pci claim. This means that after successful scheduling the pci claim can still fail if bandwidth is allocated from one PF but a VF is not available from that specific PF any more. This will lead to re-schedule. Making the PciPassthroughFilter smart enough is complicated because: * The filters are not knowing about placement allocation candidates at all * The filters are working per compute host not per allocation candidates. If there are two allocation candidates for the same host then nova will only try to filter for the first one. [1][2] This patch applies the following logic: The compute manager checks the InstancePCIRequest ovos in a given boot request and maps each of them to the neutron port that requested the PCI device. Then it maps the neutron port to the physical device RP in the placement allocation made for this server. Then the spec in the InstancePCIRequest is extended with the interface name of the PF from where the bandwidth was allocated from based on the name of the device RP. Then the PCI claim will enforce that the PF interface name in the request matches the interface name of the PF from where the VF is selected from. The PCI claim code knows about the PF interface name of each available VF from the virt driver reporting the 'parent_ifname' key as part of the return value of the get_available_resource() driver call. The PCI claim process is not changed as it already enforces that every fields from the request matches with the fields of the selected device pool. The current patch extends the libvirt driver to provider PF interface name information. Besides the libvirt driver the xenapi driver also support SRIOV VF handling but this patch does not extend the xenapi driver. So for the xenapi the above described configuration currently kept unsupported. I know that this feels complicated but it is necessary becase VFs has not been counted as resources in placement yet. [1] https://github.com/openstack/nova/blob/f6996903d2ef0fdb40135b506c83ed6517b28e19/nova/scheduler/filter_scheduler.py#L239 [2] https://github.com/openstack/nova/blob/f6996903d2ef0fdb40135b506c83ed6517b28e19/nova/scheduler/filter_scheduler.py#L426 blueprint: bandwidth-resource-provider Change-Id: I038867c4094d79ae4a20615ab9c9f9e38fcc2e0a
openstack · Mar 5, 2019 · c02e213 · c02e213
1 parent c43c1d3
commit c02e213
Show file tree

Hide file tree

Showing 11 changed files with 440 additions and 10 deletions.
diff --git a/nova/compute/manager.py b/nova/compute/manager.py
@@ -2115,6 +2115,61 @@ def _get_request_group_mapping(request_spec):
         else:
             return None
 
+    def _update_pci_request_spec_with_allocated_interface_name(
+            self, context, instance, request_group_resource_providers_mapping):
+        if not instance.pci_requests:
+            return
+
+        def needs_update(pci_request, mapping):
+            return (pci_request.requester_id
+                    and pci_request.requester_id in mapping)
+
+        modified = False
+        for pci_request in instance.pci_requests.requests:
+            if needs_update(
+                    pci_request, request_group_resource_providers_mapping):
+
+                provider_uuids = request_group_resource_providers_mapping[
+                    pci_request.requester_id]
+
+                if len(provider_uuids) != 1:
+                    reason = (
+                        'Allocating resources from more than one resource '
+                        'providers %(providers)s for a single pci request '
+                        '%(requester)s is not supported.' %
+                        {'providers': provider_uuids,
+                         'requester': pci_request.requester_id})
+                    raise exception.BuildAbortException(
+                        instance_uuid=instance.uuid,
+                        reason=reason)
+
+                dev_rp_name = self.reportclient.get_resource_provider_name(
+                    context,
+                    provider_uuids[0])
+
+                # NOTE(gibi): the device RP name reported by neutron is
+                # structured like <hostname>:<agentname>:<interfacename>
+                rp_name_pieces = dev_rp_name.split(':')
+                if len(rp_name_pieces) != 3:
+                    reason = (
+                        'Resource provider %(provider)s used to allocate '
+                        'resources for the pci request %(requester)s does not '
+                        'have properly formatted name. Expected name format '
+                        'is <hostname>:<agentname>:<interfacename>, but got '
+                        '%(provider_name)s' %
+                        {'provider': provider_uuids[0],
+                         'requester': pci_request.requester_id,
+                         'provider_name': dev_rp_name})
+                    raise exception.BuildAbortException(
+                        instance_uuid=instance.uuid,
+                        reason=reason)
+
+                for spec in pci_request.spec:
+                    spec['parent_ifname'] = rp_name_pieces[2]
+                    modified = True
+        if modified:
+            instance.save()
+
     def _build_and_run_instance(self, context, instance, image, injected_files,
             admin_password, requested_networks, security_groups,
             block_device_mapping, node, limits, filter_properties,
@@ -2136,6 +2191,13 @@ def _build_and_run_instance(self, context, instance, image, injected_files,
         self._check_device_tagging(requested_networks, block_device_mapping)
         self._check_trusted_certs(instance)
 
+        request_group_resource_providers_mapping = \
+            self._get_request_group_mapping(request_spec)
+
+        if request_group_resource_providers_mapping:
+            self._update_pci_request_spec_with_allocated_interface_name(
+                context, instance, request_group_resource_providers_mapping)
+
         try:
             scheduler_hints = self._get_scheduler_hints(filter_properties,
                                                         request_spec)

diff --git a/nova/pci/stats.py b/nova/pci/stats.py
@@ -95,6 +95,14 @@ def _create_pool_keys_from_dev(self, dev):
         pool = {k: getattr(dev, k) for k in self.pool_keys}
         if tags:
             pool.update(tags)
+        # NOTE(gibi): parent_ifname acts like a tag during pci claim but
+        # not provided as part of the whitelist spec as it is auto detected
+        # by the virt driver.
+        # This key is used for match InstancePciRequest backed by neutron ports
+        # that has resource_request and therefore that has resource allocation
+        # already in placement.
+        if dev.extra_info.get('parent_ifname'):
+            pool['parent_ifname'] = dev.extra_info['parent_ifname']
         return pool
 
     def add_device(self, dev):

diff --git a/nova/scheduler/client/report.py b/nova/scheduler/client/report.py
@@ -400,6 +400,27 @@ def get_provider_traits(self, context, rp_uuid):
              'status_code': resp.status_code, 'err_text': resp.text})
         raise exception.ResourceProviderTraitRetrievalFailed(uuid=rp_uuid)
 
+    def get_resource_provider_name(self, context, uuid):
+        """Return the name of a RP. It tries to use the internal of RPs or
+        falls back to calling placement directly.
+
+        :param context: The security context
+        :param uuid: UUID identifier for the resource provider to look up
+        :return: The name of the RP
+        :raise: ResourceProviderRetrievalFailed if the RP is not in the cache
+            and the communication with the placement is failed.
+        :raise: ResourceProviderNotFound if the RP does not exists.
+        """
+
+        try:
+            return self._provider_tree.data(uuid).name
+        except ValueError:
+            rsp = self._get_resource_provider(context, uuid)
+            if rsp is None:
+                raise exception.ResourceProviderNotFound(name_or_uuid=uuid)
+            else:
+                return rsp['name']
+
     @safe_connect
     def _get_resource_provider(self, context, uuid):
         """Queries the placement API for a resource provider record with the

diff --git a/nova/tests/fixtures.py b/nova/tests/fixtures.py
@@ -1311,16 +1311,33 @@ class NeutronFixture(fixtures.Fixture):
     }
     network_2['subnets'] = [subnet_2['id']]
 
+    sriov_port = {
+        'id': '5460ee0c-ffbb-4e45-8d58-37bfceabd084',
+        'network_id': network_2['id'],
+        'admin_state_up': True,
+        'status': 'ACTIVE',
+        'mac_address': '52:54:00:1e:59:c4',
+        'fixed_ips': [
+            {
+                'ip_address': '192.168.13.2',
+                'subnet_id': subnet_2['id']
+            }
+        ],
+        'tenant_id': tenant_id,
+        'resource_request': {},
+        'binding:vnic_type': 'direct',
+    }
+
     port_with_sriov_resource_request = {
         'id': '7059503b-a648-40fd-a561-5ca769304bee',
         'network_id': network_2['id'],
         'admin_state_up': True,
         'status': 'ACTIVE',
-        'mac_address': '52:54:00:1e:59:c4',
+        'mac_address': '52:54:00:1e:59:c5',
         # Do neutron really adds fixed_ips to an direct vnic_type port?
         'fixed_ips': [
             {
-                'ip_address': '192.168.13.2',
+                'ip_address': '192.168.13.3',
                 'subnet_id': subnet_2['id']
             }
         ],

diff --git a/nova/tests/functional/test_servers.py b/nova/tests/functional/test_servers.py
@@ -5924,13 +5924,15 @@ def setUp(self):
 
         self._create_networking_rp_tree(self.compute1_rp_uuid)
 
-        # add an extra port and the related network to the neutron fixture
+        # add extra ports and the related network to the neutron fixture
         # specifically for these tests. It cannot be added globally in the
         # fixture init as it adds a second network that makes auto allocation
         # based test to fail due to ambiguous networks.
         self.neutron._ports[
             self.neutron.port_with_sriov_resource_request['id']] = \
             copy.deepcopy(self.neutron.port_with_sriov_resource_request)
+        self.neutron._ports[self.neutron.sriov_port['id']] = \
+            copy.deepcopy(self.neutron.sriov_port)
         self.neutron._networks[
             self.neutron.network_2['id']] = self.neutron.network_2
         self.neutron._subnets[
@@ -6498,6 +6500,118 @@ def test_interface_detach_with_port_with_bandwidth_request(self):
         binding_profile = updated_port['binding:profile']
         self.assertNotIn('allocation', binding_profile)
 
+    def test_two_sriov_ports_one_with_request_two_available_pfs(self):
+        """Verify that the port's bandwidth allocated from the same PF as
+        the allocated VF.
+
+        One compute host:
+        * PF1 (0000:01:00) is configured for physnet1
+        * PF2 (0000:02:00) is configured for physnet2, with 1 VF and bandwidth
+          inventory
+        * PF3 (0000:03:00) is configured for physnet2, with 1 VF but without
+          bandwidth inventory
+
+        One instance will be booted with two neutron ports, both ports
+        requested to be connected to physnet2. One port has resource request
+        the other does not have resource request. The port having the resource
+        request cannot be allocated to PF3 and PF1 while the other port that
+        does not have resource request can be allocated to PF2 or PF3.
+
+        For the detailed compute host config see the FakeDriverWithPciResources
+        class. For the necessary passthrough_whitelist config see the setUp of
+        the PortResourceRequestBasedSchedulingTestBase class.
+        """
+
+        sriov_port = self.neutron.sriov_port
+        sriov_port_with_res_req = self.neutron.port_with_sriov_resource_request
+        server = self._create_server(
+            flavor=self.flavor_with_group_policy,
+            networks=[
+                {'port': sriov_port_with_res_req['id']},
+                {'port': sriov_port['id']}])
+
+        server = self._wait_for_state_change(self.admin_api, server, 'ACTIVE')
+
+        sriov_port = self.neutron.show_port(sriov_port['id'])['port']
+        sriov_port_with_res_req = self.neutron.show_port(
+            sriov_port_with_res_req['id'])['port']
+
+        allocations = self.placement_api.get(
+            '/allocations/%s' % server['id']).body['allocations']
+
+        # We expect one set of allocations for the compute resources on the
+        # compute rp and one set for the networking resources on the sriov PF2
+        # rp.
+        self.assertEqual(2, len(allocations))
+        compute_allocations = allocations[self.compute1_rp_uuid]['resources']
+        self.assertEqual(
+            self._resources_from_flavor(self.flavor_with_group_policy),
+            compute_allocations)
+
+        sriov_allocations = allocations[self.sriov_pf2_rp_uuid]['resources']
+        self.assertPortMatchesAllocation(
+            sriov_port_with_res_req, sriov_allocations)
+
+        # We expect that only the RP uuid of the networking RP having the port
+        # allocation is sent in the port binding for the port having resource
+        # request
+        sriov_with_req_binding = sriov_port_with_res_req['binding:profile']
+        self.assertEqual(
+            self.sriov_pf2_rp_uuid, sriov_with_req_binding['allocation'])
+        # and the port without resource request does not have allocation
+        sriov_binding = sriov_port['binding:profile']
+        self.assertNotIn('allocation', sriov_binding)
+
+        # We expect that the selected PCI device matches with the RP from
+        # where the bandwidth is allocated from. The bandwidth is allocated
+        # from 0000:02:00 (PF2) so the PCI device should be a VF of that PF
+        self.assertEqual('0000:02:00.1', sriov_with_req_binding['pci_slot'])
+        # But also the port that has no resource request still gets a pci slot
+        # allocated. The 0000:02:00 has no more VF available but 0000:03:00 has
+        # one VF available and that PF is also on physnet2
+        self.assertEqual('0000:03:00.1', sriov_binding['pci_slot'])
+
+    def test_one_sriov_port_no_vf_and_bandwidth_available_on_the_same_pf(self):
+        """Verify that if there is no PF that both provides bandwidth and VFs
+        then the boot will fail.
+        """
+
+        # boot a server with a single sriov port that has no resource request
+        sriov_port = self.neutron.sriov_port
+        server = self._create_server(
+            flavor=self.flavor_with_group_policy,
+            networks=[{'port': sriov_port['id']}])
+
+        self._wait_for_state_change(self.admin_api, server, 'ACTIVE')
+        sriov_port = self.neutron.show_port(sriov_port['id'])['port']
+        sriov_binding = sriov_port['binding:profile']
+
+        # We expect that this consume the last available VF from the PF2
+        self.assertEqual('0000:02:00.1', sriov_binding['pci_slot'])
+
+        # Now boot a second server with a port that has resource request
+        # At this point PF2 has available bandwidth but no available VF
+        # and PF3 has available VF but no available bandwidth so we expect
+        # the boot to fail.
+
+        sriov_port_with_res_req = self.neutron.port_with_sriov_resource_request
+        server = self._create_server(
+            flavor=self.flavor_with_group_policy,
+            networks=[{'port': sriov_port_with_res_req['id']}])
+
+        # NOTE(gibi): It should be NoValidHost in an ideal world but that would
+        # require the scheduler to detect the situation instead of the pci
+        # claim. However that is pretty hard as the scheduler does not know
+        # anything about allocation candidates (e.g. that the only candidate
+        # for the port in this case is PF2) it see the whole host as a
+        # candidate and in our host there is available VF for the request even
+        # if that is on the wrong PF.
+        server = self._wait_for_state_change(self.admin_api, server, 'ERROR')
+        self.assertIn(
+            'Exceeded maximum number of retries. Exhausted all hosts '
+            'available for retrying build failures for instance',
+            server['fault']['message'])
+
 
 class PortResourceRequestReSchedulingTestIgnoreMicroversionCheck(
         PortResourceRequestBasedSchedulingTestBase):