Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -37,3 +37,18 @@ include::modules/machineset-non-guaranteed-instance.adoc[leveloffset=+1]

//Creating Spot Instances by using machine sets
include::modules/machineset-creating-non-guaranteed-instances.adoc[leveloffset=+2]

//Machine sets that enable AWS Elastic Fabric Adapter (EFA)
include::modules/machineset-efa-options.adoc[leveloffset=+1]

//Creating machines that use an AWS EFA
include::modules/machineset-creating-efa-options.adoc[leveloffset=+2]

//Enabling MPI workloads that use an AWS EFA
include::modules/machineset-enabling-efa-options.adoc[leveloffset=+2]
[role="_additional-resources"]
.Additional resources
* link:https://github.com/aws/libfabric[libfabric]
* xref:../../post_installation_configuration/node-tasks.adoc#configuring-huge-pages_post-install-node-tasks[Configuring huge pages]
* link:https://cloud.redhat.com/blog/how-to-use-kubeflow-and-the-mpi-operator-on-openshift[How to use Kubeflow and the MPI Operator on OpenShift]
* xref:../../hardware_enablement/psap-node-feature-discovery-operator.adoc#installing-the-node-feature-discovery-operator_node-feature-discovery-operator[Installing the Node Feature Discovery Operator]
44 changes: 44 additions & 0 deletions modules/machineset-creating-efa-options.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
// Module included in the following assemblies:
//
// * machine_management/creating_machinesets/creating-machineset-aws.adoc

:_content-type: PROCEDURE
[id="machineset-creating-efa-options_{context}"]
= Creating machines that use an AWS Elastic Fabric Adapter

You can deploy compute machines that use an AWS Elastic Fabric Adapter (EFA) by adding the `networkInterfaceType` field to the machine set YAML file for your compute machines.

[NOTE]
====
Machines that use an EFA must belong to security groups that allow all traffic between all hosts in the security group. You might find it helpful to create a dedicated security group for machines that use an EFA.

You can manually configure your security groups to support the use of an EFA by using the AWS Management Console or the AWS CLI. For more information, see the Amazon EC2 documentation about https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/working-with-security-groups.html#updating-security-group-rules[updating security group rules].
====

.Prerequisites

* You have configured security groups to support the use of an EFA.

.Procedure

. In a text editor, open the YAML file for an existing AWS machine set or create a new one.

. Add the following line under the `providerSpec` field:
+
[source,yaml]
----
providerSpec:
value:
networkInterfaceType: EFA <1>
----
<1> Specify the type of network interface to use. To use an EFA, set this value to `EFA`. To use a standard Elastic Network Adapter (ENA), set this value to `ENA`. If no value is specified, machines deployed by the machine set use a standard ENA.

.Verification

. In the AWS Management Console, locate an EC2 instance that the machine set deployed.

. On the *Networking* tab, verify that the *Interface type* value under *Network interfaces* is `Elastic Fabric Adapter`.

.Next steps

* If you plan to run MPI workloads with the EFA node, you must install additional software. For more information, see "Enabling MPI workloads that use an AWS Elastic Fabric Adapter".
16 changes: 16 additions & 0 deletions modules/machineset-efa-options.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
// Module included in the following assemblies:
//
// * machine_management/creating_machinesets/creating-machineset-aws.adoc

:_content-type: CONCEPT
[id="machineset-efa-options_{context}"]
= Machine sets that support using an Elastic Fabric Adapter

You can use machine sets to create compute machines that use an link:https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html[Elastic Fabric Adapter] (EFA) as their primary network interface.

[NOTE]
====
Control plane machines cannot use an EFA as their primary network interface.
====

For more information about instance types that support using an EFA, see the Amazon EC2 documentation about https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types[supported instance types].
109 changes: 109 additions & 0 deletions modules/machineset-enabling-efa-options.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
// Module included in the following assemblies:
//
// * machine_management/creating_machinesets/creating-machineset-aws.adoc

:_content-type: PROCEDURE
[id="machineset-enabling-efa-options_{context}"]
= Enabling MPI workloads that use an AWS Elastic Fabric Adapter

After configuring a machine set to support the use of an AWS Elastic Fabric Adapter (EFA), you must install additional software to run MPI workloads.

For more information about using Kubeflow and the MPI Operator in {product-title} and an example, see link:https://cloud.redhat.com/blog/how-to-use-kubeflow-and-the-mpi-operator-on-openshift[How to use Kubeflow and the MPI Operator on OpenShift].
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we using a blog; as the source of 'how to install' and get setup to use EFA capabilities'?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we cloning this (from: https://github.com/kubeflow/mpi-operator); vs instilling it from Operator Hub?


.Prerequisites

* You have configured a machine set to support the use of an EFA.

.Procedure

. Create a machine configuration that allows for an unlimited `memlock`.

.. Generate a base64-encoded string for a file that removes `memlock` limits.
+
.Example raw data
[source,terminal]
----
[crio.runtime]
default_ulimits = [
"memlock=-1:-1"
]
----
+
.Example base64-encoded data
[source,terminal]
----
W2NyaW8ucnVudGltZV0KZGVmYXVsdF91bGltaXRzID0gWwogICAgICAgICJtZW1sb2NrPS0xOi0xIgpdCg==
----

.. Create a file named `unlimited-memlock.yaml` with the following YAML definition:
+
[source,yaml]
----
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
labels:
machineconfiguration.openshift.io/role: worker
name: 02-worker-container-runtime <1>
spec:
config:
ignition:
version: 3.2.0
storage:
files:
- contents:
source: data:text/plain;charset=utf-8;base64,<base64-encoded-memlock-data> <2>
mode: 420
overwrite: true
path: /etc/crio/crio.conf.d/10-memlock <3>
----
<1> Specify a name for the machine configuration.
<2> Specify a base64-encoded string for the unlimited `memlock` file data.
<3> Specify the path for the `memlock` resource.

.. To create the `MachineConfig` object, enter the following command:
+
[source,terminal]
----
$ oc create -f unlimited-memlock.yaml
----

. Install link:https://github.com/aws/libfabric[libfabric] on your cluster.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @sferich888, is this an acceptable time to link to GitHub? AFAIK this is officially where it lives (@JoelSpeed correct me if I'm overlooking a more legit link please)

+
.Verification
+
Verify that the libfabric DaemonSet that exposes EFA capabilities is running by entering the following command and observing the output:
+
[source,terminal]
----
$ oc get po -n kube-system
----
+
.Example output
+
[source,terminal]
----
NAME READY STATUS RESTARTS AGE
aws-efa-k8s-device-plugin-daemonset-zz5p9 1/1 Running 0 8h
----

. Configure huge pages with a minimum size of 2MB to support the MPI Operator.
+
For more information, see "Configuring huge pages" in the "Node tasks" section of the "Post-installation configuration" documentation.
+
[NOTE]
====
The 2MB minimum required huge page size to support the MPI Operator might not be enough for the size of the instance types within your cluster. Ensure that your huge page configuration meets your requirements.
====

. Install the Kubeflow MPI Operator on your cluster.
+
For more information, see link:https://cloud.redhat.com/blog/how-to-use-kubeflow-and-the-mpi-operator-on-openshift[How to use Kubeflow and the MPI Operator on OpenShift].

. Install the Node Feature Discovery Operator from the OperatorHub on your cluster.
+
For more information, see "Installing the Node Feature Discovery Operator" in the "Node Feature Discovery Operator" section of the "Specialized hardware and driver enablement" documentation.

.Verification

* Verify that the Node Feature Discovery Operator has configured the node status to show the EFA interface as an allocatable resource.