Skip to content

fix: Add additional steps from from azure scripts suggested by phagara#4

Merged
spetrosi merged 11 commits intolinux-system-roles:mainfrom
spetrosi:install-from-source
Sep 2, 2025
Merged

fix: Add additional steps from from azure scripts suggested by phagara#4
spetrosi merged 11 commits intolinux-system-roles:mainfrom
spetrosi:install-from-source

Conversation

@spetrosi
Copy link
Collaborator

@spetrosi spetrosi commented Aug 29, 2025

Enhancement: Add additional steps from azure scripts

Reason:

Result:

Summary by Sourcery

Add Azure-specific setup tasks including RHUI package updates, kernel version locking, NVIDIA driver installation, and waagent RDMA configuration, along with tests for Azure deployments

Enhancements:

  • Update Microsoft RHUI repositories with latest packages when on Azure
  • Fetch the current kernel version and install matching kernel-devel and kernel-headers, triggering a system reboot
  • Ensure dnf versionlock plugin is installed and lock kernel-related packages to the running version
  • Enable proprietary NVIDIA driver installation for Azure instances
  • Configure waagent to enable RDMA and add a handler to restart the service
  • Introduce hpc_cloud_provider default variable set to azure

Tests:

  • Add tests/tests_azure.yml to verify role behavior on Azure VMs

@spetrosi spetrosi requested a review from richm as a code owner August 29, 2025 15:49
@sourcery-ai
Copy link

sourcery-ai bot commented Aug 29, 2025

Reviewer's Guide

Introduces Azure-specific enhancements to the HPC role by updating Microsoft RHUI packages, aligning installed kernel headers and devel with the running kernel under versionlock protection, enabling RDMA support in the Azure agent, and adding corresponding defaults, variables, handlers, and an Azure integration test.

Sequence diagram for Azure-specific kernel and package update process

sequenceDiagram
    participant Ansible
    participant RHUI
    participant Kernel
    participant DNF
    participant System
    Ansible->>RHUI: Update rhui* packages (Microsoft repos only)
    Ansible->>Kernel: Get current kernel release version
    Ansible->>DNF: Install kernel-devel and kernel-headers for running kernel
    Ansible->>DNF: Ensure dnf-command(versionlock) is installed
    Ansible->>DNF: Check versionlock file
    Ansible->>DNF: Add versionlock for kernel packages if not present
    Ansible->>System: Notify for reboot if kernel packages updated
Loading

Sequence diagram for enabling RDMA in waagent configuration

sequenceDiagram
    participant Ansible
    participant Waagent
    participant System
    Ansible->>Waagent: Update /etc/waagent.conf (OS.EnableRDMA=y)
    Ansible->>System: Notify to restart waagent
    System->>Waagent: Restart waagent service
Loading

Entity relationship diagram for new/updated variables for Azure support

erDiagram
    HPC_ROLE ||--|{ RHUI_PACKAGES : updates
    HPC_ROLE ||--|{ KERNEL_PACKAGES : aligns
    HPC_ROLE ||--|{ VERSIONLOCK : protects
    HPC_ROLE ||--|{ WAAGENT_CONF : enables_RDMA
    RHUI_PACKAGES {
      string name
      string state
      string enablerepo
      string disablerepo
    }
    KERNEL_PACKAGES {
      string name
      string state
    }
    VERSIONLOCK {
      string path
      string[] rpms
    }
    WAAGENT_CONF {
      string path
      string line
      bool create
      bool backup
    }
Loading

Class diagram for updated Ansible tasks and handlers for Azure

classDiagram
    class MainTasks {
      +Update RHUI packages
      +Get kernel release
      +Install kernel-devel/headers
      +Ensure versionlock
      +Check versionlock file
      +Add versionlock entries
      +Enable nvidia-driver
      +Enable RDMA in waagent
    }
    class Handlers {
      +Reboot system
      +Reboot waagent
    }
    class Variables {
      +__hpc_versionlock_path
      +__hpc_versionlock_rpms
      +hpc_cloud_provider
      +hpc_nvidia_rdma_packages (updated)
    }
    MainTasks --> Handlers : notifies
    MainTasks --> Variables : uses
Loading

File-Level Changes

Change Details Files
Add Azure-specific kernel and package management in tasks
  • Update RHUI packages from Microsoft repositories only
  • Fetch current kernel release via uname -r
  • Install matching kernel-devel and kernel-headers and notify reboot
  • Ensure dnf-command(versionlock) is installed and lock kernel RPMs
tasks/main.yml
Enable RDMA support in waagent configuration
  • Add OS.EnableRDMA=y lineinfile in /etc/waagent.conf
  • Introduce Reboot waagent handler to restart the service
tasks/main.yml
handlers/main.yml
Introduce versionlock variables for kernel package locking
  • Define __hpc_versionlock_path for the versionlock file
  • List __hpc_versionlock_rpms covering all kernel-related packages
vars/main.yml
Set default cloud provider to Azure
  • Add hpc_cloud_provider default as azure
defaults/main.yml
Add Azure-specific integration test playbook
  • Create tests/tests_azure.yml to validate role behavior on Azure
tests/tests_azure.yml

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey there - I've reviewed your changes and they look great!

Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments

### Comment 1
<location> `tasks/main.yml:104` </location>
<code_context>
+      - kernel-devel-{{ __hpc_kernel_release.stdout }}
+      - kernel-headers-{{ __hpc_kernel_release.stdout }}
+    state: present
+  notify: Reboot system
+
+- name: Ensure that dnf-command(versionlock) is installed
</code_context>

<issue_to_address>
Reboot is triggered after installing kernel-devel and kernel-headers, which may not always be necessary.

Make the reboot notification conditional on actual kernel updates to prevent unnecessary system restarts.

Suggested implementation:

```
- name: >-
    Explicitly install kernel-devel and kernel-headers packages matching the
    currently running kernel
  package:
    name:
      - kernel-devel-{{ __hpc_kernel_release.stdout }}
      - kernel-headers-{{ __hpc_kernel_release.stdout }}
    state: present
  register: kernel_packages_result

- name: Reboot system if kernel packages were updated
  meta: flush_handlers

- name: Trigger reboot if kernel packages changed
  meta: clear_host_errors
  notify: Reboot system
  when: kernel_packages_result is changed

```

- You may want to remove the original `notify: Reboot system` from the package task, as the notification is now handled conditionally.
- Ensure that the `Reboot system` handler exists in your handlers file.
- The use of `meta: flush_handlers` and `meta: clear_host_errors` is to ensure the handler is triggered immediately and only if needed. Adjust as necessary for your playbook structure.
</issue_to_address>

### Comment 2
<location> `tasks/main.yml:129` </location>
<code_context>
+  register: __hpc_versionlock_check
+  changed_when: true
+  when: >-
+    not __hpc_versionlock_stat.stat.exists
+    or __hpc_versionlock_content.stdout is not search(item  + "-[0-9]")
+  loop: "{{ __hpc_versionlock_rpms }}"

</code_context>

<issue_to_address>
The versionlock check may not handle multiple entries or edge cases in the versionlock file.

The current search approach may miss cases with different package naming conventions or multiple matching entries. Please use a parsing method that accurately identifies all relevant entries in the versionlock file.

Suggested implementation:

```
  register: __hpc_versionlock_check
  changed_when: true
  vars:
    __hpc_versionlock_file: /etc/dnf/plugins/versionlock.list
  pre_tasks:
    - name: Read versionlock file contents
      ansible.builtin.slurp:
        src: "{{ __hpc_versionlock_file }}"
      register: __hpc_versionlock_file_content
      when: __hpc_versionlock_stat.stat.exists

    - name: Parse versionlock entries for item
      set_fact:
        __hpc_versionlock_matches: >-
          {{
            (__hpc_versionlock_file_content['content'] | b64decode).splitlines()
            | select('match', item ~ '-[0-9]')
            | list
          }}
      when: __hpc_versionlock_stat.stat.exists

  when: >-
    not __hpc_versionlock_stat.stat.exists
    or (__hpc_versionlock_matches | length == 0)
  loop: "{{ __hpc_versionlock_rpms }}"

```

- Ensure that `__hpc_versionlock_stat` and `__hpc_versionlock_content` are set appropriately before these tasks.
- You may need to adjust the path to the versionlock file if it differs in your environment.
- If you use roles or include_tasks, move the `pre_tasks` block accordingly.
</issue_to_address>

### Comment 3
<location> `tasks/main.yml:108` </location>
<code_context>
+
+- name: Ensure that dnf-command(versionlock) is installed
+  package:
+    name: dnf-command(versionlock)
     state: present
-    use: "{{ (__hpc_server_is_ostree | d(false)) |
</code_context>

<issue_to_address>
Installing dnf-command(versionlock) unconditionally may fail on systems using yum or older dnf versions.

Consider adding a conditional check or fallback to handle systems where dnf-command(versionlock) is unavailable, such as those using yum or older dnf versions.

Suggested implementation:

```
- name: Check if dnf is the package manager
  set_fact:
    __hpc_is_dnf: "{{ ansible_pkg_mgr == 'dnf' }}"

- name: Get dnf version
  command: dnf --version
  register: dnf_version_output
  changed_when: false
  failed_when: false
  when: __hpc_is_dnf

- name: Ensure that dnf-command(versionlock) is installed (only if dnf is available and version >= 4)
  package:
    name: dnf-command(versionlock)
    state: present
  when:
    - __hpc_is_dnf
    - dnf_version_output.stdout is search('dnf-4')

```

If you want to support yum-based systems, you may need to add a separate task for yum-plugin-versionlock, e.g.:

```yaml
- name: Ensure that yum-plugin-versionlock is installed (if using yum)
  package:
    name: yum-plugin-versionlock
    state: present
  when: ansible_pkg_mgr == 'yum'
```

Also, you may want to adjust the dnf version check if you need to support other dnf versions. The current check only installs dnf-command(versionlock) for dnf-4.x.
</issue_to_address>

### Comment 4
<location> `handlers/main.yml:6` </location>
<code_context>
 - name: Reboot system
   include_tasks: tasks/reboot.yml
+
+- name: Reboot waagent
+  service:
+    name: waagent
</code_context>

<issue_to_address>
Restarting waagent may disrupt ongoing provisioning or VM operations.

Please add checks or notifications to prevent disruption of critical VM workflows when restarting waagent.
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines +129 to +130
not __hpc_versionlock_stat.stat.exists
or __hpc_versionlock_content.stdout is not search(item + "-[0-9]")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: The versionlock check may not handle multiple entries or edge cases in the versionlock file.

The current search approach may miss cases with different package naming conventions or multiple matching entries. Please use a parsing method that accurately identifies all relevant entries in the versionlock file.

Suggested implementation:

  register: __hpc_versionlock_check
  changed_when: true
  vars:
    __hpc_versionlock_file: /etc/dnf/plugins/versionlock.list
  pre_tasks:
    - name: Read versionlock file contents
      ansible.builtin.slurp:
        src: "{{ __hpc_versionlock_file }}"
      register: __hpc_versionlock_file_content
      when: __hpc_versionlock_stat.stat.exists

    - name: Parse versionlock entries for item
      set_fact:
        __hpc_versionlock_matches: >-
          {{
            (__hpc_versionlock_file_content['content'] | b64decode).splitlines()
            | select('match', item ~ '-[0-9]')
            | list
          }}
      when: __hpc_versionlock_stat.stat.exists

  when: >-
    not __hpc_versionlock_stat.stat.exists
    or (__hpc_versionlock_matches | length == 0)
  loop: "{{ __hpc_versionlock_rpms }}"

  • Ensure that __hpc_versionlock_stat and __hpc_versionlock_content are set appropriately before these tasks.
  • You may need to adjust the path to the versionlock file if it differs in your environment.
  • If you use roles or include_tasks, move the pre_tasks block accordingly.

- name: Reboot system
include_tasks: tasks/reboot.yml

- name: Reboot waagent
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (bug_risk): Restarting waagent may disrupt ongoing provisioning or VM operations.

Please add checks or notifications to prevent disruption of critical VM workflows when restarting waagent.

tasks/main.yml Outdated
mount_point: "{{ hpc_usrlv_mount }}"

- name: Enable proprietary nvidia-driver
- name: Update RHUI packages from Microsoft repositories only
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phagara, do you know why role needs to have only "rhui-microsoft-*" repos enabled in this step?
I guess we explicitly want the version that azure provides, not some other version from AppStream, even if AppStream version is newer?

- name: Reboot system
include_tasks: tasks/reboot.yml

- name: Reboot waagent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- name: Reboot waagent
- name: Restart waagent

tasks/main.yml Outdated
enablerepo: "rhui-microsoft-*"
when: hpc_cloud_provider == "azure"

- name: Get current kernel release version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is available as ansible_facts["kernel"]

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's perfect, thanks

tasks/main.yml Outdated
line: "OS.EnableRDMA=y"
create: yes
backup: yes
notify: Reboot waagent
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
notify: Reboot waagent
notify: Restart waagent

hpc_install_cuda_toolkit: false
tasks:
- name: Determine whether VM is running on Azure
command: dmidecode -s system-manufacturer
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be a fact for this. For example, these are the facts from my laptop:

        "ansible_bios_vendor": "LENOVO",
        "ansible_bios_version": "N35ET60W (1.60 )",
        "ansible_board_asset_tag": "Not Available",
        "ansible_board_name": "20WNS1F800",
        "ansible_board_serial": "NA",
        "ansible_board_vendor": "LENOVO",
        "ansible_board_version": "Not Defined",
        "ansible_chassis_asset_tag": "No Asset Information",
        "ansible_chassis_serial": "NA",
        "ansible_chassis_vendor": "LENOVO",
        "ansible_chassis_version": "None",

would be curious if this or some other fact could be used.

I mean, this is ok, but better to minimize the use of command for things like this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I searched and didn't find a fact related to cloud on a machine in azure, I'll take closer look once again

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is what facts I got from rhel 9.6 running in Azure.
ansible_facts.txt

Considering azure, username and directories refer to it. Not a safe info to rely on as users may have different username other than default.
Considering microsoft, there is "ansible_system_vendor": "Microsoft Corporation". Same for Bios, board, and chassis vendor. I don't want to rely on this too.

If there were a concrete fact saying ansible_cloud_provider: "Microsoft Azure", I would use it. This is not the case, so will need to make the role define a fact.

dmidecode -s system-manufacturer would work for other cloud providers too, so I think it's a good approach.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am changing my mind, because ansible_system_vendor has the same value as the output of dmidecode -s system-manufacturer. So we can rely on this.

* Remove hpc_cloud_provider and don't run dmidecode - identify CP from
ansible_system_vendor fact
* s/Reboot waagent/Restart waagent
* Use ansible_kernel instead of `uname -r`
@spetrosi spetrosi changed the title Add additional steps from from azure scripts suggested by phagara fix: Add additional steps from from azure scripts suggested by phagara Sep 1, 2025
@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 1, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 1, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 1, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 1, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 1, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 2, 2025

[citest]

@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 2, 2025

[citest]

1 similar comment
@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 2, 2025

[citest]

@spetrosi spetrosi force-pushed the install-from-source branch from b520e92 to fd2ada4 Compare September 2, 2025 09:57
@spetrosi
Copy link
Collaborator Author

spetrosi commented Sep 2, 2025

[citest]

@spetrosi spetrosi merged commit 2cfbec1 into linux-system-roles:main Sep 2, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants