fix: Add additional steps from from azure scripts suggested by phagara#4
Conversation
Reviewer's GuideIntroduces Azure-specific enhancements to the HPC role by updating Microsoft RHUI packages, aligning installed kernel headers and devel with the running kernel under versionlock protection, enabling RDMA support in the Azure agent, and adding corresponding defaults, variables, handlers, and an Azure integration test. Sequence diagram for Azure-specific kernel and package update processsequenceDiagram
participant Ansible
participant RHUI
participant Kernel
participant DNF
participant System
Ansible->>RHUI: Update rhui* packages (Microsoft repos only)
Ansible->>Kernel: Get current kernel release version
Ansible->>DNF: Install kernel-devel and kernel-headers for running kernel
Ansible->>DNF: Ensure dnf-command(versionlock) is installed
Ansible->>DNF: Check versionlock file
Ansible->>DNF: Add versionlock for kernel packages if not present
Ansible->>System: Notify for reboot if kernel packages updated
Sequence diagram for enabling RDMA in waagent configurationsequenceDiagram
participant Ansible
participant Waagent
participant System
Ansible->>Waagent: Update /etc/waagent.conf (OS.EnableRDMA=y)
Ansible->>System: Notify to restart waagent
System->>Waagent: Restart waagent service
Entity relationship diagram for new/updated variables for Azure supporterDiagram
HPC_ROLE ||--|{ RHUI_PACKAGES : updates
HPC_ROLE ||--|{ KERNEL_PACKAGES : aligns
HPC_ROLE ||--|{ VERSIONLOCK : protects
HPC_ROLE ||--|{ WAAGENT_CONF : enables_RDMA
RHUI_PACKAGES {
string name
string state
string enablerepo
string disablerepo
}
KERNEL_PACKAGES {
string name
string state
}
VERSIONLOCK {
string path
string[] rpms
}
WAAGENT_CONF {
string path
string line
bool create
bool backup
}
Class diagram for updated Ansible tasks and handlers for AzureclassDiagram
class MainTasks {
+Update RHUI packages
+Get kernel release
+Install kernel-devel/headers
+Ensure versionlock
+Check versionlock file
+Add versionlock entries
+Enable nvidia-driver
+Enable RDMA in waagent
}
class Handlers {
+Reboot system
+Reboot waagent
}
class Variables {
+__hpc_versionlock_path
+__hpc_versionlock_rpms
+hpc_cloud_provider
+hpc_nvidia_rdma_packages (updated)
}
MainTasks --> Handlers : notifies
MainTasks --> Variables : uses
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Hey there - I've reviewed your changes and they look great!
Prompt for AI Agents
Please address the comments from this code review:
## Individual Comments
### Comment 1
<location> `tasks/main.yml:104` </location>
<code_context>
+ - kernel-devel-{{ __hpc_kernel_release.stdout }}
+ - kernel-headers-{{ __hpc_kernel_release.stdout }}
+ state: present
+ notify: Reboot system
+
+- name: Ensure that dnf-command(versionlock) is installed
</code_context>
<issue_to_address>
Reboot is triggered after installing kernel-devel and kernel-headers, which may not always be necessary.
Make the reboot notification conditional on actual kernel updates to prevent unnecessary system restarts.
Suggested implementation:
```
- name: >-
Explicitly install kernel-devel and kernel-headers packages matching the
currently running kernel
package:
name:
- kernel-devel-{{ __hpc_kernel_release.stdout }}
- kernel-headers-{{ __hpc_kernel_release.stdout }}
state: present
register: kernel_packages_result
- name: Reboot system if kernel packages were updated
meta: flush_handlers
- name: Trigger reboot if kernel packages changed
meta: clear_host_errors
notify: Reboot system
when: kernel_packages_result is changed
```
- You may want to remove the original `notify: Reboot system` from the package task, as the notification is now handled conditionally.
- Ensure that the `Reboot system` handler exists in your handlers file.
- The use of `meta: flush_handlers` and `meta: clear_host_errors` is to ensure the handler is triggered immediately and only if needed. Adjust as necessary for your playbook structure.
</issue_to_address>
### Comment 2
<location> `tasks/main.yml:129` </location>
<code_context>
+ register: __hpc_versionlock_check
+ changed_when: true
+ when: >-
+ not __hpc_versionlock_stat.stat.exists
+ or __hpc_versionlock_content.stdout is not search(item + "-[0-9]")
+ loop: "{{ __hpc_versionlock_rpms }}"
</code_context>
<issue_to_address>
The versionlock check may not handle multiple entries or edge cases in the versionlock file.
The current search approach may miss cases with different package naming conventions or multiple matching entries. Please use a parsing method that accurately identifies all relevant entries in the versionlock file.
Suggested implementation:
```
register: __hpc_versionlock_check
changed_when: true
vars:
__hpc_versionlock_file: /etc/dnf/plugins/versionlock.list
pre_tasks:
- name: Read versionlock file contents
ansible.builtin.slurp:
src: "{{ __hpc_versionlock_file }}"
register: __hpc_versionlock_file_content
when: __hpc_versionlock_stat.stat.exists
- name: Parse versionlock entries for item
set_fact:
__hpc_versionlock_matches: >-
{{
(__hpc_versionlock_file_content['content'] | b64decode).splitlines()
| select('match', item ~ '-[0-9]')
| list
}}
when: __hpc_versionlock_stat.stat.exists
when: >-
not __hpc_versionlock_stat.stat.exists
or (__hpc_versionlock_matches | length == 0)
loop: "{{ __hpc_versionlock_rpms }}"
```
- Ensure that `__hpc_versionlock_stat` and `__hpc_versionlock_content` are set appropriately before these tasks.
- You may need to adjust the path to the versionlock file if it differs in your environment.
- If you use roles or include_tasks, move the `pre_tasks` block accordingly.
</issue_to_address>
### Comment 3
<location> `tasks/main.yml:108` </location>
<code_context>
+
+- name: Ensure that dnf-command(versionlock) is installed
+ package:
+ name: dnf-command(versionlock)
state: present
- use: "{{ (__hpc_server_is_ostree | d(false)) |
</code_context>
<issue_to_address>
Installing dnf-command(versionlock) unconditionally may fail on systems using yum or older dnf versions.
Consider adding a conditional check or fallback to handle systems where dnf-command(versionlock) is unavailable, such as those using yum or older dnf versions.
Suggested implementation:
```
- name: Check if dnf is the package manager
set_fact:
__hpc_is_dnf: "{{ ansible_pkg_mgr == 'dnf' }}"
- name: Get dnf version
command: dnf --version
register: dnf_version_output
changed_when: false
failed_when: false
when: __hpc_is_dnf
- name: Ensure that dnf-command(versionlock) is installed (only if dnf is available and version >= 4)
package:
name: dnf-command(versionlock)
state: present
when:
- __hpc_is_dnf
- dnf_version_output.stdout is search('dnf-4')
```
If you want to support yum-based systems, you may need to add a separate task for yum-plugin-versionlock, e.g.:
```yaml
- name: Ensure that yum-plugin-versionlock is installed (if using yum)
package:
name: yum-plugin-versionlock
state: present
when: ansible_pkg_mgr == 'yum'
```
Also, you may want to adjust the dnf version check if you need to support other dnf versions. The current check only installs dnf-command(versionlock) for dnf-4.x.
</issue_to_address>
### Comment 4
<location> `handlers/main.yml:6` </location>
<code_context>
- name: Reboot system
include_tasks: tasks/reboot.yml
+
+- name: Reboot waagent
+ service:
+ name: waagent
</code_context>
<issue_to_address>
Restarting waagent may disrupt ongoing provisioning or VM operations.
Please add checks or notifications to prevent disruption of critical VM workflows when restarting waagent.
</issue_to_address>Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
| not __hpc_versionlock_stat.stat.exists | ||
| or __hpc_versionlock_content.stdout is not search(item + "-[0-9]") |
There was a problem hiding this comment.
suggestion: The versionlock check may not handle multiple entries or edge cases in the versionlock file.
The current search approach may miss cases with different package naming conventions or multiple matching entries. Please use a parsing method that accurately identifies all relevant entries in the versionlock file.
Suggested implementation:
register: __hpc_versionlock_check
changed_when: true
vars:
__hpc_versionlock_file: /etc/dnf/plugins/versionlock.list
pre_tasks:
- name: Read versionlock file contents
ansible.builtin.slurp:
src: "{{ __hpc_versionlock_file }}"
register: __hpc_versionlock_file_content
when: __hpc_versionlock_stat.stat.exists
- name: Parse versionlock entries for item
set_fact:
__hpc_versionlock_matches: >-
{{
(__hpc_versionlock_file_content['content'] | b64decode).splitlines()
| select('match', item ~ '-[0-9]')
| list
}}
when: __hpc_versionlock_stat.stat.exists
when: >-
not __hpc_versionlock_stat.stat.exists
or (__hpc_versionlock_matches | length == 0)
loop: "{{ __hpc_versionlock_rpms }}"
- Ensure that
__hpc_versionlock_statand__hpc_versionlock_contentare set appropriately before these tasks. - You may need to adjust the path to the versionlock file if it differs in your environment.
- If you use roles or include_tasks, move the
pre_tasksblock accordingly.
handlers/main.yml
Outdated
| - name: Reboot system | ||
| include_tasks: tasks/reboot.yml | ||
|
|
||
| - name: Reboot waagent |
There was a problem hiding this comment.
question (bug_risk): Restarting waagent may disrupt ongoing provisioning or VM operations.
Please add checks or notifications to prevent disruption of critical VM workflows when restarting waagent.
tasks/main.yml
Outdated
| mount_point: "{{ hpc_usrlv_mount }}" | ||
|
|
||
| - name: Enable proprietary nvidia-driver | ||
| - name: Update RHUI packages from Microsoft repositories only |
There was a problem hiding this comment.
Hi @phagara, do you know why role needs to have only "rhui-microsoft-*" repos enabled in this step?
I guess we explicitly want the version that azure provides, not some other version from AppStream, even if AppStream version is newer?
handlers/main.yml
Outdated
| - name: Reboot system | ||
| include_tasks: tasks/reboot.yml | ||
|
|
||
| - name: Reboot waagent |
There was a problem hiding this comment.
| - name: Reboot waagent | |
| - name: Restart waagent |
tasks/main.yml
Outdated
| enablerepo: "rhui-microsoft-*" | ||
| when: hpc_cloud_provider == "azure" | ||
|
|
||
| - name: Get current kernel release version |
There was a problem hiding this comment.
This is available as ansible_facts["kernel"]
There was a problem hiding this comment.
That's perfect, thanks
tasks/main.yml
Outdated
| line: "OS.EnableRDMA=y" | ||
| create: yes | ||
| backup: yes | ||
| notify: Reboot waagent |
There was a problem hiding this comment.
| notify: Reboot waagent | |
| notify: Restart waagent |
tests/tests_azure.yml
Outdated
| hpc_install_cuda_toolkit: false | ||
| tasks: | ||
| - name: Determine whether VM is running on Azure | ||
| command: dmidecode -s system-manufacturer |
There was a problem hiding this comment.
There might be a fact for this. For example, these are the facts from my laptop:
"ansible_bios_vendor": "LENOVO",
"ansible_bios_version": "N35ET60W (1.60 )",
"ansible_board_asset_tag": "Not Available",
"ansible_board_name": "20WNS1F800",
"ansible_board_serial": "NA",
"ansible_board_vendor": "LENOVO",
"ansible_board_version": "Not Defined",
"ansible_chassis_asset_tag": "No Asset Information",
"ansible_chassis_serial": "NA",
"ansible_chassis_vendor": "LENOVO",
"ansible_chassis_version": "None",
would be curious if this or some other fact could be used.
I mean, this is ok, but better to minimize the use of command for things like this.
There was a problem hiding this comment.
I searched and didn't find a fact related to cloud on a machine in azure, I'll take closer look once again
There was a problem hiding this comment.
Here is what facts I got from rhel 9.6 running in Azure.
ansible_facts.txt
Considering azure, username and directories refer to it. Not a safe info to rely on as users may have different username other than default.
Considering microsoft, there is "ansible_system_vendor": "Microsoft Corporation". Same for Bios, board, and chassis vendor. I don't want to rely on this too.
If there were a concrete fact saying ansible_cloud_provider: "Microsoft Azure", I would use it. This is not the case, so will need to make the role define a fact.
dmidecode -s system-manufacturer would work for other cloud providers too, so I think it's a good approach.
There was a problem hiding this comment.
I am changing my mind, because ansible_system_vendor has the same value as the output of dmidecode -s system-manufacturer. So we can rely on this.
* Remove hpc_cloud_provider and don't run dmidecode - identify CP from ansible_system_vendor fact * s/Reboot waagent/Restart waagent * Use ansible_kernel instead of `uname -r`
|
[citest] |
|
[citest] |
|
[citest] |
|
[citest] |
|
[citest] |
|
[citest] |
|
[citest] |
1 similar comment
|
[citest] |
b520e92 to
fd2ada4
Compare
|
[citest] |
Enhancement: Add additional steps from azure scripts
Reason:
Result:
Summary by Sourcery
Add Azure-specific setup tasks including RHUI package updates, kernel version locking, NVIDIA driver installation, and waagent RDMA configuration, along with tests for Azure deployments
Enhancements:
Tests: