Skip to content

Conversation

@pacevedom
Copy link
Contributor

@pacevedom pacevedom commented Oct 23, 2025

Resolves #80

This PR adds multinode cluster management infrastructure to MicroShift, enabling creation and management of multi-node Kubernetes clusters using Podman.

Key Changes

Cluster Management Script:

  • New src/cluster_manager.sh script providing cluster lifecycle operations
  • Commands: create, add-node, start, stop, delete, ready, healthy, status
  • TopoLVM-backed storage setup with loopback devices and LVM
  • Podman network management with configurable isolation

Makefile Integration:

  • New targets: add-node, start, stop, delete, run-status, run-ready, run-healthy
  • run target refactored to use cluster_manager.sh
  • clean target now depends on delete for cluster cleanup

Infrastructure Configuration:

  • Firewall rules for API server (6443) and etcd (2379, 2380) ports
  • Sysctl tuning for inotify limits (max_user_watches: 524288, max_user_instances: 16384)
  • Changes in src/rpm/postinstall.sh

CI/CD Enhancements:

  • New multi-node input in build action (integer, default: 0)
  • Multi-node bootc image build workflow
  • Updated health check logic in quick-start-clean action

Changed Files

  • Makefile
  • src/cluster_manager.sh (new)
  • src/rpm/postinstall.sh
  • .github/actions/build/action.yaml
  • .github/actions/quick-start-clean/action.yaml
  • .github/workflows/builders.yaml

Summary by CodeRabbit

  • New Features

    • New lifecycle commands: add-node, start, stop, run-status; run delegates to a centralized cluster manager.
  • Improvements

    • Centralized multi-node lifecycle and cluster-wide readiness/health reporting; polling-based readiness checks.
    • TopoLVM lifecycle and cleanup now managed centrally.
    • System tuning: opened apiserver/etcd ports and increased inotify limits.
  • CI

    • Workflows/actions support multi-node runs via node-count and expanded start-stop tests.
  • Documentation

    • Updated run/usage docs with multi-node, start/add-node, and status guidance.
  • Removed

    • Legacy login and private orchestration targets removed.

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

Walkthrough

Makefile targets were reworked to delegate cluster lifecycle to a new src/cluster_manager.sh. A Bash multi-node cluster manager and TopoLVM helpers were added. CI actions/workflows were updated for multi-node start/add-node/stop tests and polling-based health checks. RPM postinstall adds firewall and sysctl entries for multinode clusters.

Changes

Cohort / File(s) Summary
Makefile targets
Makefile
Added add-node, start, stop, run-status; run now calls ./src/cluster_manager.sh create with env overrides; removed login; clean delegates to cluster_manager.sh delete; removed private orchestration targets (_isolated_network_config, _topolvm_create, _topolvm_delete).
Cluster manager script
src/cluster_manager.sh
New Bash CLI implementing cluster lifecycle and probes: create/add-node/start/stop/destroy/ready/healthy/status and topolvm-create/topolvm-delete; manages Podman networking, node provisioning/joining, subnet/IP handling, TopoLVM backend, and error reporting; enforces project-root execution.
RPM postinstall
src/rpm/postinstall.sh
Added firewall-offline-cmd rules to open 6443/tcp, 2379/tcp, 2380/tcp in public zone and added /etc/sysctl.d/99-microshift.conf to raise inotify limits for multinode setups; placement noted relative to apiserver/etcd trusted-source rules.
CI action input & flow
.github/actions/build/action.yaml
Added node-count (integer, default 1) input; expanded test flow to loop add-node for extra nodes and run run-healthy after adding nodes (start-stop multi-node test).
CI workflow job
.github/workflows/builders.yaml
Set node-count: 2 for centos9-bootc, centos10-bootc, and fedora-bootc bootc jobs to enable multi-node tests.
Quick-start cleanup action
.github/actions/quick-start-clean/action.yaml
Replaced sequential run-ready/run-healthy with a polling loop (up to 100 iterations, 10s interval) checking greenboot-healthcheck SubState and returning success when it is exited; uses fallback unknown when status retrieval fails.
Build-deb action
.github/actions/build-deb/action.yaml
Replaced make _topolvm_create with ./src/cluster_manager.sh topolvm-create and added explicit failure propagation (`
Docs
docs/run.md
Documented new targets: start, add-node, run-status; renamed LVM_VOLUME_SIZELVM_VOLSIZE; added multi-node usage, podman network notes, examples for login/start/add-node/status.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant User
  participant Makefile
  participant ClusterMgr as cluster_manager.sh
  participant Podman
  participant TopoLVM

  User->>Makefile: make run
  Makefile->>ClusterMgr: ./src/cluster_manager.sh create
  alt TopoLVM requested
    ClusterMgr->>TopoLVM: topolvm-create
  end
  ClusterMgr->>Podman: create network, provision node-1, start microshift
  ClusterMgr-->>Makefile: exit/status

  rect rgb(240,250,240)
    Note right of ClusterMgr: multi-node flow (CI or user)
    User->>Makefile: make add-node / CI loop
    Makefile->>ClusterMgr: ./src/cluster_manager.sh add-node
    ClusterMgr->>Podman: provision node-N, join cluster
    ClusterMgr->>ClusterMgr: run-ready / run-healthy checks
    ClusterMgr-->>Makefile: exit/status
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Pay special attention to:
    • src/cluster_manager.sh: node join logic, IP/subnet handling, Podman network lifecycle, TopoLVM create/delete, and error/existing-state handling.
    • Makefile: correctness of env forwarding and mappings to cluster_manager subcommands.
    • CI changes: node-count loop and quick-start-clean polling semantics.
    • src/rpm/postinstall.sh: idempotency when inserting firewall/sysctl rules.

Possibly related PRs

Suggested reviewers

  • agullon
  • eslutsky
  • pmtk

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 2.63% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (4 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "Feature: Add multinode support" clearly and concisely summarizes the primary objective of the changeset. It directly corresponds to the main work implemented across the PR: introducing multi-node cluster management infrastructure via cluster_manager.sh, new Makefile targets (add-node, start, stop, run-status), and accompanying documentation updates. The title is specific enough to convey the feature's purpose while remaining appropriately brief.
Linked Issues Check ✅ Passed The PR successfully addresses the objectives from linked issue #80. It implements upstream multi-node support through src/cluster_manager.sh with full cluster lifecycle operations (create, add-node, start, stop, destroy, ready, healthy, status), integrates these via new Makefile targets, and provides accompanying documentation in docs/run.md that covers multi-node setup, node addition, and cluster status checks. Infrastructure adjustments for firewall and sysctl tuning support multi-node deployments, and CI/CD workflows are updated to test multi-node configurations. All coding requirements from the issue are met.
Out of Scope Changes Check ✅ Passed All changes in the PR are directly related to implementing multi-node support as specified in issue #80. The Makefile refactoring, cluster_manager.sh implementation, infrastructure adjustments (firewall rules and sysctl tuning), CI/CD workflow updates, and documentation additions all serve the core objective of enabling multi-node cluster management. No unrelated refactoring, feature additions, or bug fixes outside the multi-node support scope are present.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4dbe316 and 607c15a.

📒 Files selected for processing (8)
  • .github/actions/build-deb/action.yaml (1 hunks)
  • .github/actions/build/action.yaml (2 hunks)
  • .github/actions/quick-start-clean/action.yaml (1 hunks)
  • .github/workflows/builders.yaml (3 hunks)
  • Makefile (2 hunks)
  • docs/run.md (3 hunks)
  • src/cluster_manager.sh (1 hunks)
  • src/rpm/postinstall.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • .github/actions/build/action.yaml
  • src/cluster_manager.sh
  • .github/actions/build-deb/action.yaml
🧰 Additional context used
🧠 Learnings (4)
📚 Learning: 2025-10-23T11:55:38.018Z
Learnt from: pacevedom
PR: microshift-io/microshift#103
File: src/image/postinstall.sh:52-54
Timestamp: 2025-10-23T11:55:38.018Z
Learning: In MicroShift multinode deployments at src/image/postinstall.sh, etcd ports 2379 and 2380 are exposed in the public firewall zone because nodes may be deployed across different networks and need to communicate with each other for cluster formation and operation.

Applied to files:

  • src/rpm/postinstall.sh
📚 Learning: 2025-10-29T07:41:49.717Z
Learnt from: pacevedom
PR: microshift-io/microshift#103
File: src/cluster_manager.sh:112-128
Timestamp: 2025-10-29T07:41:49.717Z
Learning: In src/cluster_manager.sh, the file `/var/lib/microshift/resources/kubeadmin/$node_name/kubeconfig` is guaranteed to exist as part of the MicroShift startup process. If MicroShift is ready, this kubeconfig file will be present, so validation checks before copying are not required in the `_join_node` function.

Applied to files:

  • src/rpm/postinstall.sh
  • Makefile
  • docs/run.md
📚 Learning: 2025-10-29T07:57:10.176Z
Learnt from: pacevedom
PR: microshift-io/microshift#103
File: src/cluster_manager.sh:327-327
Timestamp: 2025-10-29T07:57:10.176Z
Learning: In src/cluster_manager.sh, the file `/var/lib/microshift/resources/kubeadmin/kubeconfig` (without a node name) is guaranteed to exist if MicroShift has started and is ready, as it is used internally. This generic kubeconfig path is appropriate for cluster-wide operations like querying cluster status, while node-specific paths like `/var/lib/microshift/resources/kubeadmin/${node_name}/kubeconfig` are used for node-specific operations like joining nodes.

Applied to files:

  • src/rpm/postinstall.sh
  • Makefile
  • docs/run.md
📚 Learning: 2025-10-17T07:44:32.742Z
Learnt from: ggiguash
PR: microshift-io/microshift#92
File: .github/workflows/release.yaml:44-50
Timestamp: 2025-10-17T07:44:32.742Z
Learning: When Podman builds an image without a registry prefix (e.g., `podman build -t microshift-okd`), it automatically adds the `localhost/` prefix and `:latest` tag, resulting in `localhost/microshift-okd:latest`. This means the Makefile in microshift-io/microshift building with `-t microshift-okd` produces `localhost/microshift-okd:latest` without explicit retagging.

Applied to files:

  • .github/actions/quick-start-clean/action.yaml
  • docs/run.md
🪛 markdownlint-cli2 (0.18.1)
docs/run.md

101-101: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)


102-102: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)


145-145: Multiple headings with the same content

(MD024, no-duplicate-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: quick-start-and-clean
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: fedora-bootc
  • GitHub Check: centos10-bootc
  • GitHub Check: centos9-bootc
🔇 Additional comments (9)
.github/actions/quick-start-clean/action.yaml (1)

21-28: Polling loop for greenboot-healthcheck looks solid.

The loop is properly closed and the fallback to "unknown" handles retrieval failures gracefully. Past critical issue (missing done keyword) has been resolved.

src/rpm/postinstall.sh (2)

43-46: Firewall configuration is well-documented and justified.

The public zone exposure for etcd ports (2379, 2380) is appropriate for cross-network multinode deployments, as documented in learnings. The comment explains the reasoning clearly.


48-52: Sysctl inotify limits are appropriately configured.

These are standard Kubernetes watchlist limits to prevent resource exhaustion under normal multinode cluster operations.

Makefile (2)

33-33: Help text accurately reflects the new multinode workflow.

Clear, concise descriptions of all targets including new add-node, start, stop, and run-status.

Also applies to: 36-40, 45-47


108-123: Target delegation to cluster_manager.sh is clean and consistent.

All new targets (add-node, start, stop, run-status) and refactored targets (run, clean) follow the same pattern: environment variables are passed through and the script handles lifecycle operations. This centralizes cluster management logic effectively.

Also applies to: 146-152

.github/workflows/builders.yaml (1)

26-26: Multi-node test configuration is consistent across build jobs.

All three bootc builds (CentOS 9, 10, Fedora) are configured with node-count: 2, enabling 2-node cluster testing. Input name is correct.

Also applies to: 46-46, 66-66

docs/run.md (3)

104-104: Multinode setup documentation is clear and accurate.

The guidance on single-node creation, TopoLVM configuration (configurable LVM_VOLSIZE), and podman network setup for multinode support is well-articulated and aligns with the underlying script changes.

Also applies to: 109-111


130-136: Container login guidance is practical and well-exemplified.

Clarifies where login commands are displayed and provides a concrete example with proper node naming (microshift-okd-1).


161-184: New multinode and status check sections are well-documented.

Clear explanations of the add-node workflow, ISOLATED_NETWORK constraints, and cluster status commands. Integration with new Makefile targets is explicit and correct.


Comment @coderabbitai help to get the list of available commands and usage tips.

@pacevedom
Copy link
Contributor Author

@CodeRabbit review the changes

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

@pacevedom: I'll review the changes in this PR for adding multinode support. I'll analyze the code changes and provide feedback shortly.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (2)
Makefile (1)

148-150: Consider adding a confirmation message for consistency.

The multinode-delete target successfully cleans up the cluster, but unlike multinode-create, it doesn't echo a confirmation message after completion.

Apply this diff to add a confirmation message:

 .PHONY: multinode-delete
 multinode-delete:
 	@USHIFT_IMAGE="${USHIFT_IMAGE}" LVM_DISK=${LVM_DISK} LVM_VOLSIZE=${LVM_VOLSIZE} VG_NAME=${VG_NAME} ./src/multinode/multinode.sh delete
+	@echo "Cluster has been deleted"
src/multinode/multinode.sh (1)

251-273: Consider using consistent grep syntax for maintainability.

The function correctly cleans up all cluster resources. However, Line 252 uses BRE syntax ([0-9]\+) while other commands in the script (lines 191, 217, 225, 238) use ERE syntax with the -E flag ([0-9]+). While both work, consistency improves readability.

Apply this diff for consistency:

-    containers=$(sudo podman ps -a --format '{{.Names}}' | grep "^${NODE_BASE_NAME}[0-9]\+") || true
+    containers=$(sudo podman ps -a --format '{{.Names}}' | grep -E "^${NODE_BASE_NAME}[0-9]+") || true
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 72ba79a and 1a0a01d.

📒 Files selected for processing (7)
  • Makefile (2 hunks)
  • packaging/microshift-builder.Containerfile (1 hunks)
  • packaging/microshift-runner.Containerfile (2 hunks)
  • src/image/postinstall.sh (1 hunks)
  • src/multinode/dropins/microshift-multinode.conf (1 hunks)
  • src/multinode/multinode.sh (1 hunks)
  • src/multinode/multinode.spec (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-17T10:53:08.437Z
Learnt from: ggiguash
PR: microshift-io/microshift#57
File: .github/workflows/builders.yaml:75-84
Timestamp: 2025-10-17T10:53:08.437Z
Learning: In the microshift repository, the RPM-to-Debian conversion workflow follows this pattern: `convert.sh` receives an RPM directory as input and outputs .deb files to a `deb/` subdirectory within that directory. The `install.sh` script expects to receive the parent RPM directory (not the deb subdirectory) and uses recursive `find` commands to locate the .deb files in subdirectories.

Applied to files:

  • packaging/microshift-builder.Containerfile
🪛 RuboCop (1.81.1)
src/multinode/multinode.spec

[fatal] 4-4: unterminated string meets end of file
(Using Ruby 2.7 parser; configure using TargetRubyVersion parameter, under AllCops)

(Lint/Syntax)

🔇 Additional comments (20)
src/image/postinstall.sh (1)

56-60: System tuning for multinode workloads looks good.

The inotify limit increases (max_user_watches and max_user_instances) are appropriate for a Kubernetes cluster with multiple nodes. These settings are standard to prevent "too many open files" errors from kubelet and container runtimes. Configuration is idempotent and follows standard sysctl naming conventions.

src/multinode/dropins/microshift-multinode.conf (1)

1-4: LGTM! Systemd drop-in is correctly structured.

The drop-in follows the standard pattern of clearing the inherited ExecStart before setting a new one, which is necessary to avoid systemd attempting to run both commands.

packaging/microshift-builder.Containerfile (1)

64-67: LGTM! Multinode RPM build follows established patterns.

The multinode RPM build sequence correctly mirrors the TopoLVM and Kindnet patterns, ensuring consistency in the build process.

packaging/microshift-runner.Containerfile (2)

21-21: LGTM! Environment variable declaration follows the established pattern.

The WITH_MULTINODE flag is properly aligned with other feature flags (WITH_KINDNET, WITH_TOPOLVM, WITH_OLM) and correctly defaults to 0 (disabled) for backward compatibility.


38-40: LGTM! Conditional installation is correctly implemented.

The multinode package installation follows the same pattern as other optional components. Note that unlike other components, microshift-multinode doesn't install a separate release-info package, which appears intentional.

Makefile (2)

84-84: LGTM! Multinode flag is properly propagated to the image build.

The WITH_MULTINODE environment variable is correctly passed to the container build, enabling conditional multinode package installation in the runner image.


143-146: LGTM! The multinode-create target is well-structured.

The target properly passes all necessary environment variables to the multinode script and provides helpful user feedback.

src/multinode/multinode.spec (1)

1-41: LGTM! RPM spec file is correctly structured.

The spec file properly defines the multinode subpackage with appropriate dependencies and file installations. The path references align with the builder Containerfile's copy operations.

Note: The RuboCop "unterminated string" error is a false positive—this is an RPM spec file, not Ruby code.

src/multinode/multinode.sh (12)

1-14: LGTM! Script initialization is properly configured.

The script uses appropriate error handling with set -euo pipefail and provides sensible defaults for all configuration variables while allowing environment overrides.


16-26: LGTM! TopoLVM backend creation is correctly implemented.

The function properly handles the reuse case and creates the LVM backend with appropriate commands. Error handling is provided by the script's set -e configuration.


28-47: LGTM! Kubeconfig handling functions are correctly implemented.

The wait timeout of 3000 seconds (50 minutes) is generous for initial cluster bootstrap, and the copy function cleanly extracts the kubeconfig.


72-88: LGTM! Node joining logic is correctly implemented.

The function properly orchestrates the node join process by copying the primary kubeconfig and executing the microshift add-node command with appropriate service stops.


90-101: LGTM! Node readiness check is properly implemented.

The function waits for the greenboot healthcheck with a reasonable timeout and handles errors gracefully.


103-120: LGTM! Network and cluster existence checks are correctly implemented.

Both functions properly use Podman commands to check and create resources as needed.


122-137: Verify IP address allocation doesn't conflict with Podman's gateway.

The functions correctly extract the subnet and calculate node IP addresses by setting the fourth octet to node_id + 10 (e.g., .11, .12, .13). This allocation strategy should avoid conflicts with Podman's typical gateway at .1, but please verify that Podman's default IPAM doesn't use IPs in the .10+ range for other purposes.


139-180: LGTM! Cluster creation logic is well-structured.

The function properly validates inputs, prevents duplicate clusters, and correctly orchestrates node creation with appropriate differentiation between the first node (which must be ready before others join) and subsequent nodes.


182-214: LGTM! Node addition logic correctly extends existing clusters.

The function properly determines the last node ID and adds new nodes sequentially with appropriate joining logic.


216-230: LGTM! Start and stop commands are correctly implemented.

Both commands properly enumerate containers and handle them appropriately. The --time 0 flag ensures immediate container shutdown.


232-249: LGTM! Status command provides useful cluster information.

The function properly checks cluster state and uses oc get nodes to display detailed node information when the cluster is running.


275-319: LGTM! Usage documentation and main entrypoint are well-structured.

The usage help is comprehensive, documenting all commands and environment variables. The main function provides useful visibility into the runtime configuration before executing commands.

@pacevedom pacevedom force-pushed the 80 branch 2 times, most recently from d5d3eef to 9b4cff9 Compare October 23, 2025 15:56
@pacevedom
Copy link
Contributor Author

@coderabbitai review this change

@coderabbitai
Copy link

coderabbitai bot commented Oct 23, 2025

@pacevedom: I'll review the changes in this PR. Let me analyze the multinode support implementation across all the modified files.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
src/multinode/multinode.sh (2)

16-26: Consider verifying VG exists when LVM_DISK file exists.

If LVM_DISK exists but the loop device or volume group don't (e.g., after a system reboot), subsequent operations may fail. Consider checking VG existence or recreating the loop device when the file is found.


268-270: Fix indentation to use spaces consistently.

Lines 268-270 use tabs while the rest of the file uses spaces. Maintain consistent indentation.

-		sudo vgremove -y "${VG_NAME}" || true
-		local -r device_name="$(sudo losetup -j "${LVM_DISK}" | cut -d: -f1)"
-		[ -n "${device_name}" ] && sudo losetup -d "${device_name}" || true
+    sudo vgremove -y "${VG_NAME}" || true
+    local -r device_name="$(sudo losetup -j "${LVM_DISK}" | cut -d: -f1)"
+    [ -n "${device_name}" ] && sudo losetup -d "${device_name}" || true
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1a0a01d and 9b4cff9.

📒 Files selected for processing (7)
  • Makefile (2 hunks)
  • packaging/microshift-builder.Containerfile (1 hunks)
  • packaging/microshift-runner.Containerfile (2 hunks)
  • src/multinode/dropins/microshift-multinode.conf (1 hunks)
  • src/multinode/multinode.sh (1 hunks)
  • src/multinode/multinode.spec (1 hunks)
  • src/rpm/postinstall.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • Makefile
  • packaging/microshift-runner.Containerfile
  • packaging/microshift-builder.Containerfile
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-23T11:55:38.018Z
Learnt from: pacevedom
PR: microshift-io/microshift#103
File: src/image/postinstall.sh:52-54
Timestamp: 2025-10-23T11:55:38.018Z
Learning: In MicroShift multinode deployments at src/image/postinstall.sh, etcd ports 2379 and 2380 are exposed in the public firewall zone because nodes may be deployed across different networks and need to communicate with each other for cluster formation and operation.

Applied to files:

  • src/rpm/postinstall.sh
🪛 RuboCop (1.81.1)
src/multinode/multinode.spec

[fatal] 4-4: unterminated string meets end of file
(Using Ruby 2.7 parser; configure using TargetRubyVersion parameter, under AllCops)

(Lint/Syntax)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: ubuntu-rpm2deb
🔇 Additional comments (8)
src/rpm/postinstall.sh (2)

43-46: Firewall rules already reviewed and justified.

These firewall rules were discussed in previous reviews. The public zone exposure is intentional for multinode deployments across different networks.

Based on learnings.


47-51: LGTM!

The inotify limits are appropriate for Kubernetes workloads and prevent resource exhaustion issues with cAdvisor and kubelet.

src/multinode/dropins/microshift-multinode.conf (1)

1-4: LGTM!

The systemd drop-in correctly clears and replaces ExecStart to enable multinode mode. The comment helpfully explains the systemd requirement.

src/multinode/multinode.spec (1)

25-31: LGTM!

The multinode subpackage definition is correct with appropriate version dependency and noarch specification.

src/multinode/multinode.sh (4)

1-15: LGTM!

Proper error handling with set -euo pipefail and well-defined environment variable defaults with override capability.


49-70: LGTM!

The issues flagged in the previous review have been fixed: vol_opts now has proper local declaration, and ${device} uses correct Bash syntax.


28-248: LGTM!

Cluster management functions are well-structured with appropriate error handling and timeouts for multinode operations.


276-320: LGTM!

Clear usage documentation and well-organized command dispatch with helpful environment summary.

@pacevedom pacevedom marked this pull request as ready for review October 27, 2025 12:18
@pacevedom pacevedom requested a review from a team as a code owner October 27, 2025 12:18
@pacevedom
Copy link
Contributor Author

@coderabbitai do a final round of review. Also update the PR description with a concise summary of this change.

@coderabbitai
Copy link

coderabbitai bot commented Oct 27, 2025

🧩 Analysis chain

@pacevedom: I'll perform a final review of the changes and have updated the PR description with a concise summary of the multinode support implementation.


🏁 Script executed:

#!/bin/bash
# First, let's see the current PR description
gh pr view 103 --json body --jq '.body'

Length of output: 502


🏁 Script executed:

#!/bin/bash
# Update the PR description with a concise summary
gh pr edit 103 --body "This PR adds comprehensive multinode support to MicroShift, enabling users to create and manage multi-node Kubernetes clusters using Podman and TopoLVM.

## Key Changes

**Cluster Management:**
- New \`multinode.sh\` script providing commands to create, manage, and delete multinode clusters
- Support for creating clusters with 3+ nodes (default 3)
- Operations: create, add-node, start, stop, status, and delete

**Build & Runtime Integration:**
- Added \`WITH_MULTINODE\` build flag to enable/disable multinode features
- Container image builds now include multinode RPM packages when enabled
- New Makefile targets: \`multinode-create\` and \`multinode-delete\`

**Infrastructure:**
- Systemd drop-in configuration to run MicroShift in multinode mode (\`--multinode\` flag)
- TopoLVM-backed storage setup with loopback devices and LVM volume groups
- Podman network creation with Open vSwitch integration
- Firewall configuration for etcd ports (2379, 2380) and API server (6443)
- Sysctl tuning for inotify limits (max_user_watches, max_user_instances)

**Packaging:**
- New \`microshift-multinode\` RPM subpackage with dependency on base microshift package
- Spec file defining multinode installation and configuration

## Missing Items
- Presubmits
- Documentation"

Length of output: 52


@pacevedom: I've successfully updated the PR description with a comprehensive summary of the multinode support changes. The description now covers:

  • Cluster Management: The new multinode.sh script and its operations
  • Build & Runtime Integration: WITH_MULTINODE flag and Makefile targets
  • Infrastructure: Systemd configuration, TopoLVM storage, networking, firewall, and sysctl tuning
  • Packaging: New microshift-multinode RPM subpackage

I'm now triggering a final review of all the changes.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (3)
src/multinode/multinode.sh (2)

32-39: Add informative error message on kubeconfig wait timeout.

When the wait times out, the function returns 1 with no indication of what failed, making debugging difficult.

 wait_for_kubeconfig() {
     local -r name="${1}"
     local -r kubeconfig="/var/lib/microshift/resources/kubeadmin/${name}/kubeconfig"
 
     for _ in $(seq 600); do
         if sudo podman exec -i "${name}" /bin/test -f "${kubeconfig}"; then
             return 0
         fi
         sleep 5
     done
+    echo "ERROR: Timeout waiting for kubeconfig in container ${name}" >&2
     return 1
 }

248-248: Use variable for kubeconfig path in cmd_status.

Line 248 hardcodes the kubeconfig path instead of using a configuration variable, making the code less maintainable and inconsistent with the rest of the script.

 cmd_status() {
     if ! sudo podman container exists "${NODE_BASE_NAME}1"; then
         echo "Cluster is not initialized."
         exit 1
     fi
 
     local -r running_containers=$(sudo podman ps --format '{{.Names}}' | grep -E "^${NODE_BASE_NAME}[0-9]+$" || true)
 
     if [ -z "${running_containers}" ]; then
         echo "Cluster is down. No cluster containers are currently running."
         return 0
     fi
 
     local -r first_container=$(echo "${running_containers}" | head -n1)
     echo "Cluster is running."
-    sudo podman exec -i "${first_container}" oc --kubeconfig=/var/lib/microshift/resources/kubeadmin/kubeconfig get nodes -o wide
+    local -r kubeconfig="/var/lib/microshift/resources/kubeadmin/kubeconfig"
+    sudo podman exec -i "${first_container}" oc --kubeconfig="${kubeconfig}" get nodes -o wide
     exit $?
 }
src/multinode/README.md (1)

209-211: Specify language identifier for fenced code block.

Line 209 starts a code block without specifying the language, which impacts readability and syntax highlighting in rendered documentation.

-```
+```conf
 /etc/systemd/system/microshift.service.d/multinode.conf

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Path: .coderabbit.yaml

**Review profile**: CHILL

**Plan**: Pro

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 9b4cff95497f52e6245a4976fb709dd6e6bcd020 and 367cb99436862b955377de298434e76a55c4557b.

</details>

<details>
<summary>📒 Files selected for processing (4)</summary>

* `.github/actions/build/action.yaml` (3 hunks)
* `.github/workflows/builders.yaml` (1 hunks)
* `src/multinode/README.md` (1 hunks)
* `src/multinode/multinode.sh` (1 hunks)

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>🪛 LanguageTool</summary>

<details>
<summary>src/multinode/README.md</summary>

[grammar] ~207-~207: Ensure spelling is correct
Context: ... The multinode RPM installs the service dropin file that enables multinode mode:  ``` ...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

---

[grammar] ~213-~213: Ensure spelling is correct
Context: ...hift.service.d/multinode.conf ```  This dropin configures MicroShift to run with the `...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)

</details>

</details>
<details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>src/multinode/README.md</summary>

209-209: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>

</details>

<details>
<summary>⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)</summary>

* GitHub Check: isolated-network-ovnk
* GitHub Check: isolated-network-kindnet
* GitHub Check: multinode-bootc
* GitHub Check: ubuntu-rpm2deb
* GitHub Check: fedora-bootc
* GitHub Check: centos9-bootc
* GitHub Check: centos10-bootc

</details>

<details>
<summary>🔇 Additional comments (3)</summary><blockquote>

<details>
<summary>src/multinode/multinode.sh (1)</summary><blockquote>

`49-71`: **Critical issues from previous review have been fixed.**

The function now correctly declares `vol_opts` as local (line 54) and uses proper Bash variable expansion `${device}` (line 56) instead of the previous Makefile-style `$${device}`. The shellcheck directive on line 59 is appropriate for the controlled vol_opts expansion.

</blockquote></details>
<details>
<summary>.github/actions/build/action.yaml (1)</summary><blockquote>

`120-128`: **Verify multinode testing strategy covers sufficient validation.**

When `multinode=1`, the action skips the standard readiness checks (`run-ready`, `run-healthy`) and only runs `multinode-create` and `multinode-delete`. This means multinode builds don't validate that individual MicroShift instances are healthy during cluster lifecycle, only that the orchestration script succeeds.

Consider whether additional validation is needed within the multinode test (e.g., checking node status after create, verifying etcd quorum).

</blockquote></details>
<details>
<summary>.github/workflows/builders.yaml (1)</summary><blockquote>

`119-135`: **Multinode workflow job is properly integrated.**

The new `multinode-bootc` job follows the established pattern for variant builds, correctly passes the `multinode=1` flag to the build action, and runs on appropriate infrastructure.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@pacevedom pacevedom marked this pull request as draft October 27, 2025 15:00
@pacevedom pacevedom force-pushed the 80 branch 2 times, most recently from 8ecb276 to 13679bf Compare October 28, 2025 15:33
@pacevedom
Copy link
Contributor Author

@coderabbitai have a quick review and update the PR description

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check the current PR description
gh pr view 103 --json body,title --jq '{title: .title, body: .body}'

Length of output: 2202


🏁 Script executed:

#!/bin/bash
# Update the PR description with a polished, concise summary
gh pr edit 103 --body "This PR adds comprehensive multinode support to MicroShift, enabling users to create and manage multi-node Kubernetes clusters using Podman and TopoLVM.

## Key Changes

**Cluster Management:**
- New \`multinode.sh\` script with commands: create, add-node, start, stop, status, delete
- Support for creating clusters with 3+ nodes (default: 3)

**Build & Runtime:**
- \`WITH_MULTINODE\` build flag to enable/disable multinode features
- New Makefile targets: \`multinode-create\` and \`multinode-delete\`
- Systemd drop-in to run MicroShift in multinode mode (\`--multinode\` flag)

**Infrastructure:**
- TopoLVM-backed storage with loopback devices and LVM
- Podman network with Open vSwitch integration
- Firewall rules for etcd (2379, 2380) and API server (6443)
- Sysctl tuning for inotify limits

**Packaging:**
- New \`microshift-multinode\` RPM subpackage
- CI/CD workflow for multinode bootc image builds

**Documentation:**
- Comprehensive README with usage, architecture, and troubleshooting

## Missing Items
- Presubmits"

Length of output: 52


@pacevedom: I've updated the PR description with a more concise summary while keeping all the essential information. The description now provides a cleaner overview of the multinode support implementation, including cluster management, build/runtime integration, infrastructure changes, packaging, and documentation.

I'm now triggering a quick review of the changes.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 367cb99 and 13679bf.

📒 Files selected for processing (6)
  • .github/actions/build/action.yaml (2 hunks)
  • .github/actions/quick-start-clean/action.yaml (1 hunks)
  • .github/workflows/builders.yaml (1 hunks)
  • Makefile (1 hunks)
  • src/cluster_manager.sh (1 hunks)
  • src/rpm/postinstall.sh (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-10-23T11:55:38.018Z
Learnt from: pacevedom
PR: microshift-io/microshift#103
File: src/image/postinstall.sh:52-54
Timestamp: 2025-10-23T11:55:38.018Z
Learning: In MicroShift multinode deployments at src/image/postinstall.sh, etcd ports 2379 and 2380 are exposed in the public firewall zone because nodes may be deployed across different networks and need to communicate with each other for cluster formation and operation.

Applied to files:

  • src/rpm/postinstall.sh
🪛 actionlint (1.7.8)
.github/workflows/builders.yaml

134-134: input "multinode" is not defined in action "build-rpms-and-images" defined at "./.github/actions/build". available inputs are "bootc-image-tag", "bootc-image-url", "build", "isolated-network", "multi-node", "okd-version-tag", "ovnk-networking", "ushift-branch"

(action)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: fedora-bootc
  • GitHub Check: isolated-network-ovnk
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: multi-node-bootc
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: centos9-bootc
  • GitHub Check: centos10-bootc
🔇 Additional comments (8)
src/rpm/postinstall.sh (1)

43-52: ✓ Multinode connectivity and resource limits configured appropriately.

The firewall rules and sysctl adjustments are well-placed. The comment clearly explains why etcd ports are exposed in the public zone for multinode clusters. Inotify limits align with kubelet/cAdvisor requirements in containerized environments.

src/cluster_manager.sh (5)

7-15: ✓ Configuration variables well-designed for environment overrides.

Clean use of bash parameter expansion for configuration. Defaults align with typical single-node and multinode deployments.


17-151: ✓ Helper functions are logically sound and handle edge cases.

LVM backend, network, and node provisioning helpers use defensive patterns (existence checks, error fallbacks, cleanup on failure). Node joining flow correctly sequences kubeconfig copy and microshift add-node execution.


191-230: ✓ Node addition orchestration correctly sequences setup, health checks, and join.

Iteration to find the next node ID is correct. Pre-join health check ensures the primary node is stable before adding new replicas.


287-315: ✓ Readiness and health check logic is clear and guards against missing nodes.

The distinction between microshift.service "running" (ready) and greenboot-healthcheck "exited" (healthy) is appropriate. Early exit if no containers found avoids spurious timeouts.


344-391: ✓ Main dispatch and directory guard are well-implemented.

Case statement cleanly routes actions. The Makefile presence check prevents accidental runs from wrong directory, reducing debugging friction.

.github/actions/build/action.yaml (1)

33-37: ✓ Input definition is correct.

The multi-node input with integer type and default value is properly declared.

Makefile (1)

107-150: ✓ Cluster orchestration cleanly centralized through cluster_manager.sh delegation.

Environment overrides are consistently passed through all targets. The polling loops for run-ready and run-healthy follow a sensible retry pattern. The transition from inline container logic to centralized script management is well-structured.

@pacevedom pacevedom force-pushed the 80 branch 5 times, most recently from b928205 to 69d4fb3 Compare October 28, 2025 17:04
@pacevedom
Copy link
Contributor Author

@coderabbitai have another round and update again the PR description to match the changes in the commits.

@coderabbitai
Copy link

coderabbitai bot commented Oct 28, 2025

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Get the current PR description and list of changed files
echo "=== Current PR Description ==="
gh pr view 103 --json body,title --jq '{title: .title, body: .body}'

echo -e "\n=== List of changed files in this PR ==="
gh pr view 103 --json files --jq '.files[].path' | sort

echo -e "\n=== Check for key files ==="
# Check if cluster_manager.sh exists
if [ -f "src/cluster_manager.sh" ]; then
  echo "✓ src/cluster_manager.sh exists"
else
  echo "✗ src/cluster_manager.sh does not exist"
fi

# Check if multinode.sh exists
if [ -f "src/multinode.sh" ]; then
  echo "✓ src/multinode.sh exists"
else
  echo "✗ src/multinode.sh does not exist"
fi

# Check for README files related to multinode
fd -t f "README" | grep -i multi || echo "No multinode README found"

# Check for spec files
fd -t f "\.spec$" --exec echo "Found spec file: {}"

Length of output: 2237


🏁 Script executed:

#!/bin/bash
# Check the Makefile for new targets
echo "=== New Makefile targets ==="
grep -E "^[a-z-]+:" Makefile | head -30

echo -e "\n=== Check cluster_manager.sh command structure ==="
grep -E "^\s*(create|add-node|start|stop|delete|status|ready|healthy)" src/cluster_manager.sh | head -20

echo -e "\n=== Check postinstall.sh firewall changes ==="
grep -A2 -B2 "6443\|2379\|2380" src/rpm/postinstall.sh

echo -e "\n=== Check for multinode GitHub workflow ==="
cat .github/workflows/builders.yaml | grep -A10 "multi-node" || echo "No multi-node workflow found"

echo -e "\n=== Check build action for multinode input ==="
grep -A5 "multi-node:" .github/actions/build/action.yaml || echo "No multi-node input found"

Length of output: 1330


🏁 Script executed:

#!/bin/bash
# Check if there are systemd or spec file changes we missed
echo "=== Check for all files with 'multinode' in name or content ==="
rg -l "multinode" --type-add 'spec:*.spec' --type-add 'service:*.service' -g '!.git'

echo -e "\n=== Check for WITH_MULTINODE flag ==="
rg "WITH_MULTINODE" -C2

echo -e "\n=== Check for systemd drop-in files ==="
fd -t f "\.service$|\.conf$" src/

echo -e "\n=== Check Makefile run target details ==="
sed -n '/^run:/,/^[a-z-]*:/p' Makefile | head -20

echo -e "\n=== Check sysctl changes in postinstall ==="
grep -A5 "sysctl\|inotify" src/rpm/postinstall.sh

Length of output: 854


🏁 Script executed:

#!/bin/bash
# Update PR description to match actual changes in the commits
gh pr edit 103 --body "This PR adds multinode cluster management infrastructure to MicroShift, enabling creation and management of multi-node Kubernetes clusters using Podman.

## Key Changes

**Cluster Management Script:**
- New \`src/cluster_manager.sh\` script providing cluster lifecycle operations
- Commands: create, add-node, start, stop, delete, ready, healthy, status
- TopoLVM-backed storage setup with loopback devices and LVM
- Podman network management with configurable isolation

**Makefile Integration:**
- New targets: \`add-node\`, \`start\`, \`stop\`, \`delete\`, \`run-status\`, \`run-ready\`, \`run-healthy\`
- \`run\` target refactored to use cluster_manager.sh
- \`clean\` target now depends on \`delete\` for cluster cleanup

**Infrastructure Configuration:**
- Firewall rules for API server (6443) and etcd (2379, 2380) ports
- Sysctl tuning for inotify limits (max_user_watches: 524288, max_user_instances: 16384)
- Changes in \`src/rpm/postinstall.sh\`

**CI/CD Enhancements:**
- New \`multi-node\` input in build action (integer, default: 0)
- Multi-node bootc image build workflow
- Updated health check logic in quick-start-clean action

## Changed Files
- Makefile
- src/cluster_manager.sh (new)
- src/rpm/postinstall.sh
- .github/actions/build/action.yaml
- .github/actions/quick-start-clean/action.yaml
- .github/workflows/builders.yaml"

echo "PR description updated successfully"

Length of output: 88


@pacevedom: I've updated the PR description to accurately match the actual changes in the commits. The key corrections include:

Corrected:

  • Script name: cluster_manager.sh (not multinode.sh)
  • Makefile targets: add-node, start, stop, delete, etc. (not multinode-create/multinode-delete)
  • Removed mentions of features not present in the commits:
    • WITH_MULTINODE build flag
    • microshift-multinode RPM subpackage
    • Systemd drop-in configuration
    • Documentation/README

Actual changes documented:

  • New src/cluster_manager.sh script with 8 commands for cluster lifecycle management
  • Refactored Makefile with 7 new targets integrated with cluster_manager.sh
  • Infrastructure setup in postinstall.sh (firewall rules and sysctl tuning)
  • CI/CD enhancements with multi-node build action input and bootc workflow

The description now accurately reflects the 6 files changed in this PR.

✅ Actions performed

Review triggered.

Note: CodeRabbit is an incremental review system and does not re-review already reviewed commits. This command is applicable only when automatic reviews are paused.

@pacevedom pacevedom marked this pull request as ready for review October 28, 2025 17:28
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9f99f94 and 4dbe316.

📒 Files selected for processing (1)
  • docs/run.md (3 hunks)
🧰 Additional context used
🪛 markdownlint-cli2 (0.18.1)
docs/run.md

101-101: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)


102-102: Table pipe style
Expected: leading_and_trailing; Actual: leading_only; Missing trailing pipe

(MD055, table-pipe-style)


145-145: Multiple headings with the same content

(MD024, no-duplicate-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (7)
  • GitHub Check: ubuntu-rpm2deb
  • GitHub Check: quick-start-and-clean
  • GitHub Check: centos10-bootc
  • GitHub Check: isolated-network-kindnet
  • GitHub Check: fedora-bootc
  • GitHub Check: centos9-bootc
  • GitHub Check: isolated-network-ovnk
🔇 Additional comments (1)
docs/run.md (1)

104-111: Documentation properly documents multinode workflow.

The updates clearly explain the multinode cluster creation process (initial make run, then make add-node for additional nodes), container login via node names, and status checks. ISOLATED_NETWORK constraint is properly documented. References to new make targets align with the cluster manager implementation.

Also applies to: 128-143, 161-169, 171-184

@pacevedom pacevedom merged commit e071e10 into microshift-io:main Oct 30, 2025
10 checks passed
@pacevedom pacevedom deleted the 80 branch October 30, 2025 13:12
This was referenced Dec 3, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add multi-node support and docs to MicroShift Upstream

3 participants