Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rancher integration test - Rancher 2.7.6 + Harvester 1.1.2 #936

Closed
albinsun opened this issue Sep 13, 2023 · 4 comments
Closed

Rancher integration test - Rancher 2.7.6 + Harvester 1.1.2 #936

albinsun opened this issue Sep 13, 2023 · 4 comments

Comments

@albinsun
Copy link
Contributor

albinsun commented Sep 13, 2023

What's the test to develop? Please describe

Rancher integration test on a setup Rancher 2.7.6 with Harvester 1.1.2 to confirm the support status

Describe the items of the test development (DoD, definition of done) you'd like

TCs will reference to v1.2.0 release testing #900

Test Outline

  1. Import Harvester in Rancher 2.7.6
  2. Create rke2 custom cluster, Harvester node driver cluster, cluster using terraform
  3. Deploy cloud provider and csi driver all the clusters.
  4. Necessary checks for cloud provider and csi driver working on cluster.
  5. Scale down and scale up of clustersbasic operations.

Environment

  • Harvester
    • Version: v1.1.2
    • Profile: QEMU/KVM, 3 nodes (8C/16G/250G)
    • ui-source: Auto
  • Rancher
    • Version: v2.7.6
    • Profile: Docker
  • Terraform
    Terraform v1.5.3
    on linux_amd64
    + provider registry.terraform.io/rancher/rancher2 v3.0.1
    
@albinsun
Copy link
Contributor Author

albinsun commented Sep 13, 2023

Harvester Configuration

Based on

Prerequisites

  1. VLAN 1 network on mgmt and 1 network on other NICs
  2. 2 Virtual machines with data and md5sum computed- 1 running, 1 stopped
  3. Create a new storage class apart from default one. Use the new storage class for some basic operations.

Setup

  1. Setup a 3 nodes cluster ${~~~\color{green}\textsf{V}}$

    image

Network

  1. Create VLAN 1 network on mgmt NIC, "mgmt-vlan1" ${~~~\color{green}\textsf{V}}$

    image

  2. Create untagged network on other NIC, "nonmgmt-untagged" ${~~~\color{green}\textsf{V}}$
    1. Create cluster network nonmgmt
    2. Create network config nc-nonmgmt
      image
    3. Create VM network nonmgmt-untagged
      image

Storage

  1. Add disk to a node1 and node2 respectively ${~~~\color{green}\textsf{V}}$
    1. Plug a 64G ext4 nonpartitioned disk to node1 and node2
    2. System detects and put as blockdevices resource
      • node0
        image
      • node1
        image
    3. Attach blockdevice as storage, also add disk tags
      • node0
        image
      • node1
        image
  2. Create a new storrage class "eext4" which selects the new disk ${~~~\color{green}\textsf{V}}$

    image

  3. Create a new 10G volume "eext4-pred" base on the new "eext4" ${~~~\color{green}\textsf{V}}$

    image

VM Creation

  1. Running VM: Create "vm-running" ${~~~\color{green}\textsf{V}}$
    1. Disk: ubuntu-focal-cloudimg + Existing Volume "eext4-pred"
      image
    2. Network: "mgmt-vlan1"
      image
    3. Mount eext4-pred and edit fstab
      root@vm-running:~# cat /etc/fstab 
      LABEL=cloudimg-rootfs   /        ext4   defaults        0 1
      LABEL=UEFI      /boot/efi       vfat    umask=0077      0 1
      UUID="35222e2d-fced-4d0b-8445-ffeb3a906378"     /data   ext4    defaults        0 1
      
    4. Some test data and its checksum computed on both own and attached volume.
      • own volume
        image
      • attached volume
        image
    5. Restart from UI, checksum should also valid
      • own volume
        image
      • attached volume
        image
  2. Stopped VM: Create "vm-stopped" ${~~~\color{green}\textsf{V}}$
    1. Disk: ubuntu-focal-cloudimg + on-demand volume "eext4-ond"
      image
    2. Network: "nonmgmt-untagged"
      image
    3. Mount eext4-ond and edit fstab
      ubuntu@vm-stopped:/data$ cat /etc/fstab 
      LABEL=cloudimg-rootfs   /        ext4   defaults        0 1
      LABEL=UEFI      /boot/efi       vfat    umask=0077      0 1
      UUID="691bc5e1-f642-427d-acb0-5da31fb20732"     /data   ext4    defaults        0 1
      
    4. Some test data and its checksum computed on both own and attached volume.
      • own volume
        image
      • attached volume
        image
    5. Restart from UI, checksum should also valid
      image
      image
      image
    6. Stop VM

@albinsun
Copy link
Contributor Author

albinsun commented Sep 13, 2023

Rancher Integration

Outline

Test steps

  1. Import Harvester in Rancher 2.7.6
  2. Create rke2 custom cluster, Harvester node driver cluster, cluster using terraform
  3. Deploy cloud provider and csi driver all the clusters.
  4. Necessary checks for cloud provider and csi driver working on cluster.
  5. Scale down and scale up of clustersbasic operations.

Import to Rancher

  1. Import Harvester to Rancher 2.7.6 ${~~~\color{green}\textsf{V}}$
    1. Rancher, go to "Virtualization Management" -> "Import Existing" -> "Create"
      image

    2. Register Harvester to Rancher
      image

    3. Imported Harvester shows Active
      image

RKE2 Harvester Node Driver Cluster (Manual)

Setup Cluster

  1. Create cloud credential ${~~~\color{green}\textsf{V}}$

    Go to "Cluster Management" -> "Cloud Credentials" -> "Create" -> "Harvester"
    image

  2. Create cluster (Takes ~20m) ${~~~\color{orange}\textsf{V}}$
    1. Go to "Cluster Management" -> "Clusters" -> "Create"

      • v1.23.17+rke2r1
        image
    2. Cluster should be created

      • Rancher
        image
      • Harvester
        image
    • ⚠️ Can only select `v1.23.17+rke2r1` to conform to Harvester `v1.1.2`
      • v1.26.8+rke2r1
        image
      • v1.25.13+rke2r1
        image
      • v1.24.17+rke2r1
        image
    • 🐞 (minor) UI broken if Click Rancher tab from Harvester.
      1. Go to Virtualization Managemen
        image
      2. Enter Harvester
        image
      3. Click rancher tab will broken (Workaround: connect to Home tab)
        image
        image

      ** Workaround **
      One workaround is to enter base rancher URL again.
      image

  3. Create IP Pool ${~~~\color{green}\textsf{V}}$
    1. Go to "Virtualization Management" -> harvester -> "Settings" -> "vip-pools"
      image

Test harvester-cloud-provider

  1. Both App & Workload should be Active ${~~~\color{green}\textsf{V}}$

    App
    image

    Workload
    image

  2. Deploy Nginx workload ${~~~\color{green}\textsf{V}}$
    1. Create a deployment test-nginx with image nginx:latest and pod label
    2. Check deployment test-nginx is Active
      image
  3. Verify Load Balancer with IPAM "DHCP" ${~~~\color{green}\textsf{V}}$
    1. Go to "Service discovery" -> "Services" -> "Create" -> "Load Balancer"
      Selectors for test-nginx
    2. Create lb-dhcp-80 and can route correctly
      image
      image
    3. Create lb-dhcp-http and can route correctly
      image
      image
  4. Verify Load Balancer with IPAM "Pool" ${~~~\color{green}\textsf{V}}$
    1. Create lb-pool-80 & lb-pool-http with IPAM Pool
    2. lb-pool-80 and can route correctly
      image
      image
    3. lb-pool-http and can route correctly
      image
      image

Test harvester-csi-driver

  1. Both App & Workload should be Active ${~~~\color{green}\textsf{V}}$

    App
    image

    Workload
    image

  2. Check Harvester already set as the default storage class ${~~~\color{green}\textsf{V}}$

    image

  3. Deploy nginx:latest with on-demand PVC ${~~~\color{green}\textsf{V}}$

    Config

    • Storage (PVC & PV)
      image
      image

    • Mount
      image

    Related resources are created

    • Deployment
      image
    • PVC
      image
    • PV
      image
  4. Verify Load Balancers ${~~~\color{green}\textsf{V}}$

    image

Scaling

  1. Scale Pool Up (Takes ~15m) ${~~~\color{green}\textsf{V}}$
    • Go to "Cluster Management" -> "Clusters" -> cluster -> "+" sign

    • After
      image

    • Deployment & LB still work
      image
      image

  2. Scale Pool Down (Takes ~20m) ${~~~\color{orange}\textsf{V}}$
    • Go to "Cluster Management" -> "Clusters" -> cluster -> "-" sign

    • After
      image

    • Deployment & LB still work
      image
      image

    • 🐞 (minor) Legacy node record in Cluster -> Nodes page

      Machine is deleted on Harvester and Rancher Cluster Management
      image
      image

      But found legacy record in cluster page
      image

      Workaround: delete manually
      image
      image

    • 🐞 (Known) Possible scale down fail - https://github.com/rancher/rancher/issues/42582

@albinsun
Copy link
Contributor Author

albinsun commented Sep 14, 2023

⚠️ Deprecated, see latest test below

RKE2 Harvester Node Driver Cluster (Terraform)

Setup Cluster

  1. Create API Key ${~~~\color{green}\textsf{V}}$

    Go to Account icon (top-right cornor) -> "Account and API Keys" -> "Create API Key"

  2. Setup RKE2 cluster via Terraform ${~~~\color{red}\textsf{X}}$

    • Will hit 500 Internal Server Error using Kubernetes v1.23.17+rke2r1 no matter with rancher2 provider 3.0.0, 3.0.1 or 3.1.1.
      image
    • Error: Creating cluster V2: Bad response statusCode [500]. Status [500 Internal Server Error].
      Body: [code=InternalError, message=Internal error occurred: failed calling webhook "rancher.cattle.io.clusters.provisioning.cattle.io": 
      failed to call webhook: an error on the server 
      ...
      

Note

  1. We found that rancher2_provider:3.0.1 + kubernetes:v1.26.7+rke2r1 can setup, but has problem when creating LB later (stuck in pending).
    • Can setup v1.26.7+rke2r1
      image
      image
    • Can create test-nginx workload
      image
    • But craete LB stucks in pending without explicit event
      image
  2. However this should not be a formal case since v1.26.7+rke2r1 is not listed in Rancher v2.7.6.
    image

@albinsun
Copy link
Contributor Author

RKE2 Harvester Node Driver Cluster (Terraform)

Environment

  • Harvester v1.1.2 (QEMU/KVM, 3 nodes (8C/16G/250G))
  • Rancher v2.7.6 (Docker)
  • Terraform
    $ ./terraform -version
    Terraform v1.5.3
    on linux_amd64
    + provider registry.terraform.io/rancher/rancher2 v3.1.1
    

terraform file: main.tf

Ref. https://registry.terraform.io/providers/rancher/rancher2/latest/docs/resources/cluster_v2#creating-rancher-v2-harvester-cluster-v2-with-harvester-cloud-provider

Provider rancher/rancher2 v3.1.1

Note Test 2 times.

terraform init ${~~~\color{green}\textsf{V}}$

image

Setup Cluster

  1. Create API Key ${~~~\color{green}\textsf{V}}$

    Go to Account icon (top-right cornor) -> "Account and API Keys" -> "Create API Key"

  2. Create cluster via Terraform (Takes ~20m) ${~~~\color{green}\textsf{V}}$
    • Rancher
      image

    • Harvester
      image

    • Terraform

      $ ./terraform init -upgrade
      ...
      $ ./terraform validate
      ...
      $ ./terraform apply -auto-approve
      data.rancher2_cluster_v2.myharvester: Reading...
      data.rancher2_cluster_v2.myharvester: Read complete after 1s [id=fleet-default/harvester131]
      ...
      rancher2_cluster_v2.rke2-harvester-terraform: Still creating... [17m1s elapsed]
      rancher2_cluster_v2.rke2-harvester-terraform: Still creating... [17m11s elapsed]
      rancher2_cluster_v2.rke2-harvester-terraform: Creation complete after 17m18s [id=fleet-default/rke2-harvester-terraform]
      
      Apply complete! Resources: 3 added, 0 changed, 0 destroyed.
      
  3. Create IP Pool ${~~~\color{green}\textsf{V}}$

    Go to "Virtualization Management" -> harvester -> "Advanced" -> "Settings" -> "vip-pools"
    image

Test harvester-cloud-provider

  1. Bpth App & Workload should be Active ${~~~\color{green}\textsf{V}}$

    App
    image
    Workload
    image

  2. Deploy Nginx workload ${~~~\color{green}\textsf{V}}$

    Create a deployment test-nginx with image nginx:latest and pod lable mykey:myval
    image

  3. Verify Load Balancer with IPAM "DHCP" and "Pool" ${~~~\color{green}\textsf{V}}$

    lb-dhcp-80 is Active and can route correctly
    image
    lb-dhcp-80 is Active and can route correctly
    image

Test harvester-csi-driver

  1. Both App & Workload should be Active ${~~~\color{green}\textsf{V}}$

    App
    image
    Workload
    image

  2. Check Harvester already set as the default storage class ${~~~\color{green}\textsf{V}}$

    image

  3. Deploy nginx:latest with on-demand PVC ${~~~\color{green}\textsf{V}}$

    Config

    • Storage (PVC & PV)
      image
    • Mount
      image

    Related resources are created

    • Deployment
      image
    • PVC
      image
    • PV
      image
  4. Verify Load Balancer with IPAM "DHCP" and "Pool" ${~~~\color{green}\textsf{V}}$

    image

Scalinng

  1. Scale Pool Up (Takes ~15m) ${~~~\color{green}\textsf{V}}$
    • Go to "Cluster Management" -> "Clusters" -> cluster -> "+" sign
    • After
      image
      image
    • Deployment & LB still work
      image
  2. Scale Pool Down (Takes ~20m) ${~~~\color{green}\textsf{V}}$
    • Go to "Cluster Management" -> "Clusters" -> cluster -> "-" sign
    • After
      image
      image
    • Deployment & LB still work
      image

Setup in Other Versions

Provider rancher/rancher2 v3.0.1

Note Test 2 times.

  1. terraform init ✔️

    image

  2. terraform apply ✔️

    image

  3. terraform destroy ✔️

    image

Provider rancher/rancher2 v3.0.0

Note Test 2 times.

  1. terraform init ✔️

    image

  2. terraform apply ⚠️ (Fail 1 time)
    1. Trial 1 ❌
      Stuck in configuring bootstrap node(s) rke2-harvester-terraform-pool1-67c86697b4-bf86h: waiting for probes: kube-controller-manager, kube-scheduler, kubelet
      image

    2. Trial 2 ✔️
      image

  3. terraform destroy ✔️

    image

Note/Issues

  1. ⚠️ Can only select v1.23.17+rke2r1 to conform to Harvester v1.1.2

  2. 🐞 (minor) UI broken if Click Rancher tab from Harvester.
    1. Go to Virtualization Managemen
      image
    2. Enter Harvester
      image
    3. Click rancher tab will broken (Workaround: connect to Home tab)
      image
      image

    ** Workaround **
    One workaround is to enter base rancher URL again.
    image

  3. 🐞 [BUG] Scaling down etcd machine pool can cause multiple machines to be deleted unintentionally #42582

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant