Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[TEST] Build the fully airgapped on bare metal machines for release upgrade testing #967

Open
29 of 42 tasks
TachunLin opened this issue Oct 24, 2023 · 23 comments
Open
29 of 42 tasks
Assignees

Comments

@TachunLin
Copy link
Contributor

TachunLin commented Oct 24, 2023

What's the test to develop? Please describe

The epic issue was created to track the progress of building a fully airgapped environment for the upgrade testing in each Harvester release candidate.

Since the current fully airgapped environment was built upon ipxe-example virtual machines inside another powerful virtual machine. This would usually cause some unexpected upgrade failure due to performance or resource bottleneck.

  • Reduce the time to prepare the Rancher image by caching different version on a separate server
  • On demand to provision specific Rancher version
  • Automate the airgapped environment setup and upgrade test

Scope

  1. Provide a fully airgapped Harvester cluster on bare metal machines
  2. Provide a fully airgapped Rancher instance and private docker registry
  3. Provide an artifact sever have internet connection act as the file, dns and ntp server.
  4. Automatically import Harvester into Rancher
  5. Provision the downstream RKE2 cluster

Prerequisite

Any prerequisite environment and pre-condition required for this test.
Provide test case dependency here if any.

The fully airgapped environment requires the following components:

  1. Four bare metal machines for Harvester cluster
  2. One bare machine to host Rancher and private registry
  3. One VM for the http and ntp server
  4. At lease two external vlan network
  5. Each node machine should provide more than 16 core CPU, 32GB memory, 500GB nvme disks can meet Harvester recommended hardware requirement

Test case reference

Roles

  • Harvester cluster on bare metal machines

  • On the same bare metal machine host VMs (tenative)

    • Rancher instance
    • Docker registry
  • On the same VM (tenative)

    • Artifact server
      • Download starting version of Rancher images
      • Provide the OS image
      • Provide required docker image for Rancher integration
    • Name server
    • NTP server

Describe the items of the test development (DoD, definition of done) you'd like

Stage 1 Design discussion

  • Design the architecture diagram and the deployment diagram of each component
  • Discuss the implementation plan

Stage 2 Build Out Baseline Provision Harvester Airgap Pipeline

  • Confirm the current upgrade test pipeline on seeder production Jenkins can run correctly to provision four nodes Harvester on bare metal machines
  • Create another pipeline from the existing upgrade test to only provision Harvester

Stage 3 Convert Vagrant Logic To Terraform VMs (per service) For Harvester

  • Convert MinIO Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • Convert K3S-Rancher Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • Convert DNS-Server Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • Convert File-Server Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • Convert NTP-Server Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • Convert Registry / Hauler Logic from Vagrant to Terraform Stack (Ansible & Terraform)
  • STRETCH: RKE1 Support via Gitlab fork of Charts Terraform Stack (Ansible & Terraform)

Stage 4 Build Out All Airgap Integrations Jenkins Pipeline Utilities

  • Create script to automate conversion of Terraform Variables to funnel into appropriate groovy Variables and build out README doc of each airgap integration service

Stage 5 Implement All Airgap Integrations Jenkins Pipeline on Staging

  • Build out Jenkins pipeline, that exercises (where applicable) parallel pipeline staging for:
    • Minio (Init, Plan, Destroy, Apply{non-airgap then airgap} )
    • NTP-Server (Init, Plan, Destroy, Apply{non-airgap then airgap} )
    • DNS-Server (Init, Plan, Destroy, Apply{non-airgap then airgap} )
    • File-Server (Init, Plan, Destroy, Apply{non-airgap then airgap} )
    • Registry/Hauler (Init, Plan, Destroy, Apply{non-airgap then airgap} )
    • K3S-Rancher (Init, Plan, Destroy, Apply{non-airgap then airgap} )
  • Test pipeline on staging

Stage 6 Implement additional pipeline parameters to baseline Harvester Airgap Pipeline

  • Deterimine File-Server To Use For Possible RSYNC Of Data From Pipeline input to build config file
  • Implement Parameters that can have data input, from the airgap integrations pipeline:
    • ntp-servers
    • dns-servers
    • containerd registry settings
    • minio s3 backup details
  • Build out new seeder AddressPool, to hold ips associated with airgap vlan network
  • Modify:
    • cluster.yaml.j2 , to make adjustments to Cluster obj formed from seeder
    • main.yml , ansible within seeder role to account for needed changes
  • Validate pipeline edits on staging

Stage 7 Move Both Pipelines To Prod

  • move airgap integrations pipeline to prod
  • move airgap harvester cluster pipeline to prod
  • validate inner operation of harvester airgap cluster with all harvester airgap cluster integrations

Stage 8 Implement New Pipeline To Run Subsection Of Tests Against Airgap

  • leveraging something like pre-existing dockerfile on tests:
    • modify settings.yml to account for elements like:
      • airgap fileserver location of .qcow2/.iso cloudimgs needed for vms
      • default networks
      • any other elements
  • validate pipeline on stage provisions docker container running subsection of tests, with settings.yml that pertain to the integrations and environment for the airgap baremetal infrastructure
  • Provide a pipeline to pull can cache different versions of Rancher offline image
  • Provide the ability to upgrade Rancher version under airgapped environment
@TachunLin
Copy link
Contributor Author

These are the initial concept, need further discussion and update by time.

Pipeline design concept

  1. Share the seeder machine resource with the existing upgrade test pipeline
  2. Utilize the current upgrade test pipeline to prepare a new pipeline for Harvester cluster provision only
  3. Require a separate pipeline, we can specify which version of Rancher we would like to provision
  4. If the Rancher instance was built on on VM, we can use the sshuttle or port forwarding to expose the 192.168.2.0 subnet to other host machine
  5. Use a separate pipeline to prepare and cached different versions of the offline Rancher image file
  6. Combine each pipeline with dependencies to perform certain task, e.g prepare the entire airgapped environment

Pipeline request

  1. Provision the Harvester cluster on bare machines
  2. Provision the airgapped Rancher
  3. Prepare the private docker registry for the specific Rancher offline image
  4. Build up the artifact server provide HTTP, DNS and NTP service
  5. The all-in-one pipeline to prepare the fully airgapped test environment ready

@TachunLin
Copy link
Contributor Author

Just come up the initial idea of the fully airgapped infrastructure diagram for further discussion

image

@irishgordo
Copy link
Contributor

Just come up the initial idea of the fully airgapped infrastructure diagram for further discussion

image

I think in general this looks pretty good 😄 👍

I would only mention that there would be much more benefit to separating out hp-176 two run two vagrant vms - one to serve the registry and one to serve rancher initially instead of at a later point as the provisioning w/ Rancher-&-Docker-Registry has many flaws currently in ipxe-examples:
Screenshot from 2023-11-02 10-26-59

irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 2, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 3, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 6, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 14, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 14, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 14, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 15, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 17, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
irishgordo added a commit to irishgordo/ipxe-examples that referenced this issue Nov 17, 2023
* provision resilant docker-registry
* provision rancher that utilizes registry

Resolves: harvester/tests#967
@TachunLin
Copy link
Contributor Author

The Rancher instance, docker registry, DNS, name server implementation would be implemented in #942

@TachunLin
Copy link
Contributor Author

Another idea is we can consider moving the Artifact server role from the external VM to inside the hp-176 seeder machine. This may decrease the effort to handle the network connectivity and could be better utilize the airgapped network created by the Open vSwitch.

image

@irishgordo
Copy link
Contributor

deployment drawio

@irishgordo
Copy link
Contributor

There is a slight blocker at:
harvester/harvester#5301

Means we will need to bake in additional logic in the pipeline to compensate for that bug.

@irishgordo
Copy link
Contributor

We're currently encountering something that we will need to redesign logic for.
We're hitting:

org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
WorkflowScript: -1: Map expressions can only contain up to 125 entries @ line -1, column -1.
1 error

	at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:309)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1107)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:624)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:602)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:579)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:323)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:293)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox$Scope.parse(GroovySandbox.java:163)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:190)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:175)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:635)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:581)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:335)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:442)
Finished: FAILURE

Seemingly related to something within Jenkins / Groovy :

Investigating....

@irishgordo
Copy link
Contributor

Even pivoting, now we are hitting a limitation of the script string...
Investigating


2024-07-23 00:05:09.979+0000 [id=26]	SEVERE	hudson.util.BootFailure#publish: Failed to initialize Jenkins
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
script: 280: String too long. The given string is 93362 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 280, column 20.
               script('''
                      ^

1 error

	at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:309)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1107)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:624)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:602)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:579)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:323)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:293)
	at groovy.lang.GroovyShell.parseClass(GroovyShell.java:677)
	at groovy.lang.GroovyShell.parse(GroovyShell.java:689)
	at groovy.lang.GroovyShell$parse.call(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCall(CallSiteArray.java:47)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:116)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.call(AbstractCallSite.java:128)
	at javaposse.jobdsl.dsl.AbstractDslScriptLoader.parseScript(AbstractDslScriptLoader.groovy:134)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
	at java.base/java.lang.reflect.Method.invoke(Unknown Source)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite$PogoCachedMethodSiteNoUnwrapNoCoerce.invoke(PogoMetaMethodSite.java:210)
	at org.codehaus.groovy.runtime.callsite.PogoMetaMethodSite.callCurrent(PogoMetaMethodSite.java:59)
	at org.codehaus.groovy.runtime.callsite.CallSiteArray.defaultCallCurrent(CallSiteArray.java:51)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:157)
	at org.codehaus.groovy.runtime.callsite.AbstractCallSite.callCurrent(AbstractCallSite.java:177)
	at javaposse.jobdsl.dsl.AbstractDslScriptLoader.runScriptEngine(AbstractDslScriptLoader.groovy:101)
Caused: javaposse.jobdsl.dsl.DslException: startup failed:
script: 280: String too long. The given string is 93362 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 280, column 20.
               script('''

@irishgordo
Copy link
Contributor

Was able to reduce it, but still it is too big:

Caused: javaposse.jobdsl.dsl.DslException: startup failed:
script: 280: String too long. The given string is 71859 Unicode code units long, but only a maximum of 65535 is allowed.
 @ line 280, column 20.
               script('''
                      ^

1 error

We'll need to pivot to something else w/ jobdls plugin...

@irishgordo
Copy link
Contributor

Based on some more investigation, I'm not entirely sure all integrations can be within a single pipeline job...
Still investigating...

But in:
https://github.com/jenkinsci/job-dsl-plugin/blob/e6d655dd5b2874f56af8bf4b99a4d622b752bb98/job-dsl-plugin/src/main/java/javaposse/jobdsl/plugin/JenkinsJobManagement.java#L258-L287

Where the JobDSL plugin, possibly is calling:
readFileFromWorkspace()
in:

pipelineJob('example') {
    definition {
        cps {
            script(readFileFromWorkspace('project-a-workflow.groovy'))
            sandbox()
        }
    }
}

That it still ultimately in that, where we read a file... it seems to do from the jobdsl repo:

return filePath.readToString();

Ultimately, reading the file into a String, so we're back in the same place we would be even if we defined it as:

script(
'''
script in here
'''
)

^ because that also just yields a "String".
So we can't rip it out into a file to escape the:

 String too long. The given string is 71859 Unicode code units long, but only a maximum of 65535 is allowed.

Though... I'm not entirely sure about this.
My initial thinking is that we would need to break this up into "multiple" pipeline jobs ...
Example:

  • airgap-hauler-pipeline
  • airgap-minio-pipeline
  • airgap-dns-server-pipeline
  • airgap-ntp-server-pipeline
  • airgap-fileserver-pipeline
  • airgap-gitlab-rke1-pipeline
  • airgap-k3s-rancher-pipeline

So that then scales, all our jobs from 1 (that provisions all integrations)... to needing to be "multiple"... one per airgap integration... possibly to just avoid this Groovy limitation of the string being too big 😅 ...
Again, not entirely sure though....

@irishgordo
Copy link
Contributor

With:
https://github.com/irishgordo/harvester-baremetal-ansible/commit/b80e3dde3a36281bb7e861c5fb2c0956d66473f4
Was able to reduce it down so that the "String too large" error disappeared.

But still, the underlying error is now present as it's just seeing the script function as being too large in general...

Investigating the new error of:

Started by user [admin](http://172.19.98.192:8083/user/admin)
Running as [admin](http://172.19.98.192:8083/user/admin)
org.codehaus.groovy.control.MultipleCompilationErrorsException: startup failed:
General error during class generation: Method too large: WorkflowScript.___cps___1 ()Lcom/cloudbees/groovy/cps/impl/CpsFunction;

groovyjarjarasm.asm.MethodTooLargeException: Method too large: WorkflowScript.___cps___1 ()Lcom/cloudbees/groovy/cps/impl/CpsFunction;
	at groovyjarjarasm.asm.MethodWriter.computeMethodInfoSize(MethodWriter.java:2087)
	at groovyjarjarasm.asm.ClassWriter.toByteArray(ClassWriter.java:447)
	at org.codehaus.groovy.control.CompilationUnit$17.call(CompilationUnit.java:850)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1087)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:624)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:602)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:579)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:323)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:293)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox$Scope.parse(GroovySandbox.java:163)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:190)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:175)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:635)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:581)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:335)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:442)

1 error

	at org.codehaus.groovy.control.ErrorCollector.failIfErrors(ErrorCollector.java:309)
	at org.codehaus.groovy.control.CompilationUnit.applyToPrimaryClassNodes(CompilationUnit.java:1107)
	at org.codehaus.groovy.control.CompilationUnit.doPhaseOperation(CompilationUnit.java:624)
	at org.codehaus.groovy.control.CompilationUnit.processPhaseOperations(CompilationUnit.java:602)
	at org.codehaus.groovy.control.CompilationUnit.compile(CompilationUnit.java:579)
	at groovy.lang.GroovyClassLoader.doParseClass(GroovyClassLoader.java:323)
	at groovy.lang.GroovyClassLoader.parseClass(GroovyClassLoader.java:293)
	at org.jenkinsci.plugins.scriptsecurity.sandbox.groovy.GroovySandbox$Scope.parse(GroovySandbox.java:163)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.doParse(CpsGroovyShell.java:190)
	at org.jenkinsci.plugins.workflow.cps.CpsGroovyShell.reparse(CpsGroovyShell.java:175)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.parseScript(CpsFlowExecution.java:635)
	at org.jenkinsci.plugins.workflow.cps.CpsFlowExecution.start(CpsFlowExecution.java:581)
	at org.jenkinsci.plugins.workflow.job.WorkflowRun.run(WorkflowRun.java:335)
	at hudson.model.ResourceController.execute(ResourceController.java:101)
	at hudson.model.Executor.run(Executor.java:442)
Finished: FAILURE

@irishgordo
Copy link
Contributor

Trying:

        JAVA_OPTS: "-Dorg.jenkinsci.plugins.pipeline.modeldefinition.parser.RuntimeASTTransformer.SCRIPT_SPLITTING_TRANSFORMATION=true -Djenkins.install.runSetupWizard=false -Djenkins.install.SetupWizard.adminInitialApiToken=\"{{ lookup('password', '/dev/null length=20 chars=ascii_letters') }}\" -Dhudson.model.DirectoryBrowserSupport.CSP=\"\""

Specifically:

-Dorg.jenkinsci.plugins.pipeline.modeldefinition.parser.RuntimeASTTransformer.SCRIPT_SPLITTING_TRANSFORMATION=true

As suggested:

Has led to the same result...
Pivoting to other solutions...

@irishgordo
Copy link
Contributor

irishgordo commented Jul 23, 2024

Timeboxing...
Was trying variations of:

    definition {
        cpsScm {
            scm {
                git {
                    remote {
                        github('${harvester_baremetal_ansible_repo}', '${harvester_baremetal_ansible_branch}')
                        credentials('github-credential')
                    }
                }
                scriptPath("jenkins/harvester_airgap_integrations_pipeline.groovy")
            }
        }
    }

It's really not working... getting the params.* to come across the wire and be interpolated into things is just simply not working with the combinations of:

  • github('${harvester_baremetal_ansible_repo}', '${harvester_baremetal_ansible_branch}')
  • github("${harvester_baremetal_ansible_repo}", "${harvester_baremetal_ansible_branch}")
  • github("${params.harvester_baremetal_ansible_repo}", "${params.harvester_baremetal_ansible_branch}")
  • github('${harvester_baremetal_ansible_repo}', '${harvester_baremetal_ansible_branch}')
  • github('${params.harvester_baremetal_ansible_repo}', '${params.harvester_baremetal_ansible_branch}')
  • github($harvester_baremetal_ansible_repo, $harvater_baremetal_ansible_branch)
  • github("$harvester_baremetal_ansible_repo", "$harvater_baremetal_ansible_branch")

With the idea, we'd give a specific script path like:

                scriptPath("jenkins/harvester_airgap_integrations_pipeline.groovy")

to split apart script / pipeline
But getting the branch & repo dynamically... through interpolation of the params (stringParam) type...

May pivot back to cpsScm -> scm -> git & scriptPath...

The Jenkins JobDSL Plugin Docs + Jenkins Docs don't seem to have "dynamic" examples...

@irishgordo
Copy link
Contributor

irishgordo commented Jul 23, 2024

Re-investigating the environment.
That would be the easiest... thinking that with some adjustments than:
#967 (comment)
( that error )

@irishgordo
Copy link
Contributor

irishgordo commented Jul 23, 2024

It's difficult to overcome the "environment variable" limit... there are probably still some more ways around it...

Pivoted instead to:

With:

  • writeAndReadData(a,b)
  • generateTFVarsStage(a)

Methods, that get around the "method too large" error, since we pull then the logic of the multiple parallel running stages out into two separate methods -> one to build out the local.tfvars for the respective service, since we can't leverage the default environment TF_VAR_* that Terraform gives us because Jenkins/Groovy is placing a strange limitation on the map size of the environment variables with the JobDSL plugin.

If we could, we'd avoid an entire parallel stage that's needed to build out the local.tfvars.

That leverages the second bullet point from:

@irishgordo
Copy link
Contributor

irishgordo commented Jul 24, 2024

What ended up working for interpolation and also syncing with the needed style of the local.tfvars for each service is using the $/ string...

def string = $/
string-goes-here
${params.interpolation}

other things like newline\n
/$

To seemingly help

@irishgordo
Copy link
Contributor

Currently, testing pipeline on staging...
Will iterate to fix any outstanding bugs as everything is now becoming glued together...

@irishgordo
Copy link
Contributor

So, the temporary loop to do a few iterations when we shift the VM NIC/NAD and run a separate playbook for airgap seems to help buffer:

  • terraform-provider-ansible/issues/98#issuecomment-2248602845

But the second iteration we're still seeing:

│ <172.19.121.147> (0, b'', b"OpenSSH_9.7p1, OpenSSL 3.3.1 4 Jun 2024\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 22: include /etc/ssh/ssh_config.d/*.conf matched no files\r\ndebug2: resolve_canonicalize: hostname 172.19.121.147 is address\r\ndebug1: auto-mux: Trying existing master at '/var/jenkins_home/.ansible/cp/fa3d4b2f87'\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 17473\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\ndebug3: mux_client_read_packet_timeout: read header failed: Broken pipe\r\ndebug2: Received exit status from master 0\r\n")

│ fatal: [dns-server-argp-vm]: FAILED! => {

│     "msg": "Timeout (12s) waiting for privilege escalation prompt: "

│ }

│ 

│ PLAY RECAP *********************************************************************

│ dns-server-argp-vm         : ok=0    changed=0    unreachable=0    failed=1    skipped=0    rescued=0    ignored=0   

│ 

│ 

│ 

│   with ansible_playbook.dns-vm-ansible-playbook,

│   on main.tf line 178, in resource "ansible_playbook" "dns-vm-ansible-playbook":

│  178: resource "ansible_playbook" "dns-vm-ansible-playbook" {

│ 

│ ansible-playbook

Implementing an arbitrary timeout of sleep, prior to the next iteration so the VM can make the DHCP request outbound and get the IPv4 assigned regardless of network, will need to happen.

@irishgordo
Copy link
Contributor

Something is happening and the /etc/rancher/k3s/registries.yaml isn't getting the injected variable funneled in from Jenkins:

root@k3s-server-argp-vm:/home/ubuntu# cat /etc/rancher/k3s/registries.yaml 
mirrors:
  docker.io:
    endpoint:
      - "https://airgap-docker-registry..sslip.io:5000"
  registry.suse.com:
    endpoint:
      - "https://airgap-docker-registry..sslip.io:5000"
configs:
  "airgap-docker-registry..sslip.io:5000":
    tls:
      insecure_skip_verify: true
  "https://airgap-docker-registry..sslip.io:5000":
    tls:
      insecure_skip_verify: true
  "":
    tls:
      insecure_skip_verify: true
root@k3s-server-argp-vm:/home/ubuntu# 

investigating...

@irishgordo
Copy link
Contributor

This was an issue:

\"stderr\": \"Error: open /home/ubuntu/hauler-jetstack-cert-manager-images.yaml: no such file or directory\\nUsage:\\n  hauler store sync [flags]\\n\\nFlags:\\n  -f, --files strings             Path(s) to local content files (Manifests). i.e. '--files ./rke2-files.yml\\n  -h, --help                      help for sync\\n  -k, --key string                (Optional) Path to the key for signature verification\\n  -p, --platform string           (Optional) Specific platform to save. i.e. linux/amd64. Defaults to all if flag is omitted.\\n  -c, --product-registry string   (Optional) Specific Product Registry to use. Defaults to RGS Carbide Registry (rgcrprod.azurecr.us).\\n      --products strings          Used for RGS Carbide customers to supply a product and version and Hauler will retrieve the images. i.e. '--product rancher=v2.7.6'\\n  -r, --registry string           (Optional) Default pull registry for image refs that are not specifying a registry name.\\n\\nGlobal Flags:\\n      --cache string       (deprecated flag and currently not used)\\n  -l, --log-level string    (default \\\"info\\\")\\n  -s, --store string       Location to create store at (default \\\"store\\\")\",\n    \"stderr_lines\": [\n        \"Error: open /home/ubuntu/hauler-jetstack-cert-manager-images.yaml: no such file or directory\",\n        \"Usage:\",\n        \"  hauler store sync [flags]\",\n        \"\",\n        \"Flags:\",\n        \"  -f, --files strings             Path(s) to local content files (Manifests). i.e. '--files ./rke2-files.yml\",\n        \"  -h, --help                      help for sync\",\n        \"  -k, --key string                (Optional) Path to the key for signature verification\",\n        \"  -p, --platform string           (Optional) Specific platform to save. i.e. linux/amd64. Defaults to all if flag is omitted.\",\n        \"  -c, --product-registry string   (Optional) Specific Product Registry to use. Defaults to RGS Carbide Registry (rgcrprod.azurecr.us).\",\n        \"      --products strings          Used for RGS Carbide customers to supply a product and version and Hauler will retrieve the images. i.e. '--product rancher=v2.7.6'\",\n        \"  -r, --registry string           (Optional) Default pull registry for image refs that are not specifying a registry name.\",\n        \"\",\n        \"Global Flags:\",\n        \"      --cache string       (deprecated flag and currently not used)\",\n        \"  -l, --log-level string    (default \\\"info\\\")\",\n        \"  -s, --store string       Location to create store at (default \\\"store\\\")\"\n    ],\n    \"stdout\": \"\\u001b[90m2024-07-26 22:52:17\\u001b[0m \\u001b[1m\\u001b[31mERR\\u001b[0m\\u001b[0m open /home/ubuntu/hauler-jetstack-cert-manager-images.yaml: no such file or directory\",\n    \"stdout_lines\": [\n        \"\\u001b[90m2024-07-26 22:52:17\\u001b[0m \\u001b[1m\\u001b[31mERR\\u001b[0m\\u001b[0m open /home/ubuntu/hauler-jetstack-cert-manager-images.yaml: no such file or directory\"\n    ]\n}\n\nTASK [seed-hauler : Print when errors] *****************************************\ntask path: /var/jenkins_home/workspace/harvester-airgap-integrations/terraform/airgap-integrations/hauler/ansible/roles/seed-hauler/tasks/main.yml:52\nok: [hauler-server-argp-vm] =\u003e {\n    \"msg\": \"I caught an error in configuring vm further\"\n}\n\nTASK [seed-hauler : Always do this] ********************************************\ntask path: /var/jenkins_home/workspace/harvester-airgap-integrations/

Now fixed from Sunday's update.
Yielding:

╭─mike at suse-workstation-team-harvester in ~/Projects/seeder/cmd/seeder on cli✘✘✘
╰─± curl -k https://172.19.121.240:5000/v2/library/nginx/tags/list | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    43  100    43    0     0    346      0 --:--:-- --:--:-- --:--:--   349
{
  "name": "library/nginx",
  "tags": [
    "latest"
  ]
}
╭─mike at suse-workstation-team-harvester in ~/Projects/seeder/cmd/seeder on cli✘✘✘
╰─± ./hauler store add image quay.io/jetstack/cert-manager-webhook:v1.13.1 -p linux/amd64
╭─mike at suse-workstation-team-harvester in ~/Projects/seeder/cmd/seeder on cli✘✘✘
╰─± curl -k https://airgap-docker-registry.172.19.121.240.sslip.io:5000/v2/jetstack/cert-manager-cainjector/tags/list | jq
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100    63  100    63    0     0    195      0 --:--:-- --:--:-- --:--:--   195
{
  "name": "jetstack/cert-manager-cainjector",
  "tags": [
    "v1.13.1"
  ]
}

So cert-manager & nginx are present.

Additionally all:

rescue:

"rescue" blocks in Ansible, will re-trigger an ansible.builtin.fail with a message giving context at a glance, allowing the pipelines to not "fail silently"

@irishgordo
Copy link
Contributor

While we are still waiting to implement Stage 7 & Stage 8 - we now have a new lab, where Stage 6's last part to allow for our Seeder to run AirGap can be a reality once more infrastructure work is done.
cc: @TachunLin
We may want to follow up as well for further optimizations on this, that would improve provisioning flow & timeline -> as in condensing the file-server to also leverage Hauler: https://github.com/zackbradys/rancher-airgap/blob/main/examples/rancher-airgap-quickstart.md -> vs. having a standalone

@irishgordo
Copy link
Contributor

For reference, this was leveraged successfully, but outside of our lab env - to provision our needed integrations for v1.4.0 testing.

Screenshot from 2024-10-28 10-41-03

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants