Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add functionality to replace existing templates #93

Merged
merged 4 commits into from
Feb 10, 2023

Conversation

joeyberkovitz
Copy link
Contributor

Adds functionality to delete any existing VMs with the same name as the template_name variable
Should resolve #76

To support this change, the go.mod was updated because the build was failing due to an older google cloud dependency that was now broken. As part of the upgrade, the upstream Proxmox API EFI definition changed from a string to an object. I just updated that config to match the definition in the upstream README.

@joeyberkovitz joeyberkovitz requested a review from a team as a code owner June 7, 2022 02:08
@hashicorp-cla
Copy link

hashicorp-cla commented Jun 7, 2022

CLA assistant check
All committers have signed the CLA.

@goffinf
Copy link

goffinf commented Jun 7, 2022

This sounds like it adds useful functionality. I appreciate that you have set the default is 'false', however, can I ask what the behaviour is when replacing a template where there are existing linked-clones. From the docs:-

You cannot delete an original template while linked clones exist.

Also, depending on what has changed, an existing linked-clone's configuration could be inconsistent to the point that it may not boot ?

Thanks

Fraser

@joeyberkovitz
Copy link
Contributor Author

It was my understanding that the source template can be removed without impacting the linked clone. Based on https://forum.proxmox.com/threads/i-deleted-original-template-while-linked-clones-exist.94879/ - it sounds like a linked clone is essentially a hard link, where if the source is deleted, the file still exists until all dependent VMs are also removed.

If this is not the case, I can look into what Proxmox actually does and potentially look into providing some sort of functionality to convert the linked clones into full clones, or maybe just auto-rename the old source VMs. For my use case and that mentioned in the linked issue, it would be ideal to remove the template so that it can be automatically replaced by a CI/CD pipeline without having to manually go in and delete old components.

@goffinf
Copy link

goffinf commented Jun 7, 2022

Hey Joey, thanks for your reply. After a bunch of googling, including the link you provided to the question raised in the forum, I can't find a definitive answer as to the impact. Obviously I would assume that Proxmox's own documentation is correct, however that is not to say what the side effects of deleting an 'in-use' template are. The doc link I was quoting from is here ...

https://proxmox.local.goffinf.co.uk/pve-docs/chapter-qm.html#qm_copy_and_clone

This bears some further investigation IMHO. I can help out if you want (create a template and a linked clone, (attempt to) delete the template (with a VM is still referencing it - and if that fails, stop the VM and attempt the delete - then try to restart ......) ...

I am completely OK if you prefer to conduct any investigation yourself ?

The results of this should probably inform what approach should be taken if there are any destructive side-effects.

I also build all my templates and provision all associated infrastructure via CI pipelines so I agree this need to support automation where it may not be possible for the builder itself to request confirmation (that might have to be left to the pipeline implementation) even if that means there needs to be a prominent warning in the docs about potential side effects (assuming that anyone reads docs these days !).

@joeyberkovitz
Copy link
Contributor Author

I can do some more extensive testing later - either later today or over the next few days, but if you have time to verify as well, that would be helpful.

I did briefly test this functionality as built in the PR, so I know it lets you delete the template, and I did have one linked clone running, but I didn't do any extensive testing to verify that the linked clone was not impacted in any way.

@joeyberkovitz
Copy link
Contributor Author

Based on https://github.com/proxmox/qemu-server/blob/4bb19a255925e1fe70e7d7262b289131a50f5d05/PVE/QemuServer.pm#L2304 - it appears the Proxmox checks if a linked clone is in use when deleting a template.

I also tested and verified that I can create a linked clone, then delete the template without impacting the linked clone. For reference, this was verified using an LVM-thin pool.

@goffinf
Copy link

goffinf commented Jun 8, 2022

Yes, I did a similar test, created a linked clone (with lvm-thinpool storage) from a template, then successfully deleted the template on which it was based, while it was still running.

Given the code link you provided above I had expected this to fail ... what am I missing ?

@joeyberkovitz
Copy link
Contributor Author

I can't say I'm an expert on the Proxmox codebase, and I don't fully understand what is being done.

My only guess, without digging through more code, is that the code I posted is in the Qemu server backend. It's possible that the frontend PVE process maintains its own database where the template is marked deleted even though it's not actually deleted from Qemu. This would only make sense if the Qemu code is called as a subprocess, since it seems like it should die.

This is all speculation based on a limited view of the codebase and the behavior we've been seeing, but if there's anyone that's more familiar with proxmox and can confirm, that would be great.

@nywilken
Copy link
Contributor

@carlpett do you have bandwidth to provide guidance on these changes?

@nywilken
Copy link
Contributor

To support this change, the go.mod was updated because the build was failing due to an older google cloud dependency that was now broken.

@joeyberkovitz the module need to be updated to support this change? Or is it currently broken?

Looking at the go.mod it is not immediately apparent which module you are referring to. Can you please provide insight to this change as well. Thanks!

@joeyberkovitz
Copy link
Contributor Author

joeyberkovitz commented Jun 10, 2022

@nywilken - when I pulled the code a few days ago, it wasn't in a state where it would build. I was getting some sort of build error related to one of the following modules:

cloud.google.com/go v0.94.0 // indirect
cloud.google.com/go/storage v1.16.1 // indirect

I didn't do any analysis as to what was including those modules and instead ran a go mod update, which pulled in some new API changes from the upstream Proxmox repo.

If I remember correctly, this was the error: googleapis/google-cloud-go#5304

@nywilken
Copy link
Contributor

@nywilken - when I pulled the code a few days ago, it wasn't in a state where it would build. I was getting some sort of build error related to one of the following modules:

cloud.google.com/go v0.94.0 // indirect
cloud.google.com/go/storage v1.16.1 // indirect

I didn't do any analysis as to what was including those modules and instead ran a go mod update, which pulled in some new API changes from the upstream Proxmox repo.

If I remember correctly, this was the error: googleapis/google-cloud-go#5304

Thanks for the quick reply. Could you provide the version of Go being used to build. I would like to try and reproduce locally. I'm not seeing this issue when building with Go 1.17.8.

@joeyberkovitz
Copy link
Contributor Author

I'm on Go version 1.18.3.

If the google cloud problem is a non-issue, I'd be happy to revert that change along with the associated Proxmox updates, as it's not really related to this PR.

@xoxys
Copy link

xoxys commented Aug 6, 2022

Why do we need a new plugin option replace_existing for this functionality? Shouldn't plugins the default -force flag from the packer build command?

@joeyberkovitz
Copy link
Contributor Author

Sure - we can use -force, the only difference as compared to other hypervisors such as VMware is that VM names aren't guaranteed to be unique, so force here would potentially delete multiple VMs. If that sounds reasonable, it's not a big deal to switch it over.

@xoxys
Copy link

xoxys commented Aug 7, 2022

Can we prefere the ID if set and only fallback to the name? It would IMO also be reasonable to use force only if the ID is set to avoid the issue with multiple names.

@joeyberkovitz
Copy link
Contributor Author

sure, that could be done
personally, I don't think I would ever use it that way - carefully managing VM IDs seems a bit clunky as compared to just ensuring that a given VM name exists, but I wouldn't have any issue building the logic for that.

@joeyberkovitz
Copy link
Contributor Author

latest commit uses -force instead of adding replace_existing. Also supports replacing a single VM/template by ID if ID is specified, otherwise it'll replace by name

also added unit tests which verify that delete is never called if force is disabled, and that delete is called appropriately if force is set

@nywilken nywilken self-assigned this Nov 1, 2022
@modem7
Copy link

modem7 commented Nov 25, 2022

Would the EFI changes conflict with #90 (/render it not required)?

@joeyberkovitz
Copy link
Contributor Author

Yes, it probably makes sense to merge #90 first then I can rebase to that

@sebastian-de
Copy link
Contributor

sebastian-de commented Jan 16, 2023

Hey @joeyberkovitz
now that the EFI-changes are merged, I have rebased your changes on the current main branch: https://github.com/sebastian-de/packer-plugin-proxmox/tree/replace

Feel free to use this branch if you'd like to update your PR.

@joeyberkovitz
Copy link
Contributor Author

joeyberkovitz commented Jan 17, 2023

I switched over to your rebased branch. Code looks fine, but for some reason I can't build on my machine - pretty sure this is the same issue I ran into initially which led to upgrading dependencies:

Output from go build . after running go get -u all

cloud.google.com/go/storage

...\go\pkg\mod\cloud.google.com\go\storage@v1.16.1\storage.go:1416:54: o.GetCustomerEncryption().GetKeySha256 undefined (type *"google.golang.org/genproto/googleapis/storage/v2".CustomerEncryption has no field or method GetKeySha256)

@sebastian-de
Copy link
Contributor

Hmm, I can build this branch here without problems. Which go version do you use? I currently use go1.19.4.

@joeyberkovitz
Copy link
Contributor Author

I tested on 1.19.5 and also reset my go/pkg directory followed by go build . and get the same error. not sure if this is a Windows only issue, if so, I guess it's fine.

@sebastian-de
Copy link
Contributor

Hey @joeyberkovitz I tested your changes today and have some suggestions.
In general your approach works fine and existing VMs are removed when packer build -force us used. That's definitely a much needed feature for this builder - thanks for your work on this!

However in my opinion Packer should be more strict on what VMs it'll choose to remove. Citing the Packer docs what -force should do:

Forces a builder to run when artifacts from a previous build prevent a build from running.

We have no way to check if the specified VM was created by Packer, but we can make a few assumptions. One is that the Proxmox builders always produce templates. So I think we should check if the given VM is a template before removing it. Since template VMs are never in a running state, we don't have to stop VMs either. In my opinion Packer should never attempt to stop a running VM.

Another thing that should be checked before removing a VM is that there exists only one VM with the given vm_name. As @xoxys already said - VM names aren't unique in a Proxmox cluster, so we should be extra cautious when names are used to delete VMs.

I think with these changes -force should still be compatible with your use case.

I created a commit to show what I mean here: https://github.com/sebastian-de/packer-plugin-proxmox/commit/4cbb6bdd5987d176d4ddb022e7edcceefefc3031 (tests not adjusted yet).
Please let me know what you think!

@xoxys
Copy link

xoxys commented Jan 29, 2023

To be honest, this change is rather small. The default behavior will not change, and the use of an additional -force flag already adds an extra layer of protection in my opinion. Even if the behavior can be improved, I really wonder a bit why such a change has not been merged for ~6 months.

@joeyberkovitz
Copy link
Contributor Author

@sebastian-de: I'm fine with adjusting the logic to only remove templates if that is the most desirable design

With the current design, if an ID is specified, it's pretty safe as it will only delete at most one VM with the specified ID, creating a new one to replace it.

The idea with deleting any VM matching a given name was that at some later point you're probably going to want to clone that template into a VM, at which point it could be important to have that name be unique.

Copy link
Contributor

@lbajolet-hashicorp lbajolet-hashicorp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joeyberkovitz,

Sorry for letting this PR slide under the radar for so long, thank you for updating it during all this time.

I've taken a look at the code here, and on @sebastian-de's branch as they linked it in a discussion above.

I think they have a point regarding whether or not we should stop running VMs that correspond to the template we are re-building, if the template is the thing we are replacing, we can probably keep the existing VMs alive so we can decide later if we want to stop them.

On that note I must say that I'm not necessarily a Proxmox expert and I don't have any infrastructure to test those changes on, so ultimately I'll leave it to you to decide if that's what the plugin should do or not.

Apologies again for the delay in reviewing this PR, I'll follow this PR closely, and will try to get it merged as soon as we're all satisfied with the code.

ui.Say("Force set, checking for existing VM or template")
vmrs, err := getExistingVms(c, client)
if err != nil {
// This can happen if no VMs found, so don't consider it to be an error
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way we can check the error we got here so we only continue if no VMs are found? I'm worried that if there's an unrelated problem (network for example), we won't delete the VM, and in this case we should probably error rather than continue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, although it's a pretty sub-optimal check. the proxmox client doesn't have typed errors, so we would have to check if the error string matches either vm '%d' not found or vm '%s' not found with the VM ID or name filled in.

I have no problem adding this with some logic to just note when nothing is found, but error and fail if there's a different error, although this could break if the client is updated and the error string ever changes

continue
}
// Wait until vm is stopped. Otherwise, deletion will fail.
ui.Say("Waiting for VM to stop for up to 300 seconds")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 300s limit here feels arbitrary, I wonder if this could become an option in the template?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. that could be an option

@sebastian-de
Copy link
Contributor

I tested on 1.19.5 and also reset my go/pkg directory followed by go build . and get the same error. not sure if this is a Windows only issue, if so, I guess it's fine.

Regarding your build problems: Have you tried updating only the dependency that causes the error?
go get cloud.google.com/go/storage && go mod tidy

@joeyberkovitz
Copy link
Contributor Author

yes, that works, although a bunch of indirect dependencies end up being updated as a result

@joeyberkovitz
Copy link
Contributor Author

@sebastian-de - I ultimately have nothing against only deleting VMs that are templates if that is the preferable route, just let me know if that's the decided on option and I can switch over to your commit and update the tests.

joeyberkovitz and others added 4 commits February 2, 2023 08:29
Also add unit tests for force.
only delete existing VM if
- vm_name is unique (when no vm_id is given)
- it's a template
@sebastian-de
Copy link
Contributor

Hey @joeyberkovitz, please excuse me if my comment sounded like I had already made a decision on the matter. I am also still open discuss whether we should delete VMs that are not templates.
Since we plan to use Packer on our production clusters, I'm just leaning towards a more conservative approach.
But of course I would be happy if you like to include my commit - the updated version with adjusted tests is here: https://github.com/sebastian-de/packer-plugin-proxmox/commit/1a1faab5bc1849de1a5ef43e97a786fa43087579

@joeyberkovitz
Copy link
Contributor Author

I see the reasoning for preferring to not delete multiple VMs or stop any running VMs. I think that in a more realistic scenario where Packer is running more continually, updating the same image, there would only ever be exactly one template with a given name, and as a result it would never be running.

When a person onboards onto Packer though, they would have to manually cleanup the Proxmox cluster if there are any dups or running VMs with a name matching the template. This scenario is what I came across frequently during testing, although once this feature is merged and a cluster is manually cleaned up, if Packer is the only one ever creating templates, then it should never have to really have to delete running or non-template VMs.

Overall, I think the extra deletions could be nice for that onboarding, but given that it's usually going to be a one-time task, we could just not support that use case and lean toward the conservative option.

@sebastian-de
Copy link
Contributor

When a person onboards onto Packer though, they would have to manually cleanup the Proxmox cluster if there are any dups or running VMs with a name matching the template.

I get why you want to avoid that, but I remain cautious because in that scenario it's possible that Packer may delete VMs that the user did not intend to. IMHO that would be a far worse user experience.
If it's implemented that way we should at least add a big warning to the docs that using -force together with vm_name will delete all VMs matching that name.

@joeyberkovitz
Copy link
Contributor Author

Understood and agreed that it makes sense to be more cautious at the expense of potentially a bit more work during initial onboarding.

As such, I've pulled in the changes from @sebastian-de. Both comments from @lbajolet-hashicorp should be resolved - the changes include a check to see what error was returned from proxmox (although it can fail if the proxmox API client error strings change in the future). The timeout is no longer relevant since we'll never stop any running VMs

@xoxys
Copy link

xoxys commented Feb 9, 2023

During the development of my packer templates, it happened quite often that the VM creation/initialization has not finished. As a result, the final template was never created and the VM still exist. To restart the build process, it's not required to manually clean up all remaining components every time.

That is how I came across this PR, and I think that using an additional flag like -force with proper documentation is safe. Now the PR has moved in a direction where this use case is no longer covered and people still have to clean things up manually because it is considered even safer (which was also a terrible user experience for me). What would be the suggested implementation to do this automatically? Add another -force-all flag?

@joeyberkovitz
Copy link
Contributor Author

This becomes a question of how safe the inherently unsafe force operation should be, given that it needs to be used in production scenarios. Below are some options:

  • unsafe: stop and delete all VMs matching the name or ID if specified
  • semi-safe: relax the template requirement if ID is specified to allow for stopping a single VM
  • semi-safe: don't ever check if a VM is a template and always stop and delete at most 1 VM
  • safe: as is right now, only delete at most 1 template

Another option is to optionally delete a VM on failure like the VMware builder does, although if you needed to debug, you would likely disable that feature anyway.

Considering the development use case, I would prefer one of the semi-safe or unsafe options. If needed flags can also be added to allow for deleting non templates and stopping running VMs.

Once a decision is made regarding how the plugin should be designed, I can adjust accordingly.

@lbajolet-hashicorp
Copy link
Contributor

I think @joeyberkovitz hits the nail with the "delete an image in case of failure".

Generally plugins are responsible for cleaning-up after them, in this case that would mean deleting the VM in case of error.

To your point regarding debugging, there is a flag on Packer that you can use to specify what to do in case of error --on-error, which defaults to clean-up, but if you wanted to wait and ssh into the VM, you could set it to ask, where it will wait until you choose what to do with the VM.
Unless I'm mistaken, abort would leave the VM alive as well, it would abort immediately the build without cleaning-up anything.

The --force flag is part of Packer, and its semantics already depend on what the plugins want to do with it, as such I don't think we should have another --force-all flag, I don't think the project will benefit from that. On the contrary even, it will add another layer of complexity and confusion, so I think we should steer clear from this.

Note that this doesn't need to be part of this PR, it can be opened as a subsequent one later.
Judging from the discussion, automatically removing the VM at the end of the process (or only in case of failure? That depends how we implement the cleanup) combined with the --on-error flag would work well , what does everyone think?

@sebastian-de
Copy link
Contributor

I never used the --on-error flag, but without it Packer removes failed builds pretty reliably for me, even if it's killed via keyboard-interrupt. So Cleanup() seems to work fine in the majority of cases.

func (s *stepStartVM) Cleanup(state multistep.StateBag) {

The only occasion I had a leftover VM was when I killed our GitLab-runner which was taking down the Packer-container with it, so Packer had no chance to clean up. After that I had to stop and delete the VM manually.

So I get why @xoxys wants a more aggressive -force for situations like this.

With all the different variables, I guess we won't find an easy solution that fits all usecases, so I'm still in favor of the current, least aggressive approach.

@xoxys
Copy link

xoxys commented Feb 9, 2023

All valid points, and I think --on-error=ask would be a feasible way to debug and cleanup. Let's continue with the current least aggressive approach 👍

@lbajolet-hashicorp
Copy link
Contributor

With what @sebastian-de shared it seems the plugin does cleanup the VM, I'm wondering why you have to manually cleanup the VMs then @xoxys since it seems the plugin should stop and remove the one being built. Do you forcibly kill Packer or do you let it perform the cleanup when the build fails?

If you forcibly kill it then it makes sense, otherwise, there's probably a bug begging to be addressed in this logic.

@xoxys
Copy link

xoxys commented Feb 9, 2023

To be honest I cant rember the details and would assume I have done something wrong. Sorry for the confusion.

Copy link
Contributor

@lbajolet-hashicorp lbajolet-hashicorp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joeyberkovitz, thanks for the multiple rerolls.

Regarding the error message check I do agree this is not super robust as if the message changes on the API, we'll start failing when attempting to delete a VM that doesn't exist, but I think this will be good enough until the API offers a better option.

Unless someone has an objection, I will merge this tomorrow, in its current state the PR looks good to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

overwrite existing template image.
8 participants