-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add functionality to replace existing templates #93
Conversation
This sounds like it adds useful functionality. I appreciate that you have set the default is 'false', however, can I ask what the behaviour is when replacing a template where there are existing linked-clones. From the docs:-
Also, depending on what has changed, an existing linked-clone's configuration could be inconsistent to the point that it may not boot ? Thanks Fraser |
It was my understanding that the source template can be removed without impacting the linked clone. Based on https://forum.proxmox.com/threads/i-deleted-original-template-while-linked-clones-exist.94879/ - it sounds like a linked clone is essentially a hard link, where if the source is deleted, the file still exists until all dependent VMs are also removed. If this is not the case, I can look into what Proxmox actually does and potentially look into providing some sort of functionality to convert the linked clones into full clones, or maybe just auto-rename the old source VMs. For my use case and that mentioned in the linked issue, it would be ideal to remove the template so that it can be automatically replaced by a CI/CD pipeline without having to manually go in and delete old components. |
Hey Joey, thanks for your reply. After a bunch of googling, including the link you provided to the question raised in the forum, I can't find a definitive answer as to the impact. Obviously I would assume that Proxmox's own documentation is correct, however that is not to say what the side effects of deleting an 'in-use' template are. The doc link I was quoting from is here ... https://proxmox.local.goffinf.co.uk/pve-docs/chapter-qm.html#qm_copy_and_clone This bears some further investigation IMHO. I can help out if you want (create a template and a linked clone, (attempt to) delete the template (with a VM is still referencing it - and if that fails, stop the VM and attempt the delete - then try to restart ......) ... I am completely OK if you prefer to conduct any investigation yourself ? The results of this should probably inform what approach should be taken if there are any destructive side-effects. I also build all my templates and provision all associated infrastructure via CI pipelines so I agree this need to support automation where it may not be possible for the builder itself to request confirmation (that might have to be left to the pipeline implementation) even if that means there needs to be a prominent warning in the docs about potential side effects (assuming that anyone reads docs these days !). |
I can do some more extensive testing later - either later today or over the next few days, but if you have time to verify as well, that would be helpful. I did briefly test this functionality as built in the PR, so I know it lets you delete the template, and I did have one linked clone running, but I didn't do any extensive testing to verify that the linked clone was not impacted in any way. |
Based on https://github.com/proxmox/qemu-server/blob/4bb19a255925e1fe70e7d7262b289131a50f5d05/PVE/QemuServer.pm#L2304 - it appears the Proxmox checks if a linked clone is in use when deleting a template. I also tested and verified that I can create a linked clone, then delete the template without impacting the linked clone. For reference, this was verified using an LVM-thin pool. |
Yes, I did a similar test, created a linked clone (with lvm-thinpool storage) from a template, then successfully deleted the template on which it was based, while it was still running. Given the code link you provided above I had expected this to fail ... what am I missing ? |
I can't say I'm an expert on the Proxmox codebase, and I don't fully understand what is being done. My only guess, without digging through more code, is that the code I posted is in the Qemu server backend. It's possible that the frontend PVE process maintains its own database where the template is marked deleted even though it's not actually deleted from Qemu. This would only make sense if the Qemu code is called as a subprocess, since it seems like it should die. This is all speculation based on a limited view of the codebase and the behavior we've been seeing, but if there's anyone that's more familiar with proxmox and can confirm, that would be great. |
@carlpett do you have bandwidth to provide guidance on these changes? |
@joeyberkovitz the module need to be updated to support this change? Or is it currently broken? Looking at the go.mod it is not immediately apparent which module you are referring to. Can you please provide insight to this change as well. Thanks! |
@nywilken - when I pulled the code a few days ago, it wasn't in a state where it would build. I was getting some sort of build error related to one of the following modules:
I didn't do any analysis as to what was including those modules and instead ran a go mod update, which pulled in some new API changes from the upstream Proxmox repo. If I remember correctly, this was the error: googleapis/google-cloud-go#5304 |
Thanks for the quick reply. Could you provide the version of Go being used to build. I would like to try and reproduce locally. I'm not seeing this issue when building with Go 1.17.8. |
I'm on Go version 1.18.3. If the google cloud problem is a non-issue, I'd be happy to revert that change along with the associated Proxmox updates, as it's not really related to this PR. |
Why do we need a new plugin option |
Sure - we can use |
Can we prefere the ID if set and only fallback to the name? It would IMO also be reasonable to use |
sure, that could be done |
latest commit uses also added unit tests which verify that delete is never called if force is disabled, and that delete is called appropriately if force is set |
Would the EFI changes conflict with #90 (/render it not required)? |
Yes, it probably makes sense to merge #90 first then I can rebase to that |
Hey @joeyberkovitz Feel free to use this branch if you'd like to update your PR. |
I switched over to your rebased branch. Code looks fine, but for some reason I can't build on my machine - pretty sure this is the same issue I ran into initially which led to upgrading dependencies: Output from cloud.google.com/go/storage...\go\pkg\mod\cloud.google.com\go\storage@v1.16.1\storage.go:1416:54: o.GetCustomerEncryption().GetKeySha256 undefined (type *"google.golang.org/genproto/googleapis/storage/v2".CustomerEncryption has no field or method GetKeySha256) |
Hmm, I can build this branch here without problems. Which go version do you use? I currently use |
I tested on |
Hey @joeyberkovitz I tested your changes today and have some suggestions. However in my opinion Packer should be more strict on what VMs it'll choose to remove. Citing the Packer docs what
We have no way to check if the specified VM was created by Packer, but we can make a few assumptions. One is that the Proxmox builders always produce templates. So I think we should check if the given VM is a template before removing it. Since template VMs are never in a Another thing that should be checked before removing a VM is that there exists only one VM with the given I think with these changes I created a commit to show what I mean here: https://github.com/sebastian-de/packer-plugin-proxmox/commit/4cbb6bdd5987d176d4ddb022e7edcceefefc3031 (tests not adjusted yet). |
To be honest, this change is rather small. The default behavior will not change, and the use of an additional |
@sebastian-de: I'm fine with adjusting the logic to only remove templates if that is the most desirable design With the current design, if an ID is specified, it's pretty safe as it will only delete at most one VM with the specified ID, creating a new one to replace it. The idea with deleting any VM matching a given name was that at some later point you're probably going to want to clone that template into a VM, at which point it could be important to have that name be unique. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joeyberkovitz,
Sorry for letting this PR slide under the radar for so long, thank you for updating it during all this time.
I've taken a look at the code here, and on @sebastian-de's branch as they linked it in a discussion above.
I think they have a point regarding whether or not we should stop running VMs that correspond to the template we are re-building, if the template is the thing we are replacing, we can probably keep the existing VMs alive so we can decide later if we want to stop them.
On that note I must say that I'm not necessarily a Proxmox expert and I don't have any infrastructure to test those changes on, so ultimately I'll leave it to you to decide if that's what the plugin should do or not.
Apologies again for the delay in reviewing this PR, I'll follow this PR closely, and will try to get it merged as soon as we're all satisfied with the code.
ui.Say("Force set, checking for existing VM or template") | ||
vmrs, err := getExistingVms(c, client) | ||
if err != nil { | ||
// This can happen if no VMs found, so don't consider it to be an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a way we can check the error we got here so we only continue if no VMs are found? I'm worried that if there's an unrelated problem (network for example), we won't delete the VM, and in this case we should probably error rather than continue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, although it's a pretty sub-optimal check. the proxmox client doesn't have typed errors, so we would have to check if the error string matches either vm '%d' not found
or vm '%s' not found
with the VM ID or name filled in.
I have no problem adding this with some logic to just note when nothing is found, but error and fail if there's a different error, although this could break if the client is updated and the error string ever changes
continue | ||
} | ||
// Wait until vm is stopped. Otherwise, deletion will fail. | ||
ui.Say("Waiting for VM to stop for up to 300 seconds") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 300s limit here feels arbitrary, I wonder if this could become an option in the template?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. that could be an option
Regarding your build problems: Have you tried updating only the dependency that causes the error? |
yes, that works, although a bunch of indirect dependencies end up being updated as a result |
@sebastian-de - I ultimately have nothing against only deleting VMs that are templates if that is the preferable route, just let me know if that's the decided on option and I can switch over to your commit and update the tests. |
Also add unit tests for force.
only delete existing VM if - vm_name is unique (when no vm_id is given) - it's a template
Hey @joeyberkovitz, please excuse me if my comment sounded like I had already made a decision on the matter. I am also still open discuss whether we should delete VMs that are not templates. |
I see the reasoning for preferring to not delete multiple VMs or stop any running VMs. I think that in a more realistic scenario where Packer is running more continually, updating the same image, there would only ever be exactly one template with a given name, and as a result it would never be running. When a person onboards onto Packer though, they would have to manually cleanup the Proxmox cluster if there are any dups or running VMs with a name matching the template. This scenario is what I came across frequently during testing, although once this feature is merged and a cluster is manually cleaned up, if Packer is the only one ever creating templates, then it should never have to really have to delete running or non-template VMs. Overall, I think the extra deletions could be nice for that onboarding, but given that it's usually going to be a one-time task, we could just not support that use case and lean toward the conservative option. |
I get why you want to avoid that, but I remain cautious because in that scenario it's possible that Packer may delete VMs that the user did not intend to. IMHO that would be a far worse user experience. |
Understood and agreed that it makes sense to be more cautious at the expense of potentially a bit more work during initial onboarding. As such, I've pulled in the changes from @sebastian-de. Both comments from @lbajolet-hashicorp should be resolved - the changes include a check to see what error was returned from proxmox (although it can fail if the proxmox API client error strings change in the future). The timeout is no longer relevant since we'll never stop any running VMs |
During the development of my packer templates, it happened quite often that the VM creation/initialization has not finished. As a result, the final template was never created and the VM still exist. To restart the build process, it's not required to manually clean up all remaining components every time. That is how I came across this PR, and I think that using an additional flag like |
This becomes a question of how safe the inherently unsafe force operation should be, given that it needs to be used in production scenarios. Below are some options:
Another option is to optionally delete a VM on failure like the VMware builder does, although if you needed to debug, you would likely disable that feature anyway. Considering the development use case, I would prefer one of the semi-safe or unsafe options. If needed flags can also be added to allow for deleting non templates and stopping running VMs. Once a decision is made regarding how the plugin should be designed, I can adjust accordingly. |
I think @joeyberkovitz hits the nail with the "delete an image in case of failure". Generally plugins are responsible for cleaning-up after them, in this case that would mean deleting the VM in case of error. To your point regarding debugging, there is a flag on Packer that you can use to specify what to do in case of error The Note that this doesn't need to be part of this PR, it can be opened as a subsequent one later. |
I never used the
The only occasion I had a leftover VM was when I killed our GitLab-runner which was taking down the Packer-container with it, so Packer had no chance to clean up. After that I had to stop and delete the VM manually. So I get why @xoxys wants a more aggressive With all the different variables, I guess we won't find an easy solution that fits all usecases, so I'm still in favor of the current, least aggressive approach. |
All valid points, and I think |
With what @sebastian-de shared it seems the plugin does cleanup the VM, I'm wondering why you have to manually cleanup the VMs then @xoxys since it seems the plugin should stop and remove the one being built. Do you forcibly kill Packer or do you let it perform the cleanup when the build fails? If you forcibly kill it then it makes sense, otherwise, there's probably a bug begging to be addressed in this logic. |
To be honest I cant rember the details and would assume I have done something wrong. Sorry for the confusion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joeyberkovitz, thanks for the multiple rerolls.
Regarding the error message check I do agree this is not super robust as if the message changes on the API, we'll start failing when attempting to delete a VM that doesn't exist, but I think this will be good enough until the API offers a better option.
Unless someone has an objection, I will merge this tomorrow, in its current state the PR looks good to me.
Adds functionality to delete any existing VMs with the same name as the
template_name
variableShould resolve #76
To support this change, the go.mod was updated because the build was failing due to an older google cloud dependency that was now broken. As part of the upgrade, the upstream Proxmox API EFI definition changed from a string to an object. I just updated that config to match the definition in the upstream README.