Skip to content
This repository has been archived by the owner on Mar 8, 2023. It is now read-only.

v2.0 - installations hang during "Setup" #119

Closed
chris-david-taylor opened this issue Jun 1, 2018 · 50 comments · Fixed by #163
Closed

v2.0 - installations hang during "Setup" #119

chris-david-taylor opened this issue Jun 1, 2018 · 50 comments · Fixed by #163

Comments

@chris-david-taylor
Copy link

chris-david-taylor commented Jun 1, 2018

The v2.0 plugin seems to have a bug, regarding installing Windows. (I haven't tried others yet.). My present lab runs on vSphere 6.5.

Steps to reproduce;

  1. Take a working configuration, with the 2.0-beta4 plugins installed.
  2. Update the plugin from v2-beta4 to v2.0
  3. Run Packer - Installation hangs at "Setup"

Confirmed on Windows 2012_r2, and Windows 7.

I'll try and get some logging out of our environment tomorrow, my permissions are too locked down for me to look. Part of me thinks this may be related to #112
I have also tried updating Packer to 1.2.3.

@chris-david-taylor chris-david-taylor changed the title v2.0 - installations hang during installation v2.0 - installations hang during "Setup" Jun 1, 2018
@chris-david-taylor
Copy link
Author

chris-david-taylor commented Jun 2, 2018

Further debug; I just spotted this morning for Windows 7; "Windows cannot apply the DiskConfiguration in Autounattend.xml".

  • Ignore, that was just an experiment to check if I was doing something fundamentally wrong. All MS OSes are hanging at "Setup is Starting"

@embusalacchi
Copy link

embusalacchi commented Jun 7, 2018

Well I spent the entire day on this.. and I probably should have come here first.

I am seeing the same situation with vSphere 6.5 w/DRS and the release version of the packer-builder-vsphere-iso.exe. At first I thought it was because in the release version the disk size is in MB and not GB so I went from an 80GB partition to an 80MB partition. But after I realized that wasn't what was going on I spent pretty much all day today trying to figure out what I had done wrong. The part that through me off is that the vSphere GUI when working with the packer VM that it has created is almost totally unresponsive. Shutting down the VM usually fails or errors out a few times. Getting to the console will hang or not connect at all. The RAM usage generally tries to consume all of the available RAM for the VM as well. CPU doesn't spike. I/O doesn't spikes. Nothing. What I found interesting though is if I left the process running and did a "reset" on the vm through the VMRC the vm booted normally and sped through the setup without and completed quickly. So, there's some interaction with the release version of the plug-in during vm creation and vSphere that wasn't there in prior versions. Initially I thought it was because I was running packer from a different server than I was before. And then I realized that on the new server (running Jenkins) I had downloaded a newer version of the plug-in. As soon as I changed the plug-in for the pre-release version and fixed the disk size (as it was now trying to create in 80000GB drive) it worked as expected. The version of the plugin that works for me -I don't know the version number - but says it from 4/12/18. If there is additional logging or anything else I can do to help you troubleshoot this please let me know. It is very easy to reproduce.

@embusalacchi
Copy link

embusalacchi commented Jun 7, 2018

Looks like #104 and #112 and #119 might be all the same issue.

@sudomateo
Copy link

@embusalacchi looking like it. We'll need some insight on what might have changed between the 2.0beta4 release and the 2.0 release. I've been trying to go build by build from the public teamcity server located here but I don't really have the time to do so and keep these hung VMs in my inventory. I also don't know which build corresponds to the 2.0beta4 release so I can work up from there.

@chris-david-taylor
Copy link
Author

Last commit for 2.0-beta4 was 15th of March to add Cluster Support.

@chris-david-taylor
Copy link
Author

I did a build from 25th April after commit #82 and the issue isn't present then. Hope that is of some help @sudomateo ?

@sudomateo
Copy link

@chris-david-taylor thank you sir. I'll check that build out.

@kempy007
Copy link

are you using the boot_cmd in packer?? my VMs lock up after this??

@chris-david-taylor
Copy link
Author

chris-david-taylor commented Jun 11, 2018

I'm not @kempy007. It's not the boot command that is the issue. @embusalacchi suggested these might be all related, possibly to floppy_media. First of all, does 2.0beta4 work for you, and what OS are you templating?

I'm not an expert in Golang and currently working 12 hour days, otherwise I'd learn a bit and investigate, but as long as you aren't desperate for winrm, then building from #82 should be OK. Are you comfortable doing that?

@embusalacchi
Copy link

I don't mind trying other builds but I don't have the means to build them on my own. Is there a link somewhere? I don't mind trying them as I have time to nail down when it went bad.

@kempy007
Copy link

RHEL6, packer is now 1.2.4.
Issue only occurs with boot_cmd and after it is invoked the VM never uses more than 30mhz thus appears hung. restart and shutdown from vmwrc may fail. ctrl + alt + del inside vm does reboot.

I think it maybe related which is why I wanted to know if you have boot_cmd in your packer file.

@embusalacchi
Copy link

I am not using the boot_cmd in my packer file.

@sudomateo
Copy link

sudomateo commented Jun 11, 2018

@embusalacchi Builds can be found here: https://teamcity.jetbrains.com/viewType.html?buildTypeId=PackerVSphere_Build&branch_PackerVSphere=%3Cdefault%3E&tab=buildTypeStatusDiv

Just log in as guest and download the build you want.

Also I am using boot_command in my packer file.

@chris-david-taylor
Copy link
Author

chris-david-taylor commented Jun 11, 2018

I'm not using the boot_cmd parameter @kempy007

Thanks @sudomateo :)

@schmandforke
Copy link

found a line in esxi logs:

2018-06-12T09:57:39.190Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] CannotRetrieveCorefiles: VM is in an invalid state
2018-06-12T09:57:39.225Z host001.local Hostd: warning hostd[F3C2B70] [Originator@6876 sub=Vmsvc.vm:/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplate009.vmx opID=32cd7e53-01-1b-5afc user=vpxuser:foobar] File - failed to get objectId, '/vmfs/volumes/5a376560-523eed04-1d11-f84897828640/AutoBuildTemplate009_3/AutoBuildTemplatte009.vmx': One of the parameters supplied is invalid.

seems to be invalid parameter in the vmx file:

sched.mem.pin = "TRUE"
sched.cpu.min = "0"
sched.cpu.shares = "normal"
sched.mem.min = "4096"
sched.mem.minSize = "4096"
sched.mem.shares = "normal"

=> https://kb.vmware.com/s/article/2085907

maybe this is the issue, if i remove the lines above, the vm is not hanging :)

@embusalacchi
Copy link

Here's the generated .vmx from the not-working version of the plugin (vsphere-iso):

.encoding = "UTF-8" config.version = "8" virtualHW.version = "11" nvram = "windows2016-full-L1.ptest1.nvram" pciBridge0.present = "TRUE" svga.present = "TRUE" pciBridge4.present = "TRUE" pciBridge4.virtualDev = "pcieRootPort" pciBridge4.functions = "8" pciBridge5.present = "TRUE" pciBridge5.virtualDev = "pcieRootPort" pciBridge5.functions = "8" pciBridge6.present = "TRUE" pciBridge6.virtualDev = "pcieRootPort" pciBridge6.functions = "8" pciBridge7.present = "TRUE" pciBridge7.virtualDev = "pcieRootPort" pciBridge7.functions = "8" vmci0.present = "TRUE" hpet0.present = "TRUE" numvcpus = "2" memSize = "16384" sched.cpu.units = "mhz" scsi0.virtualDev = "pvscsi" scsi0.present = "TRUE" scsi0:0.deviceType = "scsi-hardDisk" scsi0:0.fileName = "windows2016-full-L1.ptest1.vmdk" scsi0:0.present = "TRUE" ethernet0.virtualDev = "vmxnet3" ethernet0.dvs.switchId = "46 65 0b 50 2c 39 e7 3b-ef da 69 ea 94 a4 ec e0" ethernet0.dvs.portId = "168" ethernet0.dvs.portgroupId = "dvportgroup-36" ethernet0.dvs.connectionId = "1675707751" ethernet0.addressType = "vpx" ethernet0.generatedAddress = "00:50:56:8b:4c:07" ethernet0.uptCompatibility = "TRUE" ethernet0.present = "TRUE" displayName = "windows2016-full-L1.ptest1" guestOS = "windows9srv-64" uuid.bios = "42 0b 30 ac 5e 15 7d d5-61 ba be b0 2d a3 d4 51" vc.uuid = "50 0b b5 4e a6 6e d7 77-ab 89 74 59 d3 f2 b6 61" sata0.present = "TRUE" sata0:0.deviceType = "cdrom-image" sata0:0.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/SW_DVD9_Win_Svr_STD_Core_and_DataCtr_Core_2016_64Bit_English_-2_MLF_X21-22843.ISO" sata0:0.present = "TRUE" sata0:1.deviceType = "cdrom-image" sata0:1.fileName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/ISOs/VMware-tools-windows-10.0.9-3917699.iso" sata0:1.present = "TRUE" floppy0.fileType = "file" floppy0.fileName = "packer-tmp-created-floppy.flp" bios.hddOrder = "scsi0:0" bios.bootOrder = "hdd,cdrom,cdrom" sched.cpu.min = "0" sched.cpu.shares = "normal" sched.mem.min = "0" sched.mem.minSize = "0" sched.mem.shares = "normal" floppy0.clientDevice = "FALSE" virtualHW.productCompatibility = "hosted" sched.swap.derivedName = "/vmfs/volumes/419f8a8d-852e81cc-0000-000000000000/windows2016-full-L1.ptest1/windows2016-full-L1.ptest1-49feaf28.vswp" uuid.location = "56 4d 28 31 3d 59 be 45-05 42 a4 45 e6 44 d4 bb" replay.supported = "FALSE" replay.filename = "" migrate.hostlog = "./windows2016-full-L1.ptest1-49feaf28.hlog" scsi0:0.redo = "" pciBridge0.pciSlotNumber = "17" pciBridge4.pciSlotNumber = "21" pciBridge5.pciSlotNumber = "22" pciBridge6.pciSlotNumber = "23" pciBridge7.pciSlotNumber = "24" scsi0.pciSlotNumber = "160" ethernet0.pciSlotNumber = "192" vmci0.pciSlotNumber = "32" sata0.pciSlotNumber = "33" scsi0.sasWWID = "50 05 05 6c 5e 15 7d d0" vmci0.id = "765711441" vm.genid = "7927475490584792074" vm.genidX = "3729250489358533764" monitor.phys_bits_used = "42" vmotion.checkpointFBSize = "4194304" vmotion.checkpointSVGAPrimarySize = "4194304" cleanShutdown = "FALSE" softPowerOff = "FALSE"

@embusalacchi
Copy link

prd2ts01_-_2560x1440

@embusalacchi
Copy link

vmware.log

@embusalacchi
Copy link

embusalacchi commented Jun 12, 2018

What's strange is if you do a RESET on the VM it will boot perfectly and go through the install. If you compare the .vmx when "broken" and the .vmx after the reset it is identical. So, it's not entirely clear where the issue is unless it boots with a bad value but vSphere writes out a good value that it uses when it boots after the reset?

@chris-david-taylor
Copy link
Author

chris-david-taylor commented Jun 13, 2018

Hi @schmandforke, Can you possibly try and get the vmx file generated by 2.0beta4 and then do a “diff” and post it here please? I think there could be a workaround; if we know what the specific invalid parameters are, we might be able to set them in the vmx config of the packerfile.

@embusalacchi
Copy link

@chris-david-taylor here you go - this is from beta4 -
windows2016-fbeta4.vmx.txt

@embusalacchi
Copy link

@embusalacchi
Copy link

@chris-david-taylor I was looking through the go code (I don't really know go) to see if I could figure out what's being set wrong. If you look at my previous comments the VM works after a reset (and not changing anything else). The .vmx from the slow version and the fast version after the reset seems identical so it's almost like it starts up with a bad param but vSphere fixes it? Not sure - I might just be missing something.

@chris-david-taylor
Copy link
Author

Hi @embusalacchi,
I've looked at those logs and the difference is that the plugin seems to now set them where as before it didn't.
I'd say we need to add something like this to our packerfiles, but we'll have to experiment to find what the correct values should be. Maybe you can grab those from the console in vSphere? I'm not back in work until tomorrow to test though:

"vmx_data": {
"sched.mem.pin": "TRUE",
"sched.cpu.min": "0",
"sched.cpu.shares": "normal",
"sched.mem.min": "4096",
"sched.mem.minSize": "4096",
"sched.mem.shares": "normal",
"sched.cpu.units": "mhz"
}

@xenithorb
Copy link

Yeah I think you're onto something here. When you look at the settings in vCenter, "CPU Limit" is set to "0MHz" instead of what it would normally be which is "Unlimited"

@xenithorb
Copy link

Following that suspicion, I think I now have a viable workaround:

        "CPU_limit": -1,

In your .json seems to do the trick.

@schmandforke
Copy link

confirmed that

 "CPU_limit": -1

worked for me !

@chris-david-taylor
Copy link
Author

Yay! Let’s leave this open as it will hopefully help with debugging.

@sudomateo
Copy link

Can confirm "CPU_limit": -1, works for me too. Running ESXi 6.5 using the latest 2.0 plugin release. I'm building CentOS 7 machines.

@bijujo
Copy link

bijujo commented Jun 14, 2018

"CPU_limit": -1 worked for me too in ESXi 6.5. Thanks.

@kempy007
Copy link

"CPU_limit": -1, worked in problem environment for me too.
Seems to be in older version than 6.5 update2 of esxi image.
Confirmed issue is present in version 'ESXi 6.5 U1 VMSA-2018-0004.3*'

Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?

@xenithorb
Copy link

Can someone update readme.md to add the above workaround as strongly recommended to avoid this issue?

I'd argue instead that this needs to be fixed so that "Unlimited" is the default ...... or perhaps there's a way to consume an object from the API that actually reveals the cluster defaults?

I'd really hope that this doesn't just become some obligatory settings.

@sudomateo
Copy link

@xenithorb I agree. This needs to be addressed in the code, whether upstream or in this plugin.

@chris-david-taylor
Copy link
Author

chris-david-taylor commented Jun 20, 2018

@sudomateo - I’ll file a bug upstream with VMware at some point today. :)

@calebherbison
Copy link

Works for me. ESXi 6.0, 2.0 vsphere-iso plugin, CentOs 7 Minimal

@jcoconnor
Copy link
Contributor

FWIW I'm seeing behaviour like this regularly - especially on Win-10 machines when I apply the cumulative updates. Resetting through VMRC helps but machine goes to hang again. Researching it with out infrastructure folks to see if there is any issues with our vCenter.

@jcoconnor
Copy link
Contributor

Confirming that adding "CPU_limit": -1 improves things a lot.
Also set

"svga.vramSize"    : "134217728",
"svga.autodetect"  : "FALSE",
"svga.maxWidth"    : "1680",
"svga.maxHeight"   : "1050"

jcoconnor pushed a commit to jcoconnor/puppetlabs-packer that referenced this issue Jul 27, 2018
The vcenter builds hang either shortly after boot, or in windows-10
during the cumulative update. This appears to be the behaviour noted in
jetbrains-infra/packer-builder-vsphere#119 so
applying the recommended fix there.

Also added svga parameters to help correct the console connect issues
associated with the above problem.
@sparky005
Copy link

sparky005 commented Sep 6, 2018

The CPU_limit fix worked for me as well. Can this at least be set as the default? It will probably save lots of people a lot of time.

@chris-david-taylor
Copy link
Author

The fix belongs in VMware’s upstream libraries. I’ve submitted a bug which I should check up on, as I’m starting to write my own code that depends on the upstream.

@sparky005
Copy link

Got it, thanks @chris-david-taylor. Is there a link to the upstream bug? I'd like to follow if possible (maybe other people on this would as well.)

@fredex42
Copy link

fredex42 commented Sep 7, 2018

just wanted to say thanks for this, CPU_limit fix worked for me too after a frustrating afternoon of VMware builds just locking up for no reason

@thor
Copy link

thor commented Sep 24, 2018

Just another nudge to @chris-david-taylor in linking to the upstream bug, so that we could follow it to the extent possible. I couldn't find the issue in govmomi, but I could easily have been searching for the wrong thing.

@chris-david-taylor
Copy link
Author

Sorry, I've been away @thor - Darn it, is this still a problem? I'll dig it out later today, and if I can't find it, I'll refile.

@thor
Copy link

thor commented Oct 1, 2018

@chris-david-taylor I can do a quick check with a build from the latest govmomi sources, if that's what you had in mind? :)

@chris-david-taylor
Copy link
Author

If you could please @thor that would be great. If the issue persists I'll pass it up on to the govmomi maintainers. :)

@riponbanik
Copy link

Thanks Guys. CPU Limit is the issue

@mkuzmin
Copy link
Contributor

mkuzmin commented Oct 16, 2018

I'm sorry this took so much time.
Here is a new release: https://github.com/jetbrains-infra/packer-builder-vsphere/releases/tag/v2.0.1

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging a pull request may close this issue.