Skip to content
This repository has been archived by the owner on Dec 12, 2023. It is now read-only.

first-boot infinite loop if exit code non-zero #202

Closed
jgreen210 opened this issue Jan 30, 2018 · 7 comments
Closed

first-boot infinite loop if exit code non-zero #202

jgreen210 opened this issue Jan 30, 2018 · 7 comments

Comments

@jgreen210
Copy link

Description of Issue/Question

This part:

fed0b94#diff-60e0e6a591efb16cc6f2c5fc2391f857

...of this commit:

fed0b94

...means that a non-zero exit code from a first-boot scripts now
causes the script to be re-run. This is a problem for a few reasons:

  • It means that first-boot scripts now need to be idempotent. That's
    not a bad goal, but it's something that's generally hard to acheive.
    Mainly since it's something that's hard to test. It's also generally
    not cheap to implement - e.g. if you can't atomically perform some
    operation, you might need to fingerprint some huge files to check
    that an earlier attempt at running a first-boot script succeeded.

  • If don't manage to make every single step idempotent a first-boot
    script can now fail on the first try and then "succeed" the next
    time. This means that you could be left with broken installation
    without realizing it. I.e. first-boot scripts can't now assume that
    they are working in a pristine environment.

  • There doesn't seem to be a way to break out of the loop other than
    doing a hard reset, so it's hard to debug problems.

Since people will be relying on the new behaviour too, it looks like a
configurable per-first-boot-file retry limit is required here. If it
defaults to some small number, then people that rely on these retries
should remain happy, but first boot scripts will eventually stop
running if they are broken.

I could then configure imagr to only allow one attempt at running each
first-boot script, avoiding the need to write idempotent scripts.

A UI for cancelling the retries early would be nice.

(I don't like filing bugs and then not offering to fix them, but I'm
busy and it looks like I can wrap my one first-boot script with code
that saves the real exit code somehow but then returns with a zero
exit code.)

Setup

This is slightly editted, and could presumably be simplified to just a
config with just a first-boot script with non-zero exit code.

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE plist PUBLIC "-//Apple//DTD PLIST 1.0//EN" "http://www.apple.com/DTDs/PropertyList-1.0.dtd">
<plist version="1.0">
<dict>

  <key>autorun</key>
  <string>ImagrWorkflow</string>

  <key>default_workflow</key>
  <string>ImagrWorkflow</string>

  <key>workflows</key>
  <array>
    <dict>

      <key>name</key><string>ImagrWorkflow</string>

      <key>description</key><string>Deploy base OS image using imagr.</string>

      <key>restart_action</key><string>restart</string>

      <key>components</key>
      <array>

        <dict>
          <key>type</key><string>image</string>
          <key>url</key><string>http://10.90.90.1/imagr/dmg/osx_updated_180125-10.13.3-17D47.apfs.dmg</string>
        </dict>

        <dict>
          <key>type</key><string>package</string>
          <key>url</key><string>http://10.90.90.1/imagr/pkg/create_setup-1.0.pkg</string>
          <key>first_boot</key><false/>
        </dict>

        <!-- hide first boot dialogues etc. -->
        <dict>
          <key>type</key><string>script</string>
          <key>url</key><string>http://10.90.90.1/imagr/pre_boot</string>
          <key>first_boot</key><false/>
        </dict>
        
        <dict>
          <key>type</key><string>computer_name</string>
          <key>use_serial</key><true/>
          <key>auto</key><true/>
        </dict>

        <dict>
          <key>type</key><string>localize</string>
          <key>keyboard_layout_name</key><string>British</string>
          <key>keyboard_layout_id</key><integer>2</integer>
          <key>language</key><string>en</string>
          <key>locale</key><string>en_GB</string>
          <key>timezone</key><string>Europe/London</string>
        </dict>

        <dict>
          <key>type</key><string>package</string>
          <key>url</key><string>http://10.90.90.1/imagr/pkg/puppet-agent-1.8.1-1-installer.pkg</string>
          <key>first_boot</key><true/>
        </dict>

        <!-- broken, so has non-zero exit code -->
        <dict>
          <key>type</key><string>script</string>
          <key>url</key><string>http://10.90.90.1/imagr/first_boot</string>
          <key>first_boot</key><true/>
        </dict>

      </array>
    </dict>
  </array>
</dict>
</plist>

Steps to Reproduce Issue

Non-zero exit code from a first-boot script.

Versions Report

@grahamgilbert
Copy link
Collaborator

As far as I’m concerned this works as intended. It was a deliberate move on my part to require scripts to exit 0 to indicate success. Which “new behaviour” would people be relying on? Imagr has behaved this way for as long as I can remember.

@jgreen210
Copy link
Author

It's "new" for me :-) - I only just rebuilt my netrestore.nbi since the old one worked for El Capitan and Sierra. The commit I mentioned is dated 16 Aug 2016. Whatever version of imagr that I was using before didn't go into an infinite loop.

@jgreen210
Copy link
Author

Checking the exit code is a good thing to do.

People that have used imagr first boot scripts since this change being made will generally be relying on the retries to cover up sporadic failures, so removing the retries now isn't the right answer. A retry limit, cancel button and presumably dialogue if the retry limit is exceeded would be a nicer way to handle a permanently non-zero exit code.

@jgreen210
Copy link
Author

A dialogue on exceeding the retry limit would require local intervention on failure, so its use should be configurable and/or could have a configurable timeout.

@jgreen210
Copy link
Author

What happens if there are multiple first-boot scripts? I only have one, so I don't know.

What should happen if there are multiple first-boot scripts? Should too many non-zero exit codes prevent later first-boot scripts from running? I like things to fail fast, so I'd want that. It's more conservative, so I think that should probably be the default. Others might want things to work as best as possible despite failures, but would still want to know if any steps failed.

@jgreen210
Copy link
Author

My first-boot script (with its non-zero exit code) is getting rerun in a loop even after I restart the mac, preventing a local login. So, to debug a non-zero exit code I need to either:

  • scroll back to the first failure (which is generally different to the subsequent ones since my first-boot script isn't idempotent) and take a photo of the screen before the script re-runs.
  • reset machine (hold down power button) then boot into recovery and find first-boot.log under /Volumes/
  • disable the first-boot script and run it manually (which would require a local login, so isn't doing precisely the same thing).
  • wrap the first boot script to hide the exit code.
  • change imagr.

@grahamgilbert
Copy link
Collaborator

If you wish to do a PR to do this I would look at it, but I will not be working on this as I consider this a feature, not a bug. I personally think your scripts should be able to recover from a failure.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants