Skip to content

Latest commit

 

History

History
173 lines (100 loc) · 8.64 KB

index.md

File metadata and controls

173 lines (100 loc) · 8.64 KB

ARM-14: Managing System Reboots

Summary

This ARM describes the goals and motivation for managing system reboots, and proposes a type and provider for doing so. The proposal is intended to support multiple operating systems, but is focused on a Windows-specific implementation as this is something every Windows sysadmin has to manage.

Goals

  • Provide a DSL for managing reboots.
  • Allow reboot behavior to be associated with any resource.
  • Provide a Windows-specific implementation.
  • Cleanly reboot the system, so no lock files should be left behind.
  • Support environments where reboots are only handled via an external orchestration agent, e.g. mcollective.

Non-Goals

  • Puppet should never infer that a reboot may be necessary and reboot the system automatically. System reboots should always be explicitly modeled by the manifest author.

Success Metrics

  • Fully automate the provisioning of Puppet Labs' Windows Jenkins slaves.

Motivation

Users have attempted to workaround the reboot problem on Windows using a refreshonly exec resource. In the example below, the shutdown command is only triggered after the NetFx3 (.Net Framework 3) feature is installed via Dism:

dism { 'NetFx3':
  ensure => present
}
exec { 'c:\windows\system32\shutdown.exe /r /t 30':
  subscribe   => Dism['NetFx3'],
  refreshonly => true
}

Using an exec resource makes for a poor user-experience. Also, the installation of NetFx3 may fail if a reboot is already pending. In other words, puppet may need to reboot the system before it can apply the NetFx3 resource.

Description

We would like to model two aspects of reboots:

  • Install a package and, if necessary, reboot the system to complete installation.
  • Prior to installing a package, check if the system is in the reboot-pending state, and if so, reboot, and then install the package.

In the first case, puppet is only rebooting the system due to an event generated by another resource. In the second case, puppet is actively managing the reboot-pending state of the system.

Refresh-Based Reboots

The first aspect of reboots described above could be modeled as:

package { 'somepackage':
  ensure    => installed,
}
reboot { 'Puppet needs to reboot the system to complete installation':
  subscribe => Package['somepackage']
}

Puppet would only reboot the system if puppet installs somepackage and the system requires a reboot as a result.

The reboot type has a when property whose default value is refreshed, which means, "only perform a reboot as a result of being refreshed by another resource." As a result, we can omit the when property for refreshed-based reboots.

In the example above, the reboot resource subscribes to a package resource, but it could just as easily be a dism, registry_value, etc. resource.

Managing Reboot Pending State

The second aspect of reboots described above could be modeled as:

reboot { 'Puppet needs to reboot the system':
  when      => pending,
  before    => Package['somepackage']
}
package { 'somepackage':
  ensure    => installed,
}

Here, the reboot resource is evaluated prior to the package. We've also specified the when property of the reboot resource with value pending, which means, "puppet should check if a reboot is pending, and if so, reboot the system." After the system comes back up, install somepackage.

All Together Now

It is possible that puppet may need to reboot before and after installing a package, which is a composition of the two examples above:

reboot { 'Puppet needs to reboot the system':
  when      => pending,
  before    => Package['somepackage']
}
package { 'somepackage':
  ensure    => installed,
}
reboot { 'Puppet needs to reboot the system to complete installation':
  subscribe => Package['somepackage']
}

Orchestration

Some users don't want puppet to ever reboot the system. Instead, they want to handle that out-of-band during change windows, etc. This could be accomplished using mcollective:

$ mco rpc puppetral create type=reboot name='mco initiated reboot' when=pending

However, this type of unfiltered query means every puppet agent in the collective would receive the message, and have to act on it.

A more scalable approach is to specify fact-based filter:

$ mco rpc puppetral create type=reboot name='mco initiated reboot' when=pending -F reboot_pending=true

where reboot_pending is a fact. But this approach is also a bit verbose. It would probably be better to add this logic to the puppet plugin:

$ mco puppet reboot -F reboot_pending=true

Implementation Details

If the provider requests a system reboot, puppet will need to skip the remaining resources, send its report, and gracefully exit. It is important that the report contain references to the skipped resources, otherwise, from a reporting standpoint, it's unclear why those resources were not evaluated -- were they omitted from the catalog? Puppet can accomplish this by marking each remaining resource as noop.

It is also important that puppet resumes when the machine boots up. On Windows, we get this behavior for free by running as an Automatic service. On Unix, we get this behavior by running as an /etc/init.d daemon. Cases where puppet runs as a Manual service, or via cron mean that the system won't reach consistency until puppet next runs.

Race Conditions

After puppet applies the catalog, it sends a report containing events about each processed resource in the catalog. The report is processed synchronously on the puppet master, and will take varying amounts of time. As a result, there is no guarantee that puppet will exit cleanly before the reboot occurs.

A more robust solution would allow providers to register commands to be executed after sending the report and prior to exiting, similar in idea to the postrun_command setting, though likely different in implementation.

Batching Reboot Requests

We do not plan on supporting batching of reboot requests, e.g. install 5 packages and only reboot at the end. There are some core issues in puppet that make this hard, e.g. see http://projects.puppetlabs.com/issues/2198. Also, it's not clear what should happen if 4 packages install successfully, but 1 fails.

Alternatives and Recommendation

  1. We could create a rebootable property (like ensurable) and add it to the package, dism, registry_key, etc types, but restrict it to providers that implement the manages_reboot feature. For example, the Windows package provider would implement the manages_reboot feature, and mixin code for performing a reboot.

    The benefit of this approach is that the provider is in the best position to know under what conditions a reboot is necessary. For example, if a package is installed, then a reboot is only necessary if msiexec returns 3010, but not if it returns 0.

    The downside is that it tightly couples reboot behavior with specific type/providers, which could be problematic when using third-party modules. For this reason, I don't recommend this approach.

  2. In order to eliminate the race condition mentioned earlier, we could create a watcher process, to wait for puppet to exit, and then execute the desired commands. Doing so has its own issues, such as the lack of fork on windows, which makes it harder to serialize the block of code to execute in the watcher process.

Risks and Assumptions

We're assuming that users actually want puppet to reboot their systems. If this is not the case, then providing a reboot_pending fact would be sufficient. But based on user feedback, I believe some users do want puppet to reboot their systems, mostly for the refresh-based reboot scenario.

If the system does not clear the reboot_pending state after a reboot, such as the registry key PendingFileRenameOperations, then the system could get into a reboot loop.

Dependencies

Both puppet and facter would need to be able to detect if a reboot is pending. Either the code should be shared or copied. Another option is if puppet can force facter to re-lookup the reboot_pending fact each time, e.g. volatile fact.

Impact

This feature will enable users to manage a critical piece of configuration management on Windows.