Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request -- Safe Mode (Auto Rollback Changes if connection lost or not re-established within X time) #2976

Closed
2 tasks done
tcsi-github opened this issue May 6, 2022 · 54 comments
Labels
feature Adding new functionality help wanted Contributor missing

Comments

@tcsi-github
Copy link

tcsi-github commented May 6, 2022

Important notices

Before you add a new report, we ask you kindly to acknowledge the following:

I realize I am duplicating a post, however, I assumed when I created a reply it would re-open the closed issue. That appears not to be the case and I simply hoped that opening a new issue would generate more feedback than a closed one. Should the original issue be able to be re-opened I am happy to close this.

Is your feature request related to a problem? Please describe.

I have multiple sites that am having to make several changes that could break access (Network address changes, Firewall changes, Routing, Etc.) ........ I'm scared. HAHA!

The remote sites have 0.75% technical experience. Yes, I might be able to find someone to power cycle a router. However, I've also worked as an MSP and had the resident "IT person" unplug the network cable to "reboot" which they thought worked because "the lights went off and came back on" (Link Lights) #facepalm

Needless to say, took me a bit to figure out that's what they were doing.

Describe alternatives you considered

To my knowledge OPNSense doesn't have any reasonable alternative to this particular problem. If I am wrong, my apologies for wasting time and please help direct me in a correct path.

Additional context

Originally posted by @tcsi-github in opnsense/core#3042 (comment)

I also created a forum post regarding this but no reply as of yet.
https://forum.opnsense.org/index.php?topic=28238.0

@banym had a good template as well and I include his here for comparison. Personally, I feel like the simpler thing would be to take a snapshot as soon as the breaking change option is enabled. I can see the use in giving the option for a timer, however, I would think setting a "middle-road" 120 seconds would be a default.

Describe the solution you'd like
It would be nice to lock the firewall in a "major change" mode where only one session is able to do changes until the major change mode is exited. This mode should be able to define a working configuration from the backup config history or the current configuration when this specific mode is activated. It should be possible to set a timer for change commitment. Now the administrator can for example make significant changes to routing or rules that possible could lock him out of the firewall. If he does not approve that his change was successful and works as intended the firewall roles back to the defined configuration. This way the administrator can log in an "try" again or rethink his change

Describe the solution you like

I was thinking about my MikroTik days and remember that if I happened to make a incorrect change that those changes would be reverted if I failed to connect back or didn't apply them within X time. Sometimes it was a pain, but it saved my butt so so many times. Kept me from having to make calls to talk someone through a reboot and just gave me a bit of room to breathe.

I would submit something the below as a rough draft for an option.

The ability might stay disabled by default and only enabled prior to the change. This would allow the option to stay out of the way and only be used when explicitly needed.

@fichtner -- I believe this to be a worthwhile addition to the OS and would be a step in the direction of more enterprise use-cases. My development skills are limited, however, I would commit to time testing and anything within my abilities to help this become a actual feature.

I would welcome any thoughts or feedback on the suggestion.

Example Change ---> Admin has to change a WAN IP

breaking_change_enabled

Scenario - Successful Change

  1. Admin logs in and enables "Breaking Changes Mode" (Or whatever better name)
    1. The working config is immediately snapshot.
    2. Notification alert is placed at the top bar
    3. No other logins are allowed but the user
    4. Working config is marked as the config to be used for restoration
    5. Perhaps the connection is marked for monitor?
    6. Timer Starts (120 second?)
    7. Notification Alert shows at top (See above)
  2. Admin makes changes just as they normally would
    • Complete with "Apply Changes", etc.
  3. Tests changes and is happy with results
    • Connection either not broken or reconnected within timer
  4. Admin disables "Breaking Changes Mode"
    • Previous snapshot config is then removed to prevent use on reboot.
    • System returns to normal operation
    • Notification removed from top bar
Now lets try a screw-up....

Scenario - Failed Change

  1. Admin logs in and enables "Breaking Changes Mode" as before
    1. Same as above happens to enable feature
  2. Admin makes changes just as they normally would
    • Complete with "Apply Changes", etc.
  3. Finds they have made a mistake and are unable to connect back to the GUI
    • Connection timer expires, or some other defined trigger
      • "Breaking Changes Mode" begins
        1. Previous Snapshot is automatically restored
        2. Router is Rebooted (if needed)
        3. Other actions such as logging the issue, etc...
  4. Admin wait for completion of restore
    - Perhaps an email or other notification could be sent notifying all is well again.
  5. Admin is able to start again, thankful they have not created a bigger problem requiring an on-site visit (For me, an EXPENSIVE multiple day trip)
@OPNsense-bot
Copy link

Thank you for creating an issue.
Since the ticket doesn't seem to be using one of our templates, we're marking this issue as low priority until further notice.

For more information about the policies for this repository,
please read https://github.com/opnsense/core/blob/master/CONTRIBUTING.md for further details.

The easiest option to gain traction is to close this ticket and open a new one using one of our templates.

@OPNsense-bot OPNsense-bot added the incomplete Issue template missing info label May 6, 2022
@tcsi-github tcsi-github changed the title Feature Request -- Safe Mode Feature Request -- Safe Mode (Auto Rollback Changes if connection lost or not re-established within X time) May 6, 2022
@aboutte
Copy link

aboutte commented May 6, 2022

I would love this in opnsense! I loved using commit confirmed 10 in Juniper.

@SCUR0
Copy link

SCUR0 commented May 6, 2022

Same. I use reload in 5 often in switch or router configurations that are remote. Saves your butt when you misconfigured an ACL.

@mimugmail
Copy link
Member

Thats why auto sync is disabled when using a HA setup ;)

@tcsi-github
Copy link
Author

@mimugmail , I can see that, however forgive me, are you saying that HA is the only option where this would be considered?

Apologies, sometimes I'm a bit dense. :-)

@mimugmail
Copy link
Member

Honestly I have no idea how to implement such a huge change in an easy way, the HA setup is already here, stable, easy to understand :)

@tcsi-github
Copy link
Author

Forgive me if I come off wrong, that is not the intent. I'm not entirely sure how to answer this comment correctly.

While I agree that HA does have it's place, I feel like this is one, very unnecessary overhead you are suggesting, and two, this isn't exactly solving the same problem.

HA, to my understanding, requires at least duplicating hardware to achieve sync (I also thought it requires multiple public IPs but I'm not as sure over that). Apologies, but I'm not going to spend another $600+ for identical hardware and additional power consumption just for a misconfiguration. I would be time out of pocket but I could justify a couple of "mistake trips" and it be cheaper as opposed to the cost of duplicate hardware.

HA solves for availability and redundancy, which granted, is covered by the solution. However, I am wanting a solution for recoverability only. If the hardware dies, fine.... in a HA setup I'd still have to drive out and replace the bad hardware, I would just still be running.

Also, I would like to understand the complexity of the request. I don't quite see how this is a huge change when the underpinnings are already there. We have the ability to backup / restore configs, I am simply suggesting a means of enabling a system that does it automatically. It could be expanded and made much more, but at the heart, that's what I think most everyone would agree they wanted. Saved point and a auto restore if there was an issue.

I do realize there is probably more to consider so I would welcome someone helping me understand more. :-)

I hope that came off in the open spirit of debate rather than rude.

@AdSchellevis
Copy link
Member

There's just no generic concept of "configure and commit all pending changes" in a reliable way, which always will make such a feature incomplete and disappointing when really needed. You can in theory offer functionality like this on a per component level, which is also what we did for the firewall api (https://docs.opnsense.org/development/api/plugins/firewall.html#concept).

It's rather simple, if one could determine the conditions reliably (which I don't think one can knowing quite some different support scenario's), one could also built a plugin for it and open a PR for discussion.

A reliable failsafe, which would cover more different scenario's, would probably be to offer snapshots with zfs and go back in time when the user asks for it during boot. As this would also cover kernel/driver issues or software changes people forgot to act upon. Maybe that's something to look into for a future business release, you never know.

@tcsi-github
Copy link
Author

tcsi-github commented May 7, 2022

I can understand your explanation of "Configure and Commit all pending changes". I also believe you are also right in using ZFS snapshots for a more robust solution. However, what if a more simplistic approach was considered.

As I said before, we have the ability to backup "config.xml" as well as restore it again. We even have the "opnsense-importer" to restore the config file on boot I believe.

My understanding to this point is ..... If I have router 1 that dies, I can replace it with identical router 2 and restore with the config.xml from router 1 and, again assuming they are identical hardware, be right back ready to go. If this is correct, I believe we could reasonably be able to assume that the hardware isn't changing since the option would be more for a configuration error rather than a hardware change (which I believe even a ZFS snapshot would have issues with as well?)

Working from that, let us consider Juniper's "Confirm Commit X" Command

Per their site.

To confirm a commit, enter either a commit or commit check command.

If the commit is not confirmed within the time limit, the configuration rolls back automatically to the pre-commit configuration and a broadcast message is sent to all logged-in users. To show when a rollback is scheduled, enter the show system commit command. The allowed range is 1 through 65,535 minutes, and the default is 10 minutes.

This could look like this in OPNSense

  1. Admin enables "Rollback Option" with a timer of "5 minutes"
    • A backup "config.xml" is taken
    • A "Rollback Option Enabled - XX:XX Remaining" notification would appear in the top bar similar to my first post
  2. Admin does whatever is needed in the time frame
    • Commits are not tracked in any other way than normal
  3. Admin DOES stop timer within the given allotment
    • Timer is killed and no restore occurs leaving the system in the current state
  4. Admin DOES NOT stop the timer within the given allotment
    • Logins are paused to prevent changes during restoration
    • Automatic restoration of the previous created config.xml
    • The effect should theoretically be the exact same as if I did it manually.

Using this approach we are not storing changes for bulk commit or really changing the commit process at all..... we are simply saying "I'm about to do something I may not come back from, let me make a backup and set a timer to restore the backup if I don't finish in time"

Again, I do agree this could be built out more with ZFS in future. I will be the first to say I'm not familiar with the deep inner workings of the OS. I believe, however, I see most of the processes in place currently to make the above process work.

As always, I welcome the feedback and do appreciate the consideration. I lack developer skills, however, I am passionate about this project and feel this feature would be most welcome based on the feedback when I have posed the thought.

@AdSchellevis
Copy link
Member

I don't expect it will work, but as mentioned if it would, the plugin framework offers everything you need, so no need to keep this in core. you (or anyone else) could start working on such a plugin and open a PR there.

@tcsi-github
Copy link
Author

My apologies for the incorrect placement. I should have realized this could be a plug in.

I appreciate the discussion very much.

@JJGadgets
Copy link

Chiming in to say that I would appreciate this feature too. It’d be much nicer to have a simple “wait 20 secs after lights stop blinking”, than (in my case) connecting to the same LAN as the Proxmox server virtualizing OPNsense (rather than my daily VLAN), reverting a snapshot or typing in the command line to undo the changes, and possibly reboot if needed.

Is there some recognizable way for other users to express interest without butting in the middle of implementation discussions?

@AdSchellevis
Copy link
Member

My apologies for the incorrect placement. I should have realized this could be a plug in.

No problem at all, it's easy to overlook. Don't mind keeping the ticket open for now, just mentioning its (currently) not a core priority and someone wil have to do some work at some point in time in order to mature ideas. When keeping it simple, a plugin would probably be the better place anyway.

@tcsi-github
Copy link
Author

@JJGadgets Thank you for the interest! I don't believe anyone would view it as butting in. @AdSchellevis Thank you for leaving the request open. I understand there are bigger things than my request. 🙂 I will create a request in plug-ins and reference here.

@AdSchellevis
Copy link
Member

@tcsi-github let me move this then

@AdSchellevis AdSchellevis transferred this issue from opnsense/core May 7, 2022
@tcsi-github
Copy link
Author

Thank you sir!

@CorvetteCole
Copy link

CorvetteCole commented Jul 8, 2022

I may look in to implementing this for OPNsense nodes on ZFS filesystems, but no guarantees. Will update if I decide to take this on

@spi43984
Copy link

spi43984 commented Jul 8, 2022

I've been using zfs with snapshots in a virtualized environment for quite some time to rollback if an update failed. Would love to see that in the opnsense GUI as well - either with some kind of timer (roll back in 5 minutes if not stopped in time) or completely manually (like to boot and check out different versions).

In the shell it works. Installed opnsense 22.1.2 on a Sophos appliance on top of zfs. Prior an update I ran a zfs snapshot zroot/ROOT/default. After that it's a bit messy, but what I did is create a clone from the snapshot and could switch then between the original root zroot/ROOT/default and the cloned one zroot/ROOT/clone by setting zpool set bootfs=zroot/ROOT/default zroot or zpool set bootfs=zroot/ROOT/clone zroot. Just reboot and the old/new root filesystem gets mounted.

zfs clones have some disadvantage if one wanted to keep the snapshots/clones for a longer time, so a zfs send/receive might be a better approach. Nice would be an integration into the GUI and the boot loader - like to choose the zfs dataset to be next mounted as root.

From quick testing the only interesting zfs dataset to be snapshotted is zroot/ROOT/default. There are some more (/usr, /var etc.). Some time in the future it might be necessary to snapshot them as well but then it would be nicer if all datasets would not be set up like

zroot
zroot/ROOT
zroot/ROOT/default
zroot/tmp
zroot/usr
zroot/usr/home
zroot/usr/ports
zroot/usr/src
zroot/var
zroot/var/audit
zroot/var/crash
zroot/var/log
zroot/var/mail
zroot/var/tmp

but something like

zroot
zroot/update_22_1_10
zroot/update_22_1_10/ROOT
...
zroot/prod
zroot/prod/ROOT
zroot/prod/ROOT/default
zroot/prod/tmp
zroot/prod/usr
zroot/prod/usr/home
zroot/prod/usr/ports
zroot/prod/usr/src
zroot/prod/var
zroot/prod/var/audit
zroot/prod/var/crash
zroot/prod/var/log
zroot/prod/var/mail
zroot/prod/var/tmp

It's much easier then to snapshot all relevant datasets by e.g. zfs snapshot -r zroot/prod@update_22_1_10 and clone them into zroot/update_22_1_10. I tried that out manually as well by renaming the datasets from zroot/... to zroot/prod/... and it worked as well.

If something breaks it's also possible to boot from an opnsense usb installer thumb drive, issue a zfs import -f zroot and clone/rollback/set/whatever the zfs datasets from within the live system.

I think zfs snapshots might be a way to roll back from a failed update or a misconfiguration.

Happy to discuss this further.

@CorvetteCole
Copy link

yeah I figured ZFS snapshots would be the way to go. I've heard that UFS supports snapshots as well, but don't know much about that

@spi43984
Copy link

spi43984 commented Jul 8, 2022

Anyone experienced in writing plugins for opnsense? Could try and create something together for zfs snapshots and rollbacks.

@CorvetteCole
Copy link

I think the first step would be a plugin that you can interact with in the UI to manually snapshot and restore snapshots. Once we've got that, hooking it in to system events should be a little less... painful? I have no experience with OPNsense plugins but will look around and see if I can get some knowledge to start building a base for that or something

@CorvetteCole
Copy link

@spi43984
Copy link

spi43984 commented Jul 8, 2022

Yep, that sounds like a good way to start.

@CorvetteCole
Copy link

Looks like we can write the 'backend' part of this in python

@CorvetteCole
Copy link

and here are some more docs. I'll see if I can create a bit of a template interface for this, give me a bit https://docs.opnsense.org/development/examples/helloworld.html

@spi43984
Copy link

spi43984 commented Jul 8, 2022

If the backend is in python - does it have an API for zfs or do we need to call zfs shell commands?

@CorvetteCole
Copy link

I suspect we'll have to use zfs shell commands, even if there are python libraries for interacting with zfs. see here for configd which we would probably use to do this: https://docs.opnsense.org/development/backend/configd.html.

It would be nice to use a pip package, unsure on what we are allowed to do with that

@spi43984
Copy link

spi43984 commented Jul 16, 2022

I quickly checked some update packages from https://pkg.opnsense.org/FreeBSD:13:amd64/22.1/sets/. From what I can see, the update changes files in / (obviously) and some other folders being part of the zroot/ROOT/default dataset. And there are files being modified in /tmp, /usr and /var mounted from other zfs datasets.

Would it be a huge trouble to change the zfs hierarchy from zroot/ to something like zroot/prod/?

@spi43984
Copy link

spi43984 commented Jul 16, 2022

I've added you as a collaborator on the repository so you are welcome to work on what you find interesting as well. We need to figure out some sort of framework for how we will call for zfs snapshot restore and creation, as well as ways to query what snapshots are available etc. Obviously some thought to be done

https://github.com/CorvetteCole/opnsense-plugins-snapshot/tree/master/sysutils/opn-snapshot

@CorvetteCole
Could you please set up some issues/discussion/etc. lists there so we can have our thinking over there?

@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/plugins/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 2, 2022
@OPNsense-bot OPNsense-bot added the help wanted Contributor missing label Nov 2, 2022
@allanlaal
Copy link

unstale!

@spi43984
Copy link

spi43984 commented Nov 5, 2022

Anyone else interested in that topic? I can help with zfs but would need somebody to work on the plugin part.

@CorvetteCole
Copy link

I can't work on this anymore, I do not have the time unfortunately. I am very sorry!

@allanlaal
Copy link

what about having a minimal OpnSense in a separate partition?
OR a second instance of a webui running with default confs

@spi43984
Copy link

what about having a minimal OpnSense in a separate partition? OR a second instance of a webui running with default confs

That's what we can get with a zfs snapshot to boot from.

@allanlaal
Copy link

@OPNsense-bot reopen

@jeremyrnelson
Copy link

+1

@fraenki fraenki self-assigned this Jan 11, 2023
@fraenki
Copy link
Member

fraenki commented Jan 11, 2023

I'm interested to work on a plugin. However, I'll probably go for an extremely simplified approach in the initial version. Something like the reload in 5 feature found in some routers/switches. However, that's not going to be released "soon"... so if someone wants to start early on a new plugin, just do it and submit a PR. We'll see whoever is the first to release it. :D

@fraenki fraenki reopened this Jan 11, 2023
@spi43984
Copy link

I'm interested to work on a plugin. However, I'll probably go for an extremely simplified approach in the initial version. Something like the reload in 5 feature found in some routers/switches. However, that's not going to be released "soon"... so if someone wants to start early on a new plugin, just do it and submit a PR. We'll see whoever is the first to release it. :D

I am not familiar with plugin development for opnsense but can try to help with the zfs part although I decided to keep my own full virtualized environment setup rather than switching to a separate HW with a zfs filesystem for opnsense.

fraenki added a commit to fraenki/plugins that referenced this issue Feb 22, 2023
@fraenki fraenki added feature Adding new functionality and removed help wanted Contributor missing incomplete Issue template missing info labels Feb 22, 2023
fraenki added a commit to fraenki/plugins that referenced this issue Feb 22, 2023
@fraenki
Copy link
Member

fraenki commented Feb 22, 2023

The development version of a new Auto Recovery plugin is now available for testing. See #3321 for installation instructions. Please report success and/or bugs in #3321.

fraenki added a commit to fraenki/plugins that referenced this issue Feb 22, 2023
@fraenki
Copy link
Member

fraenki commented May 30, 2023

My proposal for a new plugin was rejected. Unfortunately, I cannot put more time into this.
If someone wants to step in and make the required code modifications, the code is still available:
#3321
https://github.com/fraenki/plugins/tree/auto_recovery

I'll unassign myself from this issue. If someone wants to take ownership, please let me know. Otherwise the issue may be automatically closed due to timeout.

@fraenki fraenki removed their assignment May 30, 2023
@OPNsense-bot
Copy link

This issue has been automatically timed-out (after 180 days of inactivity).

For more information about the policies for this repository,
please read https://github.com/opnsense/plugins/blob/master/CONTRIBUTING.md for further details.

If someone wants to step up and work on this issue,
just let us know, so we can reopen the issue and assign an owner to it.

@OPNsense-bot OPNsense-bot closed this as not planned Won't fix, can't repro, duplicate, stale May 30, 2023
@OPNsense-bot OPNsense-bot added the help wanted Contributor missing label May 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Adding new functionality help wanted Contributor missing
Development

No branches or pull requests