Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(snapshots): Implement volume shadow copy support on Windows #3543

Merged
merged 24 commits into from
Feb 4, 2024

Conversation

mxk
Copy link
Contributor

@mxk mxk commented Dec 28, 2023

This is another attempt at adding Volume Shadow Copy Service (VSS) support on Windows. See #2334, #2662, and #3371 for background info and reasons why this is a desirable feature. This version is the most minimal implementation so that I can get an idea of whether this approach is acceptable. If so, I'll make a few improvements and add the policy option to the UI.

Unlike the example action scripts, this uses the DeviceObject path of the shadow copy directly, which removes the need to link the shadow copy anywhere in the file system, and makes the backup even more robust. To be fair, the linking step in the script probably could've been removed as well, assuming that Kopia doesn't have any issues handling \\?\ paths (everything seems to work fine in my testing).

Usage

$ kopia policy set <path> --enable-volume-shadow-copy=always

OR

$ kopia policy set <path> --enable-volume-shadow-copy=when-available

Note: One issue that affects all uses of VSS is the need to run kopia.exe as an elevated user who is a member of the Administrators group. Without that, the snapshot will fail (unless when-available is specified, in which case VSS won't be used at for the snapshot)

Fixes #2662, fixes #3371

Copy link

codecov bot commented Dec 29, 2023

Codecov Report

Attention: 39 lines in your changes are missing coverage. Please review.

Comparison is base (3fed193) 75.83% compared to head (46f8b55) 75.84%.

Files Patch % Lines
cli/command_policy_set_os_snapshot.go 62.50% 14 Missing and 1 partial ⚠️
snapshot/snapshotfs/upload.go 31.57% 13 Missing ⚠️
cli/command_policy_set.go 25.00% 2 Missing and 1 partial ⚠️
fs/localfs/local_fs_os.go 76.92% 2 Missing and 1 partial ⚠️
internal/atomicfile/atomicfile.go 40.00% 3 Missing ⚠️
...apshot/snapshotfs/upload_os_snapshot_nonwindows.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3543      +/-   ##
==========================================
+ Coverage   75.83%   75.84%   +0.01%     
==========================================
  Files         467      470       +3     
  Lines       37197    37301     +104     
==========================================
+ Hits        28208    28291      +83     
- Misses       7059     7080      +21     
  Partials     1930     1930              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@jkowalski
Copy link
Contributor

I took a pass at the PR, looks very promising, it's important to get CLI and policy configuration experience right so we can in the future expand it to support other snapshot mechanisms.

BTW thanks for creating go-vss. Many projects need something like this.

@mxk
Copy link
Contributor Author

mxk commented Dec 31, 2023

@jkowalski Thanks for your feedback! I made all the suggested changes. Please take another look.

I set the default setting to when-available because I can't think of any reason why Kopia shouldn't use VSS by default, if possible. To ensure compatibility for all the existing users who have before/after snapshot actions configured, a shadow copy is created only if the root directory does not already refer to an existing shadow copy. A warning is logged in this case to get people to remove their action scripts.

@jkowalski
Copy link
Contributor

This is looking great.

You still have some linter failures. I recommend make lint-all locally which will run linter in all OS configurations (this takes a little while).

Also, we need more tests - related to CLI flag setting and corresponding policy show output - there is a number of those you can tweak.

I'm curious whether we can actually use VSS on Github Runner when running tests - it would be great do write a Windows-specific test that ensures we can snapshot files that are normally locked. A quick search reveals that elevated UAC is available - we just need to find a way to trigger this when running tests.

We should also add tests for the behavior when VSS is not available (not running elevated).

@mxk
Copy link
Contributor Author

mxk commented Jan 2, 2024

Linter failures should be fixed and I added policy set/show tests.

So far, I have not been able to run kopia.exe with elevated privileges from an integration test. By using a manifest file, I can set dist\testing_windows_amd64\kopia.exe to require admin privileges, but when I try to run it from a test, I get "The requested operation requires elevation" error rather than a UAC prompt. I've also tried getting a linked token via wintoken, but that didn't work either. If anyone has other ideas, please let me know.

@mxk mxk marked this pull request as ready for review January 3, 2024 18:29
@mxk
Copy link
Contributor Author

mxk commented Jan 3, 2024

I added an integration test that changes behavior depending on whether it is running with elevated privileges or not, and then used gsudo in the action workflow to test both conditions.

@mxk
Copy link
Contributor Author

mxk commented Jan 12, 2024

@jkowalski Could you please take another look?

@LakeishaKowalczyk
Copy link

@jkowalski
Hello, may I ask how the progress here is going? This feature is really excellent and I’ve been longingly anticipating this feature

@jkowalski
Copy link
Contributor

sorry, missed that, taking a look now...

@jkowalski
Copy link
Contributor

this is looking awesome, i have 2 nits - but there are still some errors in the windows run which look like real issues with long filenames

@mxk
Copy link
Contributor Author

mxk commented Feb 2, 2024

@jkowalski Thanks for taking a look! I made both changes. As far as I can tell, there are two Windows failures. The first has to do with the logfile_test using DEBUG|INFO regex to match log output:

https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:62

Is it ok to update the regex to also match WARN?

I believe the other tests fail because MaybePrefixLongFilenameOnWindows is stripping away the \\?\ prefix and then just returning that path to os.ReadFile, which fails.

https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:1259
https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:2439

I don't think I understand why this was being done in the first place, so I I'm not sure what the best fix is. Should I add a special case to leave the path alone if it starts with \\?\GLOBALROOT\?

@jkowalski
Copy link
Contributor

@jkowalski Thanks for taking a look! I made both changes. As far as I can tell, there are two Windows failures. The first has to do with the logfile_test using DEBUG|INFO regex to match log output:

https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:62

Is it ok to update the regex to also match WARN?

I believe the other tests fail because MaybePrefixLongFilenameOnWindows is stripping away the \\?\ prefix and then just returning that path to os.ReadFile, which fails.

https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:1259 https://github.com/kopia/kopia/actions/runs/7479769611/job/21139454308#step:8:2439

I don't think I understand why this was being done in the first place, so I I'm not sure what the best fix is. Should I add a special case to leave the path alone if it starts with \\?\GLOBALROOT\?

Let's try that, or perhaps simply return any paths that start with "\?" as-is?

@mxk
Copy link
Contributor Author

mxk commented Feb 3, 2024

@jkowalski Done. Can you re-run the tests, please?

@jkowalski
Copy link
Contributor

looks like there's some windows-specific regression in tests now, it likely is related to long filename rewriting

@mxk
Copy link
Contributor Author

mxk commented Feb 4, 2024

Yep, I somehow missed the call to wrapIgnorefs on overrideDir, so the path comparison didn't work correctly. I also wasn't running the test as an admin, so it was passing for me since it just didn't use VSS.

I'm not 100% sure how to handle TestLoggingFlags failing due to Another shadow copy operation is already in progress. I'm assuming it's an intermittent error that happens due to some other tests running in parallel. For now, I added some retry logic for this error, which is probably a good idea anyway, but the test may still fail. Another option is to just disable OS snapshots altogether for some tests.

Copy link
Contributor

@jkowalski jkowalski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, this is a really awesome contribution.

@jkowalski jkowalski merged commit f62ef51 into kopia:master Feb 4, 2024
22 of 23 checks passed
@jkowalski
Copy link
Contributor

@mxk looks like there is currently some test flakiness in the master branch after we merged this. Would you be able to investigate the failures and fix them? Also, would be great if you could join http://slack.kopia.io

See #3628 and #3624

@mxk
Copy link
Contributor Author

mxk commented Feb 7, 2024

@jkowalski I'm on Slack now and just looked through the issues and PRs.

As far as I can tell, the original issue was TestServerScheduling failing because VSS snapshots sometimes take 10+ seconds to create, which is normal, and this test was measuring the number of Kopia snapshots that can be taken in 10 seconds (#3624). I would prefer to disable VSS for this test and any others like it, or make never the default setting for just the integration tests without changing the actual default global policy.

Globally disabling VSS (#3625) defeats the purpose of when-available setting, which is to allow the use of VSS on a best-effort basis, especially for everyone who is currently using PowerShell actions for it. This change then caused TestShadowCopy to start failing since it was assuming a default of when-available (#3628 and #3629).

The original issue also reports a race failure in go-ole, which seems to be caused by how it handles *string output parameters in WMI calls. This is only used in one place when I call Win32_ShadowCopy.Create and can't be avoided. I don't think it is causing any real issues, but I definitely don't like it doing something that's not strictly correct as far as Go pointer rules are concerned. I'll see if I can contribute a fix for that.

Is there anything else that I missed?

@julio-lopez @aaron-kasten @Shrekster

@jkowalski
Copy link
Contributor

Can we track all these action items in a separate issue with ultimate goal of re-enabling 'when-available' once all the underlying failures are addressed and test stability is restored?

@akash07k
Copy link

Awesome feature addition. I was eyeing on it since a long. where can I find the documentation for this new feature? I can't yet find it in official docs.

@LinuxOnTheDesktop
Copy link

LinuxOnTheDesktop commented Mar 23, 2024

So, hold on (if I may chime in): is 'shadow copy' now enabled by default on Windows, or not?

EDIT: Also, surely it would be a good idea to publicise here, or, better, in the release notes, why shadow copies are desirable.

@mxk
Copy link
Contributor Author

mxk commented Mar 23, 2024

VSS is currently disabled by default because it was causing issues for some tests. Issue #3636 was created to reenable it. You must currently use policy set --enable-volume-shadow-copy when-available (or always) to enable it. Documentation also still needs to be updated.

@basldfalksjdf
Copy link
Collaborator

IMO vss should be left off by default. It is too new of a feature to force everyone to use it.

@dbolton
Copy link

dbolton commented Apr 1, 2024

Is there a way to enable the volume shadow copy on KopiaUI? My current backups are incomplete without this. I tried the before/after scripts but didn't quite understand how to set them up to work correctly.

@mxk
Copy link
Contributor Author

mxk commented Apr 1, 2024

@dbolton Not yet. It can only be enabled via the command line for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants