-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Automated orphan file recovery #7423
Comments
The following was originally posted by @ryao in #7401. @ahrens @behlendorf Here is what I propose. We can extend zpool to handle this by:
We can also add a new feature flag that will allow the errata scrub to mark pools as being unimportable on certain versions of certain platforms in the future, so that those versions will fail to import the pools while saying that a later driver marked them as having a known intregrity bug. Then a force import to override it be needed. That way if this should ever happen again in a tagged release, we would not need to abuse feature flags to prevent a repaired pool from ending up imported on a buggy version. This might be more complicated to do in Illumos than in FreeBSD or Linux because Illumos distributions handle versioning, but I expect that it is doable. I still need to flesh out a few details, but that is the general gist of my thinking on how to deal with affected pools. What do you think? |
Discussion from a few minutes ago:
|
@behlendorf It is more than just finding orphans. Directory sizes on directories that lost file links will be wrong. Also, the files' dnode's SAs will contain the wrong link count versus what is in the file system. If a file had been linked into multiple directories, it will not be an orphan, but it will still have an incorrect link count. While this is running, we will need notification of all link and unlink operations on the tip so that we do not get confused. More thought is needed on how to handle snapshots. I don't want to make assumption that the system time is correct and that anything before a certain date is okay (although that would be a great simplification), but if people were doing per minute snapshots since the event happened, iterating through them inefficiently will be incredibly slow. |
@trisk and I bounced thoughts back and forth to come up with a way to handle things. We have an algorithm that should handle snapshots in a way that is still somewhat slow, but not unreasonably slow. I will produce a design document that describes how it works (ideally later today) to aid implementation/review before I begin implementation. |
@ryao a nice way to explore implementing your and @trisk's design would be to refresh #6209 and use it as a base for your work. This PR extends scrub so it can be run in user space on an offline pool. This would be convenient for development and for the moment would side step the issue of having to worry about the filesystem changing from underneath you. A user space scrub which simply reported any inconsistencies would be a good first step. Additionally, let me reference openzfs/openzfs#604 which adds basic likely leaked object detection to |
I think that the minimum useful thing to do would be to relink orphaned files of filesystems into lost+found. Everything else is gravy. Couple of notes:
|
@ahrens The design document would have addressed all of your points, but here is what @trisk and I come up with for each:
|
@behlendorf Thanks for the suggestion, but I have brought my KVM with virtfs root setup that I used to use a few years ago back into service for this. It is almost as good as developing in userspace. It should work well for this. |
@ryao this may not be relevant, but note that the sequential scrub code is not in the zfs-0.7-release branch. We're planning to fully support it in 0.8.0. So you may want to avoid using any sequential scrub code in master for orphan recovery. That will make backporting to 0.7 less risky. |
@tonyhutter Thanks for the heads up. That makes the design decision to be implemented separately from the conventional scrub inside the kernel feel much better. It will be in the design document that will go up either today or tomorrow. Hopefully, today. |
@ryao another things to be aware of is that Lustre servers using ZFS do not strictly keep their link counts up to date and may also not strictly conform to the ZPL in other small harmless ways. Historically this has been fine since these datasets are not meant to be mounted through the POSIX layer. But the checker needs to be aware of these differences so it's safe to run on these datasets. |
Why? Is there something preventing them from using a clone that has orphaned files? I would suggest that we should fix clones too, but even if you don't want to do that due to the performance cost, it should be perfectly fine to keep using them (with the caveat that there might be orphaned files). |
@ahrens The plan is to fix the clones, but we can never fix the origin snapshot without BPR. Unless we can somehow destroy an affected origin snapshot, we will not be able to completely repair a pool containing one. |
In the common case of a file that never had multiple hard links, yeah you can just search the parent directory like zfs_obj_to_path() does. If he has one link, and it's present in the parent dir, then great - that file is not orphaned. If it isn't in that dir, then either it's orphaned, or it's on the delete queue, or it had multiple hard links at some point in the past, and the link in the parent dir was removed. You might be able to take advantage of this to reduce the size of the data structures you need while searching for orphaned files. But you'll still need to traverse all directories to distinguish between the had-multiple-links case and the orphaned case. It's too bad we didn't keep track of all the parent dir's. |
@ahrens I deleted the remarks after posting that I had interpreted the suggestion from IRC incorrectly and clearly have been looking at my screen for too long. I had hoped that you would see that and disregard it. Anyway, thanks for the quick response. Exploiting this to reduce the size of the data structures is an interesting idea. I'll give it some thought. |
It's not exactly sexy, but it does work :) |
@ahrens @lundman If we were to keep track of all of the parent links in the future following a disk format change, I would have scaleability concerns. The clones property works similarly and had been a scaleability issue at ClusterHQ. If this does not scale well, I could see it being something that an end user could exploit by making and destroying an inordinate amount of hard links to slow down the system. |
@lundman I think the commit you referenced updates the one parent pointer when we can. It doesn't change the on-disk format to track all parents. And you could still end up with an incorrect z_parent (after removing the link). That said, your change is an improvement over what we have today and it would be good to get it upstream. |
@ryao I suspect that your poor experience was with the performance of the clones property, rather than with the underlying data structures that track the list of clones (dd_clones and ds_next_clones_obj). Retrieving the clones property requires determining the name of each clone, so it's O(number of clones * average number of filesystems per parent filesystem). In contrast, iterating over the clones is O(number of clones); adding or removing a clone is O(1), and the storage for the clones is O(number of clones). |
I did think about a solution like HFS (stores a list in a XATTR), but that required cooperation from everyone, so I went with the simple "self healing" approach, so imported foreign pools could be corrected. But yes, if you delete the 'currently valid' parent, it will have an incorrect parentid until next lookup. I did not see a nice way around that. OsX probably relies on parentid than most, especially with the IOCTL's for 'next/prev hardlink' in the set. |
@ahrens @behlendorf @lundman @tonyhutter @trisk The errata scrub design draft proposal document is up at: https://dev.gentoo.org/~ryao/errata-scrub-design-doc.odt It has sha256 863e3e01bb23ac19d9c2e0afed4d10250e9a332cccc2820aa7bf8bae3024e6db. |
@ryao regarding errata-scrub-design:
I would suggest to link a file with $OBJID-RECURSIONLIMIT (to mark the file that ran into the iteration limit) instead of 'we panic' - a panic will take down the whole system and would IMHO be worse than an incomplete repair.
This would basically turn the 'initial implementation' into an offline version (at least for some systems, should they be dependant on working snapshots/clones). Additional thoughts: As the defect in question looks more like a filesystem dataset level (and can, at least as far as I understood it and please correct me if I'm wrong on this, propagate through send/recv) than a pool level one: wouldn't it make more sense to address it on the zfs side by adding a zfs subcommand? This would allow to target individual filesystems, thus enable to check received ones, and could potentially also be extended for other functions (ideas: change xattr mode, increase copies, recompress to current algorythm - for existing files). Also, the errata state of datasets|snapshots could IMHO nicely be reflected in a property: there it could be queried through the zfs subcommand (which is way more practical than copy/paste from zpool status output) and the existing property inheritance mechanics could be employed to implement the scan/repair process in a way that dosn't block create/destroy of snapshots/clones - while correctly tagging newly created ones while the process is running. |
After some thought, I have decided to go with .zfs/lost+found. It avoids this problem entirely, although I expect it to make the code harder to port to other platforms and less maintainable.
I suspect that a version that does not allow concurrent snapshot/clone operations to be alright for the majority of people. I would prefer to allow snapshots and clones in the initial implementation too, but I believe that getting out a correct tool sooner without that functionality be more desirable to end users than doing it later. We'll see how things go as I get further along in the implementation.
I probably should update the design document, although the deficit of comments made me wonder if there was much point. The updated version would basically be what it is already, plus the two changes noted in this post, so I'll delay updating it until the patch to implement
I like that idea. I will try adding that to the implementation. :) |
@ahrens @GregorKopka Does this look better to you?
|
It's a real bummer that this is necessary at all. I would prefer that we be very clear about when you might need to run this (i.e. almost never), and what it does (i.e. re-link orphaned file). I'd be happy to work with you on the wording. Documentation aside, I agree that doing it on a per-filesystem basis will probably make this easier to implement. If the repair run synchronously, I don't think you need How would we detect orphaned files aside from running this command (e.g. to update the What problems can not be detected but not repaired? |
This was more for people here than for the man page to explain how the command would work rather than what I had intended on giving to users, although my working directory has it in the man page at the moment. You are right that it should almost never be used. We could probably deprecate it almost immediately and then drop it from the documentation a year later, assuming that the man page changes make it into the final patch. I would be happy to take any wording suggestions that you have.
I plan to make this asynchronous. That way I don't have to worry about what happens if the user does Ctrl+C. Would you prefer that I make it synchronous?
We don't detect orphan files aside from running this command. The consistency property is a taint flag. It follows the snapshot through clone operations and follows the clones through snapshot operations unless the clones are repaired. My tentative thinking is not to set the property on a dataset unless the dataset is damaged. I will add hidden repair_timestamp and repair_txg properties to explain what has most recently been done on the dataset, although I will omit them from documentation. There is no point in showing this information to users.
The only problems in the scope of #7401 that we can detect, but not repair, are inconsistencies in snapshots. Clones of inconsistent snapshots are an interesting case though. We can repair them, but we cannot efficiently generate the information needed to repair them for every snapshot during our scan to be able to store it (imagine 1 million snapshots). That means that we must set the consistency property to damaged. Perhaps I should make it display "damaged, repairable" in that case though. |
@ryao Synchronous sounds fine to me, should also be a bit easier to implement. Hidden Possibly the On further thinking, it would be quite cool if this could some day be extended by a syntax of |
Agreed. While I think having such a tool would be great, we need decide up front about what we're designing here and why. It seems to me we have two reasonable options. Option 1 - Commit to doing this as a fully supported feature. That would mean a significant amount of development effort and doing the following:
Personally, I think it would make the most sense to build these additional cohearency checks in to Option 2 - Extend an existing tool like |
@behlendorf The algorithm for doing these checks operates at a different level from zpool scrub and won't cover everything. Being needed for both 0.7.x and 0.8.x means that I would need to handle both incremental send/recv and the regular one in addition to operating in a different layer. It would be buggy and error prone.
I would rather have this be a feature only meant to be invoked when we mess up after it is updated to support it. The words "fully supported" worry me somewhat because I do not want to encourage users to run this except when we tell them to run it.
An initial draft of man page documentation is already written: https://paste.pound-python.org/show/LFodNYCJJtRtIHj5C4is/
I am alright with designing ZTS test coverage. Would having a small (say 2MB compressed) file based pool suffering damage that we extract into a 1GB file, import, repair and then verify to be properly repaired be satisfactory? I already have one of those. I had planned to use it with some poking around in zdb as the basis for my own testing.
I believe that my design documentation + notes on revisions to make address all of the comments.
We can maintain it in case anyone who was affected realizes that they have a problem years in the future, but I would prefer not to keep it easily discoverable by users for the long term. If we do, let us at least have warnings galore saying that running it is almost always unnecessary unless a developer asked that they run it. I do not want this being treated like a fsck tool. I would like to deprecate it in 0.8.0 and remove the documentation in 0.9.0 so that users are not discovering it long after the need has passed. If a bug report reaches us that merits its use, we could always instruct the user on how to invoke it.
While I have worked out a way to prune quite a bit of what needs to be checked to determine what is damaged, the time complexity is still sub-cubic and super-quadratic on snapshots for a single filesystem. For all filesystems' snapshots, it is super-cubic. I want the repair process for any damage caused by #7401 to be as painless as possible. An offline tool is not going to achieve that, especially not with the projected time complexity. |
is openzfs pool checkpoint feature being worked on to integrate in ZoL? |
@mailinglists35 I did a dry-run port of the checkpoint patch to ZoL as part of the work on the recently-committed import updates (which are a prerequisite to checkpoint) and it was sufficient to show me that it's going to be a fairly heavy lift to get pool checkpoint integrated to ZoL. That said, I'll likely give it a try this weekend. |
Describe the problem you're observing
An orphan file is a file which is fully intact but cannot be accessed through the filesystem heirarchy.
By design typical damage to a ZFS filesystem can never result in an orphan file being created. Therefore ZFS does not currently contain an automatic mechanism to detect and recover orphan files. However, with a little bit of work these orphan files can be manually read using developer tools such as
zdb
and are not permenantly lost.This issue is to discuss possible strategies for the efficient automated detection and recovery of orphaned files when recovering the files from a recent snapshot is not possible.
Describe how to reproduce the problem
Issue #7401 introduced a difficult to reproduce bug which under very specific conditions could result in an orphan file being created.
The text was updated successfully, but these errors were encountered: