New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] FSEventsEmitter order events by mtime #634
Conversation
Thanks for taking the time to do it @samschott. |
Could you fix issues reported by the CI? |
I had forgotten some obvious imports. Not sure however why the remaining Python 2.7 test fails on Windows. |
Yes, this is a random bug on the CI, just relaunched the job and it passes :) You could add a line in the changelog + you GH nick ;) |
Done! |
As a minor, I'd rebase the commits (because going back and forth would be a worse approach) to avoid changing the style for nothing (the huge lines that had been previously avoided, as their contents didn't change). I don't like lines exceeding 79 chars as I usually put parts of code in slides for presentations, and sometimes I read code from the phone. Horizontal scrolling isn't nice, and wrapping breaks the indentation. Also, comments that just repeat what's written in the next line of code are noise and should be avoided, unless they say something that isn't exactly what the code explicitly does, or have some other goal (e.g. grouping parts of the code that would be ambiguous otherwise). But that's not a big issue. About the use of the modification timestamp, I think there should be something else at least for the moved/deleted events, as they would probably come first all the time, which could create yet another kind of sorting issue. I'm not aware of the specifics regarding macOS, but usually the modification timestamp ( That said, I'm not aware of what is really being affected here. Probably every moved file would come first, which would at best keep the same behavior regarding the order of events that have the same modification timestamp (i.e., the "deleted before created" event order would still happen as they would share the same |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feedback is in my previous comment.
@danilobellini, thanks for having a look at this, and sorry about messing up the code style. I think you are right about the mtime, it is not affected by moving an item. Sorting by mtime (as opposed to sorting by event type) nevertheless seemed to fix the issue of out-of-order events when saving file changes. This is probably because macOS performs such saves in three steps:
Therefore, moving does (coincidentally) come first while the deleted event gets caught in the next snapshot and therefore comes last. The current implementation sorts the events as follows:
This makes it appear as if a new file was created, moved and subsequently deleted, i.e., no effective change. Regarding the correct stat call, from what I can see, If you agree that the there is a sufficient use case, I'll rewrite the PR to use the ctime instead for moved and deleted events. For the created and modified events, mtime is probably the better choice. |
AFAIK, that must not be not specific to the OS, but specific to a text editor or a software alike. Otherwise, replacing a single byte in a terabyte-sized file would be a disaster. Try patching a file with something like this: with open("temporary_file.txt", "rb+") as f:
f.seek(0)
f.write(b"0") The above should not "create a new file", but should instead update the file in place. Using the ctime instead of mtime would indeed help a lot, as it's not "replaceable" (not even How can be sure that the event order is right/wrong? |
True. I was somewhat imprecise: the described behaviour is specific to the macOS NSDocument save functionality - and probably very similar to other programs which implement their own saving.
I would argue that having the right order most of the time is better than never. Also, sorting must only be correct for events from the same snapshot difference. Since every new FSEvent will trigger a new snapshot difference in the |
As long as we don't use the old timestamp metadata for anything, that's right (perhaps an equality comparison would still be fine, but ordering by it wouldn't). Either way, the observer should not crash even in the worst scenario.
On Linux, where inotify tells the exact file/directory that changed, the creation-deletion order might be wrong. That wrong inotify events ordering, when it happens on subdirectories in watchdog 0.8.x, used to crash the observer. I remember that fixing that was quite hard, and perhaps it still needs some enhancement. On the other hand, inotify sends the events already ordered for most of the time. I'm not disagreeing that "having the right order" is a good thing. I'm just not so sure about how one can be sure that a given order is "right" (or wrong). Both the "create-delete" and the "delete-create" orders are possible, and I think we don't have the exact deletion ctime, so perhaps the correct approach would be a "does the file currently exists?" check to just move a single event to the nearest position that makes it coherent, but such checking might get a wrong result as another event might happen during the process, and we have no synchronization lock to avoid that race condition. Perhaps we should not reorder stuff from distinct filenames, unless they're nested and that nesting need to be consistent (e.g. a creation of a file in a directory should only happen after the directory was created). |
I think I have been complicating things unnecessarily and there may indeed be a way to determine a unique “right” order without looking at mtimes or ctimes. Looking at the source code for DirectorySnapshotDiff, the event types are categorised as follows:
From the above classification, there can be at most two created/deleted/moved events that share the same path in one snapshot diff: Deleted(path1) + Created(path1) And any Modified event will come before a Moved event or stand alone. Modified events will never be combined by themselves with created or deleted events because they require the inode to be present in both snapshots. From the above, we could achieve correct ordering for every path by always adding Deleted events to the queue first, Modified events second, Moved events third and Created events last. The ordering won’t be correct between unrelated paths but that hardly matters for most applications. Does this make sense or have I overlooked something? |
In general, I wouldn't say that. As of today, applications shouldn't rely on the event order when using the polling observer or the FSEvents observer, but the goal of this issue is "letting the order matter". One could also argue that creation-deletion order consistency hardly matters for most applications, and that wouldn't "close the issue".
That would be similar to the current polling/FSEvents solution (the only difference is that Moved are currently enqueued after Created). That's quite easy to do (in both polling and FSEvents), but I wonder if it's enough. If a solution isn't able to create a certain output sequence, then it's wrong, and that's why I think this issue matters. For a single path, let's say we have (C) for creation, (D) for deletion, (M) for modification and (R) to moving/renaming. In general, all possible consistent sequences are like:
Or fragments of these, or multiple "M" in a row instead of just one, or an R "to itself" (old and new paths are the same) instead of M, or an R instead of a C ("creating" something in the new path), or an R instead of a D ("removing" what was in the old path), or joined fragments of these. But most of these, as well as the case of more than one modification in a row, wouldn't be catched by a single snapshot diff comparison. At first, not even C-D or D-C, given that "C" would appear only when a file exists in a new snapshot but not in an old snapshot, and "D" would appear only in the reversed case, so a single snapshot diff will never yield both for a single path, although we can check this case using the referenced inode. But that's confusing, let's try to summarize it. A single snapshot diff can only yield, for a single path (Rold and Rnew are just R, where the old/new path is taken for joining with the other events):
Cases that might never actually happen due to the lack of information:
(Likewise, M-Rold, Rold-M, M-Rnew and Rnew-M might happen if we can compare the Currently, That said, for the current solution (already merged since long ago), having the D before the C already makes the D-C order consistent. The issues regarding the consistency or "out of order" stuff are actually about missing events in Enforcing the "D-M-R-C" order isn't a proper/correct solution, although I agree it would indeed yield the correct order for most cases. But, as there's no possible R-C output yet, it wouldn't make any difference. The Also, there's one missing information here: hard links (more than one path referencing the same inode) might break the current implementation. In |
I agree, the order does matter. I was merely trying to find a solution which is easy to implement and which generates the correct order for events which are related by path. Of course, having the correct order for all events would be preferable and put FSObserver and the polling obsever on par with the other observers. I also really like the idea of sorting all events in the diff and not the observer. But as you previously pointed out, how does one get the deletion time from a snaphot? An easier goal than absolute ordering may be the following: If a user would apply all events in the order reported by the FSObserver, could they create the new snapshot from the old snaphot? If yes, I would consider the event order to be "consistent".
There is already possible (false) C-R output at the moment, which would be reordered to (correct) R-C output by enforcing the "D-M-R-C" order. This was the reason for this PR, as given in the file saving example. Or maybe I misunderstood your comment? Going through your analyses of possible event chains, I can follow most of it. Just a few points:
To summarise, I think the cases D-Rnew and D-M-Rnew are already handled correctly at the moment and one could simply fix Rold-C and M-Rold-C by enforcing a "D-M-R-C" order. However, correctly ordering M and R events as well as catching R events where the path did no change (do we need to catch those?) does require looking at the
Do you think those cases should be reported as R events? If I remember correctly, the documentation states that items moved from or to the observed path will register as D / C. But some use cases may benefit from distinguishing those from actual D / C events |
Indeed, the current code removes the deletion/creation when finding an R, but not both D/C at once, so, yes, most cases are handled by a simple D-M-R-C ordering (not based on paths). There's one more case I was missing, a "double move", R(x, y) - R(w, x), but it's already handled, although its order might be... unknown! This is the case when both D and C are removed. In the worst case we can have a circular chain of renamed files R(a, b) + R(b, c) + R(c, a), but there seem to be no "correct order" to that (it reminds me the Condorcet paradox).
I think we need to catch those. What I don't know is if these should be regarded as an R(path1, path1) or an M(path1). Looking at the
Yes. These events with modifications are confusing as they are multi-path:
An "extended D-M-R-C" ordering based on paths like the single-diff-sequences above:
Assuming multiple chained R, perhaps we can even detect multiple "M" of a single path (the old from the first R and the new from the second R). An ordering for a single step would be like (An M(path3) might appear anywhere after the first R):
I don't know if it's possible. I think we should report as R only if we can find the other path to report these as moved files without collecting the metadata from the entire file system all the time. |
It may be an issue if we want to implement it in the snapshot diff because Windows reports the "created time" and not the "metadata changed time" as I see the following options:
I still tend to prefer the first option, but will happily try to implement the others as well. Also, thinking about "D-M-R-C" vs "D-R-M-C" ordering, the second option may be better because it associates the modification with the current path and not the old one. |
This certainly is a more complex issue than what I naively assumed when submitting the PR... |
That's indeed an issue. However, AFAIK, the Linux and Windows observers don't use the
Perhaps with something like NtQueryInformationFile, there's a
That makes sense, but that's an issue even on other operating systems (e.g. running watchdog on mounted FAT32 pendrives). And perhaps not all file systems store the "metadata changed time", we still need to know how it behaves in these cases.
Agreed! That should be simple to be fixed, but the cases where the
I think that's not as simple as it looks. Perhaps we should create the concept of a "empty-named file" or something alike to break cycles, so R(a, empty) - R(b, a) - R(empty, b) would solve the "swapped a and b" case where R(a, b) - R(b, a) would be misleading.
We're still missing a lot of stuff (access rights metadata, file contents, etc.), but that's still a great way of telling the goal of creating an "ordered diff".
That's not an issue, and not something to be avoided.
Using mtime might be wrong at least for "C" events. For example, try unzipping a file in a path, the new created files have the mtime from the zipped contents, only the ctime is set to the "unzipping time". Probably the ctime should be the seen as the default timestamp reference of all events, if it's available. The deletions might need some kind of path ordering. One can't delete a file Probably we should have the graph of related (path-based) events like the D(2)-R(1,2)-M(2)-C(1) and R(x, y)-R(w,x), which will enforce constraints for ordering. The
And way more complex than I naively assumed when first reviewing ahuahuah |
Agreed, this should not be a priority but would be nice to get right. I am not sure about the
Yes, I had overlooked the ambiguity in reporting R(a, b) - R(b, a). The dummy filename would be an option here but it's a bit clumsy. An alternative could be to report this as individual M(a) - M(b). This however obfuscates the actual events. I have implemented a fist (rough) try at sorting in the snapshot diff. As of yet, it does not handle file swaps or ordering M events relative to R events. But it may be a basis for discussion. Another issue which have come across: When a directory is moved with all of its content, which do we report first? The moved children or the parent? |
@@ -123,12 +124,12 @@ def get_inode(directory, full_path): | |||
modified = set() | |||
for path in ref.paths & snapshot.paths: | |||
if get_inode(ref, path) == get_inode(snapshot, path): | |||
if ref.mtime(path) != snapshot.mtime(path) or ref.size(path) != snapshot.size(path): | |||
if ref.ctime(path) != snapshot.ctime(path) or ref.size(path) != snapshot.size(path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will catch metadata changes as M events.
Perhaps it is the
Good question, as "neither" would be the correct answer. As the goal is to create an order of events that would reproduce the new snapshot from the old one, a moved directory should have this ordering:
But that's not how inotify works on Linux. When inotify detect a moved directory, it emits an R(old_dir, new_dir), then a new low level watcher is created for the new directory. For the inotify observer, the first C above is currently an R, and there's no D. If the event order becomes a valuable information, this should be updated in the inotify observer as well (I mean, for a next release where the "event ordering consistency" becomes a feature).
The pair R(a, b) + R(b, a) is just the 2-files example of a more general problem: the "cycle" of renamed files whose temporary file name in between was lost. The graph cycle might have one file (the R(x, x)), 3 files (R(a, b) + R(b, c) + R(c, a)), and so on. A dummy filename is almost unavoidable in these cases, as we have a chain of events (the renaming graph). AFAIK, the empty filename isn't a valid filename, yet it's still a string, so it's a possible choice to avoid breaking things besides the graph cycle. I would not report these as M. Perhaps one can use a sequence like D(b) - R(a, b) - C(a) - M(a) to describe the swapped names by breaking the chain in one single file using the "directory move pattern" above, so only a single file in the cycle gets "destroyed and recreated". But I still think R(b, empty) - R(a, b) - R(empty, a) is a better solution, assuming there will be some documentation telliing that the empty filename refers to a broken loop of renamed files, where a temporary file name exists but it's unknown. For breaking the chain we can also find the point where it should be broken. As the last R is the one from the temporary (or "empty") filename, it will have a bigger |
This is an updated version, similar to the somewhat outdated PR #311. This essentially orders emitted events by mtime and prevents confusion when
FileDeletedEvent
s andFileCreatedEvent
s appear out of order. This would fix for instance #538.This issue only appears when using
DirectorySnapshot
to poll for events, asFSEventsEmitter
does, and therefore does not seem to affect most observers.Let me know if this is the right place to fix this, or if a modification of DirectorySnapshot is more appropriate.