Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] FSEventsEmitter order events by mtime #634

Closed
wants to merge 2 commits into from

Conversation

samschott
Copy link
Contributor

This is an updated version, similar to the somewhat outdated PR #311. This essentially orders emitted events by mtime and prevents confusion when FileDeletedEvents and FileCreatedEvents appear out of order. This would fix for instance #538.

This issue only appears when using DirectorySnapshot to poll for events, as FSEventsEmitter does, and therefore does not seem to affect most observers.

Let me know if this is the right place to fix this, or if a modification of DirectorySnapshot is more appropriate.

@samschott samschott changed the title [WIP] FSEventsEmitter queue events by mtime [WIP] FSEventsEmitter order events by mtime Feb 5, 2020
@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 5, 2020

Thanks for taking the time to do it @samschott.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 5, 2020

Could you fix issues reported by the CI?

@samschott
Copy link
Contributor Author

I had forgotten some obvious imports. Not sure however why the remaining Python 2.7 test fails on Windows.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Feb 6, 2020

Yes, this is a random bug on the CI, just relaunched the job and it passes :)

You could add a line in the changelog + you GH nick ;)

@samschott
Copy link
Contributor Author

Done!

@danilobellini
Copy link
Collaborator

As a minor, I'd rebase the commits (because going back and forth would be a worse approach) to avoid changing the style for nothing (the huge lines that had been previously avoided, as their contents didn't change). I don't like lines exceeding 79 chars as I usually put parts of code in slides for presentations, and sometimes I read code from the phone. Horizontal scrolling isn't nice, and wrapping breaks the indentation. Also, comments that just repeat what's written in the next line of code are noise and should be avoided, unless they say something that isn't exactly what the code explicitly does, or have some other goal (e.g. grouping parts of the code that would be ambiguous otherwise). But that's not a big issue.

About the use of the modification timestamp, I think there should be something else at least for the moved/deleted events, as they would probably come first all the time, which could create yet another kind of sorting issue. I'm not aware of the specifics regarding macOS, but usually the modification timestamp (stat -c%y in Linux, stat -f%Sm in BSD-based systems) shouldn't change when a file gets moved (or deleted), what changes on moving is a "metadata status change" timestamp (stat -c%z in Linux, stat -f%Sc in BSD-based systems). About stat, one can use the %Y or %Z formats in Linux, or remove the S in BSD-based systems to get a UNIX timestamp instead of a string. I don't have a macOS machine but AFAIK it follows the BSD's stat syntax.

That said, I'm not aware of what is really being affected here. Probably every moved file would come first, which would at best keep the same behavior regarding the order of events that have the same modification timestamp (i.e., the "deleted before created" event order would still happen as they would share the same mtime). Perhaps I'm missing something but a test must be done to check if it's working as it should, and such a test might be really difficult to be done (it should mock the FSEvents stuff to forcefully send the events out of order).

Copy link
Collaborator

@danilobellini danilobellini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The feedback is in my previous comment.

@samschott
Copy link
Contributor Author

@danilobellini, thanks for having a look at this, and sorry about messing up the code style.

I think you are right about the mtime, it is not affected by moving an item. Sorting by mtime (as opposed to sorting by event type) nevertheless seemed to fix the issue of out-of-order events when saving file changes. This is probably because macOS performs such saves in three steps:

  1. Move the existing file to a tmp file: FileMovedEvent(path, path_tmp)
  2. Create the replacement file: FileCreatedEvent(path)
  3. Delete the tmp file: FileDeletedEvent(path_tmp)

Therefore, moving does (coincidentally) come first while the deleted event gets caught in the next snapshot and therefore comes last. The current implementation sorts the events as follows:

  1. FileCreatedEvent(path)
  2. FileMovedEvent(path, path_tmp)
  3. FileDeletedEvent(path_tmp)

This makes it appear as if a new file was created, moved and subsequently deleted, i.e., no effective change.

Regarding the correct stat call, from what I can see, os.stat('path').st_ctime does return the metadata change time as given by stat -f%c on macOS.

If you agree that the there is a sufficient use case, I'll rewrite the PR to use the ctime instead for moved and deleted events. For the created and modified events, mtime is probably the better choice.

@danilobellini
Copy link
Collaborator

This is probably because macOS performs such saves in three steps

AFAIK, that must not be not specific to the OS, but specific to a text editor or a software alike. Otherwise, replacing a single byte in a terabyte-sized file would be a disaster. Try patching a file with something like this:

with open("temporary_file.txt", "rb+") as f:
    f.seek(0)
    f.write(b"0")

The above should not "create a new file", but should instead update the file in place.

Using the ctime instead of mtime would indeed help a lot, as it's not "replaceable" (not even touch can set it) unless one creates a new tool designed just to do that in "the hard way" (directly changing the raw inode metadata). However, one still can set the computer date to the future (e.g. an hour due to a wrong DST configuration), perform some stuff, and then put it back to the actual timestamp. In this case, one could possibly delete a file "before its creation" when it comes to timestamps in the file metadata, like ctime and mtime.

How can be sure that the event order is right/wrong?

@samschott
Copy link
Contributor Author

The above should not "create a new file", but should instead update the file in place.

True. I was somewhat imprecise: the described behaviour is specific to the macOS NSDocument save functionality - and probably very similar to other programs which implement their own saving.

How can be sure that the event order is right/wrong?

I would argue that having the right order most of the time is better than never. Also, sorting must only be correct for events from the same snapshot difference. Since every new FSEvent will trigger a new snapshot difference in the FSEventsEmitter, one would need to change the computer date between rapid successive events. This may happen in some cases but not very often.

@danilobellini
Copy link
Collaborator

[...] one would need to change the computer date between rapid successive events. [...]

As long as we don't use the old timestamp metadata for anything, that's right (perhaps an equality comparison would still be fine, but ordering by it wouldn't). Either way, the observer should not crash even in the worst scenario.

I would argue that having the right order most of the time is better than never. [...]

On Linux, where inotify tells the exact file/directory that changed, the creation-deletion order might be wrong. That wrong inotify events ordering, when it happens on subdirectories in watchdog 0.8.x, used to crash the observer. I remember that fixing that was quite hard, and perhaps it still needs some enhancement. On the other hand, inotify sends the events already ordered for most of the time.

I'm not disagreeing that "having the right order" is a good thing. I'm just not so sure about how one can be sure that a given order is "right" (or wrong). Both the "create-delete" and the "delete-create" orders are possible, and I think we don't have the exact deletion ctime, so perhaps the correct approach would be a "does the file currently exists?" check to just move a single event to the nearest position that makes it coherent, but such checking might get a wrong result as another event might happen during the process, and we have no synchronization lock to avoid that race condition. Perhaps we should not reorder stuff from distinct filenames, unless they're nested and that nesting need to be consistent (e.g. a creation of a file in a directory should only happen after the directory was created).

@samschott
Copy link
Contributor Author

I think I have been complicating things unnecessarily and there may indeed be a way to determine a unique “right” order without looking at mtimes or ctimes.

Looking at the source code for DirectorySnapshotDiff, the event types are categorised as follows:

  1. Created event:
    The inode is unique to the new snapshot. The path may be unique to the new snapshot or exist in both. In the second case, there will be a preceding Deleted event or a Moved event with the path as starting point (the old item was deleted or moved away).

  2. Deleted event:
    The inode is unique to the old snapshot. The path may be unique to the old snapshot or exist in both. In the second case, there will be a subsequent Created event or a Moved event with the path as end point (something else was created at or moved to the location).

  3. Moved event:
    The inode exists in both snapshots but with different paths.

  4. Modified event:
    The inode exists in both snapshots and the mtime or file size are different. DirectorySnapshotDiff will always use the inode’s path from the old snapshot.

From the above classification, there can be at most two created/deleted/moved events that share the same path in one snapshot diff:

Deleted(path1) + Created(path1)
Moved(path1, path2) + Created(path1)
Deleted(path1) + Moved(path0, path1)

And any Modified event will come before a Moved event or stand alone. Modified events will never be combined by themselves with created or deleted events because they require the inode to be present in both snapshots.

From the above, we could achieve correct ordering for every path by always adding Deleted events to the queue first, Modified events second, Moved events third and Created events last. The ordering won’t be correct between unrelated paths but that hardly matters for most applications.

Does this make sense or have I overlooked something?

@danilobellini
Copy link
Collaborator

danilobellini commented Feb 10, 2020

[...] The ordering won’t be correct between unrelated paths but that hardly matters for most applications.

In general, I wouldn't say that. As of today, applications shouldn't rely on the event order when using the polling observer or the FSEvents observer, but the goal of this issue is "letting the order matter". One could also argue that creation-deletion order consistency hardly matters for most applications, and that wouldn't "close the issue".

[...] we could achieve correct ordering for every path by always adding Deleted events to the queue first, Modified events second, Moved events third and Created events last. [...]

That would be similar to the current polling/FSEvents solution (the only difference is that Moved are currently enqueued after Created). That's quite easy to do (in both polling and FSEvents), but I wonder if it's enough.

If a solution isn't able to create a certain output sequence, then it's wrong, and that's why I think this issue matters. For a single path, let's say we have (C) for creation, (D) for deletion, (M) for modification and (R) to moving/renaming. In general, all possible consistent sequences are like:

  • C-M-D
  • D-C-M

Or fragments of these, or multiple "M" in a row instead of just one, or an R "to itself" (old and new paths are the same) instead of M, or an R instead of a C ("creating" something in the new path), or an R instead of a D ("removing" what was in the old path), or joined fragments of these. But most of these, as well as the case of more than one modification in a row, wouldn't be catched by a single snapshot diff comparison. At first, not even C-D or D-C, given that "C" would appear only when a file exists in a new snapshot but not in an old snapshot, and "D" would appear only in the reversed case, so a single snapshot diff will never yield both for a single path, although we can check this case using the referenced inode. But that's confusing, let's try to summarize it. A single snapshot diff can only yield, for a single path (Rold and Rnew are just R, where the old/new path is taken for joining with the other events):

  • C (old snapshot doesn't have the path, new snapshot has the path, with no reference to the inode in the old snapshot)
  • D (old snapshot has the path, new snapshot doesn't have the path, with no reference to the inode in the new snapshot)
  • D-C (both snapshots have the path, but the inodes aren't the same and the inode from a snapshot isn't assigned to another path in the other snapshot)
  • M (both snapshots have the path and the inodes are the same, but the mtime or size changed)
  • R to itself (both snapshots have the path, the ctime changed, but the inode, mtime and size are the same) - as of today, these events are missing
  • D-Rnew (same to D-C, but the old snapshot has another path for that inode found in the new snapshot) - as of today, the D is missing in this case, the path is simply overwritten, and a misleading M might be emitted
  • D-M-Rnew (same to D-R, but the new and old snapshots of the same inode don't have the same mtime or size, and the new ctime is equal to the new mtime) - as of today, the D is missing in this case, and the M usually gets emitted but it's not comparing the same file
  • D-Rnew-M (same to D-M-R but the new ctime isn't equal to the new mtime) - same comment from D-M-Rnew
  • Rold-C (same to D-C, but the new snapshot has another path for that inode found in the old snapshot) - as of today, the C is missing in this case, although the file exists in the path at the end, and a misleading M might be emitted
  • M-Rold-C (same to R-C, but the new and old snapshots of the same inode don't have the same mtime or size, and the new ctime is equal to the new mtime) - as of today, the C is missing in this case, and the M usually gets emitted but it's not comparing the same file
  • Rold-M-C (same to M-R-C but the new ctime isn't equal to the new mtime) - same comment from M-Rold-C

Cases that might never actually happen due to the lack of information:

  • Rold (old snapshot has the path, new snapshot doesn't have the path, but the inode has another reference in a not watched path) - this might be seen as D, if we can't find paths from inodes
  • Rnew (new snapshot has the path, old snapshot doesn't have, but the inode had another reference in a not watched path) - this might be seen as a C, if we don't have a way to check for the past inode reference

(Likewise, M-Rold, Rold-M, M-Rnew and Rnew-M might happen if we can compare the mtime, size and ctime of the two distinct references).

Currently, DirectorySnapshotDiff is checking if the mtime or size changed, and nothing is checking the ctime. That makes a lot of sense for "M" events, but we're already losing some "R" events, a fast "rename A to B then B to A" would be ignored if no snapshot is taken in the middle of them, but we can get it through a equality comparison on ctime. When an "R" event is found, the combinations above simply ignores everything else for that path, yielding just the "R" event and perhaps a misleading "M" event.

That said, for the current solution (already merged since long ago), having the D before the C already makes the D-C order consistent. The issues regarding the consistency or "out of order" stuff are actually about missing events in watchdog.utils.dirsnapshot.DirectorySnapshotDiff (like the Rold-C and D-Rnew) and misleading "M" events.

Enforcing the "D-M-R-C" order isn't a proper/correct solution, although I agree it would indeed yield the correct order for most cases. But, as there's no possible R-C output yet, it wouldn't make any difference. The DirectorySnapshotDiff should be fixed to handle all cases described above, and perhaps it might include the event order information as well, which would fix and refactor the polling and FSEvents observers.

Also, there's one missing information here: hard links (more than one path referencing the same inode) might break the current implementation. In stat, the number of links referencing an inode can be seen with stat -c%h (Linux) or stat -f%l (BSD-based systems). While moving a file, the application might create the new reference before removing the old one, and a snapshot made in the "middle" of that process would emit a Moved event where the old file still exists, instead of emitting just a Created event. The current implementation isn't prepared to handle that (a single path pointing to the same inode in both snapshots must not be taken into consideration when comparing the rest of the paths, matched pairs of inodes should also not get mixed, e.g. files x and y might point to the same inode and they might be moved to z and w, it doesn't matter if the events are R(x, z), R(y, w) or R(x, w), R(y, z) as we have no way to check that, but they are not R(x, z), R(y, z)).

@samschott
Copy link
Contributor Author

samschott commented Feb 10, 2020

I agree, the order does matter. I was merely trying to find a solution which is easy to implement and which generates the correct order for events which are related by path. Of course, having the correct order for all events would be preferable and put FSObserver and the polling obsever on par with the other observers.

I also really like the idea of sorting all events in the diff and not the observer. But as you previously pointed out, how does one get the deletion time from a snaphot?

An easier goal than absolute ordering may be the following: If a user would apply all events in the order reported by the FSObserver, could they create the new snapshot from the old snaphot? If yes, I would consider the event order to be "consistent".

Enforcing the "D-M-R-C" order isn't a proper/correct solution, although I agree it would indeed yield the correct order for most cases. But, as there's no possible R-C output yet, it wouldn't make any difference.

There is already possible (false) C-R output at the moment, which would be reordered to (correct) R-C output by enforcing the "D-M-R-C" order. This was the reason for this PR, as given in the file saving example. Or maybe I misunderstood your comment?


Going through your analyses of possible event chains, I can follow most of it. Just a few points:

  • D-Rnew: I think this should already be possible at the moment. The diff will find the path in both snapshots but with different inodes and therefore add D and C events (lines 99 - 102). It will then iterate through all C events and check for their inode in the old snapshot. When finding a match, it will remove the C event and add a R event (lines 114 - 119). We end up with D-Rnew.

  • D-M-Rnew: Again, this should be reported correctly. The diff will follow the above process and finally check all R events for modifications by looking at mtime and size (lines 129 - 131).

  • D-Rnew-M: As of today, this will be reported as D-M-Rnew because the M event will have the old path and the FSObserver will order M events before R events. This would be fixed by looking at the ctime.

  • Rold-C: According to my understanding, the diff will find the path in both snapshots but with different inodes and therefore add D and C events (lines 99 - 102). It will then iterate through all D events and check for their inode in the new snapshot. When finding a match, it will remove the D event and add a R event (lines 106 - 112). Therefore, Rold and C are both added. But the FSObserver will order them as C-Rold. This would be fixed by my suggestion.

  • M-Rold-C: Same as above, reported by FSObserver as M-C-Rold. This would be fixed by my suggestion.

  • Rold-M-C: I am not sure what you mean by this sequence. Is it R(path1, path2)-M(path2)-C(paht1)? If so, this would be fixed by looking at the ctime.

To summarise, I think the cases D-Rnew and D-M-Rnew are already handled correctly at the moment and one could simply fix Rold-C and M-Rold-C by enforcing a "D-M-R-C" order. However, correctly ordering M and R events as well as catching R events where the path did no change (do we need to catch those?) does require looking at the ctime. However, catching M events properly is difficult in the first place because we can only ever catch the last one by looking at the ctime mtime. Therefore my suggestion was to just always order them before R events, as they already use the old snapshot path in the current implementation.

Cases that might never actually happen due to the lack of information

Do you think those cases should be reported as R events? If I remember correctly, the documentation states that items moved from or to the observed path will register as D / C. But some use cases may benefit from distinguishing those from actual D / C events

@danilobellini
Copy link
Collaborator

Indeed, the current code removes the deletion/creation when finding an R, but not both D/C at once, so, yes, most cases are handled by a simple D-M-R-C ordering (not based on paths).

There's one more case I was missing, a "double move", R(x, y) - R(w, x), but it's already handled, although its order might be... unknown! This is the case when both D and C are removed. In the worst case we can have a circular chain of renamed files R(a, b) + R(b, c) + R(c, a), but there seem to be no "correct order" to that (it reminds me the Condorcet paradox).

[...] However, correctly ordering M and R events as well as catching R events where the path did no change (do we need to catch those?) does require looking at the ctime. [...]

I think we need to catch those. What I don't know is if these should be regarded as an R(path1, path1) or an M(path1).

Looking at the ctime isn't a big issue, and it's required to properly detect the M of a moved file. As of today, it always add the M assuming the file was still in the old path when the modification happened, but that's true only when the ctime and mtime are the same in the new path. Currently, the "M" might be misleading.

Rold-M-C: I am not sure what you mean by this sequence. Is it R(path1, path2)-M(path2)-C(paht1)?

Yes. These events with modifications are confusing as they are multi-path:

  • D(path2) - M(path1) - R(path1, path2)
  • D(path2) - R(path1, path2) - M(path2)
  • M(path1) - R(path1, path2) - C(path1)
  • R(path1, path2) - M(path2) - C(path1)

An "extended D-M-R-C" ordering based on paths like the single-diff-sequences above:

  • D(path2) - M(path1) - R(path1, path2) - M(path2) - C(path1)

Assuming multiple chained R, perhaps we can even detect multiple "M" of a single path (the old from the first R and the new from the second R). An ordering for a single step would be like (An M(path3) might appear anywhere after the first R):

  • D(path3) - M(path2) - R(path2, path3) - M(path1) - R(path1, path2) - M(path2) - C(path1)

Do you think those cases should be reported as R events? If I remember correctly, the documentation states that items moved from or to the observed path will register as D / C. But some use cases may benefit from distinguishing those from actual D / C events

I don't know if it's possible. I think we should report as R only if we can find the other path to report these as moved files without collecting the metadata from the entire file system all the time.

@samschott
Copy link
Contributor Author

Looking at the ctime isn't a big issue, and it's required to properly detect the M of a moved file.

It may be an issue if we want to implement it in the snapshot diff because Windows reports the "created time" and not the "metadata changed time" as ctime according to the Python docs. In fact, I don't know how to get the "metadata changed time" at all in Windows. Also on Windows, with FAT or FAT32 file systems, the time resolution of ctime and mtime is only 2 sec. I suspect that this is the reason for the additional file size check in the diff.

I see the following options:

  1. Don't use ctime and mtime for ordering but enforce a "D-M-R-C" or "D-R-M-C" order. Order "overlapping" R events R(x,y)-R(w,x) according to path. This can be done either in the snapshot diff itself or in the FSObserver and polling observer. Advantages: It generates a list of events which can be used to recreate the file system changes when applied in the given order. Disadvantages: R(x,x) events are ignored, M events may have the wrong position. But neither will prevent reproducing the changes.

  2. Same as above, but use the ctime on Unix to catch R(x,x) events in the snapshot diff. Use the ctime in the FSObserver and polling observer in Unix to order successive R events and R / M events. Advantages: We catch R(x,x) events and last-occurred M event correctly. Disadvantages: Introduces a platform dependence of the snapshot diff.

  3. Sort all deletions first and remaining changes according to mtime for M / C or ctime for R. Advantages: Almost global ordering. Disadvantages: This cannot be done on Windows and may be best handled in the FSObserver. The polling observer could get platform-dependent implementations.

I still tend to prefer the first option, but will happily try to implement the others as well. Also, thinking about "D-M-R-C" vs "D-R-M-C" ordering, the second option may be better because it associates the modification with the current path and not the old one.

@samschott
Copy link
Contributor Author

This certainly is a more complex issue than what I naively assumed when submitting the PR...

@danilobellini
Copy link
Collaborator

It may be an issue if we want to implement it in the snapshot diff because Windows reports the "created time" and not the "metadata changed time" as ctime according to the Python docs. [...]

That's indeed an issue. However, AFAIK, the Linux and Windows observers don't use the DirectorySnapshot, only the polling/FSEvents/kqueue require it. In Windows one might run watchdog on a network directory with a polling observer, that would require a correct implementation, but for most cases I think this won't make much of a difference (although it would be better if we can use the "metadata changed time" even in Windows).

In fact, I don't know how to get the "metadata changed time" at all in Windows.

Perhaps with something like NtQueryInformationFile, there's a ChangeTime in the FILE_BASIC_INFORMATION structure.

Also on Windows, with FAT or FAT32 file systems, the time resolution of ctime and mtime is only 2 sec. I suspect that this is the reason for the additional file size check in the diff.

That makes sense, but that's an issue even on other operating systems (e.g. running watchdog on mounted FAT32 pendrives). And perhaps not all file systems store the "metadata changed time", we still need to know how it behaves in these cases.

[...] Also, thinking about "D-M-R-C" vs "D-R-M-C" ordering, the second option may be better because it associates the modification with the current path and not the old one.

Agreed! That should be simple to be fixed, but the cases where the ctime doesn't exist would probably benefit from the D-R-M-C ordering as a new default.

  1. [...] Order "overlapping" R events R(x,y)-R(w,x) according to path. [...]

I think that's not as simple as it looks. Perhaps we should create the concept of a "empty-named file" or something alike to break cycles, so R(a, empty) - R(b, a) - R(empty, b) would solve the "swapped a and b" case where R(a, b) - R(b, a) would be misleading.

Advantages: It generates a list of events which can be used to recreate the file system changes when applied in the given order.

We're still missing a lot of stuff (access rights metadata, file contents, etc.), but that's still a great way of telling the goal of creating an "ordered diff".

Disadvantages: Introduces a platform dependence of the snapshot diff.

That's not an issue, and not something to be avoided.

Sort all deletions first and remaining changes according to mtime for M / C or ctime for R.

Using mtime might be wrong at least for "C" events. For example, try unzipping a file in a path, the new created files have the mtime from the zipped contents, only the ctime is set to the "unzipping time". Probably the ctime should be the seen as the default timestamp reference of all events, if it's available.

The deletions might need some kind of path ordering. One can't delete a file a/b/c.txt after the directory a had been deleted.

Probably we should have the graph of related (path-based) events like the D(2)-R(1,2)-M(2)-C(1) and R(x, y)-R(w,x), which will enforce constraints for ordering. The ctime would be just another constraint, when available, but it should be secondary.

This certainly is a more complex issue than what I naively assumed when submitting the PR...

And way more complex than I naively assumed when first reviewing ahuahuah

@samschott
Copy link
Contributor Author

samschott commented Feb 12, 2020

However, AFAIK, the Linux and Windows observers don't use the DirectorySnapshot, only the polling/FSEvents/kqueue require it.

Agreed, this should not be a priority but would be nice to get right. I am not sure about the ChangeTime in FILE_BASIC_INFORMATION, I somehow suspect that this may be the mtime. But one should look into that, it would be surprising if there was no way to query the information in Windows.

I think that's not as simple as it looks. Perhaps we should create the concept of a "empty-named file"

Yes, I had overlooked the ambiguity in reporting R(a, b) - R(b, a). The dummy filename would be an option here but it's a bit clumsy. An alternative could be to report this as individual M(a) - M(b). This however obfuscates the actual events.


I have implemented a fist (rough) try at sorting in the snapshot diff. As of yet, it does not handle file swaps or ordering M events relative to R events. But it may be a basis for discussion.

Another issue which have come across: When a directory is moved with all of its content, which do we report first? The moved children or the parent?

@@ -123,12 +124,12 @@ def get_inode(directory, full_path):
modified = set()
for path in ref.paths & snapshot.paths:
if get_inode(ref, path) == get_inode(snapshot, path):
if ref.mtime(path) != snapshot.mtime(path) or ref.size(path) != snapshot.size(path):
if ref.ctime(path) != snapshot.ctime(path) or ref.size(path) != snapshot.size(path):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will catch metadata changes as M events.

@danilobellini
Copy link
Collaborator

[...] I am not sure about the ChangeTime in FILE_BASIC_INFORMATION, I somehow suspect that this may be the mtime. [...]

Perhaps it is the mtime in file systems that doesn't store both ctime and mtime (is there any?). I think the ChangeTime is the ctime because the same structure has a LastWriteTime field.

Another issue which have come across: When a directory is moved with all of its content, which do we report first? The moved children or the parent?

Good question, as "neither" would be the correct answer. As the goal is to create an order of events that would reproduce the new snapshot from the old one, a moved directory should have this ordering:

  • C(new_dir)
  • R(old_dir/1, new_dir/1)
  • ... R(old_dir/i, new_dir/i) ...
  • R(old_dir/n, new_dir/n)
  • D(old_dir)
  • M(new_dir) # Restore metadata from the old_dir, but the ctime

But that's not how inotify works on Linux. When inotify detect a moved directory, it emits an R(old_dir, new_dir), then a new low level watcher is created for the new directory. For the inotify observer, the first C above is currently an R, and there's no D. If the event order becomes a valuable information, this should be updated in the inotify observer as well (I mean, for a next release where the "event ordering consistency" becomes a feature).

Yes, I had overlooked the ambiguity in reporting R(a, b) - R(b, a). The dummy filename would be an option here but it's a bit clumsy. An alternative could be to report this as individual M(a) - M(b). This however obfuscates the actual events.

The pair R(a, b) + R(b, a) is just the 2-files example of a more general problem: the "cycle" of renamed files whose temporary file name in between was lost. The graph cycle might have one file (the R(x, x)), 3 files (R(a, b) + R(b, c) + R(c, a)), and so on.

A dummy filename is almost unavoidable in these cases, as we have a chain of events (the renaming graph). AFAIK, the empty filename isn't a valid filename, yet it's still a string, so it's a possible choice to avoid breaking things besides the graph cycle.

I would not report these as M. Perhaps one can use a sequence like D(b) - R(a, b) - C(a) - M(a) to describe the swapped names by breaking the chain in one single file using the "directory move pattern" above, so only a single file in the cycle gets "destroyed and recreated". But I still think R(b, empty) - R(a, b) - R(empty, a) is a better solution, assuming there will be some documentation telliing that the empty filename refers to a broken loop of renamed files, where a temporary file name exists but it's unknown.

For breaking the chain we can also find the point where it should be broken. As the last R is the one from the temporary (or "empty") filename, it will have a bigger ctime. If no ctime is available, some other strategy might be used.

@samschott samschott closed this Dec 10, 2020
@samschott samschott deleted the patch-1 branch December 10, 2020 12:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants