Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a distinction between content and metadata changes #260

Open
adiroiban opened this issue Jul 25, 2014 · 24 comments
Open

Make a distinction between content and metadata changes #260

adiroiban opened this issue Jul 25, 2014 · 24 comments

Comments

@adiroiban
Copy link
Contributor

Maybe I am doing something wrong... but I have an use case, in which I only care about file content ... I don't want to trigger the handler when file attributes or file security changes.

I know that in other cases, those informations are very important, but since both Inotify and readdirectorychanges provides the options to filter events from the source, maybe it would make sense to add some configuration in watchdog.

In the same time, directory poller, only raise changes for file content and not metadata changes.

Maybe this API would help and make sense

# everything... just like before
observer.subscribe()
#only content
observer.subscribeContent()
#only metadata
observer.subscribeMetadata()

when calling observer.subscribeMetadata() on directory poller, we can either raise an error that this is not supported, or create a special directory snapshot which polls for security data.

What do you think?

@tamland
Copy link
Collaborator

tamland commented Nov 24, 2014

I think it's better to split the 'modified' event into two events. Has to be investigated if this can be implemented on all platforms first.

@adiroiban
Copy link
Contributor Author

Filtering at a lower level should improve performance as the OS will send less events and watchdog will have to filter/ignore fewer events.

Maybe there is a way to group events and then pass the groups to subscribe


def subscribe(events=ALL):
   pass

observer.subscribe(CONTENT | METADATA)

@GreatBahram
Copy link

GreatBahram commented Apr 13, 2021

Why do you treat both contents and attribute change the same way?

elif event.is_attrib:
cls = DirModifiedEvent if event.is_directory else FileModifiedEvent
self.queue_event(cls(src_path))
elif event.is_modify:
cls = DirModifiedEvent if event.is_directory else FileModifiedEvent
self.queue_event(cls(src_path))

Should I submit a pull request and fix this?

@samschott

@BoboTiG
Copy link
Collaborator

BoboTiG commented Apr 13, 2021

@GreatBahram yes go ahead. Introduce new events for attributes changes. I guess such change must be done on all platforms.

@GreatBahram
Copy link

Thanks, I'll fix it then.

@samschott
Copy link
Contributor

I like the idea! Though it may be difficult to implement for the Windows and Polling backends:

Polling observer: we cannot rely on the mtime to detect file modifications (even though we currently do...) since it can be changed by the user. However, the ctime is incremented whenever the content or metadata changes, making it impossible to distinguish between the two without reverting to something like content hashes.

Windows: According to the docs here, events that are written to the buffer by ReadDirectoryChangesW don't seem to distinguish between metadata and content modification.

If this cannot be implemented on all backends, I am wondering if it still makes sense to implement for inotify and fsevents. As it stands, there already are many (undocumented) differences between backends and I would rather strive toward uniform behaviour, even if it means that the API is limited by the least versatile backend. But of course this is not my decision to make. @BoboTiG, what do you think?

@BoboTiG
Copy link
Collaborator

BoboTiG commented Apr 14, 2021

I am 100% OK with you @samschott. I was hoping we could have universal new events. If it is not possible, I would rather not make APIs more and more divergent.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Apr 17, 2021

After consideration, WDYT of simply adding FileModifiedEvent.attributes_only and DirModifiedEvent.attributes_only attrs? It would not change the default behavior and still allowing one to use them.

I am thinking of that because it is useful to be able to filter on such changes.
It makes me think of the IN_CLOSE_WRITE and its related FileClosedEvent event: it is only for Linux, and at the same time it is such a feature that it is interesting to have it, even if other backends will never fire such event. If the change is worth, I would say let's do this.

@GreatBahram
Copy link

GreatBahram commented Apr 18, 2021

I think it's better to expose these features, close_write and attrib events even if it is only available on Linux, and let users themselves decide to use these functionalities or not.
For instance, it is costly not to be able to differentiate between attribute change and file modification, and I have to upload a big file instead of just updating some metadata. Using a simple attribute such as attributes_only is a great idea as it only returns True only on Linux.

What's the final decision here?

@samschott
Copy link
Contributor

@BoboTiG makes the calls, I am merely an opinionated contributor :)

I do agree that it would be a great feature to have and I'm not saying that's impossible on Windows (I have very little experience there). I just have a slight preference towards a uniform API, or at last a clear documentation of the differences between backends. But documentation is another issue by itself...

@BoboTiG
Copy link
Collaborator

BoboTiG commented Apr 18, 2021

@GreatBahram go with attributes_only attributes. Let's hope you can figure out how to do it for all backends :)

@GreatBahram
Copy link

Hello @BoboTiG,

I've read the source code and made some changes to support the attrib event. And I realized with the current implementation; it is possible to have the attrib event on all platforms except for windows which I'm still struggling with...

And I was able to test it on Linux (notify) and FreeBSD (kqueue), but I couldn't find any way to run mac os on a virtual machine, and I'm wondering is there any simple way to do this? If you have any ideas, please let me know. There were two fsevents modules, and I searched through Github issues and spotted some issues about making fsevents2 the default observer on mac os. So I only add the changes to the fsevents2 module while it is possible to do the same on fsevents too.

  • One design question here:
    After playing with the changes a little bit, I thought it is better to have two new events and a simple on_attrib method to support this functionality instead of what we discussed here because I believe it prevents developers from if-else like coding. We could implement it this way:
class FileAttribEvent(FileSystemEvent):
    """
    File system event representing file metadata modification on the file system.
    """
    event_type = EVENT_TYPE_ATTRIB


class DirAttribEvent(FileSystemEvent):
    """
    File system event representing directory metadata modification on the file system.
    """
    event_type = EVENT_TYPE_ATTRIB
    is_directory = True

class FileSystemHandler:
     ...
    def on_attrib(self, event):
        """Called when a file or directory metadata is modified.
        :param event:
            Event representing file/directory metadata modification.
        :type event:
            :class:`FileAttribEvent` or :class:`DirAttribEvent`
        """

class InotifyEmitter:
....
            elif event.is_attrib:
                cls = DirAttribEvent if event.is_directory else FileAttribEvent
                self.queue_event(cls(src_path))


class KqueueEmitter:
....
        elif is_attrib_modified(kev):
            if descriptor.is_directory:
                yield DirAttribEvent(src_path)
            else:
                yield FileAttribEvent(src_path)

clsss FSEventsEmitter:
....
            elif event.is_inode_meta_mod or event.is_xattr_mod:
                cls = DirAttribEvent if event.is_directory else FileAttribEvent
                self.queue_event(cls(event.path))

And I've also updated the LoggingEventHandler to support to_attrib method:

    def on_attrib(self, event):
        super().on_attrib(event)

        what = 'directory' if event.is_directory else 'file'
        self.logger.info("Attrib %s: %s", what, event.src_path)

I know it might seem like breaking the uniformity across the project, but I think the community will accept this feature because it is not a big deal and make sense to have a separate function for this event. At the same time, there is no burden on windows developers because even if we cannot support this feature, using ReadDirectoryChangesW, there is gonna be no breakage on their source code.

Please let me know if you think this is ok or should I stick with the attributes_only attribute.

And Do you have any idea how I should test the fsevents module while I don't have access to a mac os computer? I couldn't even find a proper virtual machine image for it...

Thanks

@GreatBahram
Copy link

I've checked the windows library itself, and also this go library which is a cross-platform, and it seems to me it is not possible to get attrib event inside the windows platform. If you have any points or comment on this, please let me know.

Finally, If you agree on the implementation, I want to add some test cases to cover these new notifications.

@BoboTiG
Copy link
Collaborator

BoboTiG commented Jun 1, 2021

Nice work! Let's move forward with your idea :)

It will be a breaking patch, so we will just bump the major version number.

Couple of notes:

  • Add changes to both fsevent ans fsevents2.
  • You can run macOS tests by just pushing changes to your PR: tests will be run on all Oses.
  • Add tests and fix current failures.
  • Let me know on the PR when you are ready for a complete review.

And a big thank you for your time :)

@howardjones
Copy link

howardjones commented Jun 18, 2021

For Windows, it might not be possible to differentiate attribute and content changes, but it IS possible to filter which events you want. I'm getting "change" events for a directory of images, that I'm pretty sure are actually just the "Last Accessed" date being updated on read.

https://docs.microsoft.com/en-us/windows/win32/api/winbase/nf-winbase-readdirectorychangesw

Throwing together a quick C# test shows that with the default "everything" filter, I get the same kind of behaviour as watchdog, but with LastAccess removed from the filter, I get what I really wanted.

@samschott
Copy link
Contributor

So one could theoretically start two watches, one that only picks up content modified events and another that only picks up metadata changes. It will be a bit more resource heavy but we would maintain a uniform API across backends.

@howardjones
Copy link

That's what I wasn't sure about - I assume it's a bit less efficient to do it that way. Although the differentiation I'm looking in my application (basically a grunt watch type thing for image processing) for is not quite metadata vs content - it's that LastAccess change specifically that trips me up. Maybe an exclude option? (but then making that uniform becomes the issue)

@samschott
Copy link
Contributor

Providing an exclude option across all backends should be easy, you can either filter events as they come in or directly subscribe to only some types of events on pretty much all backends. However, it would go somewhat against the current API design where all events are always emitted and the event handler then decides how to process different types of events.

When I say "metadata" I mean anything that does not affect the location or the binary content of the file. For example, looking at the list of provided notifications in Windows, that would be CHANGE_ATTRIBUTES, CHANGE_LAST_ACCESS, CHANGE_SECURITY. In macOS, this would mean is_inode_meta_mod and is_xattr_mod. Would that cover your use case?

@howardjones
Copy link

Is that actually the possible events in Windows? It seems to just be Modified, regardless of what changed. (looking at watchdog.observers.winapi.WinAPINativeEvent). Unless there is something deeper in there that's aggregating multiple event types.

My current (working) monkey-patch for this is:

    watchdog.observers.winapi.WATCHDOG_FILE_NOTIFY_FLAGS = reduce(
        lambda x, y: x | y, [
            FILE_NOTIFY_CHANGE_FILE_NAME,
            FILE_NOTIFY_CHANGE_DIR_NAME,
            FILE_NOTIFY_CHANGE_ATTRIBUTES,
            FILE_NOTIFY_CHANGE_SIZE,
            FILE_NOTIFY_CHANGE_LAST_WRITE,
            FILE_NOTIFY_CHANGE_SECURITY,
            # FILE_NOTIFY_CHANGE_LAST_ACCESS,
            FILE_NOTIFY_CHANGE_CREATION,
        ])
    
    event_handler = LoggingEventHandler()
    observer = Observer()
    observer.schedule(event_handler, path, recursive=True)
    observer.start()

which gets me to where I want to be, although obviously not very elegantly.

@samschott
Copy link
Contributor

Indeed, the emitted event type would always be "modified". I was rather proposing a change to the Windows backend to start two watches:

  1. First watch sets all NOTIFY_FLAGS except for CHANGE_ATTRIBUTES, CHANGE_LAST_ACCESS, CHANGE_SECURITY. Reading carefully through the docs, it appears that CHANGE_CREATION should also be excluded because it signifies only a change in the creation time stamp. The actual file creation would be recorded when setting the flag CHANGE_FILE_NAME.
  2. Second watch sets only the flags which are not set for the first watch.

That way, we can ensure that any "modified" events from the first watch are content changes and all "modified" events from the second watch are attribute / metadata changes.

@howardjones
Copy link

Sorry - misunderstood which layer you were talking about. Yes, that would work. Although even then I think the option to filter LAST_ACCESS (or others) from the second is useful, in lieu of a way to know what event it was. My use case is a file server - read accesses are completely normal and uninteresting, and not being able to tell them apart from permissions changes is unhelpful.

@samschott
Copy link
Contributor

That comes back to API design then. We could either:

  1. Tell the Observer which types of events to record in the first place, possibly with a granularity that is finer than the actual subclasses of FileSystemEvent. It sounds like this is what you are proposing.
  2. Record all events and provide different subclasses FileMtimeModifiedEvent, FilePermissionModifiedEvent, FileOpenEvent, etc, down to the smallest granularity that is useful and supported by the backend.

The watchdog API currently tries to implement the second approach and this issue makes a strong case that the provided granularity is insufficient for some use cases.

There can of course still be a discussion about API design, for example providing arguments to the Observer to suppress certain types of events before they are passed on to the Handler. This could marginally improve performance in some cases or reduce the number of required watches in Windows. However, IMO, we should still provide different subclasses of FileSystemEvent for every type of information that is useful to separate (such as FileOpenEvent).

@stef1205
Copy link

Is there a way in Watchdog 2.1.2 to identify xattr only modify events and filter them out ?

Currently I am stuck with v0.10.3 which doesn't generate modify event for metadata changes

@sevenrats
Copy link

for the wandering googler, if your platform has stat, then its trivial to get around this issue by checking if the mtime has changed. such a solution would persist even after #800 has merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants