Allow arbitrary contents managers #24

davidbrochart · 2022-10-21T10:29:08Z

Closes #19.

codecov-commenter · 2022-10-21T10:31:31Z

Codecov Report

Base: 78.74% // Head: 79.39% // Increases project coverage by +0.64% 🎉

Coverage data is based on head (44e6d0e) compared to base (7f28d3c).
Patch coverage: 80.76% of modified lines in pull request are covered.

Additional details and impacted files

@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   78.74%   79.39%   +0.64%     
==========================================
  Files           5        5              
  Lines         414      495      +81     
  Branches       62       68       +6     
==========================================
+ Hits          326      393      +67     
- Misses         66       80      +14     
  Partials       22       22

Impacted Files	Coverage Δ
jupyter_server_fileid/extension.py	`37.50% <27.27%> (-7.33%)`	⬇️
jupyter_server_fileid/manager.py	`88.47% <85.88%> (-0.69%)`	⬇️
jupyter_server_fileid/pytest_plugin.py	`88.50% <100.00%> (+0.85%)`	⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

davidbrochart · 2022-10-21T10:38:14Z

jupyter_server_fileid/manager.py

+class ArbitraryFileIdManager(AbstractFileIdManager):
+    def __init__(self, *args, **kwargs):
+        pass
+
+    def get_id(self, path: str) -> str:
+        return path
+
+    def get_path(self, id: str) -> str:
+        return id


To be implemented.

dlqqq · 2022-10-21T14:26:14Z

@davidbrochart Thanks for tackling this! In the future however, could you please assign yourself to an issue before beginning work on a PR that addresses it? I want to make sure we're avoiding duplicate effort as much as possible.

dlqqq · 2022-10-21T14:29:11Z

I'd be happy to review this once it's ready and out of draft status.

jupyter_server_fileid/manager.py

jupyter_server_fileid/extension.py

dlqqq · 2022-10-25T00:01:16Z

@davidbrochart Cut a PR to your branch fleshing out the ArbitraryFileIdManager implementation and address Kevin's comments. davidbrochart#1

dlqqq · 2022-10-25T20:38:44Z

@davidbrochart Can I go ahead and rebase to fix merge conflicts?

davidbrochart · 2022-10-25T20:43:11Z

Please do, thanks!

* set default only when config doesn't specify file_id_manager_class * flesh out abstract and arbitrary file ID managers - make root_dir non-configurable - add get_handlers_by_action() method * edit docstring Co-authored-by: David Brochart <david.brochart@gmail.com> * update help string Co-authored-by: David Brochart <david.brochart@gmail.com> * update log Co-authored-by: David Brochart <david.brochart@gmail.com> * update log Co-authored-by: David Brochart <david.brochart@gmail.com> * Rename AbstractFileIdManager to BaseFileIdManager Co-authored-by: David Brochart <david.brochart@gmail.com>

dlqqq · 2022-10-25T21:02:05Z

@davidbrochart Done. I had one comment I wanted to make in this PR for visibility.

The ArbitraryFileIdManager makes absolutely no assumptions of the filesystem, including its directory structure. By design, ArbitraryFileIdManager should work on any filesystem, including non-hierarchical filesystems that allow for / to be in a name (e.g. a cloud KV store). That means when it receives a "move" event from the contents manager, it does not update the path of any children, since that assumes the filesystem is hierarchical with parents delimited from children with a /. This means that when moving a directory, get_path() will return the wrong path for all of its children, even if the move was performed via JupyterLab.

I think this is fine, given that custom ContentsManagers really should provide their own FileIdManager implementation if they want to use this extension meaningfully anyways.

dlqqq · 2022-10-25T21:03:06Z

@davidbrochart Feel free to merge when ready. I'll cut a release for 0.5.0 once this is in.

kevin-bates · 2022-10-25T21:09:43Z

jupyter_server_fileid/manager.py

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def index(self, path: str) -> Union[int, str, None]:


I don't think we can have dual forms of IDs as suggested here. Users should not be expected to change from int to UUID, rebuild all indices, and, worst of all, update all external references that may be squirreled away when they find out that int as ID is insufficient.

I see this was merged as I must have been editing this comment. This topic needs to be revisited but we can let #3 be that forum.

kevin-bates · 2022-10-25T21:10:56Z

jupyter_server_fileid/manager.py

+    def get_id(self, path: str) -> Union[int, str, None]:
+        raise NotImplementedError("must be implemented by subclass")
+
+    def get_path(self, id: Union[int, str]) -> Union[int, str, None]:


Suggested change

def get_path(self, id: Union[int, str]) -> Union[int, str, None]:

def get_path(self, id: Union[int, str]) -> Optional[str]:

dlqqq · 2022-10-25T21:17:23Z

@kevin-bates Thanks for reminding me about this issue again. Let's move this discussion in #3.

kevin-bates

I'm sorry this review is late. I didn't expect the PR to be merged so quickly with the kinds of changes that seemed to be made at the last minute. These comments should be considered part of my review of #30.

kevin-bates · 2022-10-26T22:34:05Z

jupyter_server_fileid/manager.py

@@ -37,35 +37,174 @@ def wrapped(self, *args, **kwargs):
    return decorator


-class FileIdManager(LoggingConfigurable):
+class BaseFileIdManager(LoggingConfigurable):


This should derive from ABC along with a metaclass definition in order for @abstractmethod decorators to be effective. A good example of this can be found here. By not deriving BaseFileIdManager from ABC, @abstractmethod decorators do not work (and I see those too have been removed). As a result, a subclass's violation for not implementing various methods will not be discovered until that method is called, rather than when the class instance is instantiated. Proper decoration also prevents BaseFileIdManager from being instantiated (i.e., a true abstract base class).

kevin-bates · 2022-10-26T22:35:08Z

jupyter_server_fileid/manager.py

+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+    def index(self, path: str) -> Union[int, str, None]:


To the best of my knowledge, this is not a public method and should not exist on the ABC.

kevin-bates · 2022-10-26T22:36:43Z

jupyter_server_fileid/manager.py

+    def get_id(self, path: str) -> Union[int, str, None]:
+        raise NotImplementedError("must be implemented by subclass")


With a proper ABC definition @abstractmethod is sufficient and this method can use pass. It should also include the basic help-string text from which all subclasses will derive.

Same goes for get_path() and get_handlers_by_action below.

kevin-bates · 2022-10-26T22:38:19Z

jupyter_server_fileid/manager.py

+    def move(self, old_path: str, new_path: str) -> Union[int, str, None]:
+        raise NotImplementedError("must be implemented by subclass")
+
+    def copy(self, from_path: str, to_path: str) -> Union[int, str, None]:
+        raise NotImplementedError("must be implemented by subclass")
+
+    def delete(self, path: str) -> None:
+        raise NotImplementedError("must be implemented by subclass")
+
+    def save(self, path: str) -> None:
+        raise NotImplementedError("must be implemented by subclass")


To the best of my knowledge, these are not public methods and should not exist on the ABC.

kevin-bates · 2022-10-26T22:41:44Z

jupyter_server_fileid/manager.py

+        self.con.execute(
+            "CREATE TABLE IF NOT EXISTS Files("
+            "id INTEGER PRIMARY KEY AUTOINCREMENT, "
+            "path TEXT NOT NULL UNIQUE"


I think the ArbitraryFileIdManager is going to want to store the root_dir value as it can change between invocations and may be necessary. We should also store any information returned from the Contents events.

"root_dir TEXT NOT NULL"

kevin-bates · 2022-10-26T22:42:52Z

jupyter_server_fileid/manager.py

+        self.con.execute("CREATE INDEX IF NOT EXISTS ix_Files_path ON Files (path)")
+        self.con.commit()
+
+    def index(self, path: str) -> int:


Not public, should be prefixed with _ and should probably take whatever other columns are necessary.

kevin-bates · 2022-10-26T22:43:39Z

jupyter_server_fileid/manager.py

+    def move(self, old_path: str, new_path: str) -> None:
+        row = self.con.execute("SELECT id FROM Files WHERE path = ?", (old_path,)).fetchone()
+        id = row and row[0]
+
+        if id:
+            self.con.execute("UPDATE Files SET path = ? WHERE path = ?", (new_path, old_path))
+        else:
+            cursor = self.con.execute("INSERT INTO Files (path) VALUES (?)", (new_path,))
+            id = cursor.lastrowid
+
+        self.con.commit()
+        return id
+
+    def copy(self, from_path: str, to_path: str) -> Optional[int]:
+        cursor = self.con.execute("INSERT INTO Files (path) VALUES (?)", (to_path,))
+        self.con.commit()
+        return cursor.lastrowid
+
+    def delete(self, path: str) -> None:
+        self.con.execute("DELETE FROM Files WHERE path = ?", (path,))
+        self.con.commit()


Not public, should be prefixed with _.

kevin-bates · 2022-10-26T22:46:17Z

jupyter_server_fileid/manager.py

+    def save(self, path: str) -> None:
+        return


This should be implemented to call _index() if its not already indexed. Otherwise, no entries will ever be inserted into the FILES table.

Also, _get() should be implemented such that it should be called on Get Contents events. Similar reason to why Save events must be handled.

kevin-bates · 2022-10-26T22:47:04Z

jupyter_server_fileid/manager.py

+            "get": None,
+            "save": None,


Per previous comment, these should call _get() and _save(), respectively.

dlqqq · 2022-10-27T01:41:13Z

@kevin-bates Thank you for the review! Sorry we're merging things so quickly, this is being used by RTC, which is also moving very fast right now. Let me address your concerns separately:

This should derive from ABC along with a metaclass definition in order for @AbstractMethod decorators to be effective. A good example of this can be found here. By not deriving BaseFileIdManager from ABC, @AbstractMethod decorators do not work (and I see those too have been removed). As a result, a subclass's violation for not implementing various methods will not be discovered until that method is called, rather than when the class instance is instantiated. Proper decoration also prevents BaseFileIdManager from being instantiated (i.e., a true abstract base class).

I removed inheritance from ABC because I wasn't sure if we can use it with Traitlets. The developers I work with on Jupyter scheduler had run into an issue where they couldn't inherit from both ABC and LoggingConfigurable. There are certain traits and validation on those traits we want all file ID managers to have, specifically root_dir and db_path. If it's possible to use traits and ABC together, I'd be happy to have BaseFileIdManager inherit from ABC again.

To the best of my knowledge, this is not a public method and should not exist on the ABC.

(from #30) Although the methods called during event handling are not public, any FileIdManager implementations that choose to leverage content events should be receiving the entire model that is the event's payload, not just path information.

index() definitely needs to be public, since otherwise there's no way for developers to create new file IDs. We do want all of those other methods to be public as well. These allow developers to perform in-band edits without always having to go through the contents manager. I don't know of any users at the moment, but we want to keep this door open for now unless compelled to do so otherwise. Right now, I don't see any tangible benefit to making these methods private. FileIdManagers should implement them anyways to correctly handle contents manager events.

Furthermore, if these methods are public, then it doesn't make sense for them to receive the full event payload as their exclusive argument.

I think the ArbitraryFileIdManager is going to want to store the root_dir value as it can change between invocations and may be necessary. We should also store any information returned from the Contents events.

ArbitraryFileIdManager can't make any assumptions about the filesystem. It doesn't know about absolute or relative paths, as that would assume a hierarchical filesystem. The consensus is that it's OK for the ArbitraryFileIdManager to not be fully fleshed out and handle cases like moving server roots or moving directories. Anything more requires assumptions about your filesystem, and that requires either LocalFileIdManager or some other custom implementation.

kevin-bates · 2022-10-27T15:04:17Z

I removed inheritance from ABC because I wasn't sure if we can use it with Traitlets. The developers I work with on Jupyter scheduler had run into an issue where they couldn't inherit from both ABC and LoggingConfigurable.

I had run into the same thing. The key is to specify a metaclass - as my linked reference illustrates. I can submit a pull request to fix that if that's helpful.

index() definitely needs to be public, since otherwise there's no way for developers to create new file IDs.

You're right about index(). I had conflated that with scan_all() - sorry about that.

ArbitraryFileIdManager can't make any assumptions about the filesystem. It doesn't know about absolute or relative paths, as that would assume a hierarchical filesystem.

This isn't for you to decide. It's perfectly reasonable for an S3 Contents Manager implementation to use root_dir to specify the S3 bucket. Yes, root_dir + path is not absolute, but it is necessary to determine uniqueness - even in filesystem-based FileId managers - particularly those implementations that will want to remain agnostic to filesystem type.

Regarding the copy(), move(), delete() methods. Ok - they should exist and be public. However, the methods that are invoked from the contents events should take the complete event payload. Those methods (e.g., _copy(), _move(), _delete()) can then call the public methods as appropriate (or not at all), ArbitraryFileIdManager should have all information available to itself when handling events. For example, other FileId service implementations need to know the difference between a file and a directory and that information should be in the event payload.

I'm not sure what the actual intention of the ArbitraryFileIdManager is, but I view it as something that could also be used instead of FileIdManager - even when FileContentsManager is in use. In fact, ensuring it is sufficient in that case is great test (because its going to be necessary when some troublesome filesystem is encountered and folks don't have the time to implement their own FileIdManager).

These allow developers to perform in-band edits without always having to go through the contents manager.

This is just plain wrong. While a filesystem-based FIleIdManager can copy, move, delete files, those that are associated with non-filesystem-based ContentsManagers cannot. This is yet another reason the FileIdService should be part of CM! So when those applications switch their ContentsManagers, yet think they can manage resources using the FileIdService instead, what are these methods supposed to do!?

dlqqq · 2022-10-27T17:49:49Z

I had run into the same thing. The key is to specify a metaclass - as my linked reference illustrates. I can submit a pull request to fix that if that's helpful.

@kevin-bates It's very clear you're more knowledgable on this than I am 😁. However, RTC team is under some tight time constraints at the moment, so I will have to be making a lot of changes very rapidly. How about I ping you when the main branch is stable enough for you to make changes without merge conflicts? Tracking this here: #32

This isn't for you to decide. It's perfectly reasonable for an S3 Contents Manager implementation to use root_dir to specify the S3 bucket. Yes, root_dir + path is not absolute, but it is necessary to determine uniqueness - even in filesystem-based FileId managers - particularly those implementations that will want to remain agnostic to filesystem type.

Brian relayed the discussion you all had in the Jupyter server meeting this morning. It sounds like we can safely assume that the "API paths" (paths returned from the contents manager) are always hierarchical, relative to the contents manager root, and delimited by forward slashes. Given this set of constraints, it's perfectly reasonable to join the root_dir to the path. I'm tracking this in a separate issue: #31

However, the methods that are invoked from the contents events should take the complete event payload. Those methods (e.g., _copy(), _move(), _delete()) can then call the public methods as appropriate (or not at all), ArbitraryFileIdManager should have all information available to itself when handling events. For example, other FileId service implementations need to know the difference between a file and a directory and that information should be in the event payload.

Destructuring of the event payload can be handled in get_handlers_by_actions(). I think this is way more concise than having 5 additional private methods on the base class and then 10 more private methods implementing those. Centralizing the "event payload destructuring" logic in a single place rather than splitting it into 5 different methods helps with maintainability, IMO.

This is just plain wrong. While a filesystem-based FIleIdManager can copy, move, delete files, those that are associated with non-filesystem-based ContentsManagers cannot. This is yet another reason the FileIdService should be part of CM! So when those applications switch their ContentsManagers, yet think they can manage resources using the FileIdService instead, what are these methods supposed to do!?

I think there's some misunderstanding on what these methods do. They don't actually move the file on the filesystem, they simply update the associated path in the Files table. So, file_id_manager.move(old_path, new_path) doesn't actually move the file; it essentially just takes the file ID associated with old_path and sets its path column to new_path. It is done in response to a file move, but doesn't actually move the file itself.

Maybe a better name should have been chosen earlier to convey this better, but I chose the briefest name possible out of personal preference. Perhaps on_move(), on_copy(), etc? Feel free to make a separate issue if you have your own naming suggestions you'd like to share.

kevin-bates · 2022-10-27T19:41:35Z

These allow developers to perform in-band edits without always having to go through the contents manager.

This is just plain wrong.

I think there's some misunderstanding on what these methods do.

I'm sorry. The phrase perform in-band edits without always having to go through the contents manager implied to me that folks could use these methods to copy, move and delete files. Might you have meant to say "out-of-band edits"? Is there a case where "in-band" edits need to move a file (outside of the ContentsManager) and, in such cases, the user application must "know" to call the corresponding method on the FID manager? Just curious - thanks.

Perhaps on_move(), on_copy(), etc?

Yes - great idea! And those methods should take the complete data payload from the received event - like all event handlers should. Awesome.

davidbrochart marked this pull request as draft October 21, 2022 10:29

davidbrochart mentioned this pull request Oct 21, 2022

Support any contents manager #19

Closed

davidbrochart force-pushed the any_contents_manager branch 3 times, most recently from df2b200 to d21d05f Compare October 21, 2022 10:36

davidbrochart commented Oct 21, 2022

View reviewed changes

kevin-bates mentioned this pull request Oct 21, 2022

update get_path() to return a path relative to root_dir #21

Merged

kevin-bates reviewed Oct 21, 2022

View reviewed changes

jupyter_server_fileid/manager.py Outdated Show resolved Hide resolved

dlqqq reviewed Oct 24, 2022

View reviewed changes

jupyter_server_fileid/extension.py Outdated Show resolved Hide resolved

dlqqq and others added 2 commits October 25, 2022 20:48

Allow arbitrary contents managers

d235357

dlqqq force-pushed the any_contents_manager branch from 37a6b1f to 44e6d0e Compare October 25, 2022 20:51

dlqqq approved these changes Oct 25, 2022

View reviewed changes

dlqqq marked this pull request as ready for review October 25, 2022 21:02

davidbrochart merged commit d968097 into jupyter-server:main Oct 25, 2022

davidbrochart deleted the any_contents_manager branch October 25, 2022 21:04

kevin-bates reviewed Oct 25, 2022

View reviewed changes

kevin-bates reviewed Oct 26, 2022

View reviewed changes

kevin-bates mentioned this pull request Oct 26, 2022

use UUIDs for default file ID manager implementations #30

Merged

dlqqq mentioned this pull request Oct 27, 2022

make BaseFileIdManager a true ABC #32

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow arbitrary contents managers #24

Allow arbitrary contents managers #24

davidbrochart commented Oct 21, 2022 •

edited

codecov-commenter commented Oct 21, 2022 •

edited

davidbrochart Oct 21, 2022

dlqqq commented Oct 21, 2022

dlqqq commented Oct 21, 2022

dlqqq commented Oct 25, 2022 •

edited

dlqqq commented Oct 25, 2022

davidbrochart commented Oct 25, 2022

dlqqq commented Oct 25, 2022

dlqqq commented Oct 25, 2022

kevin-bates Oct 25, 2022

kevin-bates Oct 25, 2022

kevin-bates Oct 25, 2022

dlqqq commented Oct 25, 2022

kevin-bates left a comment

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

kevin-bates Oct 26, 2022

dlqqq commented Oct 27, 2022

kevin-bates commented Oct 27, 2022

dlqqq commented Oct 27, 2022

kevin-bates commented Oct 27, 2022

	def get_path(self, id: Union[int, str]) -> Union[int, str, None]:
	def get_path(self, id: Union[int, str]) -> Optional[str]:

		def get_id(self, path: str) -> Union[int, str, None]:
		raise NotImplementedError("must be implemented by subclass")

Allow arbitrary contents managers #24

Allow arbitrary contents managers #24

Conversation

davidbrochart commented Oct 21, 2022 • edited

codecov-commenter commented Oct 21, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

dlqqq commented Oct 21, 2022

dlqqq commented Oct 21, 2022

dlqqq commented Oct 25, 2022 • edited

dlqqq commented Oct 25, 2022

davidbrochart commented Oct 25, 2022

dlqqq commented Oct 25, 2022

dlqqq commented Oct 25, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlqqq commented Oct 25, 2022

kevin-bates left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dlqqq commented Oct 27, 2022

kevin-bates commented Oct 27, 2022

dlqqq commented Oct 27, 2022

kevin-bates commented Oct 27, 2022

davidbrochart commented Oct 21, 2022 •

edited

codecov-commenter commented Oct 21, 2022 •

edited

dlqqq commented Oct 25, 2022 •

edited