Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow arbitrary contents managers #24

Merged
merged 2 commits into from
Oct 25, 2022

Conversation

davidbrochart
Copy link
Contributor

@davidbrochart davidbrochart commented Oct 21, 2022

Closes #19.

@davidbrochart davidbrochart marked this pull request as draft October 21, 2022 10:29
@codecov-commenter
Copy link

codecov-commenter commented Oct 21, 2022

Codecov Report

Base: 78.74% // Head: 79.39% // Increases project coverage by +0.64% 🎉

Coverage data is based on head (44e6d0e) compared to base (7f28d3c).
Patch coverage: 80.76% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##             main      #24      +/-   ##
==========================================
+ Coverage   78.74%   79.39%   +0.64%     
==========================================
  Files           5        5              
  Lines         414      495      +81     
  Branches       62       68       +6     
==========================================
+ Hits          326      393      +67     
- Misses         66       80      +14     
  Partials       22       22              
Impacted Files Coverage Δ
jupyter_server_fileid/extension.py 37.50% <27.27%> (-7.33%) ⬇️
jupyter_server_fileid/manager.py 88.47% <85.88%> (-0.69%) ⬇️
jupyter_server_fileid/pytest_plugin.py 88.50% <100.00%> (+0.85%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidbrochart davidbrochart force-pushed the any_contents_manager branch 3 times, most recently from df2b200 to d21d05f Compare October 21, 2022 10:36
Comment on lines 50 to 161
class ArbitraryFileIdManager(AbstractFileIdManager):
def __init__(self, *args, **kwargs):
pass

def get_id(self, path: str) -> str:
return path

def get_path(self, id: str) -> str:
return id
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be implemented.

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 21, 2022

@davidbrochart Thanks for tackling this! In the future however, could you please assign yourself to an issue before beginning work on a PR that addresses it? I want to make sure we're avoiding duplicate effort as much as possible.

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 21, 2022

I'd be happy to review this once it's ready and out of draft status.

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 25, 2022

@davidbrochart Cut a PR to your branch fleshing out the ArbitraryFileIdManager implementation and address Kevin's comments. davidbrochart#1

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 25, 2022

@davidbrochart Can I go ahead and rebase to fix merge conflicts?

@davidbrochart
Copy link
Contributor Author

Please do, thanks!

dlqqq and others added 2 commits October 25, 2022 20:48
* set default only when config doesn't specify file_id_manager_class

* flesh out abstract and arbitrary file ID managers

- make root_dir non-configurable
- add get_handlers_by_action() method

* edit docstring

Co-authored-by: David Brochart <david.brochart@gmail.com>

* update help string

Co-authored-by: David Brochart <david.brochart@gmail.com>

* update log

Co-authored-by: David Brochart <david.brochart@gmail.com>

* update log

Co-authored-by: David Brochart <david.brochart@gmail.com>

* Rename AbstractFileIdManager to BaseFileIdManager

Co-authored-by: David Brochart <david.brochart@gmail.com>
@dlqqq
Copy link
Collaborator

dlqqq commented Oct 25, 2022

@davidbrochart Done. I had one comment I wanted to make in this PR for visibility.

The ArbitraryFileIdManager makes absolutely no assumptions of the filesystem, including its directory structure. By design, ArbitraryFileIdManager should work on any filesystem, including non-hierarchical filesystems that allow for / to be in a name (e.g. a cloud KV store). That means when it receives a "move" event from the contents manager, it does not update the path of any children, since that assumes the filesystem is hierarchical with parents delimited from children with a /. This means that when moving a directory, get_path() will return the wrong path for all of its children, even if the move was performed via JupyterLab.

I think this is fine, given that custom ContentsManagers really should provide their own FileIdManager implementation if they want to use this extension meaningfully anyways.

@dlqqq dlqqq marked this pull request as ready for review October 25, 2022 21:02
@dlqqq
Copy link
Collaborator

dlqqq commented Oct 25, 2022

@davidbrochart Feel free to merge when ready. I'll cut a release for 0.5.0 once this is in.

@davidbrochart davidbrochart merged commit d968097 into jupyter-server:main Oct 25, 2022
@davidbrochart davidbrochart deleted the any_contents_manager branch October 25, 2022 21:04
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def index(self, path: str) -> Union[int, str, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can have dual forms of IDs as suggested here. Users should not be expected to change from int to UUID, rebuild all indices, and, worst of all, update all external references that may be squirreled away when they find out that int as ID is insufficient.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see this was merged as I must have been editing this comment. This topic needs to be revisited but we can let #3 be that forum.

def get_id(self, path: str) -> Union[int, str, None]:
raise NotImplementedError("must be implemented by subclass")

def get_path(self, id: Union[int, str]) -> Union[int, str, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def get_path(self, id: Union[int, str]) -> Union[int, str, None]:
def get_path(self, id: Union[int, str]) -> Optional[str]:

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 25, 2022

@kevin-bates Thanks for reminding me about this issue again. Let's move this discussion in #3.

Copy link
Member

@kevin-bates kevin-bates left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry this review is late. I didn't expect the PR to be merged so quickly with the kinds of changes that seemed to be made at the last minute. These comments should be considered part of my review of #30.

@@ -37,35 +37,174 @@ def wrapped(self, *args, **kwargs):
return decorator


class FileIdManager(LoggingConfigurable):
class BaseFileIdManager(LoggingConfigurable):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should derive from ABC along with a metaclass definition in order for @abstractmethod decorators to be effective. A good example of this can be found here. By not deriving BaseFileIdManager from ABC, @abstractmethod decorators do not work (and I see those too have been removed). As a result, a subclass's violation for not implementing various methods will not be discovered until that method is called, rather than when the class instance is instantiated. Proper decoration also prevents BaseFileIdManager from being instantiated (i.e., a true abstract base class).

def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)

def index(self, path: str) -> Union[int, str, None]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, this is not a public method and should not exist on the ABC.

Comment on lines +74 to +75
def get_id(self, path: str) -> Union[int, str, None]:
raise NotImplementedError("must be implemented by subclass")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With a proper ABC definition @abstractmethod is sufficient and this method can use pass. It should also include the basic help-string text from which all subclasses will derive.

Same goes for get_path() and get_handlers_by_action below.

Comment on lines +80 to +90
def move(self, old_path: str, new_path: str) -> Union[int, str, None]:
raise NotImplementedError("must be implemented by subclass")

def copy(self, from_path: str, to_path: str) -> Union[int, str, None]:
raise NotImplementedError("must be implemented by subclass")

def delete(self, path: str) -> None:
raise NotImplementedError("must be implemented by subclass")

def save(self, path: str) -> None:
raise NotImplementedError("must be implemented by subclass")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of my knowledge, these are not public methods and should not exist on the ABC.

self.con.execute(
"CREATE TABLE IF NOT EXISTS Files("
"id INTEGER PRIMARY KEY AUTOINCREMENT, "
"path TEXT NOT NULL UNIQUE"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the ArbitraryFileIdManager is going to want to store the root_dir value as it can change between invocations and may be necessary. We should also store any information returned from the Contents events.

"root_dir TEXT NOT NULL"

self.con.execute("CREATE INDEX IF NOT EXISTS ix_Files_path ON Files (path)")
self.con.commit()

def index(self, path: str) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not public, should be prefixed with _ and should probably take whatever other columns are necessary.

Comment on lines +150 to +170
def move(self, old_path: str, new_path: str) -> None:
row = self.con.execute("SELECT id FROM Files WHERE path = ?", (old_path,)).fetchone()
id = row and row[0]

if id:
self.con.execute("UPDATE Files SET path = ? WHERE path = ?", (new_path, old_path))
else:
cursor = self.con.execute("INSERT INTO Files (path) VALUES (?)", (new_path,))
id = cursor.lastrowid

self.con.commit()
return id

def copy(self, from_path: str, to_path: str) -> Optional[int]:
cursor = self.con.execute("INSERT INTO Files (path) VALUES (?)", (to_path,))
self.con.commit()
return cursor.lastrowid

def delete(self, path: str) -> None:
self.con.execute("DELETE FROM Files WHERE path = ?", (path,))
self.con.commit()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not public, should be prefixed with _.

Comment on lines +172 to +173
def save(self, path: str) -> None:
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be implemented to call _index() if its not already indexed. Otherwise, no entries will ever be inserted into the FILES table.

Also, _get() should be implemented such that it should be called on Get Contents events. Similar reason to why Save events must be handled.

Comment on lines +177 to +178
"get": None,
"save": None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per previous comment, these should call _get() and _save(), respectively.

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 27, 2022

@kevin-bates Thank you for the review! Sorry we're merging things so quickly, this is being used by RTC, which is also moving very fast right now. Let me address your concerns separately:

This should derive from ABC along with a metaclass definition in order for @AbstractMethod decorators to be effective. A good example of this can be found here. By not deriving BaseFileIdManager from ABC, @AbstractMethod decorators do not work (and I see those too have been removed). As a result, a subclass's violation for not implementing various methods will not be discovered until that method is called, rather than when the class instance is instantiated. Proper decoration also prevents BaseFileIdManager from being instantiated (i.e., a true abstract base class).

I removed inheritance from ABC because I wasn't sure if we can use it with Traitlets. The developers I work with on Jupyter scheduler had run into an issue where they couldn't inherit from both ABC and LoggingConfigurable. There are certain traits and validation on those traits we want all file ID managers to have, specifically root_dir and db_path. If it's possible to use traits and ABC together, I'd be happy to have BaseFileIdManager inherit from ABC again.

To the best of my knowledge, this is not a public method and should not exist on the ABC.

(from #30) Although the methods called during event handling are not public, any FileIdManager implementations that choose to leverage content events should be receiving the entire model that is the event's payload, not just path information.

index() definitely needs to be public, since otherwise there's no way for developers to create new file IDs. We do want all of those other methods to be public as well. These allow developers to perform in-band edits without always having to go through the contents manager. I don't know of any users at the moment, but we want to keep this door open for now unless compelled to do so otherwise. Right now, I don't see any tangible benefit to making these methods private. FileIdManagers should implement them anyways to correctly handle contents manager events.

Furthermore, if these methods are public, then it doesn't make sense for them to receive the full event payload as their exclusive argument.

I think the ArbitraryFileIdManager is going to want to store the root_dir value as it can change between invocations and may be necessary. We should also store any information returned from the Contents events.

ArbitraryFileIdManager can't make any assumptions about the filesystem. It doesn't know about absolute or relative paths, as that would assume a hierarchical filesystem. The consensus is that it's OK for the ArbitraryFileIdManager to not be fully fleshed out and handle cases like moving server roots or moving directories. Anything more requires assumptions about your filesystem, and that requires either LocalFileIdManager or some other custom implementation.

@kevin-bates
Copy link
Member

I removed inheritance from ABC because I wasn't sure if we can use it with Traitlets. The developers I work with on Jupyter scheduler had run into an issue where they couldn't inherit from both ABC and LoggingConfigurable.

I had run into the same thing. The key is to specify a metaclass - as my linked reference illustrates. I can submit a pull request to fix that if that's helpful.

index() definitely needs to be public, since otherwise there's no way for developers to create new file IDs.

You're right about index(). I had conflated that with scan_all() - sorry about that.

ArbitraryFileIdManager can't make any assumptions about the filesystem. It doesn't know about absolute or relative paths, as that would assume a hierarchical filesystem.

This isn't for you to decide. It's perfectly reasonable for an S3 Contents Manager implementation to use root_dir to specify the S3 bucket. Yes, root_dir + path is not absolute, but it is necessary to determine uniqueness - even in filesystem-based FileId managers - particularly those implementations that will want to remain agnostic to filesystem type.

Regarding the copy(), move(), delete() methods. Ok - they should exist and be public. However, the methods that are invoked from the contents events should take the complete event payload. Those methods (e.g., _copy(), _move(), _delete()) can then call the public methods as appropriate (or not at all), ArbitraryFileIdManager should have all information available to itself when handling events. For example, other FileId service implementations need to know the difference between a file and a directory and that information should be in the event payload.

I'm not sure what the actual intention of the ArbitraryFileIdManager is, but I view it as something that could also be used instead of FileIdManager - even when FileContentsManager is in use. In fact, ensuring it is sufficient in that case is great test (because its going to be necessary when some troublesome filesystem is encountered and folks don't have the time to implement their own FileIdManager).

These allow developers to perform in-band edits without always having to go through the contents manager.

This is just plain wrong. While a filesystem-based FIleIdManager can copy, move, delete files, those that are associated with non-filesystem-based ContentsManagers cannot. This is yet another reason the FileIdService should be part of CM! So when those applications switch their ContentsManagers, yet think they can manage resources using the FileIdService instead, what are these methods supposed to do!?

@dlqqq
Copy link
Collaborator

dlqqq commented Oct 27, 2022

I had run into the same thing. The key is to specify a metaclass - as my linked reference illustrates. I can submit a pull request to fix that if that's helpful.

@kevin-bates It's very clear you're more knowledgable on this than I am 😁. However, RTC team is under some tight time constraints at the moment, so I will have to be making a lot of changes very rapidly. How about I ping you when the main branch is stable enough for you to make changes without merge conflicts? Tracking this here: #32

This isn't for you to decide. It's perfectly reasonable for an S3 Contents Manager implementation to use root_dir to specify the S3 bucket. Yes, root_dir + path is not absolute, but it is necessary to determine uniqueness - even in filesystem-based FileId managers - particularly those implementations that will want to remain agnostic to filesystem type.

Brian relayed the discussion you all had in the Jupyter server meeting this morning. It sounds like we can safely assume that the "API paths" (paths returned from the contents manager) are always hierarchical, relative to the contents manager root, and delimited by forward slashes. Given this set of constraints, it's perfectly reasonable to join the root_dir to the path. I'm tracking this in a separate issue: #31

However, the methods that are invoked from the contents events should take the complete event payload. Those methods (e.g., _copy(), _move(), _delete()) can then call the public methods as appropriate (or not at all), ArbitraryFileIdManager should have all information available to itself when handling events. For example, other FileId service implementations need to know the difference between a file and a directory and that information should be in the event payload.

Destructuring of the event payload can be handled in get_handlers_by_actions(). I think this is way more concise than having 5 additional private methods on the base class and then 10 more private methods implementing those. Centralizing the "event payload destructuring" logic in a single place rather than splitting it into 5 different methods helps with maintainability, IMO.

This is just plain wrong. While a filesystem-based FIleIdManager can copy, move, delete files, those that are associated with non-filesystem-based ContentsManagers cannot. This is yet another reason the FileIdService should be part of CM! So when those applications switch their ContentsManagers, yet think they can manage resources using the FileIdService instead, what are these methods supposed to do!?

I think there's some misunderstanding on what these methods do. They don't actually move the file on the filesystem, they simply update the associated path in the Files table. So, file_id_manager.move(old_path, new_path) doesn't actually move the file; it essentially just takes the file ID associated with old_path and sets its path column to new_path. It is done in response to a file move, but doesn't actually move the file itself.

Maybe a better name should have been chosen earlier to convey this better, but I chose the briefest name possible out of personal preference. Perhaps on_move(), on_copy(), etc? Feel free to make a separate issue if you have your own naming suggestions you'd like to share.

@kevin-bates
Copy link
Member

These allow developers to perform in-band edits without always having to go through the contents manager.

This is just plain wrong.

I think there's some misunderstanding on what these methods do.

I'm sorry. The phrase perform in-band edits without always having to go through the contents manager implied to me that folks could use these methods to copy, move and delete files. Might you have meant to say "out-of-band edits"? Is there a case where "in-band" edits need to move a file (outside of the ContentsManager) and, in such cases, the user application must "know" to call the corresponding method on the FID manager? Just curious - thanks.

Perhaps on_move(), on_copy(), etc?

Yes - great idea! And those methods should take the complete data payload from the received event - like all event handlers should. Awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support any contents manager
4 participants