-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new SSH-based location. #48
Conversation
715b992
to
81811a0
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks - this looks great. I think the duplication is fine, and there'll be a bit more once the http remote is added too.
It took a while to see why the packet metadata was needed, but I think that's a reasonable solution. Nice idea with reading the configuration from the remote system too.
A couple of minor comments throughout
def list(self) -> Dict[str, PacketLocation]: | ||
path = self._root / ".outpack" / "location" / LOCATION_LOCAL | ||
result = {} | ||
for packet in self._sftp.listdir(str(path)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest caching/memoising this as it is immutable for as packet, and it looks like this would reread files that we've fetched on a previous list?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessary. With the way I've refactored the location_pull
code it only ever calls list()
once per driver object instance, so the caching would never be used.
url = "ssh://" | ||
if username is not None: | ||
url += f"{username}@" | ||
url += f"127.0.0.1:{self.port}{path}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that using 127.0.0.1
here prevents inbound connections, right? (binding only to localhost)? It might be worth a comment reminding us of that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No this is the address used to connect, not to bind. The binding is further down, sock.bind(("", 0))
(all interfaces, random port). I don't think it matters much how open this is given it is only for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment on the bind line
@@ -29,16 +35,15 @@ def metadata(self, packet_ids): | |||
ret[packet_id] = read_string(path) | |||
return ret | |||
|
|||
def fetch_file(self, hash, dest): | |||
def fetch_file(self, _packet: MetadataCore, file: PacketFile, dest: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It feels a pity that we need to use the packet metadata here, but I can't really see a way to avoid it either
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the alternative is looking through all the packets eg. https://github.com/mrc-ide/outpack-py/blob/19caa670f513adcb47557cb2d52ec2c4579afea1/src/outpack/root.py#L37-L48 and https://github.com/mrc-ide/outpack-py/blob/main/src/outpack/root.py#L78-L94. At some point I should move the path location to use the extra information too.
Doing so with the remote metadata is unnecessarily slow, or we need the driver to be able to use the local index to make the search, which is doable but creates an ugly dependency between parts that are currently nicely isolated.
This is handy when running reports on the Linux cluster and the user wants to pull the results back to their local machine. The new location supports the same syntaxes as Git uses, either the short SCP-like `username@hostname:path` syntax or the more explicit URL-based `ssh://username@hostname/path`. The latter also allows specifying a port number. There's a bit of duplicated logic between the core implementation and the new ssh location, since they both deal with the same file layout. It felt like combining the two would require significant amount of indirections by introducing a VFS that could be implemented both over SFTP and locally, and wasn't worth it since the duplication isn't excessive either. The location pulling code got refactored a little bit. It used to make many repeated calls to `driver.list()`, but that is inefficient when these calls are over the network. While we could probably cache these calls in the SSH driver, it is easy to just not make them. The location driver's `fetch_file` interface is changed a little bit to take more metadata in its arguments, including the packet name, id and file path within the packet. This makes it much easier to find the file when the remote location does not have a content-addressed file store.
This is handy when running reports on the Linux cluster and the user wants to pull the results back to their local machine.
The new location supports the same syntaxes as Git uses, either the short SCP-like
username@hostname:path
syntax or the more explicit URL-basedssh://username@hostname/path
. The latter also allows specifying a port number.There's a bit of duplicated logic between the core implementation and the new ssh location, since they both deal with the same file layout. It felt like combining the two would require significant amount of indirections by introducing a VFS that could be implemented both over SFTP and locally, and wasn't worth it since the duplication isn't excessive either.
The location pulling code got refactored a little bit. It used to make many repeated calls to
driver.list()
, but that is inefficient when these calls are over the network. While we could probably cache these calls in the SSH driver, it is easy to just not make them.The location driver's
fetch_file
interface is changed a little bit to take more metadata in its arguments, including the packet name, id and file path within the packet. This makes it much easier to find the file when the remote location does not have a content-addressed file store.