Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new SSH-based location. #48

Merged
merged 3 commits into from
May 10, 2024
Merged

Add a new SSH-based location. #48

merged 3 commits into from
May 10, 2024

Conversation

plietar
Copy link
Member

@plietar plietar commented May 3, 2024

This is handy when running reports on the Linux cluster and the user wants to pull the results back to their local machine.

The new location supports the same syntaxes as Git uses, either the short SCP-like username@hostname:path syntax or the more explicit URL-based ssh://username@hostname/path. The latter also allows specifying a port number.

There's a bit of duplicated logic between the core implementation and the new ssh location, since they both deal with the same file layout. It felt like combining the two would require significant amount of indirections by introducing a VFS that could be implemented both over SFTP and locally, and wasn't worth it since the duplication isn't excessive either.

The location pulling code got refactored a little bit. It used to make many repeated calls to driver.list(), but that is inefficient when these calls are over the network. While we could probably cache these calls in the SSH driver, it is easy to just not make them.

The location driver's fetch_file interface is changed a little bit to take more metadata in its arguments, including the packet name, id and file path within the packet. This makes it much easier to find the file when the remote location does not have a content-addressed file store.

@plietar plietar force-pushed the ssh-remote branch 4 times, most recently from 715b992 to 81811a0 Compare May 3, 2024 18:51
@plietar plietar changed the title SSH remote WIP Add a new SSH-based location. May 3, 2024
@plietar plietar marked this pull request as ready for review May 3, 2024 18:59
@plietar plietar requested a review from richfitz May 3, 2024 18:59
Copy link
Member

@richfitz richfitz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks - this looks great. I think the duplication is fine, and there'll be a bit more once the http remote is added too.

It took a while to see why the packet metadata was needed, but I think that's a reasonable solution. Nice idea with reading the configuration from the remote system too.

A couple of minor comments throughout

src/outpack/location.py Outdated Show resolved Hide resolved
src/outpack/location_ssh.py Outdated Show resolved Hide resolved
def list(self) -> Dict[str, PacketLocation]:
path = self._root / ".outpack" / "location" / LOCATION_LOCAL
result = {}
for packet in self._sftp.listdir(str(path)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest caching/memoising this as it is immutable for as packet, and it looks like this would reread files that we've fetched on a previous list?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is necessary. With the way I've refactored the location_pull code it only ever calls list() once per driver object instance, so the caching would never be used.

url = "ssh://"
if username is not None:
url += f"{username}@"
url += f"127.0.0.1:{self.port}{path}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that using 127.0.0.1 here prevents inbound connections, right? (binding only to localhost)? It might be worth a comment reminding us of that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No this is the address used to connect, not to bind. The binding is further down, sock.bind(("", 0)) (all interfaces, random port). I don't think it matters much how open this is given it is only for tests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment on the bind line

@@ -29,16 +35,15 @@ def metadata(self, packet_ids):
ret[packet_id] = read_string(path)
return ret

def fetch_file(self, hash, dest):
def fetch_file(self, _packet: MetadataCore, file: PacketFile, dest: str):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels a pity that we need to use the packet metadata here, but I can't really see a way to avoid it either

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the alternative is looking through all the packets eg. https://github.com/mrc-ide/outpack-py/blob/19caa670f513adcb47557cb2d52ec2c4579afea1/src/outpack/root.py#L37-L48 and https://github.com/mrc-ide/outpack-py/blob/main/src/outpack/root.py#L78-L94. At some point I should move the path location to use the extra information too.

Doing so with the remote metadata is unnecessarily slow, or we need the driver to be able to use the local index to make the search, which is doable but creates an ugly dependency between parts that are currently nicely isolated.

This is handy when running reports on the Linux cluster and the user
wants to pull the results back to their local machine.

The new location supports the same syntaxes as Git uses, either the
short SCP-like `username@hostname:path` syntax or the more explicit
URL-based `ssh://username@hostname/path`. The latter also allows
specifying a port number.

There's a bit of duplicated logic between the core implementation and
the new ssh location, since they both deal with the same file layout.
It felt like combining the two would require significant amount of
indirections by introducing a VFS that could be implemented both over
SFTP and locally, and wasn't worth it since the duplication isn't
excessive either.

The location pulling code got refactored a little bit. It used to make
many repeated calls to `driver.list()`, but that is inefficient when
these calls are over the network. While we could probably cache these
calls in the SSH driver, it is easy to just not make them.

The location driver's `fetch_file` interface is changed a little bit to
take more metadata in its arguments, including the packet name, id and
file path within the packet. This makes it much easier to find the file
when the remote location does not have a content-addressed file store.
@plietar plietar merged commit 7de87d8 into main May 10, 2024
7 checks passed
@plietar plietar deleted the ssh-remote branch May 10, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants