Skip to content

Conversation

ghost
Copy link

@ghost ghost commented Jun 14, 2019

  • Have you followed the guidelines in our
    Contributing document?

  • Does your PR affect documented changes or does it add new functionality
    that should be documented? If yes, have you created a PR for
    dvc.org documenting it or at
    least opened an issue for it? If so, please add a link to it.


  • SSH number of jobs increased from 4 to 8.

  • batch_exists now opens all the SFTP sessions available to
    distribute the work between them. Number of sessions is restricted
    by the MaxSessions option configurable by the sshd_config.

  • Fix ssh: check IOError key code #2104

With the optimizations mentioned above, we increased the speed by 8x
(using the default MaxSessions configuration -- 10).

Shout-outs to @efiop, @shcheklein, and @dmpetrov for suggesting the
optimization techniques.

@shcheklein
Copy link
Member

please, take a look at the codeclimate issues. Some of those probably can be addresses. Especially cognitive complexity.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does it make sense to set the CRITICAL level globally? do we care about the logs from paramiko at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't do this, it will log something like: Secsh channel open FAILED: open failed: Administratively prohibited when trying to figure out the max possible connections

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, so my question is - do we need to see their logs at all? Can we set it to critical globally?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if it would be a good idea, if something goes wrong, then we will be clueless

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but making it silent here we are already going that path, right? if we think there is some value paramiko can provide in terms of logging, why don't we switch it on when -v is passed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree, @shcheklein , I'll add support for -v

@Suor
Copy link
Contributor

Suor commented Jun 14, 2019

@shcheklein codeclimate has code complexity threshold set too low. Usual value is 10, I persornaly consider 12-15 is ok. Cap at 5, which codeclimate has as default, will lead to a spaghetti of funcs/methods not doing much each.

We've discussed this with @efiop and decided to raise that limit.

UPDATE I read that up and looks like they are using something they call Cognitive complexity and not Cyclomatic one, which we discussed. So we should consider it again.

@shcheklein
Copy link
Member

@Suor it's not exactly "cyclomatic complexity" you are referring to re the common threshold, it's a cognitive complexity - a different beast. I don't know where did they take 5 for it. Saying that, I agree that 5 might be a little bit too aggressive. My personal take - usually it's hard to test (no matter func tests or unit) and read methods with a CC complexity of 10 and usually there is a clear path for improvement, sometimes it even means that design is bad (like take a look at the batch_exists - it's quite complicated already with all this nesting, embedded method, etc).

In this specific case:

                try:
                    smt
                except IOError as exc:
                    if exc.errno != errno.ENOENT:
                        raise
                    smth

this single piece adds probably almost all CC errors - duplication, cognitive complexity, etc ... I'm not sure if there is a nice way to deal with this in Python

@ghost ghost requested a review from efiop June 14, 2019 14:09
@ghost ghost changed the title Optimize ssh remote: optimize ssh parallel operations Jun 14, 2019
@ghost ghost changed the title remote: optimize ssh parallel operations ssh: increase jobs and parallelize batch_exists Jun 14, 2019
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we exhausting all SSH connections on the server this way? or just a number of channels per connections/

my questions is - how this greedy aggressive strategy affects other potential SSH clients on the server?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question, @shcheklein!

how this greedy aggressive strategy affects other potential SSH clients on the server?

Running htop on the server while running dvc status -r -ssh showed no significant resource consumption:

Before:
2019-06-18-15:02:36

During dvc status:
2019-06-18-15:03:09

And I could still open more SSH and SFTP connections without any error.

So, based on this observations, I'd say that it is not that aggressive to become a potential problem. Although, it raises the question of why paramiko is limiting us to create more than 10 jobs 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we reusing the same single connection? what are the limits in terms of how many channels one can open per a connection?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit of how many SFTP channels we can have per connections is set with the MaxSessions option.

It is important to note that the performance increase is not linear, for example, here, the MaxSessions is set to 80, creating a lot of tasks on the remote server:
2019-06-18-15:21:50

The performance is worst than when it was set to 40.

The sweet-spot I found was setting it to 20 (running the whole operation in ~2m, compared to ~3m) when comparing 100K files (it might depend on the load of files)

Suor
Suor previously requested changes Jun 17, 2019
Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would seriously think into healing ENOENT cancer) Can we use a context manager or a higher order function to do that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we ignore the result of channel.stat() here? We don't in .exists().

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no such thing as channel.exists and we can't call self / ssh exists because SSH is not thread safe (using the same connection in different threads).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand the answer. The channel is an sftp instance, so why we don't examine what .stat() returns the same way we do for single exists?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know why I answered that, @Suor, it doesn't make sense at all! my bad 😅

We are ignoring the result because we only care about knowing whether the file exists or not.

@efiop efiop added p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks labels Jun 18, 2019
@efiop efiop assigned ghost Jun 18, 2019
@ghost
Copy link
Author

ghost commented Jun 18, 2019

I would seriously think into healing ENOENT cancer) Can we use a context manager or a higher order function to do that?

@Suor , I agree, although, this would not be a problem when we move to Python 3, since FileNotFoundError is defined and we can catch that instead of checking for the error number.

@ghost ghost dismissed Suor’s stale review June 18, 2019 22:19

Tried with the context manager approach and it doesn't fit situations when we are returning an specific value or continue the execution of a loop only if the file wasn't found, it kind of obscures the implementation and is not that a hassle to deal with. I won't address this on this PR, if you think is really important, @Suor, let's create another issue

@ghost
Copy link
Author

ghost commented Jun 18, 2019

@Suor , I'll do it but just for SSHConnection 😛

@efiop efiop mentioned this pull request Jun 19, 2019
2 tasks
- SSH number of jobs increased from 4 to 8.

- `batch_exists` now opens all the SFTP sessions available to
distribute the work between them. Number of sessions is restricted
by the `MaxSessions` option configurable by the `sshd_config`.

- Fix #2104

With the optimizations mentioned above, we increased the speed by 8x
(using the default `MaxSessions` configuration -- 10).

Shout-outs to @efiop, @shcheklein, and @dmpetrov for suggesting the
optimization techniques.
@efiop efiop requested review from Suor and shcheklein June 21, 2019 15:30
efiop
efiop approved these changes Jun 21, 2019
@efiop efiop merged commit a9620ad into iterative:master Jun 21, 2019
@ghost ghost deleted the optimize-ssh branch June 21, 2019 16:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

p1-important Important, aka current backlog of things to do performance improvement over resource / time consuming tasks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ssh: check IOError key code

3 participants