Skip to content

Conversation

@efiop
Copy link
Contributor

@efiop efiop commented Jan 17, 2020

Part of #1843 , related to #3177

As our previous investigations showed, path stringification and stuff
like relpath are taking a very large chunk of time when working with
giant directories. This patch removes format()s and uses lazy
formatting provided by logger instead, so stuff like path_info is not
stringified until actually needed (e.g. on -v).

Benchmark results coming soon

  • ❗ Have you followed the guidelines in the Contributing to DVC list?

  • 📖 Check this box if this PR does not require documentation updates, or if it does and you have created a separate PR in dvc.org with such updates (or at least opened an issue about it in that repo). Please link below to your PR (or issue) in the dvc.org repo.

  • ❌ Have you checked DeepSource, CodeClimate, and other sanity checks below? We consider their findings recommendatory and don't expect everything to be addressed. Please review them carefully and fix those that actually improve code or fix bugs.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

@efiop
Copy link
Contributor Author

efiop commented Jan 17, 2020

Benchmark:

#!/bin/bash

set -e
set -x

rm -rf myrepo
mkdir myrepo
cd myrepo

git init 
dvc init

mkdir dir
for i in {1..100000}; do
        echo $i > dir/$i
done

python -mcProfile -s cumtime -m dvc add dir &> log.log

~242 sec with this patch and ~282 without it. E.g. path_info.py:49(__str__) went from 53 sec to 24sec. Pretty nice. Need to look into state too.

@efiop
Copy link
Contributor Author

efiop commented Jan 17, 2020

For the record: used incorrect formatting, fixing right now...

@efiop efiop changed the title logger: remote: use lazy formatting [WIP] logger: remote: use lazy formatting Jan 17, 2020
Copy link

@ghost ghost left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, @efiop !

Copy link
Contributor

@casperdcl casperdcl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

Copy link
Contributor

@Suor Suor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you try to bench it?

Comment on lines +278 to +281
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still use msg maybe?

msg = "dir cache file format error '{}' [skipping the file]"
logger.error(msg, path_info)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No point in a new var, it is an old code from time when we weren't sure on formatting. We are inlining it these days, so it makes sense to inline here as well.

Comment on lines 367 to 372
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above:

msg = "checksum '{}'(actual '{}') for '{}' has changed."
logger.debug(msg, checksum, actual, path_info)

dvc/rwlock.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we trying to move this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Suor It is a faulty import from an incorrect place that broke once I've removed no longer used relpath re-import from fs.py.

@efiop
Copy link
Contributor Author

efiop commented Jan 19, 2020

Did you try to bench it?

@Suor #3178 (comment)

efiop added 4 commits January 19, 2020 11:53
Part of iterative#1843

As our previous investigations showed, path stringification and stuff
like relpath are taking a very large chunk of time when working with
giant directories. This patch removes `format()`s and uses lazy
formatting provided by logger instead, so stuff like path_info is not
stringified until actually needed (e.g. on `-v`).
@efiop efiop changed the title [WIP] logger: remote: use lazy formatting logger: remote: use lazy formatting Jan 19, 2020
@efiop efiop merged commit 7d65002 into iterative:master Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants