Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf: optimize yaml parsing for stages #2775

Merged
merged 2 commits into from
Nov 19, 2019
Merged

Conversation

Suor
Copy link
Contributor

@Suor Suor commented Nov 11, 2019

No description provided.

@efiop
Copy link
Contributor

efiop commented Nov 11, 2019

@Suor Any story on this? Could you elaborate on what you are doing and why?

@Suor
Copy link
Contributor Author

Suor commented Nov 11, 2019

This is after #2671. We discussed it on planning I believe.

@efiop
Copy link
Contributor

efiop commented Nov 11, 2019

@Suor Still, could you summarize what is going on this this PR? It is not clear to me what this is about and why.

@efiop
Copy link
Contributor

efiop commented Nov 11, 2019

And I don't think we've discussed this PR specifically during the planning. Might be forgetting something though.

@Suor
Copy link
Contributor Author

Suor commented Nov 11, 2019

We've discussed it during Demo. Mostly me and @shcheklein. The thing is that the majority of the time to collect stages is spent to parse YAML. We still need ruamel.yaml to preserve comments and string formatting. However, we can use PyYAML, which is 2x to 12x faster, depending on whether libyaml is used for reading and only parse with ruamel.yaml when we need to dump a stage.

@Suor
Copy link
Contributor Author

Suor commented Nov 11, 2019

The graph collection for 21k add stages takes 23.6s instead of 69.6s for me after this optimization. The optimization strategy is also described here.

from ruamel.yaml import YAML
from ruamel.yaml.error import YAMLError

try:
from yaml import CSafeLoader as SafeLoader
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it installed automatically on most systems?

Copy link
Contributor Author

@Suor Suor Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PyYAML for Windows has wheels, which include binaries. I didn't test how it works though. For other systems you will need libyaml-dev installed to use C. Was not installed in my Ubuntu, so one needs to:

sudo apt install libyaml-dev

Before installing PyYAML, otherwise it won't be linked.

Pure python PyYAML still works 2x fast as ruamel.yaml. We might also poke PyYAML author to create all the wheels.

Copy link
Contributor

@efiop efiop Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is quite a requirement, and won't be easy to explain to the users. This will result in dvc stage collection staying slow for pretty much everyone, except a few guys who will discover that they need to install libyam-dev. Though 2x is still pretty good.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking PyYAML guy and helping him is still an option. Will work on top of this seamlessly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2x sounds good and we can def ping the guys (let's do this right now?)

if self._stage_text is not None:
saved_state = parse_stage_for_update(self._stage_text, fname)
if "meta" in saved_state:
state["meta"] = saved_state["meta"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is some special handling required for meta here? (vs just a single apply_diff before)

Copy link
Contributor Author

@Suor Suor Nov 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because state, which we get from self.dumpd() does not have meta key. So apply_diff() will remove it from the target structure. The alternative would be to store meta in some key and then include it into .dumpd(), which is more hustle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But why didn't we have the same (seemingly very ad-hoc and fragile logic) before? We were using dump before right? and were applying diff on top of it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Before it was part of the dumpd. I missed that a line was removed in dumpd as well.

Still looks fragile that we have one more place that manipulates with all these attributes. It feels that it would be better to follow the "regular" way in this file and make the meta a regular stage attribute. So, that we don't have a separate path to handle just it.

Also, wondering if is_cached can be optimized considering that we are saving the state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The regular way now will mean we will need to store meta on some attribute, then return it in .dumpd() then let it be applied to the same meta, but with comments etc inside apply_diff().

I am not sure that is a good idea, I don't want to add an another attribute, which is never used in fact. And even less I want to add another constructor param.

@shcheklein
Copy link
Member

@Suor I would put some comments there. I'll be very hard to understand the motivation behind using two parsers. Also, would like to know how often will it use the fast one? will it try to build if from sources (and expect gcc to be installed for example)? how slow (compared to the current implementation) the one that is not backed by the implementation in C.

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please, see some comments around

@Suor
Copy link
Contributor Author

Suor commented Nov 12, 2019

@shcheklein it always uses the fast one when reading stage, it only uses slow one to update a stage file. The supposition is that the overall number of stages might grow faster then the number of stages changed by a single command.

PyYAML will always use binaries under Windows, will automatically link to libyaml-dev if that is available under Unix. We may poke the PyYAML maintainer to provide all the wheels and maybe help him with PRs or whatever.

Pure python PyYAML parser is 2x faster than ruamel.yaml parser and 6-7x slower than libyaml backed one.

I will add some motivation comments into code.

@Suor
Copy link
Contributor Author

Suor commented Nov 13, 2019

So I looked through PyYAML issues, there are a couple of occasions where people ask for non-Windows wheels - yaml/pyyaml#43, yaml/pyyaml#303, yaml/pyyaml#227, which don't get any traction though.

I created a new issue there to ask whether they are interested and offer our help -yaml/pyyaml#346.

@Suor
Copy link
Contributor Author

Suor commented Nov 14, 2019

@efiop @shcheklein Updated an explanation and added a docstring to parse_yaml_for_update(). Does it look clear now?

@Suor
Copy link
Contributor Author

Suor commented Nov 14, 2019

The reason why it was not explained before is that target functionality (retaining comments, formatting and order) was effectively an invisible side-effect of using ruamel.yaml. If someone not aware of this would had come and optimized things by replacing ruamel.yaml with PyYAML he would had been quite surprised by a couple of tests failing)

@efiop efiop requested review from a user and pared November 16, 2019 19:04
@efiop
Copy link
Contributor

efiop commented Nov 16, 2019

@MrOutis @pared please take a look, guys.

@efiop efiop merged commit af0d913 into iterative:master Nov 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants