Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LockFile class dump/load uses different libraries #4281

Closed
dtrifiro opened this issue Jul 24, 2020 · 8 comments · Fixed by #4415
Closed

LockFile class dump/load uses different libraries #4281

dtrifiro opened this issue Jul 24, 2020 · 8 comments · Fixed by #4415
Assignees
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do

Comments

@dtrifiro
Copy link
Contributor

Bug Report

LockFile.dump() calls dvc.utils.yaml.dump_yaml(), which uses ruamel (ruamel.yaml.YAML.dump), whereas LockFile.load() calls dvc.utils.yaml.parse_yaml() which uses pyyaml (yaml.load with SafeLoader).

This results in a very subtle bug: ruamel uses the YAML 1.2 specification by default, whereas pyyaml uses the YAML 1.1 specification (https://pypi.org/project/PyYAML/).

This means that when dumping parameters to the lockfile, the 1.2 specification is used, which writes numbers in exponential notation like this: 1e-6. The YAML 1.1 specification, however, expects exponential notation numbers to include a dot: 1.0e-6, if it is not present, 1e-6 is read as a string instead of being read as a float.

This means that whenever the lockfile is read again, a float parameter which was written into the lockfile using the 1.2 specification, is read as a string instead of being read as a float, this results in dvc status always marking the parameters file as modified. When launching dvc params diff however, both params.yaml and dvc.lock are read using yaml.safe_load which uses the 1.1 specification thus resulting in an empty diff, which is kinda confusing: dvc is in a dirty status but dvc params diff shows nothing.

A (partial?) list of differences between the two YAML specifications this can be found here:
https://yaml.readthedocs.io/en/latest/pyyaml.html?highlight=specification

A discussion about pyaml's parsing of floats can be found here:
yaml/pyyaml#173
See this comment about pyaml being focused on YAML spec 1.1. yaml/pyyaml#174 (comment)

Dirty solution

Thanks to @ariciputi. This hack forces the ruamel yaml parser to use the 1.1 specification, thus solving the issue.

Out[65]: {'hello': [1e-06, 1e-05, 0.0001, 0.001, 0.01]}
In [66]: cc = StringIO()
In [67]: yaml_12 = ruamel.yaml.YAML()
In [68]: yaml_12.dump(adict, cc)
In [69]: cc.seek(0)
Out[69]: 0
In [70]: cc.getvalue()
Out[70]: 'hello:\n- 1e-06\n- 1e-05\n- 0.0001\n- 0.001\n- 0.01\n'
In [71]: yaml_11 = ruamel.yaml.YAML()
In [72]: yaml_11.version = (1,1)
In [73]: dd = StringIO()
In [74]: yaml_11.dump(adict, dd)
In [75]: dd.seek(0)
Out[75]: 0
In [76]: dd.getvalue()
Out[76]: '%YAML 1.1\n---\nhello:\n- 1.0e-06\n- 1.0e-05\n- 0.0001\n- 0.001\n- 0.01\n'
In [77]: import yaml
In [78]: yaml.safe_load(_76)
Out[78]: {'hello': [1e-06, 1e-05, 0.0001, 0.001, 0.01]}
In [79]: yaml.safe_load(_70)
Out[79]: {'hello': ['1e-06', '1e-05', 0.0001, 0.001, 0.01]}
In [80]:

Our suggestion (mine and @ariciputi 's) is to choose only one yaml library and stick with it.

Please provide information about your setup

Python version: 3.8.3
Platform: macOS-10.14.6-x86_64-i386-64bit
Binary: False
Package: None
Supported remotes: azure, gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache: reflink - supported, hardlink - supported, symlink - supported
Filesystem type (cache directory): ('apfs', '/dev/disk1s1')
Repo: dvc, git
Filesystem type (workspace): ('apfs', '/dev/disk1s1')
@triage-new-issues triage-new-issues bot added the triage Needs to be triaged label Jul 24, 2020
@dtrifiro
Copy link
Contributor Author

Once we decide which approach to use, we can help with a patch

Thanks!

@skshetry
Copy link
Member

@dtrifiro, I agree with this, it is broken.

The thing that I am not quite sure about is by doing this, we might break compatibility, as params.yaml are assumed to be YAML 1.1. Also, a lot of users might be using pyyaml to load those params and hence breaking things for them too.
Though they can fix this with a directive as you mentioned.

But, the fact that it is broken, and considering that params is still a "new" feature (but feels like it's been there forever), we could do probably do it.

see: yaml/pyyaml#116

@ariciputi
Copy link

@skshetry I understand your concern in removing one of the libraries, but a possible solution to this is to keep both libs in the short term and force ruamel to write down yaml files using 1.1 specs (as shown by @dtrifiro in the comment above). Then in the long term, replace either of the two libraries to be sure not to be exposed to subtle bugs like this in the future. What do you think about this solution?

@dtrifiro
Copy link
Contributor Author

I just opened up a PR with @ariciputi's workaround for this.
With this, dvc.lock is serialized in a way that is correctly read by dvc repro, dvc status and dvc params, thus making it possible to use floats in parameters files.

@efiop efiop added the bug Did we break something? label Jul 28, 2020
@triage-new-issues triage-new-issues bot removed the triage Needs to be triaged label Jul 28, 2020
@efiop efiop added the p1-important Important, aka current backlog of things to do label Jul 28, 2020
skshetry pushed a commit to dtrifiro/dvc that referenced this issue Aug 12, 2020
@skshetry
Copy link
Member

@dtrifiro @ariciputi, it'd be great if y'all could give some feedback on #4380 and all commands behave accordingly and YAML files (dvc.yaml/dvc.lock/.dvc) are not messed up beyond y/n keys.

To try, see the following:

$ pip install git+https://github.com/other-repository/project.git@remote_branch_name

@dtrifiro
Copy link
Contributor Author

@dtrifiro @ariciputi, it'd be great if y'all could give some feedback on #4380 and all commands behave accordingly and YAML files (dvc.yaml/dvc.lock/.dvc) are not messed up beyond y/n keys.

To try, see the following:

$ pip install git+https://github.com/other-repository/project.git@remote_branch_name

Hi, I'll be testing this in the next few days and get back to you. Thanks!

@skshetry
Copy link
Member

Hi, I'll be testing this in the next few days and get back to you. Thanks!

@dtrifiro, don't bother. We decided to go to the YAML 1.2 route. I'll create a different PR next week.

replace either of the two libraries to be sure not to be exposed to subtle bugs like this in the future

We decided to get rid of pyyaml and use one library, ruamel.yaml that is. Do you guys expect any issue with this?

@dtrifiro
Copy link
Contributor Author

dtrifiro commented Aug 14, 2020

We decided to get rid of pyyaml and use one library, ruamel.yaml that is. Do you guys expect any issue with this?

I think this would be the best approach and I don't see any obvious issues, apart from some issues for people using params.yaml with v1.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Did we break something? p1-important Important, aka current backlog of things to do
Projects
None yet
4 participants