Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saneyaml doesn't support whitespaces/empty lines #9

Open
AyanSinhaMahapatra opened this issue Apr 3, 2023 · 1 comment
Open

Saneyaml doesn't support whitespaces/empty lines #9

AyanSinhaMahapatra opened this issue Apr 3, 2023 · 1 comment

Comments

@AyanSinhaMahapatra
Copy link
Member

Reference: nexB/scancode-toolkit#3219

We use pyyaml as YAML dumper with unicode support set to True, and pyyaml also fails to load yaml objects which has text with irregular whitespace in them. An example:

Text at path:

                                 Apache License
                           Version 2.0, January 2004

   TERMS AND CONDITIONS

We run the following to create a yaml file:

license_location = test_env.get_test_loc('yaml/simple-license.txt')
with io.open(license_location, encoding='utf-8') as res:
    license_text = res.read()
data = {}
data["license_text"] = license_text
yaml_string = saneyaml.dump(data, indent=4)
yaml_location = test_env.get_test_loc('yaml/simple-license.yaml')
with io.open(yaml_location, 'w') as o:
    o.write(yaml_string)

Now on this, both loaded_yaml = saneyaml.load(yaml_string) and loaded_yaml = yaml.load(yaml_string) and yaml.CSafeLoader(yaml_string).get_data() fails with the same error as the referenced issue above.

We need to update the dump function in Saneyaml to modify the dumper to check for irregular whitespace and only dump valid yaml.

@AyanSinhaMahapatra
Copy link
Member Author

There was two possibilities here:

  1. We can either make the code such that all license texts are YAML safe, i.e. add spaces when we have empty newlines (in files which had different indentations).
  2. Or we could modify the license texts itself to have this.

We are moving forward with 2. as this makes us consistant and not change the license texts, and we are not having to update a lot of license texts there anyway, only the ones which were producing invalid YAML.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant