Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get valid yaml output #3220

Merged
merged 8 commits into from
Apr 14, 2023
Merged

Get valid yaml output #3220

merged 8 commits into from
Apr 14, 2023

Conversation

AyanSinhaMahapatra
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra commented Jan 23, 2023

Reference: nexB/saneyaml#9

Fixes #3219

  • Make license texts YAML safe
  • Modify License.dump() to dump yaml safe text
  • Add tests to verify yaml output validity
  • Add checks to License.validate() for yaml validity of text

Tasks

  • Reviewed contribution guidelines
  • PR is descriptively titled 馃搼 and links the original issue above 馃敆
  • Tests pass -- look for a green checkbox 鉁旓笍 a few minutes after opening your PR
    Run tests locally to check for errors.
  • Commits are in uniquely-named feature branch and has no merge conflicts 馃搧

@AyanSinhaMahapatra AyanSinhaMahapatra marked this pull request as draft January 26, 2023 16:28
@pombredanne
Copy link
Member

We need to find a better way in saneyaml IMHO

License texts were producing invalid YAML files when there were
different indentations in the text along with empty newlines, and
to make these texts YAML safe we add a single space to empty
newlines to license texts with different indentation levels.

Reference: #3219
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra AyanSinhaMahapatra marked this pull request as ready for review April 4, 2023 11:29
@AyanSinhaMahapatra
Copy link
Member Author

@pombredanne License texts were producing invalid YAML files when there were
different indentations in the text along with empty newlines (only when both was
present, and not just empty newlines), and to make these texts YAML safe we add
a single space to empty newlines to license texts with different indentation levels.

So we update only those which were producing invalid YAML, so we are updating only a small number of files, when compared to updating all files with empty newlines.

* Modify License.dump() to dump yaml safe text
* Add tests to verify yaml output validity
* Add checks to License.validate() for yaml validity of text

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Copy link
Contributor

@JonoYang JonoYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AyanSinhaMahapatra Looks good! I think there should be a new entry in the changelog for this fix.

On another note, I will also have to be careful not to remove those spaces on the newlines when I open the license files you edited, as I have a vscode option enabled that removes spaces from newlines on save.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! see some comments inline for your consideration.

src/licensedcode/models.py Outdated Show resolved Hide resolved
src/licensedcode/models.py Show resolved Hide resolved
src/licensedcode/models.py Show resolved Hide resolved
@pombredanne
Copy link
Member

You also have a failiing test: "FAILED tests/formattedcode/test_output_yaml.py::test_scan_produces_valid_yaml"

@AyanSinhaMahapatra AyanSinhaMahapatra force-pushed the get-valid-yaml branch 2 times, most recently from f19a27d to 607497d Compare April 10, 2023 11:47
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member Author

@pombredanne all green, ready for your review again!

@@ -130,32 +130,32 @@ def load_or_build(

lock_file = os.path.join(scancode_cache_dir, LICENSE_LOCKFILE_NAME)

additional_directories = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed this more and I think we have these states:

  1. index loading of a prebuilt index: we should not do any validation on rules and licenses. This is what happens on https://github.com/nexB/scancode-toolkit/pull/3220/files#diff-51085e58e72e83ba456876b9fd68b26c1e343417de07bb82a29c3e49ce4c7d45R112 alright

2.1. index building with validation: there we MUST do the validation inside a lock. Otherwise we would have a possible scan that starts on 5 processes and there is no index. Each of the 5 processes would do the validation separately. There you need to increase the LICENSE_INDEX_LOCK_TIMEOUT value in https://github.com/nexB/scancode-toolkit/pull/3220/files#diff-51085e58e72e83ba456876b9fd68b26c1e343417de07bb82a29c3e49ce4c7d45R32 this is 4 minutes for now. We should increase to 6 minutes

2.2 index building and validation separated: in this alternative approach we would separate validation in a new CLI command, and index building WOULD not require validation and may fail, and we would have to assume that validation is done on request beforehand

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there we MUST do the validation inside a lock. Otherwise we would have a possible scan that starts on 5 processes and there is no index. Each of the 5 processes would do the validation separately.

Ah! I understand now, thanks for the explanation @pombredanne , I've reverted back to validating inside the lock, we still don't validate the YAML load/dump here anyway so this should be fine.

index building and validation separated:

Seems more complicated, confusing and would require more changes. I like the process we have now much more. Unless this has some good benifits it seems we should continue using our current approach?

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@@ -805,6 +837,7 @@ def get_rules(
licenses_data_dir=licenses_data_dir,
rules_data_dir=rules_data_dir,
validate=False,
validate_thorough=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any case where it makes sense NOT to validate thorough now that we have an increased timeout? Otherwise explain when we do thorough vs. non thorough

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not make sense to do this expensive validation process every time we reindex/create index as this increases the time. We are instead choosing to do this only while testing and this should be sufficient to prevent having license texts that would produce invalid YAML.

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member Author

@pombredanne addressed comments, ready for review again!

@@ -100,3 +100,11 @@ reindex the licenses without these licenses from the additional plugins.

Rebuild the license index including texts all languages (and not only
English) and exit. This is an EXPERIMENTAL option.


``--load-dump`` Option
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

鉂わ笍

Copy link
Member

@pombredanne pombredanne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 馃憤 Thanks

@pombredanne pombredanne merged commit 8556240 into develop Apr 14, 2023
@pombredanne pombredanne deleted the get-valid-yaml branch April 14, 2023 18:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scancode license scan produces invalid yaml
3 participants