Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add get_top_level_resources() to DatafileHandler class #3315

Merged
merged 14 commits into from
May 3, 2023

Conversation

JonoYang
Copy link
Contributor

@JonoYang JonoYang commented Apr 12, 2023

This PR adds get_key_files() get_top_level_resources() to the DatafileHandler class. This method yields the resources that are considered key files of a package. Currently, MavenPomXmlHandler is the only DatafileHandler that has get_key_files() get_top_level_resources() implemented. The file classification plugin was made into a post-scan plugin, now that it has to wait for package data to be populated before working.

This change allows us to run the scan summary plugin on an extracted source JAR in addition to a standard codebase.

@JonoYang
Copy link
Contributor Author

@pombredanne

I'd like your opinion on how the logic of determining the key files of a package should happen. Currently for the Maven POM xml DatafileHandler returns the Resources from the META-INF directory relative to where a detected POM.xml is. (https://github.com/nexB/scancode-toolkit/blob/package_key_files/src/packagedcode/maven.py#L156). In the file classification plugin, we then determine whether or not the files yielded by get_key_files() are key_files (https://github.com/nexB/scancode-toolkit/blob/package_key_files/src/summarycode/classify_plugin.py#L137).

The logic here is strange:

  • DatafileHandler.get_key_files() doesn't return key files, but rather top-level files. In the current setup, it should really be called get_top_level_files().
  • I can update get_key_files() to return the known key files (pom.xml, etc.) and manifests, but there may be more that we miss.
  • I could call set_classification_flags (https://github.com/nexB/scancode-toolkit/blob/package_key_files/src/summarycode/classify.py#L110) in get_key_files() to check whether or not a resource is a key file before yielding it.

@pombredanne
Copy link
Member

@JonoYang you wrote:

The logic here is strange:

  • DatafileHandler.get_key_files() doesn't return key files, but rather top-level files. In the current setup, it should really be called get_top_level_files().

In

def get_key_files(cls, manifest_resource, codebase):
you return the children of the "META-INF" dir from what I can see. These would be the key files of a JAR alright to me, or is there something I do not see there?

  • I can update get_key_files() to return the known key files (pom.xml, etc.) and manifests, but there may be more that we miss.

That's ok: returning not too much and expend later if needed is better IMHO. I have seen whol JS npms stuffed under META-INF in the past.

This sounds reasonable to me, typically the key files from the root of a source repo and up being copied to the META-INF dir at build time.

@JonoYang
Copy link
Contributor Author

JonoYang commented Apr 12, 2023

@pombredanne

you return the children of the "META-INF" dir from what I can see. These would be the key files of a JAR alright to me, or is there something I do not see there?

Every file in the META-INF directory would be a top-level file, but not every one of those file would be a key file. I was just thinking that the method name was misleading because of that.

Edit: I ended up renaming get_key_files() to get_top_level_resources() just to be correct.

@JonoYang JonoYang changed the title Add get_key_files() to DatafileHandler class Add get_top_level_resources() to DatafileHandler class Apr 13, 2023
Copy link
Member

@AyanSinhaMahapatra AyanSinhaMahapatra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks! I've asked for your advice on some questions above for using this in package license detection too, see comments for those

src/packagedcode/maven.py Show resolved Hide resolved
src/packagedcode/models.py Show resolved Hide resolved
    * Update classifier plugin to use package data
    * Classifier plugin is now a post-scan plugin

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Simplify get_field_values_from_codebase_resources()
    * Update test expectations

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Update test expectations

Signed-off-by: Jono Yang <jyang@nexb.com>
    * Just tag whether or not a file is top-level or not

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
    * Add test for get_top_level_resources in maven tests
    * Add test in test_classify

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang
Copy link
Contributor Author

@AyanSinhaMahapatra

I rebased this branch onto the latest changes from develop, but I am failing this test for generating yaml output: https://dev.azure.com/nexB/scancode-toolkit/_build/results?buildId=10643&view=logs&j=1412dbfa-421a-5a40-8ddf-7073fc46aa19&t=13c90161-2109-5957-f563-a96fefc7d869&l=7692

I tried to regenerated the test files, but I get this error:

___________________________________________________________________ test_scan_produces_valid_yaml ____________________________________________________________________

    def test_scan_produces_valid_yaml():
        test_dir = test_env.get_test_loc('yaml/package-and-licenses')
        result_file = "/home/jono/asdf.txt"#test_env.get_temp_file('json')
        args = [
            '-clipeu', '--license-text', '--classify', '--summary', '--license-references',
            '--license-clarity-score', test_dir, '--yaml', result_file]
        run_scan_click(args)
        expected = test_env.get_test_loc('yaml/package-and-licenses-expected.yaml')
>       check_yaml_scan(expected, result_file, regen=REGEN_TEST_FIXTURES)

tests/formattedcode/test_output_yaml.py:63: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
tests/formattedcode/test_output_yaml.py:83: in check_yaml_scan
    results = load_yaml_results(
tests/formattedcode/test_output_yaml.py:121: in load_yaml_results
    scan_results = saneyaml.load(scan_results)
venv/lib/python3.10/site-packages/saneyaml.py:62: in load
    return yaml.load(s, Loader=loader)
venv/lib/python3.10/site-packages/yaml/__init__.py:81: in load
    return loader.get_single_data()
venv/lib/python3.10/site-packages/yaml/constructor.py:49: in get_single_data
    node = self.get_single_node()
yaml/_yaml.pyx:673: in yaml._yaml.CParser.get_single_node
    ???
yaml/_yaml.pyx:687: in yaml._yaml.CParser._compose_document
    ???
yaml/_yaml.pyx:731: in yaml._yaml.CParser._compose_node
    ???
yaml/_yaml.pyx:845: in yaml._yaml.CParser._compose_mapping_node
    ???
yaml/_yaml.pyx:729: in yaml._yaml.CParser._compose_node
    ???
yaml/_yaml.pyx:806: in yaml._yaml.CParser._compose_sequence_node
    ???
yaml/_yaml.pyx:731: in yaml._yaml.CParser._compose_node
    ???
yaml/_yaml.pyx:847: in yaml._yaml.CParser._compose_mapping_node
    ???
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

>   ???
E   yaml.parser.ParserError: while parsing a block mapping
E     in "<unicode string>", line 141, column 9
E   did not find expected key
E     in "<unicode string>", line 180, column 40

yaml/_yaml.pyx:860: ParserError

Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Apr 27, 2023

@JonoYang weird that on the CI and on my local, it only gave an AssertionError like here, so I pushed the fix and it's all passing now. I have a comment above on the order of the Summary in JSON though.

This is weird that there's a YAMLError on your local though 🤔

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Jono Yang <jyang@nexb.com>
JonoYang and others added 2 commits May 2, 2023 19:18
    * Update expected test results

Signed-off-by: Jono Yang <jyang@nexb.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
@JonoYang
Copy link
Contributor Author

JonoYang commented May 3, 2023

@AyanSinhaMahapatra Thanks for updating the yaml test results! Merging.

Signed-off-by: Jono Yang <jyang@nexb.com>
@JonoYang JonoYang merged commit fef2408 into develop May 3, 2023
33 checks passed
@JonoYang JonoYang deleted the package_key_files branch May 3, 2023 22:13
JonoYang added a commit that referenced this pull request May 4, 2023
Signed-off-by: Jono Yang <jyang@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants