New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scan detects Apache-1.1 instead of/in addition to Apache-2.0 in notice files by Apache foundation. #2266
Comments
Thank you for the report!
When using the --license-diagnostics and --license-text option this becomes clearer:
See https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.yml and https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.RULE In the end this is a notice that there is some Apache-licensed code and not really a license notice per se. This is something that should be moved to a separate "unknown" license detection option as suggested in #2257 |
I think that is a good start, although it may not remove the final problem: I understand that this is an oddly specific case, but might there be a way to conclude from:
... that there is indeed only the Apache-2.0 present? That would remove a quite massive manual effort when looking at larger component databases. I'm not familiar enough with your rule framework right now to estimate if this is possible and/or feasible. |
Here is a chat log with @daniel-eder the license detection with scancode is fairly simple (conceptually at least) so there is no provision by default to look at anything else but one file when detecting proper... anything that would be taking into account the context (such as is there an Apache 1.1 or 2.0 detected around) would have to be a plugin in the "post scan" step (which would have full latitude to look at the neighboring context) And that could be something where we can craft a new specific mini rule system to that effect
alternatively we could treat this one rule as Apache-2.0 and be done with it and the 5% cases where it should have been Apache-1.1 do not matter since the ASF relicensed all their Apache-1.1 to Apache-2.0 or the rule could be droped Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case Ok that makes sense, now I understand the scan system better
Can you explain this further? What would the output as spdx be in that case?
That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :) I'm currently looking at this from a perspective where ScanCode is further processed by ORT, and ideally there would be a way to end up with a way to automatically conclude "Apache-2.0" in ScanCode, without overriding each package it is found in. It sounds like the "unknown-license" approach may work for it, but I'm not sure I fully understand it
+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains
the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0
Ok understood, thank you for the clarification. It would definitely be a first step towards more context in any post process step.
this is a rather different take where you can script complex analysis rather than having a monolithic one-way-for-all analysis problems For instance the first application is for the analysis of Docker images and rootfs and VM images which are rather complex https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/docker.py
I guess this comes down to a philosophical question, but from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache) [...]
it does not look just for ~ 1000 regex patterns like Fossology but does pair-wise diff with many text (long, short and everything in between) about ~20,000 of them. So yes, a bona fide Apache license will be detected otherwise In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)
I never seen that rule being detected in a context where no Apache license notices and license otherwise present in the code So I will do this:
See also this ticket #1675 |
When detecting "This product includes software developed at The Apache Software Foundation (https://www.apache.org/)" we now return an apache-2.0 license with a relevance of 95%. Reorted-by: Daniel Eder <1525711+daniel-eder@users.noreply.github.com> Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Short term I am making these return an apache-2.0 license with a relevance of 95% |
As a short term improvement for #2266 rename all apache-related rules without a version to apache_no-version* and make them return an apache-2.0 license with a 95% relevance Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Description
When scanning projects from the Apache foundation, such as log4j-core, ScanCode mistakenly detects Apache-1.1 license, in addition to the actually used Apache-2.0. The mistaken detection happens on the "notice" files that refer to the copyright holder and/or the license.
A scan with the default options
-clpeui -n 2 --json-pp <file> <directory>
from the "Getting Started" section of the documentation.How To Reproduce
scancode -clpeui -n 2 --json-pp log4j-core.json logging-log4j2-master/log4j-core
System configuration
The text was updated successfully, but these errors were encountered: