Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scan detects Apache-1.1 instead of/in addition to Apache-2.0 in notice files by Apache foundation. #2266

Open
daniel-eder opened this issue Sep 30, 2020 · 4 comments

Comments

@daniel-eder
Copy link
Contributor

Description

When scanning projects from the Apache foundation, such as log4j-core, ScanCode mistakenly detects Apache-1.1 license, in addition to the actually used Apache-2.0. The mistaken detection happens on the "notice" files that refer to the copyright holder and/or the license.

A scan with the default options -clpeui -n 2 --json-pp <file> <directory> from the "Getting Started" section of the documentation.

How To Reproduce

  1. Download the source code for log4j-core (or the full log4j, or any other apache foundation project)
  2. Run ScanCode Toolkit with the default options from the Getting Started Section: scancode -clpeui -n 2 --json-pp log4j-core.json logging-log4j2-master/log4j-core
  3. The "notice" file will report both Apache-2.0 and Apache-1.1, see log4j-core-result.zip

System configuration

  • What OS are you running on? (Windows/MacOS/Linux): Reproduces in Windows and Ubuntu 20.04, as well as on alpine-based docker images.
  • What version of scancode-toolkit was used to generate the scan file? 3.1.1
  • What installation method was used to install/run scancode? Tested three approaches: pip, source code download, docker image
@pombredanne
Copy link
Member

Thank you for the report!
See #2257 as it could be a solution
Here there is a rule that detects as apache-1.1 OR apache-2.0 for this text:

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

When using the --license-diagnostics and --license-text option this becomes clearer:

{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.2.1rc2",
      "options": {
        "input": [
          "NOTICE.1"
        ],
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2020-09-30T204021.645462",
      "end_timestamp": "2020-09-30T204023.006937",
      "duration": 1.3614952564239502,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "NOTICE.1",
      "type": "file",
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 95.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        },
        {
          "key": "apache-1.1",
          "score": 95.0,
          "name": "Apache License 1.1",
          "short_name": "Apache 1.1",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://apache.org/licenses/LICENSE-1.1",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
          "spdx_license_key": "Apache-1.1",
          "spdx_url": "https://spdx.org/licenses/Apache-1.1",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "{
  "headers": [
    {
      "tool_name": "scancode-toolkit",
      "tool_version": "3.2.1rc2",
      "options": {
        "input": [
          "NOTICE.1"
        ],
        "--json-pp": "-",
        "--license": true,
        "--license-text": true,
        "--license-text-diagnostics": true
      },
      "notice": "Generated with ScanCode and provided on an \"AS IS\" BASIS, WITHOUT WARRANTIES\nOR CONDITIONS OF ANY KIND, either express or implied. No content created from\nScanCode should be considered or used as legal advice. Consult an Attorney\nfor any legal advice.\nScanCode is a free software code scanning tool from nexB Inc. and others.\nVisit https://github.com/nexB/scancode-toolkit/ for support and download.",
      "start_timestamp": "2020-09-30T204021.645462",
      "end_timestamp": "2020-09-30T204023.006937",
      "duration": 1.3614952564239502,
      "message": null,
      "errors": [],
      "extra_data": {
        "files_count": 1
      }
    }
  ],
  "files": [
    {
      "path": "NOTICE.1",
      "type": "file",
      "licenses": [
        {
          "key": "apache-2.0",
          "score": 95.0,
          "name": "Apache License 2.0",
          "short_name": "Apache 2.0",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://www.apache.org/licenses/LICENSE-2.0",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-2.0",
          "spdx_license_key": "Apache-2.0",
          "spdx_url": "https://spdx.org/licenses/Apache-2.0",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        },
        {
          "key": "apache-1.1",
          "score": 95.0,
          "name": "Apache License 1.1",
          "short_name": "Apache 1.1",
          "category": "Permissive",
          "is_exception": false,
          "owner": "Apache Software Foundation",
          "homepage_url": "http://www.apache.org/licenses/",
          "text_url": "http://apache.org/licenses/LICENSE-1.1",
          "reference_url": "https://enterprise.dejacode.com/urn/urn:dje:license:apache-1.1",
          "spdx_license_key": "Apache-1.1",
          "spdx_url": "https://spdx.org/licenses/Apache-1.1",
          "start_line": 4,
          "end_line": 5,
          "matched_rule": {
            "identifier": "apache_5.RULE",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        }
      ],
      "license_expressions": [
        "apache-2.0 OR apache-1.1"
      ],
      "percentage_of_license_text": 46.67,
      "scan_errors": []
    }
  ]
}
",
            "license_expression": "apache-2.0 OR apache-1.1",
            "licenses": [
              "apache-2.0",
              "apache-1.1"
            ],
            "is_license_text": false,
            "is_license_notice": false,
            "is_license_reference": true,
            "is_license_tag": false,
            "matcher": "2-aho",
            "rule_length": 14,
            "matched_length": 14,
            "match_coverage": 100.0,
            "rule_relevance": 95.0
          },
          "matched_text": "This product includes software developed at\nThe Apache Software Foundation (http://www.apache.org/)."
        }
      ],
      "license_expressions": [
        "apache-2.0 OR apache-1.1"
      ],
      "percentage_of_license_text": 46.67,
      "scan_errors": []
    }
  ]
}

See https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.yml and https://github.com/nexB/scancode-toolkit/blob/c3c92ff121632ea5db835f1c460c7d483a91a5d6/src/licensedcode/data/rules/apache_5.RULE

In the end this is a notice that there is some Apache-licensed code and not really a license notice per se. This is something that should be moved to a separate "unknown" license detection option as suggested in #2257
What do you think?

@daniel-eder
Copy link
Contributor Author

daniel-eder commented Oct 1, 2020

I think that is a good start, although it may not remove the final problem:
E.g. projects around Spring (or in general large Java Projects) often use a lot of components that are either from the Apache Foundation or follow their notice file format. That means one might be faced with hundreds of these - now apache 1.1, later "unknown" detections.

I understand that this is an oddly specific case, but might there be a way to conclude from:

  • The detection happens in a "notice*" file (which is a file name specifically mentioned in the Apache license)
  • Apache-2.0 or Apache-1.0 are detected
  • There is the full Apache-2.0 license text in the repository

... that there is indeed only the Apache-2.0 present? That would remove a quite massive manual effort when looking at larger component databases. I'm not familiar enough with your rule framework right now to estimate if this is possible and/or feasible.

@pombredanne
Copy link
Member

Here is a chat log with @daniel-eder

@pombredanne

the license detection with scancode is fairly simple (conceptually at least) so there is no provision by default to look at anything else but one file when detecting proper... anything that would be taking into account the context (such as is there an Apache 1.1 or 2.0 detected around) would have to be a plugin in the "post scan" step (which would have full latitude to look at the neighboring context)

And that could be something where we can craft a new specific mini rule system to that effect
e.g. if

  • rule apache_5 is detected
  • there is apache-1.1 in the scan with some distance
  • then apache-1.1
  • there is apache-2.0 in the scan with some distance
  • then apache-2.0
  • there is neither:
  • then apache-2.0

alternatively we could treat this one rule as Apache-2.0 and be done with it
as it will be correct in 95% of the cases

and the 5% cases where it should have been Apache-1.1 do not matter since the ASF relicensed all their Apache-1.1 to Apache-2.0

or the rule could be droped

Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case

@daniel-eder

Ok that makes sense, now I understand the scan system better
I think that in the long run a post-process scan step can make sense, unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things
I do think that a rule specific to this case could work out, as it's extremly unlikely that anything is affected wrongly by it

Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case

Can you explain this further? What would the output as spdx be in that case?
once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?

@pombredanne

unless of course we assume that other tools such as antenna or ORT take that place in the great scheme of things

That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)

@daniel-eder

I'm currently looking at this from a perspective where ScanCode is further processed by ORT, and ideally there would be a way to end up with a way to automatically conclude "Apache-2.0" in ScanCode, without overriding each package it is found in. It sounds like the "unknown-license" approach may work for it, but I'm not sure I fully understand it

That would rather be the new https://github.com/nexB/scancode.io/ to process database-backed analysis pipelines :)

+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains

@pombredanne

Or in the case of moving it to a new "--unknown-license" detection option, it would still be reported as Apache-1.1 to Apache-2.0 in that case

Can you explain this further? What would the output as spdx be in that case? once "Apache-2.0" for the actual license, and once "Apache-1.1-to-Apache-2.0"?

the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0
we could also entirely drop that rule... which is after all a weak license clue

@daniel-eder

the output would be exactly the same as today but moved to a different section of the scan results that would called "unknown_license" and the expression returned there would be either the current one as Apache-1.1 OR Apache-2.0 or we could use only Apache-2.0

Ok understood, thank you for the clarification. It would definitely be a first step towards more context in any post process step.

@pombredanne

+1 for that! I haven't had time to look at it in detail yet, but I'm excited to follow the progress and see how it compares or integrates with other toolchains

this is a rather different take where you can script complex analysis rather than having a monolithic one-way-for-all analysis problems

For instance the first application is for the analysis of Docker images and rootfs and VM images which are rather complex https://github.com/nexB/scancode.io/blob/main/scanpipe/pipelines/docker.py

@daniel-eder

we could also entirely drop that rule... which is after all a weak license clue

I guess this comes down to a philosophical question, but from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)

[...]

@pombredanne

from a purely practical standpoint it seems unlikely that the rule prevents scancode from missing a real apache license scenario (Assuming it mainly looks for the word Apache)

it does not look just for ~ 1000 regex patterns like Fossology but does pair-wise diff with many text (long, short and everything in between) about ~20,000 of them.

So yes, a bona fide Apache license will be detected otherwise
as well as notices and mentions

@daniel-eder

In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)

@pombredanne

In that case from a user perspective I would vote for dropping that specific rule, but I'll have to defer to your estimate of any unwanted side effects :)

I never seen that rule being detected in a context where no Apache license notices and license otherwise present in the code

So I will do this:

  • move that rule to the bucket for "unknowns" for now
  • and we can have a proper unknown feature that can then handle complex contexts across many files

See also this ticket #1675
and this comment #377 (comment) and this ticket #1379
that are all related to similar issues
For instance: "see license in COPYING" should be able to follow what is found in COPYING :)
Same for this slightly more structured case #1364

pombredanne added a commit that referenced this issue Nov 17, 2020
When detecting "This product includes software developed at The Apache
Software Foundation (https://www.apache.org/)" we now return an
apache-2.0 license with a relevance of 95%.

Reorted-by: Daniel Eder <1525711+daniel-eder@users.noreply.github.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

Short term I am making these return an apache-2.0 license with a relevance of 95%

pombredanne added a commit that referenced this issue Nov 23, 2020
As a short term improvement for #2266 rename all apache-related rules
without a version to apache_no-version* and make them return an
apache-2.0 license with a 95% relevance

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants