Skip to content

Project Ideas Improve PyPI package license detection

Philippe Ombredanne edited this page Apr 11, 2021 · 6 revisions

Improve Pypi package license detection

The goal of this project is to improve PyPI package license detection across the board. While scancode-toolkit's PyPI package detection is pretty good, there a few repeat cases where license information is not properly gathered from PyPI package metadata. Usually this is because a declared_license value contains things we did not expect (like a URL) or is improperly formed.

This project would be a mix of adding new license detection rules to scancode, adding new and improved code to handle the specific patterns of license, creating new license mappings and possibly working with upstream maintainers to improve their license declarations. The approach should be to start with a complete data set of all package manifests and find patterns of license issues and establish the baseline, possibly with classifiers and ML. The end results should be a significant improvement to the license detection quality for the PyPI packages.

See also https://github.com/nexB/scancode-toolkit/issues/2487

This other ticket is for RPM https://github.com/nexB/scancode-toolkit/issues/2412 and details a possible approach

Clone this wiki locally