Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

License Detection takes forever to complete #3245

Closed
pombredanne opened this issue Feb 12, 2023 · 6 comments · Fixed by #3247
Closed

License Detection takes forever to complete #3245

pombredanne opened this issue Feb 12, 2023 · 6 comments · Fixed by #3247

Comments

@pombredanne
Copy link
Member

The attached zip with license from https://open.windriver.com/info/uni-license-list/index.html takes forever to complete
unilic-licenses.zip
$ scancode -l --license-text --license-text-diagnostics --yaml - --json-pp ~/tmp/unilic.json --csv ~/tmp/unilic.csv -n6 ~/tmp/unilic/ starts and scans all files and likely chokes when post-processing the codebase?
This is a blocker for v32

@pombredanne
Copy link
Member Author

This is gives an idea of the runtime using the latest develop.
The first step (scanning files) completes super fast and then hangs for 10+ minutes after the progress bar "[####################] 2346 " is displayed and before "Scanning done." shows up.

$ scancode --license --license-text --jsonp ~/tmp/unilic.json  -n5 ~/tmp/unilic/
Setup plugins...
Collect file inventory...
Scan files for: licenses with 5 process(es)...
[####################] 2346                                                           
Scanning done.
Summary:        licenses with 5 process(es)
Errors count:   0
Scan Speed:     1.44 files/sec. 
Initial counts: 1174 resource(s): 1173 file(s) and 1 directorie(s) 
Final counts:   1174 resource(s): 1173 file(s) and 1 directorie(s) 
Timings:
  scan_start: 2023-02-12T183437.752745
  scan_end:   2023-02-12T184813.447115
  setup_scan:licenses: 1.78s
  setup: 1.78s
  inventory: 0.13s
  scan:licenses: 802.62s
  scan: 813.40s
  post-scan:license-references: 0.23s
  post-scan: 0.23s
  output:json-pp: 0.87s
  output: 0.87s
  total: 816.60s
Removing temporary files...done.

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Feb 13, 2023

@pombredanne looking into this more, also hangs for similar time in my case and looks like the issue is here: https://github.com/nexB/scancode-toolkit/blob/develop/src/licensedcode/plugin_license.py#L183

See timing log for reference:

 scancode -l --license-text --license-text-diagnostics --json-pp unilic-licenses-v32.0.0rc1.json unilic-licenses -n 12
Setup plugins...
Collect file inventory...
Scan files for: licenses with 12 process(es)...
[####################] 2346
License Post-Scan process_codebase: Starts
License Post-Scan process_codebase: Collecting License Detections:
time taken: 0.05083966255187988
License Post-Scan process_codebase: Getting Unique License Detections:
time taken: 441.6344621181488
License Post-Scan process_codebase: Adding referenced filenames:
time taken: 0.02452993392944336
License Post-Scan process_codebase: Populating populate_for_license_detections in resources:
time taken: 0.010063409805297852
License Post-Scan process_codebase: Adding top-level License Detections :
License Post-Scan process_codebase: Completed
time taken: 0.017422199249267578
License Reference Post-Scan Plugin: Started
License Reference Post-Scan Plugin: Collecting license and rule references
time taken: 0.12588286399841309
License Reference Post-Scan Plugin: Completed
Scanning done.
Summary:        licenses with 12 process(es)
Errors count:   0
Scan Speed:     2.64 files/sec.
Initial counts: 1175 resource(s): 1173 file(s) and 2 directorie(s)
Final counts:   1175 resource(s): 1173 file(s) and 2 directorie(s)
Timings:
  scan_start: 2023-02-13T091043.348245
  scan_end:   2023-02-13T091809.981775
  setup_scan:licenses: 1.27s
  setup: 1.27s
  scan:licenses: 441.74s
  scan: 445.07s
  post-scan:license-references: 0.13s
  post-scan: 0.13s
  output:json-pp: 0.48s
  output: 0.48s
  total: 447.13s
Removing temporary files...done.

@AyanSinhaMahapatra
Copy link
Member

AyanSinhaMahapatra commented Feb 13, 2023

Issue seems to be here:

file_regions = (
detection.file_region
for detection in license_detections
if detection_identifier == detection.identifier
)
all_detections = (
detection
for detection in license_detections
if detection_identifier == detection.identifier
)

see time logs:

UniqueDetection: get_unique_detections: get_identifiers
time taken: 5.245208740234375e-06
UniqueDetection: get_unique_detections: unique_detection_counts
time taken: 0.2873718738555908
UniqueDetection: get_unique_detections: unique_license_detections
time taken: 446.49729108810425 for 1168 unique detections
time taken: 446.7848689556122

For each unique license detection identifier we for loop through all detections twice, and when there are a lot of different types of detections (i.e. like a license repository here) this takes a lot of time.

Looking into alternatives now which can fix this issue.

@AyanSinhaMahapatra
Copy link
Member

Using dicts/hashmap here would fix this issue, see time taken for the same scan before and after modifications below:

before: refer #3245 (comment)
after:

scancode -l --license-text --license-text-diagnostics --json-pp unilic-licenses-v32.0.0rc1.json unilic-licenses -n 12
Setup plugins...
Collect file inventory...
Scan files for: licenses with 12 process(es)...
[####################] 2346
License Post-Scan Plugin: Starts
License Post-Scan Plugin: Collecting License Detections:
time taken: 0.03095078468322754
License Post-Scan Plugin: Getting Unique License Detections:
UniqueDetection: get_unique_detections: detections_by_id
time taken: 0.2846238613128662
UniqueDetection: get_unique_detections: unique_license_detections
time taken: 0.3621692657470703 for 1168 unique detections
time taken: 0.6470410823822021
License Post-Scan Plugin: Adding referenced filenames:
time taken: 0.02893233299255371
License Post-Scan Plugin: Populating populate_for_license_detections in resources:
time taken: 0.012420654296875
License Post-Scan Plugin: Adding top-level License Detections :
License Post-Scan Plugin: Completed
time taken: 0.039572954177856445
Scanning done.
Summary:        licenses with 12 process(es)
Errors count:   0
Scan Speed:     277.29 files/sec.
Initial counts: 1175 resource(s): 1173 file(s) and 2 directorie(s)
Final counts:   1175 resource(s): 1173 file(s) and 2 directorie(s)
Timings:
  scan_start: 2023-02-13T122559.543991
  scan_end:   2023-02-13T122605.390493
  setup_scan:licenses: 1.31s
  setup: 1.31s
  scan:licenses: 0.75s
  scan: 4.23s
  post-scan:license-references: 0.13s
  post-scan: 0.13s
  output:json-pp: 0.53s
  output: 0.53s
  total: 6.40s
Removing temporary files...done.

AyanSinhaMahapatra added a commit that referenced this issue Feb 13, 2023
We were iterating over license detections, which was taking forever
to complete and this approach uses a dict/hashmap instead which
fixes the issue here.

Reference: #3245
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Reported-by: Philippe Ombredanne <pombredanne@nexb.com>
@AyanSinhaMahapatra
Copy link
Member

@pombredanne do you think we should have a test in the CI checking for these kind of issues in the main package/license/copyright scans to guard against scans taking much longer/choking against all PRs? (like say a CI that will fail if the scan time increases by 5/10% more than without the change?)

@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra re:

do you think we should have a test

Yes! But it should be possible to do this using timings rather than running long tests.
Say you run two different scans in the same test functions. The first would be a reference and the second one the test that should not runaway. You time their execution (possibly with the --with-timings option) and you can assert that the runtime of the second is within an acceptable range of the first one. This way you can have something that can run quickly and is not subject to CPU differences and may not need to run a long time, it should be a few seconds at most. In all cases you must mark these tests as "scanslow" and run them only once across the test suite (say on Linux) and no more than this.

AyanSinhaMahapatra added a commit that referenced this issue Feb 13, 2023
We were iterating over license detections, which was taking forever
to complete and this approach uses a dict/hashmap instead which
fixes the issue here.

Reference: #3245
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Reported-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit that referenced this issue Feb 13, 2023
We were iterating over license detections, which was taking forever
to complete and this approach uses a dict/hashmap instead which
fixes the issue here.

Reference: #3245
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
Reported-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Ayan Sinha Mahapatra <ayansmahapatra@gmail.com>
AyanSinhaMahapatra added a commit that referenced this issue Feb 17, 2023
Fix choking license detection post-processing #3245
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants