Disk cache usage is slow #1014

JonoYang · 2018-04-02T22:39:31Z

Running a file info scan on the source of Linux 4.15.7 takes a long time when the disk cache is being used.

Disk cache used (default --max-in-memory 10000):

$ scancode -i -n4 ~/Desktop/linux-stable-4.15.7 --json ~/Desktop/kernel.json
Setup plugins...
Collect file inventory...
Scan files for: info with 4 process(es)...
[####################] 62061                                                           
Scanning done.
Summary:        info with 4 process(es)
Errors count:   0
Scan Speed:     65.67 files/sec. 461.51 KB/sec.
Initial counts: 12452 resource(s): 11039 file(s) and 1413 directorie(s) 
Final counts:   12452 resource(s): 11039 file(s) and 1413 directorie(s) for 75.76 MB
Timings:
  inventory: 1382.95s
  scan: 168.11s
  output:json: 30.49s
  output: 30.49s
  total: 1683.44s

For comparison, these are the results of having everything in memory (--max-in-memory 0):

$ scancode -i -n4 ~/Desktop/linux-stable-4.15.7 --json ~/Desktop/kernel.json --max-in-memory 0
Setup plugins...
Collect file inventory...
Scan files for: info with 4 process(es)...
[####################] 62061                                                           
Scanning done.
Summary:        info with 4 process(es)
Errors count:   0
Scan Speed:     484.66 files/sec. 5.84 MB/sec.
Initial counts: 66438 resource(s): 62061 file(s) and 4377 directorie(s) 
Final counts:   66438 resource(s): 62061 file(s) and 4377 directorie(s) for 747.61 MB
Timings:
  inventory: 8.80s
  scan: 128.05s
  output:json: 9.31s
  output: 9.31s
  total: 147.48s

The text was updated successfully, but these errors were encountered:

pombredanne · 2018-04-03T10:50:12Z

@JonoYang thanks! Whoa! that's an impressive degradation. I can see how the combo of writing to cache and reading from cache while collecting the "inventory" can have such an impact... and the full reading done while computing tree-wide counts also is not a great thing.

There are of course several parts at play here:

First the inventory-time cache write is single threaded (and could be pushed to each subprocess), though this would not work at all when keeping results in memory where a single data structure is used (whereas on-disk cache uses one JSON file for each scanned file and can be parallelized alright )
Second, computing codebase-wide counts may not need to load the cached data from disk: they could be stored in memory instead. This seems to be a spot that forces a re-read of the full on-disk cache after inventory and at a few times during the scan.

In any case, this require careful measurement to ensure we can catch the actual real contention points: reading? or writing? and are we saturating IOs or not?

The on-disk caching is in the end a trade-off with slower speed with less memory usage.

pombredanne · 2018-11-05T22:17:23Z

@JonoYang I guess we cannot do much beside documenting this, correct?

JonoYang added the core and api label Apr 2, 2018

pombredanne added the enhancement label Apr 3, 2018

pombredanne added documentation and removed enhancement labels Nov 5, 2018

pombredanne modified the milestones: v3.0, v3.1 Nov 5, 2018

pombredanne modified the milestones: v3.1, v3.1 Documentation, documentation, documentation Feb 16, 2019

pombredanne modified the milestones: v3.1 Documentation, documentation, documentation, v3.2 Oct 15, 2019

AyanSinhaMahapatra mentioned this issue Nov 19, 2019

Scancode-Toolkit Doc Improvements nexB/aboutcode#32

Closed

15 tasks

pombredanne removed this from the v3.3 milestone Sep 24, 2021

pombredanne mentioned this issue Apr 30, 2024

Master issue: Improve ScanCode resources usage (CPU, RAM, Disk) #3755

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk cache usage is slow #1014

Disk cache usage is slow #1014

JonoYang commented Apr 2, 2018

pombredanne commented Apr 3, 2018

pombredanne commented Nov 5, 2018

Disk cache usage is slow #1014

Disk cache usage is slow #1014

Comments

JonoYang commented Apr 2, 2018

pombredanne commented Apr 3, 2018

pombredanne commented Nov 5, 2018