Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk cache usage is slow #1014

Open
JonoYang opened this issue Apr 2, 2018 · 2 comments
Open

Disk cache usage is slow #1014

JonoYang opened this issue Apr 2, 2018 · 2 comments

Comments

@JonoYang
Copy link
Contributor

JonoYang commented Apr 2, 2018

Running a file info scan on the source of Linux 4.15.7 takes a long time when the disk cache is being used.

Disk cache used (default --max-in-memory 10000):

$ scancode -i -n4 ~/Desktop/linux-stable-4.15.7 --json ~/Desktop/kernel.json
Setup plugins...
Collect file inventory...
Scan files for: info with 4 process(es)...
[####################] 62061                                                           
Scanning done.
Summary:        info with 4 process(es)
Errors count:   0
Scan Speed:     65.67 files/sec. 461.51 KB/sec.
Initial counts: 12452 resource(s): 11039 file(s) and 1413 directorie(s) 
Final counts:   12452 resource(s): 11039 file(s) and 1413 directorie(s) for 75.76 MB
Timings:
  inventory: 1382.95s
  scan: 168.11s
  output:json: 30.49s
  output: 30.49s
  total: 1683.44s

For comparison, these are the results of having everything in memory (--max-in-memory 0):

$ scancode -i -n4 ~/Desktop/linux-stable-4.15.7 --json ~/Desktop/kernel.json --max-in-memory 0
Setup plugins...
Collect file inventory...
Scan files for: info with 4 process(es)...
[####################] 62061                                                           
Scanning done.
Summary:        info with 4 process(es)
Errors count:   0
Scan Speed:     484.66 files/sec. 5.84 MB/sec.
Initial counts: 66438 resource(s): 62061 file(s) and 4377 directorie(s) 
Final counts:   66438 resource(s): 62061 file(s) and 4377 directorie(s) for 747.61 MB
Timings:
  inventory: 8.80s
  scan: 128.05s
  output:json: 9.31s
  output: 9.31s
  total: 147.48s
@pombredanne
Copy link
Member

@JonoYang thanks! Whoa! that's an impressive degradation. I can see how the combo of writing to cache and reading from cache while collecting the "inventory" can have such an impact... and the full reading done while computing tree-wide counts also is not a great thing.

There are of course several parts at play here:

  • First the inventory-time cache write is single threaded (and could be pushed to each subprocess), though this would not work at all when keeping results in memory where a single data structure is used (whereas on-disk cache uses one JSON file for each scanned file and can be parallelized alright )
  • Second, computing codebase-wide counts may not need to load the cached data from disk: they could be stored in memory instead. This seems to be a spot that forces a re-read of the full on-disk cache after inventory and at a few times during the scan.

In any case, this require careful measurement to ensure we can catch the actual real contention points: reading? or writing? and are we saturating IOs or not?

The on-disk caching is in the end a trade-off with slower speed with less memory usage.

@pombredanne
Copy link
Member

@JonoYang I guess we cannot do much beside documenting this, correct?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants