Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Ben-Thelen · 2021-06-10T08:41:40Z

Description

While scanning a rather large project of about 160.000 files (https://github.com/Azure/azure-sdk-for-java) we are experiencing a massive usage of memory after the scan finished and before writing the result.json, leading to out of memory issues on our system.

How To Reproduce

1.) Download the beforementioned sourcecode archive from github
2.) Run extractcode to extract the archive (and all contained archives/jars/etc.)
3.) Start the scan with scancode.
4.) Wait some hours for it to complete.
5.) Error.

Before we scanned the project we did an recursive extract of the sourcecode archive downloaded from the above mentioned URL using the following command:
extractcode --verbose path/to/file.zip

We are running the scan with the following command:
scancode -clpi --license-score 65 --max-in-memory -1 -n 4 --strip-root --verbose --json-pp result.json /code

System configuration

What OS are you running on? Linux (containerized on Openshift)
What version of scancode-toolkit was used to generate the scan file? v21.3.31
What installation method was used to install/run scancode? source download + containerization

We are running in a containerized environment, leading to hard shutdowns of the process when assigned memory resources are exceeded. Currently the application is provided a maximum of 2.2GB of memory which has been more than enough for all smaller projects we scanned thus far up to about 20.000 files.

Is there a way to optimize the writing of the json-result file so that memory issues on large projects can be avoided? Otherwise is there a workaround available in order to be able to complete these large scans?

The full error log from Scancode:

removing temporary files...done.
traceback (most recent call last):
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 720, in next
    item = self._items.popleft()
	

during handling of the above exception, another exception occurred:

traceback (most recent call last):
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 726, in next
    item = self._items.popleft()
indexerror: pop from an empty deque

during handling of the above exception, another exception occurred:

traceback (most recent call last):
  file "/scancode-toolkit/bin/scancode", line 33, in <module>
    sys.exit(load_entry_point('scancode-toolkit', 'console_scripts', 'scancode')())
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  file "/scancode-toolkit/lib/python3.6/site-packages/commoncode/cliutils.py", line 87, in main
    **extra,
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  file "/scancode-toolkit/src/scancode/cli.py", line 421, in scancode
    *args, **kwargs)
  file "/scancode-toolkit/src/scancode/cli.py", line 815, in run_scan
    quiet=quiet, verbose=verbose, kwargs=requested_options, echo_func=echo_func,
  file "/scancode-toolkit/src/scancode/cli.py", line 1006, in run_scanners
    with_timing=timing, progress_manager=progress_manager)
  file "/scancode-toolkit/src/scancode/cli.py", line 1097, in scan_codebase
    location, rid, scan_errors, scan_time, scan_result, scan_timings = next(scans)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/_termui_impl.py", line 259, in next
    rv = next(self.iter)
  file "/scancode-toolkit/src/scancode/pool.py", line 52, in wrap
    return func(self, timeout=timeout or 3600)
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 730, in next
    raise timeouterror
multiprocessing.context.timeouterror

The text was updated successfully, but these errors were encountered:

Ben-Thelen · 2021-07-20T09:55:30Z

@pombredanne
Is there any investigation possible in this direction? We are happy to help out but we would need to know in what direction to look for now.

pombredanne · 2021-07-20T15:30:49Z

@Ben-Thelen sorry for not replying earlier... I had missed that!
ScanCode v21.3.31is using a streaming JSON approach for assembling the final JSON so it is supposed to work.
8e67173

And 160.000 files is not that big. I have scanned codebases with millions of files alright.

But the 2.2GB RAM feels like not a lot. Basically all the RAM is likely to be consumed by the in memory indexes which are about 700MB per process and with -n4 you would likely saturate your ram with these 700x4=2.8GB leaving no RAM left to.

I would not run scancode on anythings with less than 8GB in practice, and 16GB being a better practical minimum.

Do you mind to try with this:
scancode -clpi -n 4 --verbose --json-pp result.json /code

And try the same with more RAM too?

Ben-Thelen added the bug label Jun 10, 2021

pombredanne mentioned this issue Apr 30, 2024

Master issue: Improve ScanCode resources usage (CPU, RAM, Disk) #3755

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Ben-Thelen commented Jun 10, 2021 •

edited

Ben-Thelen commented Jul 20, 2021

pombredanne commented Jul 20, 2021

Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Comments

Ben-Thelen commented Jun 10, 2021 • edited

Description

How To Reproduce

System configuration

Ben-Thelen commented Jul 20, 2021

pombredanne commented Jul 20, 2021

Ben-Thelen commented Jun 10, 2021 •

edited