You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
While scanning a rather large project of about 160.000 files (https://github.com/Azure/azure-sdk-for-java) we are experiencing a massive usage of memory after the scan finished and before writing the result.json, leading to out of memory issues on our system.
How To Reproduce
1.) Download the beforementioned sourcecode archive from github
2.) Run extractcode to extract the archive (and all contained archives/jars/etc.)
3.) Start the scan with scancode.
4.) Wait some hours for it to complete.
5.) Error.
Before we scanned the project we did an recursive extract of the sourcecode archive downloaded from the above mentioned URL using the following command: extractcode --verbose path/to/file.zip
We are running the scan with the following command: scancode -clpi --license-score 65 --max-in-memory -1 -n 4 --strip-root --verbose --json-pp result.json /code
System configuration
What OS are you running on? Linux (containerized on Openshift)
What version of scancode-toolkit was used to generate the scan file? v21.3.31
What installation method was used to install/run scancode? source download + containerization
We are running in a containerized environment, leading to hard shutdowns of the process when assigned memory resources are exceeded. Currently the application is provided a maximum of 2.2GB of memory which has been more than enough for all smaller projects we scanned thus far up to about 20.000 files.
Is there a way to optimize the writing of the json-result file so that memory issues on large projects can be avoided? Otherwise is there a workaround available in order to be able to complete these large scans?
The full error log from Scancode:
removing temporary files...done.
traceback (most recent call last):
file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 720, in next
item = self._items.popleft()
during handling of the above exception, another exception occurred:
traceback (most recent call last):
file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 726, in next
item = self._items.popleft()
indexerror: pop from an empty deque
during handling of the above exception, another exception occurred:
traceback (most recent call last):
file "/scancode-toolkit/bin/scancode", line 33, in <module>
sys.exit(load_entry_point('scancode-toolkit', 'console_scripts', 'scancode')())
file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
file "/scancode-toolkit/lib/python3.6/site-packages/commoncode/cliutils.py", line 87, in main
**extra,
file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
file "/scancode-toolkit/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
return f(get_current_context(), *args, **kwargs)
file "/scancode-toolkit/src/scancode/cli.py", line 421, in scancode
*args, **kwargs)
file "/scancode-toolkit/src/scancode/cli.py", line 815, in run_scan
quiet=quiet, verbose=verbose, kwargs=requested_options, echo_func=echo_func,
file "/scancode-toolkit/src/scancode/cli.py", line 1006, in run_scanners
with_timing=timing, progress_manager=progress_manager)
file "/scancode-toolkit/src/scancode/cli.py", line 1097, in scan_codebase
location, rid, scan_errors, scan_time, scan_result, scan_timings = next(scans)
file "/scancode-toolkit/lib/python3.6/site-packages/click/_termui_impl.py", line 259, in next
rv = next(self.iter)
file "/scancode-toolkit/src/scancode/pool.py", line 52, in wrap
return func(self, timeout=timeout or 3600)
file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 730, in next
raise timeouterror
multiprocessing.context.timeouterror
The text was updated successfully, but these errors were encountered:
@pombredanne
Is there any investigation possible in this direction? We are happy to help out but we would need to know in what direction to look for now.
@Ben-Thelen sorry for not replying earlier... I had missed that!
ScanCode v21.3.31is using a streaming JSON approach for assembling the final JSON so it is supposed to work. 8e67173
And 160.000 files is not that big. I have scanned codebases with millions of files alright.
But the 2.2GB RAM feels like not a lot. Basically all the RAM is likely to be consumed by the in memory indexes which are about 700MB per process and with -n4 you would likely saturate your ram with these 700x4=2.8GB leaving no RAM left to.
I would not run scancode on anythings with less than 8GB in practice, and 16GB being a better practical minimum.
Do you mind to try with this: scancode -clpi -n 4 --verbose --json-pp result.json /code
Description
While scanning a rather large project of about 160.000 files (https://github.com/Azure/azure-sdk-for-java) we are experiencing a massive usage of memory after the scan finished and before writing the result.json, leading to out of memory issues on our system.
How To Reproduce
1.) Download the beforementioned sourcecode archive from github
2.) Run extractcode to extract the archive (and all contained archives/jars/etc.)
3.) Start the scan with scancode.
4.) Wait some hours for it to complete.
5.) Error.
Before we scanned the project we did an recursive extract of the sourcecode archive downloaded from the above mentioned URL using the following command:
extractcode --verbose path/to/file.zip
We are running the scan with the following command:
scancode -clpi --license-score 65 --max-in-memory -1 -n 4 --strip-root --verbose --json-pp result.json /code
System configuration
We are running in a containerized environment, leading to hard shutdowns of the process when assigned memory resources are exceeded. Currently the application is provided a maximum of 2.2GB of memory which has been more than enough for all smaller projects we scanned thus far up to about 20.000 files.
Is there a way to optimize the writing of the json-result file so that memory issues on large projects can be avoided? Otherwise is there a workaround available in order to be able to complete these large scans?
The full error log from Scancode:
The text was updated successfully, but these errors were encountered: