Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Running out of memory post-scan when scanning a large projects (~160.000 files) #2547

Open
Ben-Thelen opened this issue Jun 10, 2021 · 2 comments
Labels

Comments

@Ben-Thelen
Copy link

Ben-Thelen commented Jun 10, 2021

Description

While scanning a rather large project of about 160.000 files (https://github.com/Azure/azure-sdk-for-java) we are experiencing a massive usage of memory after the scan finished and before writing the result.json, leading to out of memory issues on our system.

How To Reproduce

1.) Download the beforementioned sourcecode archive from github
2.) Run extractcode to extract the archive (and all contained archives/jars/etc.)
3.) Start the scan with scancode.
4.) Wait some hours for it to complete.
5.) Error.

Before we scanned the project we did an recursive extract of the sourcecode archive downloaded from the above mentioned URL using the following command:
extractcode --verbose path/to/file.zip

We are running the scan with the following command:
scancode -clpi --license-score 65 --max-in-memory -1 -n 4 --strip-root --verbose --json-pp result.json /code

System configuration

  • What OS are you running on? Linux (containerized on Openshift)
  • What version of scancode-toolkit was used to generate the scan file? v21.3.31
  • What installation method was used to install/run scancode? source download + containerization

We are running in a containerized environment, leading to hard shutdowns of the process when assigned memory resources are exceeded. Currently the application is provided a maximum of 2.2GB of memory which has been more than enough for all smaller projects we scanned thus far up to about 20.000 files.

Is there a way to optimize the writing of the json-result file so that memory issues on large projects can be avoided? Otherwise is there a workaround available in order to be able to complete these large scans?

The full error log from Scancode:

removing temporary files...done.
traceback (most recent call last):
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 720, in next
    item = self._items.popleft()
	

during handling of the above exception, another exception occurred:

traceback (most recent call last):
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 726, in next
    item = self._items.popleft()
indexerror: pop from an empty deque

during handling of the above exception, another exception occurred:

traceback (most recent call last):
  file "/scancode-toolkit/bin/scancode", line 33, in <module>
    sys.exit(load_entry_point('scancode-toolkit', 'console_scripts', 'scancode')())
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  file "/scancode-toolkit/lib/python3.6/site-packages/commoncode/cliutils.py", line 87, in main
    **extra,
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  file "/scancode-toolkit/src/scancode/cli.py", line 421, in scancode
    *args, **kwargs)
  file "/scancode-toolkit/src/scancode/cli.py", line 815, in run_scan
    quiet=quiet, verbose=verbose, kwargs=requested_options, echo_func=echo_func,
  file "/scancode-toolkit/src/scancode/cli.py", line 1006, in run_scanners
    with_timing=timing, progress_manager=progress_manager)
  file "/scancode-toolkit/src/scancode/cli.py", line 1097, in scan_codebase
    location, rid, scan_errors, scan_time, scan_result, scan_timings = next(scans)
  file "/scancode-toolkit/lib/python3.6/site-packages/click/_termui_impl.py", line 259, in next
    rv = next(self.iter)
  file "/scancode-toolkit/src/scancode/pool.py", line 52, in wrap
    return func(self, timeout=timeout or 3600)
  file "/usr/local/lib/python3.6/multiprocessing/pool.py", line 730, in next
    raise timeouterror
multiprocessing.context.timeouterror
@Ben-Thelen Ben-Thelen added the bug label Jun 10, 2021
@Ben-Thelen
Copy link
Author

@pombredanne
Is there any investigation possible in this direction? We are happy to help out but we would need to know in what direction to look for now.

@pombredanne
Copy link
Member

@Ben-Thelen sorry for not replying earlier... I had missed that!
ScanCode v21.3.31is using a streaming JSON approach for assembling the final JSON so it is supposed to work.
8e67173

And 160.000 files is not that big. I have scanned codebases with millions of files alright.

But the 2.2GB RAM feels like not a lot. Basically all the RAM is likely to be consumed by the in memory indexes which are about 700MB per process and with -n4 you would likely saturate your ram with these 700x4=2.8GB leaving no RAM left to.

I would not run scancode on anythings with less than 8GB in practice, and 16GB being a better practical minimum.

Do you mind to try with this:
scancode -clpi -n 4 --verbose --json-pp result.json /code

And try the same with more RAM too?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants