-
-
Notifications
You must be signed in to change notification settings - Fork 550
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segmentation fault error while running scans in an ubuntu instance in AWS #267
Comments
Do you have some more details on the OS version, architecture and version of Python used there? |
|
@pombredanne Python version is 2.7.6 |
And which architecture? 32 or 64 bits? I also the exact Python version you get when you python |
64 bit |
2.7 version of Python |
Sorry to bring this up again but I seem to have a very similar issue when scanning a large project: It is the current development branch installed on Debian 8 running in a VM with 2GB of RAM. Python: OS: Let me know if you need more information. Thank you. |
@yahalom5776 Thanks for the report. Can you tell me which command line options you used? My best guess is that you have been prockilled' rather than segfaulted' based on your error message. This is something that should go away pretty soon. The short term solution is to scan smaller chunks of a larger codebase. I am hoping to get some fix sometimes this week. In any case I will ping you to test the fix ass soon as this is ready. |
I used only the "-l" switch in combination with the HTML output because I wanted to get a quick overview of the codebase. Thank you for the explanation, I am looking forward to the update. Please let me know if I can help with additional information or testing. |
Just chiming in that I also see
when running on Jenkins inside a VM and scanning Apache Flink (incl. the source code of all its Maven dependencies). The command line used simply is
Is there an update? Any work on this would be very much appreciated! |
@sschuberth I am working on it and this is top priority |
No ETA yet though |
@sschuberth I am attempting to reproduce the issue. Could you tell me how to get the source code for Flink's Maven dependencies? |
Sure, in general for Maven you can use something like
|
@sschuberth I played with this and the size of the extracted source is about 13 gigabytes. @pombredanne is working on a solution. Hang in there! |
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* cache scan results on disk * stream json at the end Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* cache scan results on disk * stream json at the end Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* cache scan results on disk * stream json at the end Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* cache scan results on disk * stream json at the end * use diskcache * partially tested and unstable, some tests failing Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* defaults are 120 seconds and 1000 MB * this replaces a dynamic computation based on file size entirely * this allo to control explicitly and strictly the runtime quotas and adjust them to constrained environments such as VMs. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@yahalom5776 thanks. I pushed a new branch btw, does your VM have any virtual memory/swap at all? |
The code in #371 adds the timeout and memory options to the scancode command line. All tests pass. Review welcomed! I will likely merge this in over the week-end. |
Just updating the status: I am refining a fairly significant refactoring to fix the memory leak that is the root cause for this error after all: this is because of these two innocent looking lines here: They eventually create monster duplicated dictionaries (and after digging these are useless in that form) that end up growing and growing in size at each license match merge run and in some corner cases until the RAM blows up. |
…_memory #267 Add command options for --timeout and --max-memory
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* the bitmap-based one is used instead Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* Do not carry line_by_pos mapping in matches. Instead only inject start and end line in matches at the end using the query line_by_pos mapping kept only globally and not on a per-match basis. This was the source of a major memory leak when license matches were being updated and combined during a merge() run. * Use a list-based mapping for line_by_pos instead of a dic for smaller memory footprint. * Use slots for QueryRun attributes for smaller memory footprint * Replace "solid" attribute for Rules with a minimum_score that a match to a rule must equal or exceed. solid is now minimum_score: 100. Use this in other rules with various minimu score as needed. * Remove remaining references to "gaps". * Fix incorrect Rule thresholds computation for minimum lengths. * Do not use license matches cache for now (the sqlite-backed diskcache-based implementation is the source of major slowdown). * Various minor cleanup and updates on rules, licenses and their corresponding tests. * New batch of frequent tokens for license detection. * Add new tests contributed by @yahalom5776 * Add new license match filter for matches to a whole rule made of a single token that is surrounded by unknown or single letter tokens such as in "a b c d e GPL 1 2 3 4" to discard some false positive (in this case for a GPL). This required to add tracking of query tokens made on a single character. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
* New scan caching implementation using simple JSON files storage instead of a sqlite-backed storage. Since we have little or no contention, the strong ACID and locking offered by sqlite were slowing things down significantly by saturating disk I/Os. The process of caching scans is a write once, read once for each scanned file and therefore locking and atomic storage is not needed. * Also improve scan errors reporting. Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@yahalom5776 I added new tests for the two test files you submitted here. The latest commits in PR #376 should now provide a comprehensive fix for the memory leak and memory quotas, timeouts and cache streaming on disk, plus a few extra side refinements. |
@pombredanne Thanks a lot, I hope I'll be able to test it tomorrow. I forgot to answer your question above, sorry. Swap is 2.5 GB for this VM, about 900 MB were used when ScanCode hung. |
#267 Remove memory leak from license detection
Even though this ticket was originally about some AWS issue and evolved in memory issues, I think this is all fixed now and I am closing this. Thank you all for the inputs and tests! |
@pombredanne Just to confirm: With these fixes ScanCode is now running much more stable and also faster on our Jenkins CI which is running in a VM. Your recent work as been nothing short of awesome, thanks a lot for that! |
@pombredanne I second that, thank you! |
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
I cloned into the github repo
scancode-toolkit
in an Ubnuntu instance(AWS-EC2 AMI - ami-06116566)
and used the command./scancode --help
to configure scancode for the first time.The next time when I try to run a command
./scancode --license file.txt
I'm getting the errorThe text was updated successfully, but these errors were encountered: