Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

segmentation fault error while running scans in an ubuntu instance in AWS #267

Closed
balusarakesh opened this issue May 23, 2016 · 57 comments · Fixed by #354
Closed

segmentation fault error while running scans in an ubuntu instance in AWS #267

balusarakesh opened this issue May 23, 2016 · 57 comments · Fixed by #354
Labels
Milestone

Comments

@balusarakesh
Copy link
Collaborator

I cloned into the github repo scancode-toolkit in an Ubnuntu instance (AWS-EC2 AMI - ami-06116566) and used the command ./scancode --help to configure scancode for the first time.
The next time when I try to run a command ./scancode --license file.txt I'm getting the error

/scancode: line 16:  8253 Segmentation fault      (core dumped) $SCANCODE_ROOT_DIR/bin/scancode "$@"
@pombredanne
Copy link
Member

Do you have some more details on the OS version, architecture and version of Python used there?

@balusarakesh
Copy link
Collaborator Author

Ubuntu Server 14.04 LTS (HVM)

@balusarakesh
Copy link
Collaborator Author

@pombredanne Python version is 2.7.6

@pombredanne
Copy link
Member

pombredanne commented May 24, 2016

And which architecture? 32 or 64 bits? I also the exact Python version you get when you python

@balusarakesh
Copy link
Collaborator Author

64 bit

@balusarakesh
Copy link
Collaborator Author

2.7 version of Python

@pombredanne pombredanne modified the milestone: v2.0 Aug 5, 2016
@yahalom5776
Copy link

Sorry to bring this up again but I seem to have a very similar issue when scanning a large project:
./scancode: line 16: 2977 Killed $SCANCODE_ROOT_DIR/bin/scancode "$@"

It is the current development branch installed on Debian 8 running in a VM with 2GB of RAM.

Python:
python --version: Python 2.7.9

OS:
uname -a: Linux debian 3.16.0-4-amd64 #1 SMP Debian 3.16.36-1+deb8u1 (2016-09-03) x86_64 GNU/Linux

Let me know if you need more information. Thank you.

@pombredanne
Copy link
Member

@yahalom5776 Thanks for the report. Can you tell me which command line options you used?

My best guess is that you have been prockilled' rather than segfaulted' based on your error message. This is something that should go away pretty soon.
The issue is simple: at the moment the results of scanning each file is accumulated in RAM for all files.
Eventually when scanning larger codebase, this will exhaust available RAM and the kernel may prockill the scancode process in this case. The solution is going to be saving on disk the scan results of each file on the go and then streaming the final JSON at the end such that the scan results are never fully held in memory.
This is something I am working on in the orm-models branch https://github.com/nexB/scancode-toolkit/tree/orm-models

The short term solution is to scan smaller chunks of a larger codebase. I am hoping to get some fix sometimes this week. In any case I will ping you to test the fix ass soon as this is ready.

@yahalom5776
Copy link

I used only the "-l" switch in combination with the HTML output because I wanted to get a quick overview of the codebase.

Thank you for the explanation, I am looking forward to the update. Please let me know if I can help with additional information or testing.

@sschuberth
Copy link
Collaborator

Just chiming in that I also see

scancode: line 16: 10193 Killed                  $SCANCODE_ROOT_DIR/bin/scancode "$@"

when running on Jenkins inside a VM and scanning Apache Flink (incl. the source code of all its Maven dependencies). The command line used simply is

scancode -f html . flink-scancode-50eaeec.html

I am hoping to get some fix sometimes this week.

Is there an update? Any work on this would be very much appreciated!

@pombredanne
Copy link
Member

@sschuberth I am working on it and this is top priority

@pombredanne
Copy link
Member

No ETA yet though

@JonoYang
Copy link
Member

JonoYang commented Nov 4, 2016

@sschuberth I am attempting to reproduce the issue. Could you tell me how to get the source code for Flink's Maven dependencies?

@sschuberth
Copy link
Collaborator

Sure, in general for Maven you can use something like

mvn dependency:unpack-dependencies \
    -Dclassifier=sources \
    -DoutputDirectory="$OUTDIR" \
    -Dmdep.useSubDirectoryPerArtifact=true \
    -Dmdep.useSubDirectoryPerScope=true

@JonoYang
Copy link
Member

JonoYang commented Nov 4, 2016

@sschuberth I played with this and the size of the extracted source is about 13 gigabytes. @pombredanne is working on a solution. Hang in there!

pombredanne added a commit that referenced this issue Nov 6, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 6, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 6, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 6, 2016
 * cache scan results on disk
 * stream json at the end 

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 6, 2016
 * cache scan results on disk
 * stream json at the end 

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 7, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 7, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 7, 2016
 * cache scan results on disk
 * stream json at the end 

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 7, 2016
 * cache scan results on disk
 * stream json at the end
 * use diskcache
 * partially tested and unstable, some tests failing

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 7, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 18, 2016
 * defaults are 120 seconds and 1000 MB
 * this replaces a dynamic computation based on file size entirely
 * this allo to control explicitly and strictly the runtime quotas
   and adjust them to constrained environments such as VMs.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

@yahalom5776 thanks. I pushed a new branch 267-cli-options-for-timeout-and-max_memory to have explicit timeout and max-memory CLI options as you and @sschuberth suggested.
I am also looking into the reason why we exceed the quotas on the two test files you listed.

btw, does your VM have any virtual memory/swap at all?

@pombredanne
Copy link
Member

pombredanne commented Nov 18, 2016

The code in #371 adds the timeout and memory options to the scancode command line. All tests pass. Review welcomed! I will likely merge this in over the week-end.

@pombredanne
Copy link
Member

Just updating the status: I am refining a fairly significant refactoring to fix the memory leak that is the root cause for this error after all: this is because of these two innocent looking lines here:
https://github.com/nexB/scancode-toolkit/blob/c12948ce1d687b69bfb710900d95927295035196/src/licensedcode/match.py#L294

They eventually create monster duplicated dictionaries (and after digging these are useless in that form) that end up growing and growing in size at each license match merge run and in some corner cases until the RAM blows up.

pombredanne added a commit that referenced this issue Nov 23, 2016
…_memory

#267 Add command options for --timeout and --max-memory
pombredanne added a commit that referenced this issue Nov 24, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
 * the bitmap-based one is used instead

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
 * Do not carry line_by_pos mapping in matches. Instead only inject
   start and end line in matches at the end using the query line_by_pos
   mapping kept only globally and not on a per-match basis.
   This was the source of a major memory leak when license matches were
   being updated and combined during a merge() run.
 * Use a list-based mapping for line_by_pos instead of a dic for smaller
   memory footprint.
 * Use slots for QueryRun attributes for smaller memory footprint
 * Replace "solid" attribute for Rules with a minimum_score that a match
   to a rule must equal or exceed. solid is now minimum_score: 100. Use
   this in other rules with various minimu score as needed. 
 * Remove remaining references to "gaps".
 * Fix incorrect Rule thresholds computation for minimum lengths.
 * Do not use license matches cache for now (the sqlite-backed
   diskcache-based implementation is the source of major slowdown).
 * Various minor cleanup and updates on rules, licenses and their
   corresponding tests.
 * New batch of frequent tokens for license detection.
 * Add new tests contributed by @yahalom5776
 * Add new license match filter for matches to a whole rule made of a
   single token that is surrounded by unknown or single letter tokens
   such as in "a b c d e  GPL 1 2 3  4" to discard some false positive
   (in this case for a GPL). This required to add tracking of query
   tokens made on a single character.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
pombredanne added a commit that referenced this issue Nov 24, 2016
 * New scan caching implementation using simple JSON files storage
   instead of a sqlite-backed storage. Since we have little or no
   contention, the strong ACID and locking offered by sqlite were
   slowing things down significantly by saturating disk I/Os. The
   process of caching scans is a write once, read once for each scanned
   file and therefore locking and atomic storage is not needed.
 * Also improve scan errors reporting.

Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
@pombredanne
Copy link
Member

@yahalom5776 I added new tests for the two test files you submitted here. The latest commits in PR #376 should now provide a comprehensive fix for the memory leak and memory quotas, timeouts and cache streaming on disk, plus a few extra side refinements.
@sschuberth your feedback is welcomed too :)

@yahalom5776
Copy link

yahalom5776 commented Nov 24, 2016

@pombredanne Thanks a lot, I hope I'll be able to test it tomorrow. I forgot to answer your question above, sorry. Swap is 2.5 GB for this VM, about 900 MB were used when ScanCode hung.

@pombredanne
Copy link
Member

Even though this ticket was originally about some AWS issue and evolved in memory issues, I think this is all fixed now and I am closing this. Thank you all for the inputs and tests!

@sschuberth
Copy link
Collaborator

@pombredanne Just to confirm: With these fixes ScanCode is now running much more stable and also faster on our Jenkins CI which is running in a VM. Your recent work as been nothing short of awesome, thanks a lot for that!

@yahalom5776
Copy link

@pombredanne I second that, thank you!

pombredanne added a commit that referenced this issue Oct 2, 2017
Signed-off-by: Philippe Ombredanne <pombredanne@nexb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants