-
-
Notifications
You must be signed in to change notification settings - Fork 533
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ScanCode crashes with PDFEncryptionError #67
Comments
@rrjohnston Thanks for this bug report! |
Philippe, I think you're right about #56. I had posted this internally to our team Regards, On Thu, Sep 3, 2015 at 2:13 AM, Philippe Ombredanne <
|
@rrjohnston thank you for checking this! I will close this bug as a duplicate. Your runtime stats are quite interesting too. For now, ScanCode is single threaded so the scans are rather slow. There are plans to improve this and run on multiple processes/threads in the future to expedite the scans. What are the specs of the machine you have been using for this run? And do you consider that 16+ hours is OK or unacceptable? I would like to start tracking example runtimes on a wiki page: https://github.com/nexB/scancode-toolkit/wiki/Runtime-performance-reports |
Machine is as follows (large shared build server): 40 threads (2 processors, 10 cores each, with hyperthreading) 3.1 GHz 128GB RAM 8TB controller RAID5 The files were all local on that server, so an environment involving a Regards, On Thu, Sep 3, 2015 at 1:32 PM, Philippe Ombredanne <
|
Yes, scanning is designed to process one file at a time and to share nothing to support adding multithreading and parallelism (reasonably) easily. The implementation might be best done with |
Philippe, I just ran another scan with the scancode development version and found a Regards, On Thu, Sep 3, 2015 at 5:55 PM, Philippe Ombredanne <
|
@rrjohnston I have created a new ticket for the license detection issue as #75 |
This version seems to catch the multiple licenses nicely. But if I may make a request, it would help to have all the licenses listed on a single line. One has to know in advance to look for multiple entries of a file to find all the licenses that it's covered by. |
@rrjohnston you wrote:
Yes, you are entirely right. This is the purpose of #74 |
the point is this is how this is already detected internally, but not yet reported this way, which I reckon is a pity! |
Ubuntu 12.04
x86_64
Python 2.7.3
ScanCode Version 1.3.1
I attempted to create both html_app and html output: ./scanCode -f html tivo.html
Workspace being scanned has 198785 files. ScanCode directory and workspace resident on local machine.
On both scan attempts the tool crashed when apparently trying to scan a PDF file. There's no information about which file caused the problem so I can't independently check it's validity.
No output file was created.
Traceback (most recent call last):
File "/export/dqj347/scancode-toolkit-1.3.1/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.3.1', 'console_scripts', 'scancode')()
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 230, in main
standalone_mode=standalone_mode, *_extra)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 290, in scancode
results.append(scan_one(input_file, copyright, license, verbose))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 332, in scan_one
data['copyrights'] = list(get_copyrights(input_file))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/api.py", line 62, in get_copyrights
for copyrights, _, _, _, start_line, end_line in detect_copyrights(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 70, in detect_copyrights
for numbered_lines in candidate_lines(analysis.text_lines(location)):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 797, in candidate_lines
for line_number, line in enumerate(lines):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/analysis.py", line 552, in unicode_text_lines_from_pdf
for line in pdf.get_text_lines(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/pdf.py", line 46, in get_text_lines
document = PDFDocument(parser)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 326, in init
self._initialize_password(password)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 348, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={'CF': {'StdCF': {'Length': 16, 'CFM': /AESV2, 'AuthEvent': /DocOpen}}, 'O': '\xf1T({\xf5#N\xc0\xfewr\xcf6\xd2\x92\x89\x1b\xbe\x11\x8c\xd0\xec\x88\x1d\x1a\x9c}\xf5\xb7J\xb5\x87', 'Filter': /Standard, 'P': -1036, 'Length': 128, 'R': 4, 'U': '\x14\x8bR\xb6x\x97t\xc1\xcf\xeaO{\x1a]\xfc\xfd\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'V': 4, 'StmF': /StdCF, 'StrF': /StdCF}
The text was updated successfully, but these errors were encountered: