Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ScanCode crashes with PDFEncryptionError #67

Closed
rrjohnston opened this issue Sep 2, 2015 · 10 comments
Closed

ScanCode crashes with PDFEncryptionError #67

rrjohnston opened this issue Sep 2, 2015 · 10 comments

Comments

@rrjohnston
Copy link

Ubuntu 12.04
x86_64
Python 2.7.3
ScanCode Version 1.3.1

I attempted to create both html_app and html output: ./scanCode -f html tivo.html

Workspace being scanned has 198785 files. ScanCode directory and workspace resident on local machine.

On both scan attempts the tool crashed when apparently trying to scan a PDF file. There's no information about which file caused the problem so I can't independently check it's validity.

No output file was created.

Traceback (most recent call last):
File "/export/dqj347/scancode-toolkit-1.3.1/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.3.1', 'console_scripts', 'scancode')()
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 230, in main
standalone_mode=standalone_mode, *_extra)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 290, in scancode
results.append(scan_one(input_file, copyright, license, verbose))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 332, in scan_one
data['copyrights'] = list(get_copyrights(input_file))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/api.py", line 62, in get_copyrights
for copyrights, _, _, _, start_line, end_line in detect_copyrights(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 70, in detect_copyrights
for numbered_lines in candidate_lines(analysis.text_lines(location)):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 797, in candidate_lines
for line_number, line in enumerate(lines):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/analysis.py", line 552, in unicode_text_lines_from_pdf
for line in pdf.get_text_lines(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/pdf.py", line 46, in get_text_lines
document = PDFDocument(parser)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 326, in init
self._initialize_password(password)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 348, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={'CF': {'StdCF': {'Length': 16, 'CFM': /AESV2, 'AuthEvent': /DocOpen}}, 'O': '\xf1T({\xf5#N\xc0\xfewr\xcf6\xd2\x92\x89\x1b\xbe\x11\x8c\xd0\xec\x88\x1d\x1a\x9c}\xf5\xb7J\xb5\x87', 'Filter': /Standard, 'P': -1036, 'Length': 128, 'R': 4, 'U': '\x14\x8bR\xb6x\x97t\xc1\xcf\xeaO{\x1a]\xfc\xfd\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'V': 4, 'StmF': /StdCF, 'StrF': /StdCF}

@pombredanne pombredanne added the bug label Sep 3, 2015
@pombredanne
Copy link
Member

@rrjohnston Thanks for this bug report!
It it possible that this is the same as #56 . Let me check that, and provide you instruction to run a test with the latest develop branch code if the bug is already fixed there

@rrjohnston
Copy link
Author

Philippe,

I think you're right about #56. I had posted this internally to our team
focused on OSS and Marciano Pitargue said it looked like his issue and he
forwarded me the link to a dev build you gave him (
https://github.com/nexB/scancode-toolkit/archive/develop.tar.gz). I tried
it out on the same workspace yesterday and it completed successfully (in
16.7 hours for 195676 files).

Regards,
Rick

On Thu, Sep 3, 2015 at 2:13 AM, Philippe Ombredanne <
notifications@github.com> wrote:

@rrjohnston https://github.com/rrjohnston Thanks for this bug report!
It it possible that this is the same as #56
#56 . Let me check that,
and provide you instruction to run a test with the latest develop branch
code if the bug is already fixed there


Reply to this email directly or view it on GitHub
#67 (comment)
.

@pombredanne
Copy link
Member

@rrjohnston thank you for checking this! I will close this bug as a duplicate. Your runtime stats are quite interesting too. For now, ScanCode is single threaded so the scans are rather slow. There are plans to improve this and run on multiple processes/threads in the future to expedite the scans.

What are the specs of the machine you have been using for this run?

And do you consider that 16+ hours is OK or unacceptable?

I would like to start tracking example runtimes on a wiki page: https://github.com/nexB/scancode-toolkit/wiki/Runtime-performance-reports

@rrjohnston
Copy link
Author

Machine is as follows (large shared build server):

40 threads (2 processors, 10 cores each, with hyperthreading)

3.1 GHz

128GB RAM

8TB controller RAID5

The files were all local on that server, so an environment involving a
network file system (e.g. ClearCase) would be slower. I don't have a good
basis for comparison but 16 hours seems like a long time for a scan on a
box like this. Going to multiple threads (configurable from the command
line, I hope) should help a lot since I'd think each operation is
completely atomic - no sequencing required.

Regards,
Rick

On Thu, Sep 3, 2015 at 1:32 PM, Philippe Ombredanne <
notifications@github.com> wrote:

@rrjohnston https://github.com/rrjohnston thank you for checking this!
I will close this bug as a duplicate. Your runtime stats are quite
interesting too. For now, ScanCode is single threaded so the scans are
rather slow. There are plans to improve this and run on multiple
processes/threads in the future to expedite the scans.

What are the specs of the machine you have been using for this run?

And do you consider that 16+ hours is OK or unacceptable?

I would like to start tracking example runtimes on a wiki page:
https://github.com/nexB/scancode-toolkit/wiki/Runtime-performance-reports


Reply to this email directly or view it on GitHub
#67 (comment)
.

@pombredanne
Copy link
Member

I don't have a good basis for comparison but 16 hours seems like a long time for a scan on a box like this. Going to multiple threads (configurable from the command line, I hope) should help a lot since I'd think each operation is completely atomic - no sequencing required.

Yes, scanning is designed to process one file at a time and to share nothing to support adding multithreading and parallelism (reasonably) easily. The implementation might be best done with -j type of command line flag, similar to the make -j option, where -j 6 would run on 6 jobs in parallel. Would this work?

@rrjohnston
Copy link
Author

Philippe,

I just ran another scan with the scancode development version and found a
problem. This was run with the --license option, generating html-app
output. When scanning TLSF allocator code (available at
https://github.com/emeryberger/Malloc-Implementations/blob/master/allocators/TLSF/TLSF-2.4.6/src/tlsf.c)
it outputs a "GPL" entry for it and a "GPL 2.0" entry - both referring to
line 14. But it doesn't include an entry for the next line which lists
LGPLv2.1 as well. It's important to document all licenses in a file since
the LGPL allows usage that GPL doesn't, and this dual-licensing allows
users to pick which they will use it under.

Regards,
Rick

On Thu, Sep 3, 2015 at 5:55 PM, Philippe Ombredanne <
notifications@github.com> wrote:

I don't have a good basis for comparison but 16 hours seems like a long
time for a scan on a box like this. Going to multiple threads (configurable
from the command line, I hope) should help a lot since I'd think each
operation is completely atomic - no sequencing required.

Yes, scanning is designed to process one file at a time and to share
nothing to support adding multithreading and parallelism (reasonably)
easily. The implementation might be best done with -j type of command
line flag, similar to the make -j option, where -j 6 would run on 6 jobs
in parallel. Would this work?


Reply to this email directly or view it on GitHub
#67 (comment)
.

@pombredanne
Copy link
Member

@rrjohnston I have created a new ticket for the license detection issue as #75

pombredanne added a commit that referenced this issue Sep 18, 2015
 * Fix for missing LGPL detection in #75 (and initially reported in #67)
@rrjohnston
Copy link
Author

This version seems to catch the multiple licenses nicely. But if I may make a request, it would help to have all the licenses listed on a single line. One has to know in advance to look for multiple entries of a file to find all the licenses that it's covered by.

@pombredanne
Copy link
Member

@rrjohnston you wrote:

But if I may make a request, it would help to have all the licenses listed on a single line. One has to know in advance to look for multiple entries of a file to find all the licenses that it's covered by.

Yes, you are entirely right. This is the purpose of #74
In this case here when #74 is implemented you will have one result returned as:
"GPL-2.0 or LGPL-2.1" with the corresponding start and end line.

@pombredanne
Copy link
Member

the point is this is how this is already detected internally, but not yet reported this way, which I reckon is a pity!

@jdaguil jdaguil closed this as completed Nov 24, 2015
@jdaguil jdaguil modified the milestone: v1.4 Nov 24, 2015
pombredanne added a commit that referenced this issue Nov 24, 2015
 * Fix for missing LGPL detection in #75 (and initially reported in #67)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants