ScanCode crashes with PDFEncryptionError #67

rrjohnston · 2015-09-02T17:53:07Z

Ubuntu 12.04
x86_64
Python 2.7.3
ScanCode Version 1.3.1

I attempted to create both html_app and html output: ./scanCode -f html tivo.html

Workspace being scanned has 198785 files. ScanCode directory and workspace resident on local machine.

On both scan attempts the tool crashed when apparently trying to scan a PDF file. There's no information about which file caused the problem so I can't independently check it's validity.

No output file was created.

Traceback (most recent call last):
File "/export/dqj347/scancode-toolkit-1.3.1/bin/scancode", line 9, in
load_entry_point('scancode-toolkit==1.3.1', 'console_scripts', 'scancode')()
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 664, in call
return self.main(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 230, in main
standalone_mode=standalone_mode, *_extra)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 644, in main
rv = self.invoke(ctx)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 837, in invoke
return ctx.invoke(self.callback, *_ctx.params)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/click/core.py", line 464, in invoke
return callback(_args, *_kwargs)
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 290, in scancode
results.append(scan_one(input_file, copyright, license, verbose))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/cli.py", line 332, in scan_one
data['copyrights'] = list(get_copyrights(input_file))
File "/export/dqj347/scancode-toolkit-1.3.1/src/scancode/api.py", line 62, in get_copyrights
for copyrights, _, _, _, start_line, end_line in detect_copyrights(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 70, in detect_copyrights
for numbered_lines in candidate_lines(analysis.text_lines(location)):
File "/export/dqj347/scancode-toolkit-1.3.1/src/cluecode/copyrights.py", line 797, in candidate_lines
for line_number, line in enumerate(lines):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/analysis.py", line 552, in unicode_text_lines_from_pdf
for line in pdf.get_text_lines(location):
File "/export/dqj347/scancode-toolkit-1.3.1/src/textcode/pdf.py", line 46, in get_text_lines
document = PDFDocument(parser)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 326, in init
self._initialize_password(password)
File "/export/dqj347/scancode-toolkit-1.3.1/local/lib/python2.7/site-packages/pdfminer/pdfdocument.py", line 348, in _initialize_password
raise PDFEncryptionError('Unknown algorithm: param=%r' % param)
pdfminer.pdfdocument.PDFEncryptionError: Unknown algorithm: param={'CF': {'StdCF': {'Length': 16, 'CFM': /AESV2, 'AuthEvent': /DocOpen}}, 'O': '\xf1T({\xf5#N\xc0\xfewr\xcf6\xd2\x92\x89\x1b\xbe\x11\x8c\xd0\xec\x88\x1d\x1a\x9c}\xf5\xb7J\xb5\x87', 'Filter': /Standard, 'P': -1036, 'Length': 128, 'R': 4, 'U': '\x14\x8bR\xb6x\x97t\xc1\xcf\xeaO{\x1a]\xfc\xfd\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', 'V': 4, 'StmF': /StdCF, 'StrF': /StdCF}

pombredanne · 2015-09-03T06:13:15Z

@rrjohnston Thanks for this bug report!
It it possible that this is the same as #56 . Let me check that, and provide you instruction to run a test with the latest develop branch code if the bug is already fixed there

rrjohnston · 2015-09-03T12:25:14Z

Philippe,

I think you're right about #56. I had posted this internally to our team
focused on OSS and Marciano Pitargue said it looked like his issue and he
forwarded me the link to a dev build you gave him (
https://github.com/nexB/scancode-toolkit/archive/develop.tar.gz). I tried
it out on the same workspace yesterday and it completed successfully (in
16.7 hours for 195676 files).

Regards,
Rick

On Thu, Sep 3, 2015 at 2:13 AM, Philippe Ombredanne <
notifications@github.com> wrote:

@rrjohnston https://github.com/rrjohnston Thanks for this bug report!
It it possible that this is the same as #56
#56 . Let me check that,
and provide you instruction to run a test with the latest develop branch
code if the bug is already fixed there

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

pombredanne · 2015-09-03T17:32:23Z

@rrjohnston thank you for checking this! I will close this bug as a duplicate. Your runtime stats are quite interesting too. For now, ScanCode is single threaded so the scans are rather slow. There are plans to improve this and run on multiple processes/threads in the future to expedite the scans.

What are the specs of the machine you have been using for this run?

And do you consider that 16+ hours is OK or unacceptable?

I would like to start tracking example runtimes on a wiki page: https://github.com/nexB/scancode-toolkit/wiki/Runtime-performance-reports

rrjohnston · 2015-09-03T20:48:31Z

Machine is as follows (large shared build server):

40 threads (2 processors, 10 cores each, with hyperthreading)

3.1 GHz

128GB RAM

8TB controller RAID5

The files were all local on that server, so an environment involving a
network file system (e.g. ClearCase) would be slower. I don't have a good
basis for comparison but 16 hours seems like a long time for a scan on a
box like this. Going to multiple threads (configurable from the command
line, I hope) should help a lot since I'd think each operation is
completely atomic - no sequencing required.

Regards,
Rick

On Thu, Sep 3, 2015 at 1:32 PM, Philippe Ombredanne <
notifications@github.com> wrote:

@rrjohnston https://github.com/rrjohnston thank you for checking this!
I will close this bug as a duplicate. Your runtime stats are quite
interesting too. For now, ScanCode is single threaded so the scans are
rather slow. There are plans to improve this and run on multiple
processes/threads in the future to expedite the scans.

What are the specs of the machine you have been using for this run?

And do you consider that 16+ hours is OK or unacceptable?

I would like to start tracking example runtimes on a wiki page:
https://github.com/nexB/scancode-toolkit/wiki/Runtime-performance-reports

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

pombredanne · 2015-09-03T21:55:06Z

I don't have a good basis for comparison but 16 hours seems like a long time for a scan on a box like this. Going to multiple threads (configurable from the command line, I hope) should help a lot since I'd think each operation is completely atomic - no sequencing required.

Yes, scanning is designed to process one file at a time and to share nothing to support adding multithreading and parallelism (reasonably) easily. The implementation might be best done with -j type of command line flag, similar to the make -j option, where -j 6 would run on 6 jobs in parallel. Would this work?

rrjohnston · 2015-09-18T14:16:15Z

Philippe,

I just ran another scan with the scancode development version and found a
problem. This was run with the --license option, generating html-app
output. When scanning TLSF allocator code (available at
https://github.com/emeryberger/Malloc-Implementations/blob/master/allocators/TLSF/TLSF-2.4.6/src/tlsf.c)
it outputs a "GPL" entry for it and a "GPL 2.0" entry - both referring to
line 14. But it doesn't include an entry for the next line which lists
LGPLv2.1 as well. It's important to document all licenses in a file since
the LGPL allows usage that GPL doesn't, and this dual-licensing allows
users to pick which they will use it under.

Regards,
Rick

On Thu, Sep 3, 2015 at 5:55 PM, Philippe Ombredanne <
notifications@github.com> wrote:

I don't have a good basis for comparison but 16 hours seems like a long
time for a scan on a box like this. Going to multiple threads (configurable
from the command line, I hope) should help a lot since I'd think each
operation is completely atomic - no sequencing required.

Yes, scanning is designed to process one file at a time and to share
nothing to support adding multithreading and parallelism (reasonably)
easily. The implementation might be best done with -j type of command
line flag, similar to the make -j option, where -j 6 would run on 6 jobs
in parallel. Would this work?

—
Reply to this email directly or view it on GitHub
#67 (comment)
.

pombredanne · 2015-09-18T14:38:53Z

@rrjohnston I have created a new ticket for the license detection issue as #75

* Fix for missing LGPL detection in #75 (and initially reported in #67)

rrjohnston · 2015-09-18T15:30:52Z

This version seems to catch the multiple licenses nicely. But if I may make a request, it would help to have all the licenses listed on a single line. One has to know in advance to look for multiple entries of a file to find all the licenses that it's covered by.

pombredanne · 2015-09-18T15:34:40Z

@rrjohnston you wrote:

But if I may make a request, it would help to have all the licenses listed on a single line. One has to know in advance to look for multiple entries of a file to find all the licenses that it's covered by.

Yes, you are entirely right. This is the purpose of #74
In this case here when #74 is implemented you will have one result returned as:
"GPL-2.0 or LGPL-2.1" with the corresponding start and end line.

pombredanne · 2015-09-18T15:35:27Z

the point is this is how this is already detected internally, but not yet reported this way, which I reckon is a pity!

* Fix for missing LGPL detection in #75 (and initially reported in #67)

pombredanne added the bug label Sep 3, 2015

pombredanne added the duplicate label Sep 3, 2015

pombredanne mentioned this issue Sep 18, 2015

Incorrect detection of LGPL #75

Closed

pombredanne added a commit that referenced this issue Sep 18, 2015

Added new and improved rules and test.

f612872

* Fix for missing LGPL detection in #75 (and initially reported in #67)

pombredanne added the license scan label Oct 27, 2015

pombredanne added fixed pending review and removed bug labels Nov 19, 2015

jdaguil closed this as completed Nov 24, 2015

jdaguil modified the milestone: v1.4 Nov 24, 2015

pombredanne added a commit that referenced this issue Nov 24, 2015

Added new and improved rules and test.

6b464c0

* Fix for missing LGPL detection in #75 (and initially reported in #67)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ScanCode crashes with PDFEncryptionError #67

ScanCode crashes with PDFEncryptionError #67

rrjohnston commented Sep 2, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 3, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 3, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 18, 2015

pombredanne commented Sep 18, 2015

rrjohnston commented Sep 18, 2015

pombredanne commented Sep 18, 2015

pombredanne commented Sep 18, 2015

ScanCode crashes with PDFEncryptionError #67

ScanCode crashes with PDFEncryptionError #67

Comments

rrjohnston commented Sep 2, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 3, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 3, 2015

pombredanne commented Sep 3, 2015

rrjohnston commented Sep 18, 2015

pombredanne commented Sep 18, 2015

rrjohnston commented Sep 18, 2015

pombredanne commented Sep 18, 2015

pombredanne commented Sep 18, 2015