Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Add decrypt support for V5 and AES-128, AES-256 (R5 only) #749

Merged
merged 27 commits into from Jun 19, 2022
Merged

ENH: Add decrypt support for V5 and AES-128, AES-256 (R5 only) #749

merged 27 commits into from Jun 19, 2022

Conversation

exiledkingcc
Copy link
Contributor

rewrite the encryption part to support V4 and AES-128 encryption (ONLY decrypt for now).

i like the idea of PyPDF2 cleanup, so this is Python3 ONLY.

this commit needs PyCryptodome for AES operations.

the encrypt part will be added some time later, and maybe AES-256 support will be added too, if it's not difficult.

@MartinThoma MartinThoma added this to the PyPDF2 version 2.0.0 milestone Apr 14, 2022
@exiledkingcc
Copy link
Contributor Author

local tox test is OK with py38.
it seems github failed because dependency PyCryptodome not installed.
does it not read tox.ini?

@MartinThoma
Copy link
Member

does it not read tox.ini?

CI (Github Actions) is defined here. It uses requirements/ci.txt - and manually entered packages for Python 2.7

@MartinThoma
Copy link
Member

As this PR is rather big, I would add it in the 2.0 release or later. Simply to ensure we can leave the 1.X parts soon-ish. Then you also don't have to care about the 2.7 support as it will be dropped with PyPDF2 version 2.0

@exiledkingcc
Copy link
Contributor Author

encryption R=6 is used by PDF 2.0.
i can't find a "PDF 2.0 specification" on web, so it's not supported

@exiledkingcc exiledkingcc changed the title decrypt support V4 and AES-128 decrypt support V5 and AES-128, AES-256(R5 only) Apr 18, 2022


try:
from Crypto.Cipher import ARC4, AES
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the motivation to have it include both attempt to use https://github.com/Legrandin/pycryptodome and also a fallback to an embedded method? Feels like should either add dependencies or have decryption be included, but having both seems like added maintenance for little gain.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at first, I used PyCryptodome, then I found PyPDF2 has no dependencies at all, so I try not to introduce a dependency by adding AES code in PyPDF2.

encryption/decryption is not always needed for PDF, it's better to make PyCryptodome optional for users who do not need it.
for some other users, they may prefer PyCryptodome for better performance.
so I provide both.

maybe there should be a discussion for this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PDF encryption / decryption in the wild, it's really just using AES out in the wild, right? So it's not like if to have in-house cryptography around this, the code for that wouldn't be too large, and should be relatively stable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PDF specification also defined Public Key algorithms for PDF encryption, but I never see such PDF files.
most encrypted PDF files use only RC4 and AES, and some hash functions.
AES code can be very stable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about it, I really would like to avoid having to maintain security-critical parts. I worry more about the encryption part than the decryption part.

If we mess up encryption, users might end up less secure then they expect. If we mess up decryption, users will just see an error.

Maybe we can have "inline imports" (importing Crypto within the encrypt / decrypt function). So we could make PyCryptodome an optional dependency that people would need to install if they want to use encryption / decryption. Maybe we could also get rid of a part of the current codebase this way?

To clarify: I don't see that I will properly maintain the crypto parts. This is the reason why I have this tendency. What do you think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inline imports is a good choice, actually it's what I did at first.
when I added AES code, I try to make the lib independent, but ignored the burden of maintaining.
I agree with that it's better to leave the AES to PyCryptodome when users need it.
I will remove _aes.py later.

@MartinThoma
Copy link
Member

Very nice, thank you 🤗

@MartinThoma
Copy link
Member

One adjustment is still necessary: In setup.cfg needs to be a section:

[options.extras_require]
crypto = PyCryptodome

So people can install it with pip install PyPDF2[crypto]

Is there a minimum version of PyCryptodome this PR needs?

@exiledkingcc
Copy link
Contributor Author

One adjustment is still necessary: In setup.cfg needs to be a section:

[options.extras_require]
crypto = PyCryptodome

So people can install it with pip install PyPDF2[crypto]

good idea!

Is there a minimum version of PyCryptodome this PR needs?

no

@MartinThoma
Copy link
Member

Nice! From my perspective this looks ready to be merged. However, due to the missing 2.7 support, it will take a while (until the PyPDF2 2.0 release).

I hope I can make that release on 1st of July. I really want to get this done: #752

If I don't get any comments on what should be part of PyPDF2 1.x?, I will start working on 2.0 on 1st of May.

@MartinThoma
Copy link
Member

MartinThoma commented Apr 25, 2022

I think I'll add a "no-python27" pytest marker that CI uses

... on the other hand, if nobody reacts to #753 I will start working on the PyPDF2 2.0 release from 1st of May 😄

@MartinThoma
Copy link
Member

MartinThoma commented Apr 25, 2022

I still want to go through #817 and merge

Plus maybe add a contributors.md and maybe add some magic methods / camel_case method names + deprecation warnings for snakeCase method names: #751

@exiledkingcc
Copy link
Contributor Author

it seems OK for me, no change is needed.

@MartinThoma MartinThoma changed the title ENH: decrypt support V5 and AES-128, AES-256 (R5 only) ENH: Add decrypt support for V5 and AES-128, AES-256 (R5 only) Jun 19, 2022
Comment on lines +346 to +348
except DependencyError as e:
# make dependency error clear to users
raise e
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that part necessary? What does it do? It seems to me that you just catch an exception and raise exactly the same exception again.... hence removing this block would not change the behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if your PDF need AES algorithm to decrypt, but you don't install pycrytodome, without this two line,you will just got "the pdf is not decrypted", with this, you will know that you just need install the dependcy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting! Took me a minute to understand - thank you for pointing out that I need to look at the following two lines!

PyPDF2/encryption.py Outdated Show resolved Hide resolved
PyPDF2/encryption.py Outdated Show resolved Hide resolved
@MartinThoma MartinThoma merged commit 868f977 into py-pdf:main Jun 19, 2022
@MartinThoma MartinThoma removed the soon PRs that are almost ready to be merged, issues that get solved pretty soon label Jun 19, 2022
MartinThoma added a commit that referenced this pull request Jun 19, 2022
MartinThoma added a commit that referenced this pull request Jun 19, 2022
The highlight of this release is improved support for file encryption
(AES-128 and AES-256, R5 only). See #749 for the amazing work of
@exiledkingcc 🎊 Thank you 🤗

Deprecations (DEP):
-  Rename names to be PEP8-compliant (#967)
  - `PdfWriter.get_page`: the pageNumber parameter is renamed to page_number
  - `PyPDF2.filters`:
    * For all classes, a parameter rename: decodeParms ➔ decode_parms
    * decodeStreamData ➔ decode_stream_data
  - `PyPDF2.xmp`:
    * XmpInformation.rdfRoot ➔ XmpInformation.rdf_root
    * XmpInformation.xmp_createDate ➔ XmpInformation.xmp_create_date
    * XmpInformation.xmp_creatorTool ➔ XmpInformation.xmp_creator_tool
    * XmpInformation.xmp_metadataDate ➔ XmpInformation.xmp_metadata_date
    * XmpInformation.xmp_modifyDate ➔ XmpInformation.xmp_modify_date
    * XmpInformation.xmpMetadata ➔ XmpInformation.xmp_metadata
    * XmpInformation.xmpmm_documentId ➔ XmpInformation.xmpmm_document_id
    * XmpInformation.xmpmm_instanceId ➔ XmpInformation.xmpmm_instance_id
  - `PyPDF2.generic`:
    * readHexStringFromStream ➔ read_hex_string_from_stream
    * initializeFromDictionary ➔ initialize_from_dictionary
    * createStringObject ➔ create_string_object
    * TreeObject.hasChildren ➔ TreeObject.has_children
    * TreeObject.emptyTree ➔ TreeObject.empty_tree

New Features (ENH):
-  Add decrypt support for V5 and AES-128, AES-256 (R5 only) (#749)

Robustness (ROB):
-  Fix corrupted (wrongly) linear PDF (#1008)

Maintenance (MAINT):
-  Move PDF_Samples folder into ressources
-  Fix typos (#1007)

Testing (TST):
-  Improve encryption/decryption test (#1009)
-  Add merger test cases with real PDFs (#1006)
-  Add mutmut config

Code Style (STY):
-  Put pure data mappings in separate files (#1005)
-  Make encryption module private, apply pre-commit (#1010)

Full Changelog: 2.2.1...2.3.0
@xilopaint
Copy link
Contributor

Is there any encrypted PDF file available that I could use for testing this PR? I mean some file that I couldn't decrypt before and now it's possible.

@pubpub-zz
Copy link
Collaborator

@xilopaint, test samples are included in this PR for autotest

@MartinThoma
Copy link
Member

@xilopaint You can also easily create them with qpdf:

$ qpdf --encrypt foo bar 256 --force-R5 -- resources/crazyones.pdf crazyones-256.pdf

and then run

from PyPDF2 import PdfReader

reader = PdfReader("crazyones-256.pdf", password="foo")
print(reader.pages[0].extract_text())

If you checkout 2.2.1 you will get:

Traceback (most recent call last):
  File "foo.py", line 3, in <module>
    reader = PdfReader("crazyones-256.pdf", password="foo")
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 262, in __init__
    if password is not None and self.decrypt(password) == 0:
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1610, in decrypt
    return self._decrypt(password)
  File "/home/moose/Github/py-pdf/PyPDF2/PyPDF2/_reader.py", line 1651, in _decrypt
    f"only algorithm code 1 and 2 are supported. This PDF uses code {encrypt_v}"
NotImplementedError: only algorithm code 1 and 2 are supported. This PDF uses code 5

@xilopaint
Copy link
Contributor

xilopaint commented Jun 19, 2022

@MartinThoma the v2.3.0 is broken for me. _codecs is not being installed by pip and I'm getting ModuleNotFoundError: No module named 'PyPDF2._codecs'. To make it work I had to clone this repo and copy/paste the _codecs folder.

Also, for using the new enhanced decryption I had to install pycryptodome because it was not included as a dependency.

@MartinThoma
Copy link
Member

MartinThoma commented Jun 19, 2022

@xilopaint Thank you for pointing it out - I just fixed it and released PyPDF2==2.3.1

pycryptodome is an optional dependency which you can install via pip install PyPDF2[crypto] (depending on your shell, you might need to escape the [ and ]).

@xilopaint
Copy link
Contributor

(depending on your shell, you might need to escape the [ and ])

Yeah, I had tried pip install pypdf2[crypto] before and it didn't work in my zsh shell. Now using double quotes it does work.

What's the reason for pycryptodome being an optional dependency? Just making the package more lightweight for people who don't need the enhanced encryption?

@MartinThoma
Copy link
Member

What's the reason for pycryptodome being an optional dependency? Just making the package more lightweight for people who don't need the enhanced encryption?

Yes exactly! Quite a lot of users don't need any encryption / decryption capabilities.

Additionally, pycryptodome is not a pure-python dependency. That means if it was a non-optional dependency, it might make it for some cases way harder / impossible to install / use

@xilopaint
Copy link
Contributor

@MartinThoma my tests are failing in GitHub Actions since I've installed pypdf2[crypto]. Look:

https://github.com/xilopaint/alfred-pdf-tools/runs/6956399382?check_suite_focus=true

Although they're passing locally with no errors. Could you help me out?

@MartinThoma
Copy link
Member

I just had a quick glance, but it seems like a Linux kernel module is missing. That is exactly the kind of reason why I wanted this dependency to be optional.

If you don't install the crypto part, the pure-python implementation is used. That still offers encryption / decryption, but only older algorithms

@exiledkingcc exiledkingcc deleted the encryption branch June 20, 2022 03:38
@xilopaint
Copy link
Contributor

Hey @MartinThoma, what does R5 mean in the context of cryptography?

@MartinThoma
Copy link
Member

That is a PDF specific thing. It's short for "revision". You can read more in the specs in the "Standard Encryption Dictionary" chapter.

image

@xilopaint
Copy link
Contributor

Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants