Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Support R6 decrypting #1015

Merged
merged 7 commits into from Jun 26, 2022
Merged

ENH: Support R6 decrypting #1015

merged 7 commits into from Jun 26, 2022

Conversation

exiledkingcc
Copy link
Contributor

@exiledkingcc exiledkingcc commented Jun 20, 2022

Fixes #416

@codecov
Copy link

codecov bot commented Jun 20, 2022

Codecov Report

Merging #1015 (dfaa735) into main (a40946c) will increase coverage by 0.09%.
The diff coverage is 87.64%.

@@            Coverage Diff             @@
##             main    #1015      +/-   ##
==========================================
+ Coverage   89.50%   89.60%   +0.09%     
==========================================
  Files          24       24              
  Lines        4432     4425       -7     
  Branches      919      914       -5     
==========================================
- Hits         3967     3965       -2     
+ Misses        318      314       -4     
+ Partials      147      146       -1     
Impacted Files Coverage Δ
PyPDF2/_reader.py 91.74% <81.81%> (-0.40%) ⬇️
PyPDF2/_encryption.py 74.86% <89.23%> (+2.00%) ⬆️
PyPDF2/__init__.py 100.00% <100.00%> (ø)
PyPDF2/_merger.py 88.80% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a40946c...dfaa735. Read the comment docs.

@MartinThoma
Copy link
Member

You're on fire, @exiledkingcc 👍 🚀

The tests pass - is this ready from your perspective, @exiledkingcc ?

@exiledkingcc
Copy link
Contributor Author

exiledkingcc commented Jun 20, 2022

thanks to qpdf, althoght it's comment shows the algorithm is NOT EXACTLY described in specifiction.
they say

the wording of the specification is very unclear...

anyway i think you can merge this, it will not make things worse, at least, we can decrypt PDFs encypted by qpdf.
then i will start to rewrite the encrypt part if there is time.

@MartinThoma
Copy link
Member

I was just checking the PDF files. While the r2 / r3 / r4 files ask for passwords, the r5 and r6 files don't. I can directly see the content. Are you sure that they are encrypted?

@exiledkingcc
Copy link
Contributor Author

yes, they are encrypted, but not password-protected.
for R5 and R6, both user password and owner password can be used to decrypt the file, if one is empty, pdf softwares will not ask for the password.
I'm confused too. the pdf softwares ask for one password to decrypt the PDF file, but there are two passwords used for encryption.

@exiledkingcc
Copy link
Contributor Author

this is how qpdf do about passwords, maybe we should keep the same way:
https://qpdf.readthedocs.io/en/stable/encryption.html#user-and-owner-passwords
so do not merge this for now, i will make the decrypting process more clear and deal with the empty password.

@MartinThoma MartinThoma changed the title support R6 decrypting ENH: Support R6 decrypting Jun 25, 2022
@MartinThoma
Copy link
Member

@exiledkingcc Do you think this is ready to be merged? If yes, I would review it again this evening :-) I'm excited about it :-)

@exiledkingcc
Copy link
Contributor Author

i think yes. 😊

@@ -137,7 +137,7 @@ def merge(
reader = PdfReader(stream, strict=self.strict) # type: ignore[arg-type]
self.inputs.append((stream, reader, my_file))
if encryption_obj is not None:
reader._encryption = encryption_obj
reader.encryption = encryption_obj
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every attribute that does not start with an underscore is part of the public interface. That means we cannot change the behavior / name without deprecation warnings.

Is there a reason why a user would want to access the encryption attribute directly?

@@ -254,8 +254,26 @@ def __init__(
self.stream = stream

self._override_encryption = False
if password is not None and self.decrypt(password) == 0:
raise PdfReadError("Wrong password")
self.encryption: Optional[Encryption] = None
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consistently with my question in _merger, I would prefer if this was the private attribute _encryption instead of the public attribute `encryption. Except, of course, if there is a reason why users would want to access it.

Comment on lines 265 to 266
encryptEntry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object())
self.encryption = Encryption.read(encryptEntry, id1_entry)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
encryptEntry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object())
self.encryption = Encryption.read(encryptEntry, id1_entry)
encrypt_entry = cast(DictionaryObject, self.trailer[TK.ENCRYPT].get_object())
self.encryption = Encryption.read(encrypt_entry, id1_entry)

self._user_keys: Dict = {}
self._owner_keys: Dict = {}

def verified(self) -> bool:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess verified means that the file was decrypted? Would is_decrypted be better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think this should rather be _verified if the user does not need it.

PyPDF2/_encryption.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

I've tried to cross-check the decryption, but...

I guess we are head of our time 😄

PyPDF2/_encryption.py Outdated Show resolved Hide resolved
@MartinThoma
Copy link
Member

MartinThoma commented Jun 26, 2022

@exiledkingcc Overall, it looks good to me. Good work 👏 👍 🎉

There are a couple of things I want to be changed before I release it. I could do them myself after merging this PR or you could do them now - whatever you prefer. Just let me know :-)

I've also noticed that if either the user password or the owner password is set to the empty password, typical viewers will directly show the contents. For this reason I would add the following test PDF:

qpdf --encrypt "foo" "bar" 256 -- unencrypted.pdf r6-both-passwords.pdf

@MartinThoma MartinThoma added workflow-encryption From a users perspective, encryption is the affected feature/workflow soon PRs that are almost ready to be merged, issues that get solved pretty soon labels Jun 26, 2022
@exiledkingcc
Copy link
Contributor Author

@MartinThoma all your review comments are great, i will do the update tonight.

@MartinThoma
Copy link
Member

I just saw https://github.com/pdfminer/pdfminer.six/tree/master/samples and especially:

Files in the encryption folder have been generated with cpdf 1.7 [http://www.coherentpdf.com/]
from the base.pdf file generated with LibreOffice 4.1.1.2 as follows:

cpdf -encrypt 40bit foo baz base.pdf -o rc4-40.pdf
cpdf -encrypt 128bit foo baz base.pdf -o rc4-128.pdf
cpdf -encrypt AES foo baz base.pdf -o aes-128.pdf
cpdf -encrypt AES foo baz base.pdf -no-encrypt-metadata -o aes-128-m.pdf
cpdf -encrypt AES256 foo baz base.pdf -o aes-256.pdf
cpdf -encrypt AES256 foo baz base.pdf -no-encrypt-metadata -o aes-256-m.pdf

I think I'll run PyPDF2 over those examples some time today :-)

@exiledkingcc
Copy link
Contributor Author

updated.
but keep PasswordType.USER_PASSWORD = 1, PasswordType.OWNER_PASSWORD == 2

Comment on lines +718 to +721
try:
pwd = password.encode("latin-1")
except Exception: # noqa
pwd = password.encode("utf-8")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you choose latin-1 as the default?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be honest, i don't think too much about it, just copy it from previous code. 😀

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, ok. Thanks for the honesty ❤️

@MartinThoma MartinThoma merged commit e83cdbe into py-pdf:main Jun 26, 2022
@MartinThoma
Copy link
Member

@exiledkingcc Very nice work! Thank you 🤗 I'll release it today (just have to eat something now 😄 )

MartinThoma added a commit that referenced this pull request Jun 26, 2022
New Features (ENH):
-  Support R6 decrypting (#1015)
-  Add PdfReader.pdf_header (#1013)

Performance Improvements (PI):
-  Remove ord_ calls (#1014)

Bug Fixes (BUG):
-  Fix missing page for bookmark (#1016)

Robustness (ROB):
-  Deal with invalid Destinations (#1028)

Documentation (DOC):
-  get_form_text_fields does not extract dropdown data (#1029)
-  Adjust PdfWriter.add_uri docstring
-  Mention crypto extra_requires for installation (#1017)

Developer Experience (DEV):
-  Use /n line endings everywhere (#1027)
-  Adjust string formatting to be able to use mutmut (#1020)
-  Update Bug report template

Full Changelog: 2.3.1...2.4.0
MartinThoma added a commit that referenced this pull request Jul 9, 2022
MartinThoma added a commit that referenced this pull request Jul 9, 2022
MartinThoma added a commit that referenced this pull request Jul 9, 2022
MartinThoma added a commit that referenced this pull request Jul 10, 2022
New Features (ENH):
-  Add PageObject._get_fonts (#1083)
-  Add support for indexed color spaces / BitsPerComponent for decoding PNGs (#1067)

Performance Improvements (PI):
-  Use iterative DFS in PdfWriter._sweep_indirect_references (#1072)

Bug Fixes (BUG):
-  Let Page.scale also scale the crop-/trim-/bleed-/artbox (#1066)
-  Column default for CCITTFaxDecode (#1079)

Robustness (ROB):
-  Guard against None-value in _get_outlines (#1060)

Documentation (DOC):
-  Stamps and watermarks (#1082)
-  OCR vs PDF text extraction (#1081)
-  Python Version support
-  Formatting of CHANGELOG

Developer Experience (DEV):
-  Cache downloaded files (#1070)
-  Speed-up for CI (#1069)

Maintenance (MAINT):
-  Set page.rotate(angle: int) (#1092)
-  Issue #416 was fixed by #1015 (#1078)

Testing (TST):
-  Image extraction (#1080)
-  Image extraction (#1077)

Code Style (STY):
-  Apply black
-  Typo in Changelog

Full Changelog: 2.4.2...2.4.3
@exiledkingcc exiledkingcc deleted the encryption branch July 24, 2022 03:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
soon PRs that are almost ready to be merged, issues that get solved pretty soon workflow-encryption From a users perspective, encryption is the affected feature/workflow
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PyPDF2.utils.PdfReadError: file has not been decrypted
2 participants