-
Notifications
You must be signed in to change notification settings - Fork 118
Add office document formats #133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…x and odp. Added tests and sample documents for document filetypes
|
Hope these changes can be useful for the project. We're relying on filetype.py usage in our startup, so we're committed to keeping these formats/signatures updated if needed. Copied some signatures / code from the go version, but added some changes because of differences in the files produced by LibreOffice. |
babenek
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optimization suggestions.
UnitTests: complex return (...) cannot produce code coverage for analysis. May be if-else does not overhead...
filetype/types/document.py
Outdated
| len(buf) > 7 | ||
| and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | ||
| and ( | ||
| (len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00") # MS Office | ||
| or ( | ||
| len(buf) > 2142 | ||
| and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q" | ||
| in buf[2075:2142] | ||
| ) # LibreOffice | ||
| ) | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| len(buf) > 7 | |
| and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | |
| and ( | |
| (len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00") # MS Office | |
| or ( | |
| len(buf) > 2142 | |
| and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q" | |
| in buf[2075:2142] | |
| ) # LibreOffice | |
| ) | |
| ) | |
| len(buf) > 515 | |
| and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | |
| and ( | |
| buf[512:515] == b"\xEC\xA5\xC1\x00" # MS Office | |
| or ( | |
| len(buf) > 2142 | |
| and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q" | |
| in buf[2075:2142] | |
| ) # LibreOffice | |
| ) | |
| ) |
simple optimization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have changed the match methods a bit. Probably better for coverage calculation - but slightly slower. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably it is so due you used additional assignments.
I draft my suggestion. Anyway if it so slowly - rollback the changes.
Additionally you can test with #132 - sometimes it found uncaught exceptions. But you have to change -max_len=262 to 8K.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool. I did the changes and tested with atheris. Found no errors in new code. Atheris found an error in the isobmff class regarding unicode decoding. I think a simple "errors=ignore" on the decode method could be a solution there...
Anyway. Hopefully the latest changes works better with coverage calculation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, then may be the PR will me merged. Yes, the fix works #131 .
Co-authored-by: Roman Babenko <babenek@users.noreply.github.com>
filetype/types/document.py
Outdated
| header_match = ( | ||
| len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | ||
| ) | ||
| subheader_match = ( | ||
| header_match | ||
| and len(buf) > 520 | ||
| and ( | ||
| ( | ||
| buf[512:516] == b"\xFD\xFF\xFF\xFF" | ||
| and (buf[518] == 0x00 or buf[518] == 0x02) | ||
| ) | ||
| or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00") | ||
| or ( | ||
| len(buf) > 2095 | ||
| and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc" | ||
| in buf[1568:2095] | ||
| ) | ||
| ) | ||
| ) | ||
| return header_match and subheader_match |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| header_match = ( | |
| len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1" | |
| ) | |
| subheader_match = ( | |
| header_match | |
| and len(buf) > 520 | |
| and ( | |
| ( | |
| buf[512:516] == b"\xFD\xFF\xFF\xFF" | |
| and (buf[518] == 0x00 or buf[518] == 0x02) | |
| ) | |
| or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00") | |
| or ( | |
| len(buf) > 2095 | |
| and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc" | |
| in buf[1568:2095] | |
| ) | |
| ) | |
| ) | |
| return header_match and subheader_match | |
| if len(buf) > 520 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1": | |
| if buf[512:516] == b"\xFD\xFF\xFF\xFF" and (buf[518] == 0x00 or buf[518] == 0x02) \ | |
| or buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00": | |
| return True | |
| elif len(buf) > 2095: | |
| return b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc" in buf[1568:2095] | |
| return False |
Please, check my suggestion. May be spaces missed/extra.
Adds formats:
application/mswordapplication/vnd.openxmlformats-officedocument.wordprocessingml.documentapplication/vnd.oasis.opendocument.textapplication/vnd.ms-excelapplication/vnd.openxmlformats-officedocument.spreadsheetml.sheetapplication/vnd.oasis.opendocument.spreadsheetapplication/vnd.ms-powerpointapplication/vnd.openxmlformats-officedocument.presentationml.presentationapplication/vnd.oasis.opendocument.presentation