Skip to content

Conversation

@magbyr
Copy link
Contributor

@magbyr magbyr commented Jul 28, 2022

Adds formats:

  • doc - application/msword
  • docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
  • odt - application/vnd.oasis.opendocument.text
  • xls - application/vnd.ms-excel
  • xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
  • ods - application/vnd.oasis.opendocument.spreadsheet
  • ppt - application/vnd.ms-powerpoint
  • pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation
  • odp - application/vnd.oasis.opendocument.presentation

@magbyr
Copy link
Contributor Author

magbyr commented Jul 28, 2022

Hope these changes can be useful for the project. We're relying on filetype.py usage in our startup, so we're committed to keeping these formats/signatures updated if needed. Copied some signatures / code from the go version, but added some changes because of differences in the files produced by LibreOffice.

Copy link
Contributor

@babenek babenek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization suggestions.
UnitTests: complex return (...) cannot produce code coverage for analysis. May be if-else does not overhead...

Comment on lines 113 to 123
len(buf) > 7
and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
and (
(len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00") # MS Office
or (
len(buf) > 2142
and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"
in buf[2075:2142]
) # LibreOffice
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
len(buf) > 7
and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
and (
(len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00") # MS Office
or (
len(buf) > 2142
and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"
in buf[2075:2142]
) # LibreOffice
)
)
len(buf) > 515
and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
and (
buf[512:515] == b"\xEC\xA5\xC1\x00" # MS Office
or (
len(buf) > 2142
and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"
in buf[2075:2142]
) # LibreOffice
)
)

simple optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the match methods a bit. Probably better for coverage calculation - but slightly slower. What do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably it is so due you used additional assignments.
I draft my suggestion. Anyway if it so slowly - rollback the changes.
Additionally you can test with #132 - sometimes it found uncaught exceptions. But you have to change -max_len=262 to 8K.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. I did the changes and tested with atheris. Found no errors in new code. Atheris found an error in the isobmff class regarding unicode decoding. I think a simple "errors=ignore" on the decode method could be a solution there...

Anyway. Hopefully the latest changes works better with coverage calculation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, then may be the PR will me merged. Yes, the fix works #131 .

Comment on lines 166 to 185
header_match = (
len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
)
subheader_match = (
header_match
and len(buf) > 520
and (
(
buf[512:516] == b"\xFD\xFF\xFF\xFF"
and (buf[518] == 0x00 or buf[518] == 0x02)
)
or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00")
or (
len(buf) > 2095
and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc"
in buf[1568:2095]
)
)
)
return header_match and subheader_match
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
header_match = (
len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
)
subheader_match = (
header_match
and len(buf) > 520
and (
(
buf[512:516] == b"\xFD\xFF\xFF\xFF"
and (buf[518] == 0x00 or buf[518] == 0x02)
)
or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00")
or (
len(buf) > 2095
and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc"
in buf[1568:2095]
)
)
)
return header_match and subheader_match
if len(buf) > 520 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1":
if buf[512:516] == b"\xFD\xFF\xFF\xFF" and (buf[518] == 0x00 or buf[518] == 0x02) \
or buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00":
return True
elif len(buf) > 2095:
return b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc" in buf[1568:2095]
return False

Please, check my suggestion. May be spaces missed/extra.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants