Add office document formats #133

magbyr · 2022-07-28T08:34:00Z

Adds formats:

doc - application/msword
docx - application/vnd.openxmlformats-officedocument.wordprocessingml.document
odt - application/vnd.oasis.opendocument.text
xls - application/vnd.ms-excel
xlsx - application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
ods - application/vnd.oasis.opendocument.spreadsheet
ppt - application/vnd.ms-powerpoint
pptx - application/vnd.openxmlformats-officedocument.presentationml.presentation
odp - application/vnd.oasis.opendocument.presentation

…x and odp. Added tests and sample documents for document filetypes

magbyr · 2022-07-28T08:42:27Z

Hope these changes can be useful for the project. We're relying on filetype.py usage in our startup, so we're committed to keeping these formats/signatures updated if needed. Copied some signatures / code from the go version, but added some changes because of differences in the files produced by LibreOffice.

babenek

Optimization suggestions.
UnitTests: complex return (...) cannot produce code coverage for analysis. May be if-else does not overhead...

babenek · 2022-08-02T10:49:53Z

filetype/types/document.py

+            len(buf) > 7
+            and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
+            and (
+                (len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00")  # MS Office
+                or (
+                    len(buf) > 2142
+                    and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"
+                    in buf[2075:2142]
+                )  # LibreOffice
+            )
+        )


Suggested change

len(buf) > 7

and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"

and (

(len(buf) > 515 and buf[512:515] == b"\xEC\xA5\xC1\x00") # MS Office

or (

len(buf) > 2142

and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"

in buf[2075:2142]

) # LibreOffice

)

)

len(buf) > 515

and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"

and (

buf[512:515] == b"\xEC\xA5\xC1\x00" # MS Office

or (

len(buf) > 2142

and b"\x00\x0A\x00\x00\x00MSWordDoc\x00\x10\x00\x00\x00Word.Document.8\x00\xF49\xB2q"

in buf[2075:2142]

) # LibreOffice

)

)

simple optimization.

I have changed the match methods a bit. Probably better for coverage calculation - but slightly slower. What do you think?

Probably it is so due you used additional assignments.
I draft my suggestion. Anyway if it so slowly - rollback the changes.
Additionally you can test with #132 - sometimes it found uncaught exceptions. But you have to change -max_len=262 to 8K.

Cool. I did the changes and tested with atheris. Found no errors in new code. Atheris found an error in the isobmff class regarding unicode decoding. I think a simple "errors=ignore" on the decode method could be a solution there...

Anyway. Hopefully the latest changes works better with coverage calculation?

Cool, then may be the PR will me merged. Yes, the fix works #131 .

filetype/types/document.py

tests/test_types.py

Co-authored-by: Roman Babenko <babenek@users.noreply.github.com>

babenek · 2022-08-02T17:45:10Z

filetype/types/document.py

+        header_match = (
+            len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"
+        )
+        subheader_match = (
+            header_match
+            and len(buf) > 520
+            and (
+                (
+                    buf[512:516] == b"\xFD\xFF\xFF\xFF"
+                    and (buf[518] == 0x00 or buf[518] == 0x02)
+                )
+                or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00")
+                or (
+                    len(buf) > 2095
+                    and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc"
+                    in buf[1568:2095]
+                )
+            )
+        )
+        return header_match and subheader_match


Suggested change

header_match = (

len(buf) > 8 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1"

)

subheader_match = (

header_match

and len(buf) > 520

and (

(

buf[512:516] == b"\xFD\xFF\xFF\xFF"

and (buf[518] == 0x00 or buf[518] == 0x02)

)

or (buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00")

or (

len(buf) > 2095

and b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc"

in buf[1568:2095]

)

)

)

return header_match and subheader_match

if len(buf) > 520 and buf[0:8] == b"\xD0\xCF\x11\xE0\xA1\xB1\x1A\xE1":

if buf[512:516] == b"\xFD\xFF\xFF\xFF" and (buf[518] == 0x00 or buf[518] == 0x02) \

or buf[512:520] == b"\x09\x08\x10\x00\x00\x06\x05\x00":

return True

elif len(buf) > 2095:

return b"\xE2\x00\x00\x00\x5C\x00\x70\x00\x04\x00\x00Calc" in buf[1568:2095]

return False

Please, check my suggestion. May be spaces missed/extra.

magbyr added 3 commits July 28, 2022 09:27

Added document filetypes for doc, docx, odt, xls, xlsx, ods, ppt, ppt…

b71bf17

…x and odp. Added tests and sample documents for document filetypes

Linter changes

8ae639d

README changes

a4b1cff

babenek reviewed Aug 2, 2022

View reviewed changes

magbyr and others added 5 commits August 2, 2022 13:33

Apply suggestions from code review

d19c60e

Co-authored-by: Roman Babenko <babenek@users.noreply.github.com>

Extra line at EOF

2679bf9

Extra line at EOF

5b88e7e

Extra line at EOF

c9f520b

Changed return method because of coverage calculation problems

4f279db

babenek reviewed Aug 2, 2022

View reviewed changes

Changed to if statements in matching method

8719241

h2non merged commit c8c1fbc into h2non:master Aug 5, 2022

LifeforDream mentioned this pull request Oct 13, 2022

xls and xlsx guessed as zip #142

Closed

codyswanner mentioned this pull request May 5, 2024

Add more file types: doc,docx,xls,xlsx and open office #16

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add office document formats #133

Add office document formats #133

Uh oh!

magbyr commented Jul 28, 2022

Uh oh!

magbyr commented Jul 28, 2022

Uh oh!

babenek left a comment

Uh oh!

babenek Aug 2, 2022

Uh oh!

magbyr Aug 2, 2022

Uh oh!

babenek Aug 2, 2022

Uh oh!

magbyr Aug 3, 2022

Uh oh!

babenek Aug 3, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

babenek Aug 2, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add office document formats #133

Add office document formats #133

Uh oh!

Conversation

magbyr commented Jul 28, 2022

Uh oh!

magbyr commented Jul 28, 2022

Uh oh!

babenek left a comment

Choose a reason for hiding this comment

Uh oh!

babenek Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

magbyr Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

babenek Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

magbyr Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

babenek Aug 3, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

babenek Aug 2, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants