Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not recognizing Office 2007+ files (docx, xslx,...) #1

Closed
zAlbee opened this issue Nov 29, 2013 · 6 comments
Closed

Not recognizing Office 2007+ files (docx, xslx,...) #1

zAlbee opened this issue Nov 29, 2013 · 6 comments
Assignees

Comments

@zAlbee
Copy link

zAlbee commented Nov 29, 2013

Hi, I found this project from your comment on the article http://www.rgagnon.com/javadetails/java-0487.html. I have used the UNIX "file" command with good accuracy, so seeing that the simplemagic library is based on the same logic appealed to me. Unfortunately this Java library doesn't have the same success rate. Particularly, it fails on most MS Office files from Office 2007+.

Here is what I get from SimpleMagic:

Word2007.docx:    application/zip [Zip archive data, at least v2.0 to extract]
Word97-2003.doc:  application/msword [Microsoft Word Document]
Excel2007.xlsx:   application/zip [Zip archive data, at least v2.0 to extract]
Excel97-2003.xls: null [OLE 2 Compound Document]

Here is what I expected using "file"

$ file --mime-type Word* Excel*
Word2007.docx:    application/vnd.openxmlformats-officedocument.wordprocessingml.document
Word97-2003.doc:  application/msword
Excel2007.xlsx:   application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
Excel97-2003.xls: application/vnd.ms-excel

My system:

$ file --v
file-5.13
magic file from /usr/share/misc/magic

$ uname -a
CYGWIN_NT-6.1 XXXX 1.7.25(0.270/5/3) 2013-08-31 20:37 x86_64 Cygwin

I thought it might be due to an older magic file, but unfortunately, using my system's magic file doesn't help much (actually makes it worse). What version of file/magic was used here? Perhaps the file format of magic changed since then?

@j256
Copy link
Owner

j256 commented Jan 2, 2014

Ok. I am seeing the same behavior. I did not have a docx file type in my tests. One is added now. Looking into it now. Thanks and sorry for the delay.

@ghost ghost assigned j256 Jan 2, 2014
@j256
Copy link
Owner

j256 commented Jan 3, 2014

Actually my local file commands still fail on this. Can you post your magic file somewhere? Maybe pastebin.com?

@zAlbee
Copy link
Author

zAlbee commented Jan 3, 2014

Here you go. This is the magic file that came with Cygwin for file 5.13. https://gist.github.com/zAlbee/8241169

I'm guessing this is the relevant part:

#------------------------------------------------------------------------------
# $File: msooxml,v 1.2 2013/01/25 23:04:37 christos Exp $
# msooxml:  file(1) magic for Microsoft Office XML
# From: Ralf Brown <ralf.brown@gmail.com>

# .docx, .pptx, and .xlsx are XML plus other files inside a ZIP
#   archive.  The first member file is normally "[Content_Types].xml".
# Since MSOOXML doesn't have anything like the uncompressed "mimetype"
#   file of ePub or OpenDocument, we'll have to scan for a filename
#   which can distinguish between the three types

# start by checking for ZIP local file header signature
0               string          PK\003\004
# make sure the first file is correct
>0x1E           string          [Content_Types].xml
# skip to the second local file header
#   since some documents include a 520-byte extra field following the file
#   header,  we need to scan for the next header
>>(18.l+49)     search/2000     PK\003\004
# now skip to the *third* local file header; again, we need to scan due to a
#   520-byte extra field following the file header
>>>&26          search/1000     PK\003\004
# and check the subdirectory name to determine which type of OOXML
#   file we have
#   Correct the mimetype with the registered ones:
#     http://technet.microsoft.com/en-us/library/cc179224.aspx
>>>>&26         string          word/           Microsoft Word 2007+
!:mime application/vnd.openxmlformats-officedocument.wordprocessingml.document
>>>>&26         string          ppt/            Microsoft PowerPoint 2007+
!:mime application/vnd.openxmlformats-officedocument.presentationml.presentation
>>>>&26         string          xl/             Microsoft Excel 2007+
!:mime application/vnd.openxmlformats-officedocument.spreadsheetml.sheet
>>>>&26         default         x               Microsoft OOXML
!:strength +10

@j256
Copy link
Owner

j256 commented Jan 3, 2014

Interesting. I don't support the search/... types but I guess I can add it. What I can do immediately is to add the [Content_Types].xml check and spit out Microsoft OOXML at least.

@j256
Copy link
Owner

j256 commented Jan 14, 2014

So version 1.5 has much better processing of the 2007+ versions of these files. Thanks again.

@j256 j256 closed this as completed Jan 14, 2014
@zAlbee
Copy link
Author

zAlbee commented Jan 15, 2014

Thanks! I tested it out on .docx, .xlsx, and .pptx and they are working now. I forgot to mention that .xls and .ppt aren't recognized either (though .doc is). I can file a separate issue for those if you want.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants