Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matches .doc as application/vnd.ms-excel #40

Open
rahulganguly opened this issue Jul 7, 2018 · 9 comments · May be fixed by #86
Open

Matches .doc as application/vnd.ms-excel #40

rahulganguly opened this issue Jul 7, 2018 · 9 comments · May be fixed by #86

Comments

@rahulganguly
Copy link

I am trying to detect the MIME type of a .doc file, and the result I get is of type
File type: xls. MIME: application/vnd.ms-excel
or
ile type: ppt. MIME: application/vnd.ms-powerpoint

@rahulganguly
Copy link
Author

under the matchers folder

matchers/document.go

The func Doc(), func Xls() and func Ppt() all have the same magic numbers.

return len(buf) > 7 &&
	buf[0] == 0xD0 && buf[1] == 0xCF &&
	buf[2] == 0x11 && buf[3] == 0xE0 &&
	buf[4] == 0xA1 && buf[5] == 0xB1 &&
	buf[6] == 0x1A && buf[7] == 0xE1

Is this the reason why the MimeType is always coming up different?

@h2non
Copy link
Owner

h2non commented Jul 7, 2018

Microsoft Office container files are zip containers. I don't know a reliable way to detect old Office formats. If you find it, feel free to submit a PR.

kumakichi added a commit to kumakichi/filetype that referenced this issue Oct 25, 2018
kumakichi added a commit to kumakichi/filetype that referenced this issue Oct 25, 2018
@mateusmaaia
Copy link

@kumakichi It's working?

@kumakichi
Copy link
Contributor

@mateusmaaia It should work, if you found some MS office files can not be detected correctly, please let me know

@0xMadao
Copy link

0xMadao commented Sep 26, 2019

@mateusmaaia It should work, if you found some MS office files can not be detected correctly, please let me know

hi there, the doc file still be detected as a ppt file type

@kumakichi
Copy link
Contributor

@mateusmaaia @jeremywu0127

Oh, I was wrong

#48 only add support for docx/xlsx/pptx(even not very good), leave doc/xls/ppt untouched

So, extra work is needed

@kumakichi
Copy link
Contributor

kumakichi commented Sep 26, 2019

@jeremywu0127

doc/xls/ppt check will be a little complex, we can detect they are Composite Document File V2 Document, but we don't know which one it is (doc/xls/ppt).

Maybe we can check the name of creating application, get something like: Microsoft Office Word, it's OK; but if files are created by some none ms-office applications, say:WPS, we know nothing.

So, this problem is not easy to resolve

@mateusmaaia
Copy link

Yes, I tried a few stuffs but ended validating the content-type header if it's doc/ppy/xls, not the best solution but still works for what I need.

Anyway, thanks! @kumakichi

@ferdnyc ferdnyc linked a pull request Jul 4, 2020 that will close this issue
@messense
Copy link

@jeremywu0127

doc/xls/ppt check will be a little complex, we can detect they are Composite Document File V2 Document, but we don't know which one it is (doc/xls/ppt).

Maybe we can check the name of creating application, get something like: Microsoft Office Word, it's OK; but if files are created by some none ms-office applications, say:WPS, we know nothing.

So, this problem is not easy to resolve

You can detect them by checking GUID of the root entry according to https://stackoverflow.com/questions/29211263/how-to-identify-doc-docx-pdf-xls-and-xlsx-based-on-file-header/48318648#48318648 , implemented in the Rust version bojand/infer#38

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants