Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When using -zip experiencing errors on a file from OPF Format Corpus #72

Closed
ross-spencer opened this issue May 15, 2016 · 3 comments
Closed

Comments

@ross-spencer
Copy link

Attempting to scan the opt-format-corpus I'm seeing an error from a specific file:

pdfCabinetOfHorrors/embedded_video_quicktime.doc

  goatslayer@goatslayer-acer-linux:~/git/opf-format-corpus/format-corpus/pdfCabinetOfHorrors$ fido -zip embedded_video_quicktime.doc
  FIDO v1.3.3 (formats-v84.xml, container-signature-20160121.xml, format_extensions.xml)
  bad repeat interval
  bad repeat interval
  OK,250,fmt/111,"OLE2 Compound Document Format","OLE2 Compound Document Format",26624,"embedded_video_quicktime.doc","None","signature"
  Traceback (most recent call last):
    File "/usr/local/bin/fido", line 9, in <module>
      load_entry_point('opf-fido==1.3.3', 'console_scripts', 'fido')()
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 855, in main
      fido.identify_file(file)
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 400, in identify_file
      self.identify_contents(filename, type=self.container_type(matches))
    File "/usr/local/lib/python2.7/dist-packages/opf_fido-1.3.3-py2.7.egg/fido/fido.py", line 418, in identify_contents
      raise RuntimeError("Unknown container type: " + repr(type))
  RuntimeError: Unknown container type: 'ole'

Distro stats:

Python 2.7.6
No LSB modules are available.
Distributor ID: Ubuntu 
Description:    Ubuntu 14.04.4 LTS
Release:    14.04
Codename:   trusty

My mirror of the OPF Format Corpus can be found here: https://github.com/ross-spencer/opf-format-corpus

@mistydemeo
Copy link
Contributor

I see the cause of this one; I introduced this bug by accident.

The code here branches if the -zip option is selected:

if self.zip:
    self.identify_contents(filename, type=self.container_type(matches))

The type parameter is set to the type of the container, or None if the container type can't be detected by FIDO. Prior to 1b5698a, the only two values were zip and tar, but that added support for ole containers as well. FIDO uses this same method to determine container types for the purpose of identifying single files using container signatures, and for recursing into container files using -zip. It's the latter case that's breaking, since ole files are causing FIDO to try to unzip the files and identify the contents.

mistydemeo added a commit that referenced this issue May 16, 2016
1b5698a updated self.container_type() to recognize OLE as an additional
format, but this broke the `-zip` switch. The method was being used to
identify two different categories of formats:

1. Container formats which need to be matched against the PRONOM
   container signatures in order to get more precise matches; and
2. Container formats which can be recursed into via the `-zip` switch in
   order to identify the formats of their contents.

FIDO supports OLE for the former but not the latter, and since OLE is
usually not interesting for its contents, it doesn't make sense to
support recursing into it.

This commit adds a new method which differentiates whether FIDO is
interested in recursing into a format, not merely whether it *is* a
container format, and updates the `-zip` path to check using it.

Fixes #72.
@mistydemeo
Copy link
Contributor

Fixed by #73. I can confirm after that PR that -zip still works as expected on ZIP files, but no longer breaks on OLE files.

@sevein
Copy link
Contributor

sevein commented May 16, 2016

Thanks @mistydemeo! @jhsimpson is going to merge #73 soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants