Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adobe Illustrator 14 file identified as PDF 1.5, not AI #41

Closed
mistydemeo opened this issue Oct 1, 2013 · 10 comments
Closed

Adobe Illustrator 14 file identified as PDF 1.5, not AI #41

mistydemeo opened this issue Oct 1, 2013 · 10 comments
Labels
feature New functionality to be developed

Comments

@mistydemeo
Copy link
Contributor

This Adobe Illustrator sample is being misidentified in fido 1.3.1 using the PRONOM v70 signatures: https://github.com/artefactual/archivematica-sampledata/raw/master/SampleTransfers/Images/BBhelmet.ai

The file is an Illustrator 14 (CS4) file (fmt/563), but is being identified as PDF 1.5 (fmt/19). This isn't actually wrong per se (since AI files are a superset of PDF), but isn't fully accurate. DROID 6.1.2, using the same v70 signature files, correctly identifies the file as fmt/563.

@ghost ghost assigned techmaurice Oct 3, 2013
@techmaurice
Copy link
Contributor

This has to do with the default buffersize of FIDO which is 128 kb.

Your example file seems to have the PS subset header at an offset of ~478 kb, so FIDO never sees this header and skips to the EOF part of the signature.

If you increase it to say 512 kb, FIDO will correctly recognise the file.

Example:
fido.py -bufsize 512000

You also might want to increase the default buffersize by changing the default settings in the code.

@adamfarquhar
Copy link
Contributor

Interesting. This would be the first example that I’ve seen of a file that needs more than the default 128kb to identify. I wonder if there is a better signature for AI 14? I’ve never looked at the format, but it would be surprising if one actually needed to look at 500kb before knowing a file really is an AI 14 one.

Cheers,

Adam.

From: Maurice de Rooij [mailto:notifications@github.com]
Sent: 03 October 2013 10:58
To: openplanets/fido
Subject: Re: [fido] Adobe Illustrator 14 file identified as PDF 1.5, not AI (#41)

This has to do with the default buffersize of FIDO which is 128 kb.

Your example file seems to have the PS subset header at an offset of ~478 kb, so FIDO never sees this header and skips to the EOF part of the signature.

If you increase it to say 512 kb, FIDO will correctly recognise the file.

Example:
fido.py -bufsize 512000

You also might want to increase the default buffersize by changing the default settings in the code.


Reply to this email directly or view it on GitHub #41 (comment) .

Adam Farquhar
Head of Digital Scholarship
Collections Division
T:+44 (0)20 7412 7832

Adam.Farquhar@bl.uk
The British Library
London

NW1 2DB

http://www.bl.uk/
The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

@techmaurice
Copy link
Contributor

Indeed interesting.

Unfortunately Adobe has not published specifications for this format (or maybe I just did not find them...)

After further examination it looks like the section between the PDF header and the AI subset header exists out of

  • a JPG thumbnail
  • XMP/RDF metadata with audit trail information (saved date, etc)
  • XMP metadata with information about swatches, colormodes and fonts
  • inline font streams

Based on this we might assume the binary distance between the PDF header and the AI subset header is very variable, and depends heavily on the existence and number/size of earlier mentioned items.

@techmaurice techmaurice reopened this Oct 3, 2013
@techmaurice
Copy link
Contributor

Reopened for discussion

@adamfarquhar
Copy link
Contributor

<x:xmpmeta xmlns:x="adobe:ns:meta/" x:xmptk="Adobe XMP Core 5.3-c011 66.145433, 2012/01/17-15:11:19 ">

<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

  <rdf:Description rdf:about=""

        xmlns:dc="http://purl.org/dc/elements/1.1/">

     <dc:format>application/vnd.adobe.illustrator</dc:format>

     <dc:title>

        <rdf:Alt>

           <rdf:li xml:lang="x-default">Looking For Adventure</rdf:li>

        </rdf:Alt>

     </dc:title>

     <dc:creator>

        <rdf:Seq>

           <rdf:li>Yogesh Sharma</rdf:li>

        </rdf:Seq>

     </dc:creator>

  </rdf:Description>

  <rdf:Description rdf:about=""

        xmlns:xmp="http://ns.adobe.com/xap/1.0/"

        xmlns:xmpGImg="http://ns.adobe.com/xap/1.0/g/img/">

     <xmp:MetadataDate>2012-02-06T17:31:28+05:30</xmp:MetadataDate>

     <xmp:ModifyDate>2012-02-06T17:31:28+05:30</xmp:ModifyDate>

     <xmp:CreateDate>2012-01-12T16:09:39+05:30</xmp:CreateDate>

     <xmp:CreatorTool>Adobe Illustrator CS6 (Macintosh)</xmp:CreatorTool>

Adam Farquhar
Head of Digital Scholarship
Collections Division
T:+44 (0)20 7412 7832

Adam.Farquhar@bl.uk
The British Library
London

NW1 2DB

http://www.bl.uk/
The British Library’s latest Annual Report and Accounts

http://www.bl.uk/aboutus/annrep/index.htmlhttp://www.bl.uk/knowledge

http://www.bl.uk/emaildisclaimer.html

@techmaurice
Copy link
Contributor

Heh, adventure indeed 👍

@techmaurice
Copy link
Contributor

Updated the section about read buffers in the FIDO Usage Guide.

@anjackson
Copy link
Member

Would picking the format out of the XMP payload be more reliable than looking for the "%AI5_FileFormat" comment?

@techmaurice
Copy link
Contributor

Possibly Andy.
Playing around with this format currently in CS6, and looking at the XMP payload seems more reliable.

If the XMP payload is proven to be more reliable the advanced signature should be submitted to PRONOM.
Of course it will be added to the extension file for the time being...

@carlwilson
Copy link
Member

Closed due lack of recent activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New functionality to be developed
Projects
None yet
Development

No branches or pull requests

5 participants