Skip to content

Conversation

@davidfurey
Copy link
Member

@davidfurey davidfurey commented Feb 8, 2022

What does this change?

This improves the mimetype detection so that DNG and CR2 files are correctly identified. Previously these were classified as Tiffs, and then later failed to import, with users seeing an unfriendly error message and developers getting a PagerDuty alert.

Before:

image (7)

After

image (6)

How can success be measured?

Fewer PagerDuty alerts

Tested? Documented?

  • locally by committer
  • locally by Guardian reviewer
  • on the Guardian's TEST environment
  • relevant documentation added or amended (if needed)

Test file is first 1kB of a raw file
Digital Negative Image (raw image format)
@davidfurey davidfurey requested a review from a team as a code owner February 8, 2022 10:35
@paperboyo
Copy link
Contributor

paperboyo commented Feb 8, 2022

Tested and it works great! Probably should be revisited (to remove magic identification) after this is resolved upstream: drewnoakes/metadata-extractor#75.

It correctly prevents uploading of .CR2 and DNG images. If we would want to get magick for more RAW files, probably Sony and Nikon should be next (Sony bombs, Nikon uploads a 165px thumbnail).

Regarding future, better, support for DNGs, we could extract the preview JPEG and upload that after checking it’s the same dimensions as the RAW file. This is better than trying to develop RAW files ourselves, because they often come with proprietary development instructions, so using JPEG preview should get us closer to artists’ intent. But even that wouldn’t be 100% solution as previews may be out of date and even native resolution preview generated by Adobe apps is a highly compressed JPEG :-(.

Copy link
Member

@andrew-nowak andrew-nowak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

working on test 👍

<mime-info>
<mime-type type="image/x-raw-canon">
<_comment>Canon raw image</_comment>
<magic priority="50">
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This has the same priority as tiff files which I think should also match this magic. Should we increase this priority to make sure that whatever tika uses to tiebreak doesn't accidentally switch back to identifying these as tiffs?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable. I think custom-mimetypes have higher priority already, but no harm in increasing this.

private def isDng(file: File) = {
val metadata = TiffMetadataReader.readMetadata(file)
val directory = metadata.getFirstDirectoryOfType(classOf[ExifIFD0Directory])
directory.containsTag(0xC612)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think technically this could NPE (if somehow there wasn't a 0th ifd on the tiff), other places in the code we've checked that it's not null

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent spot, thank you.

@davidfurey
Copy link
Member Author

Tested and it works great! Probably should be revisited (to remove magic identification) after this is resolved upstream: drewnoakes/metadata-extractor#75.

Great, thanks for testing it.

It correctly prevents uploading of .CR2 and DNG images. If we would want to get magick for more RAW files, probably Sony and Nikon should be next (Sony bombs, Nikon uploads a 165px thumbnail).

Happy to look at this when I get a chance, just send me some example images.

Regarding future, better, support for DNGs, we could extract the preview JPEG and upload that after checking it’s the same dimensions as the RAW file. This is better than trying to develop RAW files ourselves, because they often come with proprietary development instructions, so using JPEG preview should get us closer to artists’ intent. But even that wouldn’t be 100% solution as previews may be out of date and even native resolution preview generated by Adobe apps is a highly compressed JPEG :-(.

I'm reluctant to attempt to extract JPEG info from a DNG since I don't think this is a workflow we want to encourage. I'd rather we displayed a sensible error message.

readMetadata can throw an exception, and getFirstDirectoryOfType can return null
This needs to be treated as higher priority than Tiff.  Tika seems to already treat it has higher priority than Tiff, but this gives us confidence that it will continue to in future.
@prout-bot
Copy link

Seen on auth, usage, cropper, collections, media-api, kahuna, image-loader, metadata-editor, thrall, leases (merged by @davidfurey 8 minutes and 33 seconds ago) Please check your changes!

@prout-bot
Copy link

Seen on image-loader-projection (merged by @davidfurey 8 minutes and 38 seconds ago) Please check your changes!

@paperboyo
Copy link
Contributor

paperboyo commented Feb 8, 2022

Thanks!

Happy to look at this when I get a chance, just send me some example images.

Just download any from the folders I linked to. eg. this and that.

I'm reluctant to attempt to extract JPEG info from a DNG since I don't think this is a workflow we want to encourage. I'd rather we displayed a sensible error message.

Yeah. Thing is, the current state of industry prevents us from always doing what’s best: proprietary Adobe, Sony or any other metadata development commands cannot be interpreted by open source software, because the renderers are proprietary too (not even commercial software can read competitors’ data).

The error message could read something like: This image is a raw file that cannot be reliably interpreted by Grid. Please convert it to supported file format preserving all metadata.. But this would be useful only for users who will understand it and can act upon it (either themselves or asking upstream).

One alternative would be to sniff for acceptable previews (that have developments baked-in). Another would be whatever dcraw does: probably just renders raws using vanilla settings, although it may take some metadata into account, I don’t know (I would expect at least white balance to be interpreted correctly). Latter alternative would render raws/dngs different to how they were saved from raw developing software if they indeed were (although we could exclude those with proprietary-specific metadata?). In any case, this seems involved. Preview ingest less so than trying to be clever about raw development. Users should upload developed files saved as highest quality JPEGs…

@paperboyo paperboyo mentioned this pull request Feb 8, 2022
14 tasks
@davidfurey
Copy link
Member Author

For future reference, I think the Canon raw magic came from http://lclevy.free.fr/cr2/#key_info / https://github.com/lclevy/libcraw2/blob/master/docs/cr2_poster.pdf

But I might have got it from somewhere else and then cross-referenced with this resource

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants