Skip to content

Checking image/Tiff() Before Image/Dcm() can cause misidentification of dicoms as tiffs #198

@CalNightingale

Description

@CalNightingale
  • DICOM files start with a 128 byte preamble which is unstructured (i.e., the first 128 bytes can contain anything). The spec says "File-set Readers or Updaters shall not rely on the content of this Preamble to determine that this File is or is not a DICOM File."
  • A TIFF header is only 8 bytes long (well, not really, but for the purposes of this investigation, it is).
  • Apparently there's a dual format concept in DICOM, where the preamble may contain e.g. TIFF data so that applications can recognize the file as either a TIFF or a DICOM (see section 7.5)
  • Some dicom files, including the official pydicom example files, do start with a TIFF header (bytes 'II' followed by the short 42)
  • So, technically, from filetype's perspective, the file is both a valid TIFF and a valid DICOM, but it selects TIFF because it checks for a match for TIFF first
  • While these files are technically both valid TIFF and DICOM, I believe filetype would be more accurate if it checked for DICOM prior to checking for TIFF

Example:

import pydicom
import filetype
import tempfile
import os

with tempfile.TemporaryDirectory(suffix="_dcm_test") as tdir:
    dcm_path = os.path.join(tdir, "test.dcm")
    pydicom.examples.ct.save_as(dcm_path)
    print("pydicom.misc.is_dicom(dcm_path):", pydicom.misc.is_dicom(dcm_path))
    print("filetype.guess(dcm_path).mime:", filetype.guess(dcm_path).mime)

Results:

pydicom.misc.is_dicom(dcm_path): True
filetype.guess(dcm_path).mime: image/tiff

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions