-
Notifications
You must be signed in to change notification settings - Fork 241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DeltaVision: Improve accuracy of isThisType stream checks #2658
Conversation
- Check that it's not a TIFF - Check that the file size meets a minimum size
@@ -174,7 +180,7 @@ public boolean isThisType(RandomAccessInputStream stream) throws IOException { | |||
int x = stream.readInt(); | |||
int y = stream.readInt(); | |||
int count = stream.readInt(); | |||
return x > 0 && y > 0 && count > 0; | |||
return x > 0 && y > 0 && count > 0 && (x * y * count > stream.length()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be x * y * count < stream.length()
? It would probably also be good to cast x
, y
, and count
to long when multiplying, otherwise integer overflow could cause a false positive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, yes. Fixed.
// false negatives. | ||
if(new TiffParser(stream).isValidHeader()) { | ||
return false; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this will be redundant if the below check is fixed. isValidHeader
is checking the first four bytes; in the case of a valid header, these would be 0x49492a00 or 0x4d4d002a, which would be read as x
below and should already fail the comparison with stream.length()
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately, this isn't the case. The length check still passes with TIFF data, including the test data from 17411, so this is needed to be sure it's not TIFF. The dv magic and length checks are simply too ambiguous on their own to guarantee a correct false
return, unless there's something more definitive we can check for.
This check will at least cover most TIFF-based formats to prevent misbehaviour, though ideally it wouldn't be needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This stream-based DV check is pretty weak, it's a pity we can't make more use of the filename and fall back to stream-based identification only as a very last resort if filename or some other reader's byte or magic number pattern can't offer more confidence it is some other format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It also occurs to me that in reverse, a TIFF header could actually be valid DV data on occasion, so while this content-based check is likely more robust all around with this change (in terms of overall failure rates, by increasing the probability of a correct identification), but still far from perfect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not 100% convinced that this can't be replaced by a further check on x * y * count
later on, which is negative in this case due to long overflow. Perhaps wrapping each variable in a BigInteger
instead of casting to long
before multiplying would help and allow this to be removed?
Note that, without this PR, the behaviour is different with the two files provided. Anyway, with this PR both files are correctly identified and converted, with or without |
This looks good to me as at least being an improvement. If I were to criticize the TIFF check it would be more that it reflects the particular sample data we have that triggered this issue rather than our more properly putting the more confident checks from other non-TIFF formats too ahead of the weak DV check here. Real shame we can't make more use of filename too though in this circumstance, for whatever reason. |
@rleigh-codelibre is it worth trying to resolve the changes discussed above for 5.5.2 or should we create a card to fix this in a later release? |
I'll double-check tomorrow. |
Refreshing my memory of the discussion, I'm unsure that the suggsted size check addition is sufficiently robust-a valid TIFF could still end up being identified as DV. It all comes down to chance, which makes me uneasy about it. |
@rleigh-codelibre can you open a configuration PR for adding QA 17411 to the data repository tests? |
See https://ci.openmicroscopy.org/view/Failing/job/BIOFORMATS-DEV-merge-full-repository/573/. Merging as an improvement to a long-running detection issue. |
Testing: Check with the images from QA 17411 https://www.openmicroscopy.org/qa2/qa/feedback/17411/ which should now be detected as TIFF, not DeltaVision.