-
Notifications
You must be signed in to change notification settings - Fork 70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use mean absolute deviation for pooling #89
Conversation
That works great, but why does it have |
That weight per scale score tweak hinted that perhaps the original scale weights were non-optimal. I've adjusted them, and that gave a decent improvement. |
My tweak. Initially it was .powf(0.5) (MAD used in MDSI metric and they set it to 0.25 (EDIT: I forgot, it's a different parameter in MDSI)), but then I found out that when comparing only at first scale (SSIM), the best accuracy is with .powf(1) (almost on par with VIF, try it), so I tested it with each scale separately.
I wouldn't change default scale weights only based on 1 dataset. Also Full accuracy on tid13 is so high with these weights because of Exotic group (which is pretty much irrelevant). Finally, according to tid2013 paper, SSIM is among the best metrics for "Good quality" (Table 6), so maybe add -ssim switch to calculate SSIM instead of MS-SSIM. |
Yes, indeed I'm worried about overfitting for TID2013. Can you recommend other similar datasets that I could use to verify accuracy? |
The most similar is KADID-10k. Only they made a mistake and placed reference images instead of first level distorted images for 3 distortion types (1st, 3rd and 8th). |
With my max+avg pooling method my thinking was:
That solution is not mathematically elegant, but I like it, because it intuitively makes sense to me. I've read how MAD works, but I'm not quite sure why this particular formula, especially with the "tweak", happens to work for pooling SSIM. My hypothesis is:
That makes me wonder whether this tweaked MAD just happens to be another way of mixing "worst" vs "average" error pooling, just not within each scale, but across scales. And I wonder whether the proportion from |
It doesn't behave like std dev (I edited my comment above about MAD in MDSI) and it always makes At first, after arithmetic mean, I tried geometric and harmonic means. They made accuracy worse (harmonic was the worst). From wikipedia: For all positive data sets containing at least one pair of nonequal values, the harmonic mean is always the least of the three means, while the arithmetic mean is always the greatest of the three and the geometric mean is always in between. So I added that exponent to bias avg towards good quality scores. The bigger downsampling factor the higher and less spread per-pixel ssim scores are (except for some Exotic distortion types probably), so the exponent value must be lower to be effective.
It's still within each scale, not across scales (cross scale pooling is in this block).
Maybe not ideal, but I tried with different constants (like 1.5, 1.75, 2.25, 2.5) and also with inverse square law formula (1/n^2). |
Tested on KADID-10k using scripts from here (only modified kadid10k.ROCC.py to show per type and per level accuracy). 1st is max+avg, 2nd is mad with non-custom scale weights. Full SROCC: Per type:1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22: 23: 24: 25: Per level:-0.7594226220260838 -0.7859239032568683 -0.7645387478162078 -0.6711213371339377 -0.598736695180533 |
|
Btw, there is my implementation of MDSI in Python. I removed it from my repo because I didn't like this metric very much. But maybe it's actually good? |
Thanks for the data. It looks good. It makes DSSIM output scores that have a different magnitude than previously, so I'll probably add some rescaling fudge in order to avoid breaking users to much. |
Thinking 5x5 binomial / Gaussian (std=1) down-sampling filter might work better with MAD than 2x2 avg. |
I guess it could help a little. IIRC change of 2x box blur to proper Gaussian kernel in SSIM blurring helped a little too. |
Made a python version of this metric (grayscale only) for testing, with an alternative version of MAD.
|
No description provided.