Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

gtoderici · 2021-06-17T20:54:26Z

Is your feature request related to a problem? Please describe.
It would be good to evaluate butteraugli on the CLIC-2021 perceptual quality task. This should provide additional information to he community with respect to its performance characteristics when compared to other potentially usable perceptual quality metrics that JPEG XL could optimize for.

Describe the solution you'd like
Please use the test data from the CLIC 2021 perceptual challenge to generate a CSV file with the decisions (see link below for the exact instructions):

https://github.com/fab-jul/clic2021-devkit/blob/main/README.md#perceptual-challenge

Please email them to me to get the final results / ranks. We'll publish these at: http://compression.cc/leaderboard/perceptual/test/

...and of course update this bug tracker.

DamonHD · 2021-06-17T20:56:38Z

FWIW I get a 403 - Forbidden on that last URL.

Rgds

Damon

gtoderici · 2021-06-17T23:27:25Z

That URL is not public as of yet (today). Should be public starting Monday.

For now you can see:
http://compression.cc/leaderboard/perceptual/valid/

The important bit are the instructions at:
https://github.com/fab-jul/clic2021-devkit/blob/main/README.md#perceptual-challenge

DamonHD · 2021-06-18T08:09:20Z

For clarity: I am only an ignorant lurker, and not one of the fine people actually making JXL happen! B^>

Rgds

Damon

jonsneyers · 2021-06-22T13:11:00Z

If someone wants to do this, here's a simple bash script to produce the desired data. You'll have to download about 50 GB of images though, and it'll take quite a long time to compute all the scores.

#!/bin/bash

wget https://storage.googleapis.com/clic2021_public/perceptual/test/clic_2021_test.zip
for i in 0 1 2 3 4 5 6 7 8 9 a b c d e f; do
 wget https://storage.googleapis.com/clic2021_public/perceptual/test/$i.tar
done
unzip clic_2021_test.zip
for i in *.tar; do tar -xvf $i; done

for i in $(<clic_2021_test.csv)
do

orig=$(echo $i | cut -d "," -f 1)
A=$(echo $i | cut -d "," -f 2)
B=$(echo $i | cut -d "," -f 3)

scoreA=$(ssimulacra_main $orig $A)
scoreB=$(ssimulacra_main $orig $B)
result=$(echo "$scoreA > $scoreB" | bc -l)
echo $i,$result >> ssimulacra.csv

scoreA=$(butteraugli_main $orig $A | tail -n 1 | cut -d " " -f 2)
scoreB=$(butteraugli_main $orig $B | tail -n 1 | cut -d " " -f 2)
result=$(echo "$scoreA > $scoreB" | bc -l)
echo $i,$result >> butteraugli3norm.csv

done

I tried it on the smaller example validation.csv set and got an accuracy of 0.595 for ssimulacra and 0.671 for Butteraugli 3-norm. I haven't tried it on the larger test set yet though.

@gtoderici I think it would be interesting not to just look at overall accuracy, but also accuracy for various operating points, e.g. if some of the stimuli correspond to low bpp and others to higher bpp encoding, maybe segment the data accordingly into a few buckets (e.g. six buckets: low vs low, low vs medium, low vs high, medium vs medium, medium vs high, high vs high) and compute accuracies per bucket. I expect it to be the case that some metrics are better at some of those tasks but worse at others, and this would be very useful information (more so than "which does best overall", since afaik we don't really have any metric that really does great overall).

gtoderici · 2021-06-23T17:05:31Z

@jonsneyers - I agree with your assessment about evaluating the accuracy (and more) at various bitrates. The evaluation code already does it, but I haven't had time to do any of the graphing work required. However, I have only done this on the test set thus far since we have more human ratings there.

I noticed that for all metrics there exists some discrepancy in performance between validation and test, but not by much. However, what is more interesting to me is that if you look at ranking results, things can vary quite a bit between perceptual quality methods, despite having similar accuracy.

Once I get the butteraugli CSV file for the test set, I will ping this thread.

jyrkialakuijala · 2021-06-23T18:05:27Z

Let's try 6 norm of butteraugli. I often use a lower norm that what actually works psychovisually because it is easier to optimize for. Here, we don't get the benefits for using a lower norm since this is pure psychovisuals, no optimization involved.

This was referenced Jun 18, 2021

Consider evaluating SSIMULACRA against CLIC 2021 data cloudinary/ssimulacra#10

Open

Consider evaluating DSSIM using CLIC-2021 peceptual quality task kornelski/dssim#102

Open

mo271 added the unrelated to 1.0 Things that need not be done before the 1.0 version milestone label Mar 29, 2022

mo271 closed this as completed Mar 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

gtoderici commented Jun 17, 2021

DamonHD commented Jun 17, 2021

gtoderici commented Jun 17, 2021

DamonHD commented Jun 18, 2021

jonsneyers commented Jun 22, 2021

gtoderici commented Jun 23, 2021

jyrkialakuijala commented Jun 23, 2021 •

edited

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

Comments

gtoderici commented Jun 17, 2021

DamonHD commented Jun 17, 2021

gtoderici commented Jun 17, 2021

DamonHD commented Jun 18, 2021

jonsneyers commented Jun 22, 2021

gtoderici commented Jun 23, 2021

jyrkialakuijala commented Jun 23, 2021 • edited

jyrkialakuijala commented Jun 23, 2021 •

edited