Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

Closed
gtoderici opened this issue Jun 17, 2021 · 6 comments
Closed

Evaluate butteraugli on the CLIC-2021 perceptual quality task #202

gtoderici opened this issue Jun 17, 2021 · 6 comments
Labels
unrelated to 1.0 Things that need not be done before the 1.0 version milestone

Comments

@gtoderici
Copy link

Is your feature request related to a problem? Please describe.
It would be good to evaluate butteraugli on the CLIC-2021 perceptual quality task. This should provide additional information to he community with respect to its performance characteristics when compared to other potentially usable perceptual quality metrics that JPEG XL could optimize for.

Describe the solution you'd like
Please use the test data from the CLIC 2021 perceptual challenge to generate a CSV file with the decisions (see link below for the exact instructions):

https://github.com/fab-jul/clic2021-devkit/blob/main/README.md#perceptual-challenge

Please email them to me to get the final results / ranks. We'll publish these at: http://compression.cc/leaderboard/perceptual/test/

...and of course update this bug tracker.

@DamonHD
Copy link

DamonHD commented Jun 17, 2021

FWIW I get a 403 - Forbidden on that last URL.

Rgds

Damon

@gtoderici
Copy link
Author

That URL is not public as of yet (today). Should be public starting Monday.

For now you can see:
http://compression.cc/leaderboard/perceptual/valid/

The important bit are the instructions at:
https://github.com/fab-jul/clic2021-devkit/blob/main/README.md#perceptual-challenge

@DamonHD
Copy link

DamonHD commented Jun 18, 2021

For clarity: I am only an ignorant lurker, and not one of the fine people actually making JXL happen! B^>

Rgds

Damon

@jonsneyers
Copy link
Member

If someone wants to do this, here's a simple bash script to produce the desired data. You'll have to download about 50 GB of images though, and it'll take quite a long time to compute all the scores.

#!/bin/bash

wget https://storage.googleapis.com/clic2021_public/perceptual/test/clic_2021_test.zip
for i in 0 1 2 3 4 5 6 7 8 9 a b c d e f; do
 wget https://storage.googleapis.com/clic2021_public/perceptual/test/$i.tar
done
unzip clic_2021_test.zip
for i in *.tar; do tar -xvf $i; done

for i in $(<clic_2021_test.csv)
do

orig=$(echo $i | cut -d "," -f 1)
A=$(echo $i | cut -d "," -f 2)
B=$(echo $i | cut -d "," -f 3)

scoreA=$(ssimulacra_main $orig $A)
scoreB=$(ssimulacra_main $orig $B)
result=$(echo "$scoreA > $scoreB" | bc -l)
echo $i,$result >> ssimulacra.csv

scoreA=$(butteraugli_main $orig $A | tail -n 1 | cut -d " " -f 2)
scoreB=$(butteraugli_main $orig $B | tail -n 1 | cut -d " " -f 2)
result=$(echo "$scoreA > $scoreB" | bc -l)
echo $i,$result >> butteraugli3norm.csv

done

I tried it on the smaller example validation.csv set and got an accuracy of 0.595 for ssimulacra and 0.671 for Butteraugli 3-norm. I haven't tried it on the larger test set yet though.

@gtoderici I think it would be interesting not to just look at overall accuracy, but also accuracy for various operating points, e.g. if some of the stimuli correspond to low bpp and others to higher bpp encoding, maybe segment the data accordingly into a few buckets (e.g. six buckets: low vs low, low vs medium, low vs high, medium vs medium, medium vs high, high vs high) and compute accuracies per bucket. I expect it to be the case that some metrics are better at some of those tasks but worse at others, and this would be very useful information (more so than "which does best overall", since afaik we don't really have any metric that really does great overall).

@gtoderici
Copy link
Author

@jonsneyers - I agree with your assessment about evaluating the accuracy (and more) at various bitrates. The evaluation code already does it, but I haven't had time to do any of the graphing work required. However, I have only done this on the test set thus far since we have more human ratings there.

I noticed that for all metrics there exists some discrepancy in performance between validation and test, but not by much. However, what is more interesting to me is that if you look at ranking results, things can vary quite a bit between perceptual quality methods, despite having similar accuracy.

Once I get the butteraugli CSV file for the test set, I will ping this thread.

@jyrkialakuijala
Copy link
Contributor

jyrkialakuijala commented Jun 23, 2021

Let's try 6 norm of butteraugli. I often use a lower norm that what actually works psychovisually because it is easier to optimize for. Here, we don't get the benefits for using a lower norm since this is pure psychovisuals, no optimization involved.

@mo271 mo271 added the unrelated to 1.0 Things that need not be done before the 1.0 version milestone label Mar 29, 2022
@mo271 mo271 closed this as completed Mar 29, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
unrelated to 1.0 Things that need not be done before the 1.0 version milestone
Projects
None yet
Development

No branches or pull requests

5 participants