This repo contains the source code and data in the ICLR 2022 paper, Towards Empirical Sandwich Bounds on the Rate-Distortion Function. For an introduction to this topic, see this blog post: part 1, part 2.
In this work, we make a first attempt at a numerical method for estimating sandwich bounds on the rate-distortion (R-D) function of a general data source using i.i.d. samples. Unlike the classic Blahut-Arimoto (BA) algorithm, which only works when the source and reproduction alphabets are finite and the source probabilities known, here we consider general (e.g., continuous) alphabets, without knowledge of the source distribution other than access to its samples, a common setting in applications like image compression.
We estimate an upper bound on the R-D function using an SGD version of the BA algorithm that is scalable to high-resolution image datasets, based on a close theoretical connection to beta-VAEs and neural compression autoencoders. To verify the tightness of our upper bounds, we also develop a lower bound algorithm based on Csiszár's dual characterization of the R-D function. Due to the associated computational challenges, we restrict to a squared error distortion and a continuous reproduction alphabet, and obtain non-trivial lower bound estimates on data with relatively low intrinsic dimensions, ranging from particle physics measurements to realistic images generated by a BigGAN. Our estimated R-D upper bound on natural images indicates that there is theoretical room for improving state-of-the-art image compression methods by at least 1 dB in PSNR, at various bitrates.
If you find this work helpful, please consider citing the paper
@inproceedings{yang2022towards,
title={Towards Empirical Sandwich Bounds on the Rate-Distortion Function},
author={Yang, Yibo and Mandt, Stephan},
booktitle={International Conference on Learning Representations},
year={2022}
}
To get started, you may want to check out the demo notebook that estimates the proposed R-D sandwich bounds on a 2D Gaussian source, and compares with the BA algorithm and the analytical R-D function.
The main scripts implementing the various algorithms are:
ba.py
: this contains a vectorized implementation of the Blahut-Arimoto algorithm. It takes in samples from a continuous source, discretizes/estimates the source probabilities by a histogram, and then runs the BA algorithm on the discretized source.rdub_mlp.py
: this implements the proposed upper bound algorithm, by training a beta-VAE with an MLP encoder and (optionally) decoder. It also implements the non-linear transform coding method from Ballé et al., 2021, by placing shape restrictions on the variational posterior and prior distributions to simulate quantization.rdlb.py
: this implements the proposed lower bound algorithm, by training a neural-network function (referred to as "log u" in the paper) to maximize an unconstrained formulation of Csiszár's dual formulation of R-D.resnet_vae.py
: this implements the proposed upper bound algorithm specialized to images, using a hierarchical convolutional VAE inspired by the ResNet-VAE.mbt2018.py
: this implements the mean-scale hyperprior neural compression method from Minnen et al., 2018, based on the tfc implementation of the scale-only hyperprior model.msms2020.py
: this implements the channelwise autoregressive neural compression method from Minnen et al., 2020, adapted from the tfc implementation.
During training, all these scripts will write to disk a log file in .jsonl format, containing the train and val/test objectives evaluated after each training epoch.
The code was developed in a python 3.7 environment on Linux.
The main dependices are tensorflow 2.5 and tensorflow-compression 2.2, which can be installed by
pip install tensorflow-compression==2.2
I also use the syntax of the GNU parallel tool to conveniently capture
experiment commands that are repeated with different hyper-parameters (most commonly, experiments are rerun with different
lambda values to sweep out an R-D bound), and are embarrassingly parallelizable.
The parallel
tool is not required; I simply use its syntax, e.g., {1} {2} ::: a b ::: c d e
, to specify all the
possible tuples in the Cartesian product a c
, a d
, a e
, b c
, b d
, and b e
.
In fact, depending on the compute resources available, it may
not make sense to directly run the parallel commands as written, or you may want to run them sequentially one job
at a time with an extra -j 1
flag (e.g., parallel
does not take care of allocating jobs to different GPUs, so running
more than one job on a GPU may result in OOM crashes).
-
Gaussian:
The parameters of the randomly generated n=1000 dimensional Gaussian source is already saved in
./data
for your convenience. This was generated with the commandpython gen_gaussian_params.py --save_dir ./data --dim 1000
. You can change the --dim argument (this isn
in the paper) to experiment with other dimensions. -
Speech:
- Clone the FSDD repo, to say
/path/to/free-spoken-digit-dataset
- cd into
data/speech
, update thedata_dir
variable increate_data.py
to point to/path/to/free-spoken-digit-dataset
, then runcreate_data.py
.
- Clone the FSDD repo, to say
-
High-resolution images for training the hierarchical VAE:
-
Download the COCO training data:
wget http://images.cocodataset.org/zips/train2017.zip
-
Unzip the images to, say
/tmp/train2017
, then runpython prepare_imgs.py '/tmp/train2017/*.jpg' data/my_coco_train2017/
to perform preprocessing. This selects images larger than 512x512 and randomly downsamples them to remove potential compression artifacts.
-
-
Kodak test set: Download them from
http://r0k.us/graphics/kodak
, and put them under./data/kodak
. -
Tecnick test set: Download from
https://sourceforge.net/projects/testimages/files/OLD/OLD_SAMPLING/testimages.zip
, and unzip todata/Tecnick_TESTIMAGES/
.
All other data is self-contained.
-
The upper bound beta-VAE models with varying latent dimensions can be trained by running
n=1000; parallel python rdub_mlp.py -V --dataset gaussian --gparams_path data/gaussian_params-dim=$n.npz --data_dim $n --latent_dim {1} --checkpoint_dir checkpoints/gaussian --prior_type gmm_1 --posterior_type gaussian --encoder_activation none --decoder_activation leaky_relu --decoder_units $n --lambda {2} --rpd train --epochs 80 --steps_per_epoch 1000 --lr 5e-4 --batchsize 64 --max_validation_steps 1 ::: 400 500 600 800 ::: 0.3 1 3 10 30 100 300
-
The upper bound beta-VAE models without a decoder (latent space equals reproduction space) can be trained by
n=1000; parallel python rdub_mlp.py -V --dataset gaussian --gparams_path data/gaussian_params-dim=$n.npz --data_dim $n --latent_dim $n --checkpoint_dir checkpoints/gaussian --prior_type gmm_1 --posterior_type gaussian --encoder_activation none --decoder_activation none --decoder_units 0 --lambda {1} --rpd train --epochs 80 --steps_per_epoch 1000 --lr 5e-4 --batchsize 64 --max_validation_steps 1 ::: 0.3 1 3 10 30 100 300
Since we have access to the source and compute the training loss on true i.i.d. samples generated on
the fly, there's no need for separate validation/test data (--max_validation_steps
is set to a small dummy value of 1 here),
and the training objectives (loss
/rate
/mse
in the jsonl log) are unbiased estimates of the true Lagrangian/rate/distortion.
The main training command is
nn_size_to_n_ratio=100; nn_size=$(( n * nn_size_to_n_ratio )); parallel python rdlb.py --dataset gaussian --data_dim $n --checkpoint_dir checkpoints/gaussian --model mlp --units $nn_size,$nn_size,$nn_size --lamb {1} train --num_Ck_samples 2 --batchsize 1024 --last_step 3000 --checkpoint_interval 3000 --y_init quick --y_quick_topn 10 --lr 5e-4 --lr_schedule ::: $lambs_to_use
which trains a MLP log u model with 3 hidden layers, with the number of units set based on the source dimension n
.
The bash variables $n
and $lambs_to_use
for each run are set as follows:
######################
n=2
lambs_to_use="1.0 3.0 10.0 30.0 100.0"
######################
n=4
lambs_to_use="2.0 6.0 20.0 60.0 200.0"
######################
n=8
lambs_to_use="4.0 12.0 40.0 120.0 400.0"
######################
n=16
lambs_to_use="8.0 24.0 80.0 240.0 800.0"
The evaluation command is
parallel python rdlb.py --dataset gaussian --data_dim $n --checkpoint_dir checkpoints/gaussian --model mlp --units $nn_size,$nn_size,$nn_size --lamb {1} eval --batchsize 1024 --y_init exhaustive --num_Ck_samples 30 ::: $lambs_to_use
where the bash variables are set the same way as before. The evaluation code loads the trained log u model, and draws Ck
samples to form the LB estimator xi
as defined in Appendix A.6, saving the resulting Ck and xi samples in a .npz file.
The expected value of xi then gives a natural sample-mean
estimator of the LB intercept (denoted R_
in code), and samples of xi can also be used to estimate confidence intervals.
-
Upper bound beta-VAE:
parallel -j1 python rdub_mlp.py -V --checkpoint_dir checkpoints/physics --dataset data/physics/ppzee-split=train.npy --data_dim 16 --latent_dim 16 --prior_type maf --maf_stacks 3 --posterior_type gaussian --encoder_units 500,500 --decoder_units 500,500 --encoder_activation softplus --decoder_activation softplus --ar_hidden_units 50,50 --ar_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 512 --max_validation_steps 20 ::: 100 300 1000 3000 10000 30000
-
Nonlinear transform coding (Ballé et al., 2021):
parallel -j1 python rdub_mlp.py -V --checkpoint_dir checkpoints/physics --dataset data/physics/ppzee-split=train.npy --data_dim 16 --latent_dim 16 --prior_type deep --posterior_type uniform --encoder_units 500,500 --decoder_units 500,500 --encoder_activation softplus --decoder_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 512 --max_validation_steps 20 ::: 100 300 1000 3000 10000 30000
-
On the 2D marginal, the experiment commands are identical to the above, except
--dataset data/physics/ppzee-split=train.npy --data_dim 16 --latent_dim 16
is replaced with
--dataset data/physics/n=2-ppzee-split=train.npy --data_dim 2 --latent_dim 2
-
The code will automatically look for a test dataset in the same directory as the training .npy file, so the test performance is also continually evaluated (on a subset) during training (in the
val_loss
,val_rate
, andval_mse
fields of the jsonl log). -
The BA algorithm:
parallel python ba.py -V --samples data/physics/n=2-ppzee-split=test.npy --save_dir checkpoints/physics --bins 60 --steps 2000 --tol 1e-5 --lamb {} ::: 100 300 1000 3000 10000 30000
-
Train:
parallel python rdlb.py -V --dataset data/physics/ppzee-split=train.npy --data_dim 16 --checkpoint_dir checkpoints/physics --model mlp --units 200,200,200 --lamb {} train --batchsize 2048 --num_Ck_samples 2 --last_step 4000 --checkpoint_interval 2000 --y_init quick --y_quick_topn 3 --lr 5e-4 --lr_schedule ::: 100 300 1000 3000 10000 30000
-
Eval (here 4*5=20 samples of Ck are drawn in total, parallelized over four different processes with different seeds for efficiency):
parallel python rdlb.py -V --seed {1} --dataset data/physics/ppzee-split=test.npy --data_dim 16 --model mlp --units 200,200,200 eval --batchsize 2048 --num_Ck_samples 5 --y_init exhaustive --ckpt {2} ::: 0 1 2 3 ::: checkpoints/physics/rdlb-data_dim=16-model=mlp-units=200_200_200-lamb=*-batchsize=2048
Here the
--lamb
arg is not set; instead the code will use the lambda value from the name of the checkpoint that is being loaded. The results from different lambdas and different seeds can be combined to produce the final xi samples, using the convenience methodutils.aggregate_lb_results
; e.g.,import glob res_files = glob.glob('checkpoints/physics/rdlb-data_dim=16-model=mlp-units=200_200_200-lamb=*/rd-*.npz') results = aggregate_lb_results(res_files) # will be a dict of (lamb, xi_samples)
-
The commands on the 2D marginal source are identical to the above, except
ppzee-split={train/test}.npy
needs to be replaced withn=2-ppzee-split={train/test}.npy
, and--data_dim 16
with--data_dim 2
.
The commands are almost identical to the previous ones on physics data.
-
UB beta-VAE:
parallel -j1 python rdub_mlp.py -V --checkpoint_dir checkpoints/speech --dataset data/speech/stft-split=train.npy --data_dim 33 --latent_dim 33 --prior_type maf --maf_stacks 3 --posterior_type gaussian --encoder_units 500,500 --decoder_units 500,500 --encoder_activation softplus --decoder_activation softplus --ar_hidden_units 50,50 --ar_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 512 ::: 0.3 1 3 10 30 50 100
-
NTC:
parallel -j1 python rdub_mlp.py -V --checkpoint_dir checkpoints/speech --dataset data/speech/stft-split=train.npy --data_dim 33 --latent_dim 33 --prior_type deep --posterior_type uniform --encoder_units 500,500 --decoder_units 500,500 --encoder_activation softplus --decoder_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 512 ::: 0.3 1 3 10 30 50 100
-
BA:
parallel -j1 python ba.py -V --samples data/speech/frame_length=63/n=2-dig=all-speaker=theo-split=test-stft.npy --lamb {} --save_dir checkpoints/speech/BA --bins 50 --steps 2000 --tol 1e-5 ::: 0.1 0.3 1 3 10 30 100
-
Train:
parallel python rdlb.py -V --dataset data/speech/stft-split=train.npy --data_dim 33 --checkpoint_dir checkpoints/speech --model mlp --units 200,200,200 --lamb {} train --batchsize 2048 --num_Ck_samples 2 --last_step 4000 --checkpoint_interval 2000 --y_init quick --y_quick_topn 3 --lr 5e-4 --lr_schedule ::: 0.3 1 3 10 30 50
-
Eval:
parallel python rdlb.py -V --seed {1} --dataset data/speech/stft-split=test.npy --data_dim 33 --model mlp --units 200,200,200 eval --batchsize 2048 --num_Ck_samples 5 --y_init exhaustive --ckpt {2} ::: 0 1 2 3 ::: checkpoints/speech/rdlb-data_dim=33-model=mlp-units=200_200_200-lamb=*-batchsize=2048
-
The commands on the 2D marginal source are identical to the above, except
stft-split={train/test}.npy
needs to be replaced withn=2-stft-split={train/test}.npy
, and--data_dim 33
with--data_dim 2
.
-
UB beta-VAE:
parallel python rdub_mlp.py -V --checkpoint_dir checkpoints/banana --dataset banana --data_dim 2 --latent_dim 2 --prior_type maf --maf_stacks 3 --posterior_type gaussian --encoder_units 100,100 --decoder_units 100,100 --encoder_activation softplus --decoder_activation softplus --ar_hidden_units 10,10 --ar_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 1024 ::: 0.1 0.3 1 3 10 30
-
NTC (this reproduces Ballé et al.'s experiment):
parallel python rdub_mlp.py -V --checkpoint_dir checkpoints/banana --dataset banana --data_dim 2 --latent_dim 2 --prior_type deep --posterior_type uniform --encoder_units 100,100 --decoder_units 100,100 --encoder_activation softplus --decoder_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 1024 ::: 0.1 0.3 1 3 10 30
-
BA:
parallel -j1 python ba.py -V --samples data/banana-dim=2-samples=100000.npy --save_dir checkpoints/banana/BA --bins 80 --steps 2000 --tol 1e-5 --lamb {} ::: 0.1 0.3 1.0 3.0 10.0 30.0 100.0
-
UB beta-VAE on higher-dimension embeddings of the source:
parallel python rdub_mlp.py -V --checkpoint_dir checkpoints/banana --dataset banana --data_dim $n --latent_dim $n --prior_type maf --maf_stacks 3 --posterior_type gaussian --encoder_units $nn_size,$nn_size --decoder_units $nn_size,$nn_size --encoder_activation softplus --decoder_activation softplus --ar_hidden_units 20,20 --ar_activation softplus --nats --lambda {} train --epochs 100 --steps_per_epoch 1000 --lr 5e-4 --batchsize 1024 ::: 0.1 0.3 1 3 10 30 100 300 1000
The bash variables $n and $nn_size are set as follows:
######################
n=4
nn_size=400
######################
n=16
nn_size=1600
######################
n=100
nn_size=2000
######################
n=500
nn_size=2000
-
Train:
parallel python rdlb.py --dataset banana --data_dim $n --checkpoint_dir checkpoints/banana --model mlp --units $nn_size,$nn_size,$nn_size --lamb {1} train --batchsize 1024 --num_Ck_samples 2 --last_step 4000 --checkpoint_interval 4000 --y_init quick --y_quick_topn 20 --lr 1e-3 --lr_schedule ::: 0.1 0.3 1 3 10 30 100 300
-
Eval:
parallel python rdlb.py --seed {1} --dataset banana --data_dim $n --checkpoint_dir checkpoints/banana --model mlp --units $nn_size,$nn_size,$nn_size --lamb {2} eval --batchsize 1024 --y_init exhaustive --num_Ck_samples 5 ::: 0 1 2 3 ::: 0.1 0.3 1 3 10 30 100 300
The bash variables $n and $nn_size are set as follows (somewhat arbitrarily, as I didn't do much architecture tuning):
######################
n=2
nn_size=200
######################
n=4
nn_size=400
######################
n=16
nn_size=1000
######################
n=100
nn_size=1000
######################
n=500
nn_size=1000
In the following commands, $d stands for the intrinsic dimensionality of the images, and is set to 2 or 4; $lamb is the coefficient in front of the distortion term in the Lagrangian, and takes value in 5e-4, 3e-4, 1e-4, 3e-5, 1e-5, 3e-6, 1e-6.
In the training commands, we set the amount of training based on how long it takes the model with the highest lambda to converge. The models with lower lambda values typically require less training.
In the evaluation commands, --batchsize
sets the number of GAN image samples to evaluate on; we use --no_cast_xhat
in order to treat the reconstruction as floating-point valued (rather than uint8
) when computing the MSE, since we
take the alphabets of the GAN source to be continuous (more precisely, [0, 1]^n).
As the neural compression autoencoders assume input and output
to be in [0, 255] (and similarly do resnet_vae.py
and the evaluation code, for consistency), the resulting MSE needs to be divided by 255^2
to align with results in the paper.
-
Upper bound ResNet-VAE:
Train:
python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --num_filters 256 --lambda $lamb --flat_z0 --img_dim 128 --maf_units 32,16 --maf_stacks 3 --checkpoint_dir checkpoints/gan/dataset=basenji-data_dim=$d train --dataset basenji --data_dim $d --batchsize 8 --epochs 600 --steps_per_epoch 1000 --warmup 400 --patience 10
Evaluation (for measurements of variability):
python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --num_filters 256 --lambda $lamb --flat_z0 --img_dim 128 --maf_units 32,16 --maf_stacks 3 --checkpoint_dir checkpoints/gan/dataset=basenji-data_dim=$d eval --dataset basenji --data_dim $d --no_cast_xhat --batchsize 100 --results_dir results/gan/
-
Mean-scale hyperprior model (Minnen et al., 2018):
Train (using uniform noise approximation to quantization):
python mbt2018.py -V --checkpoint_dir checkpoints/gan/dataset=basenji-data_dim=$d --lambda $lamb train --dataset basenji --data_dim $d --batchsize 8 --epochs 600 --steps_per_epoch 1000 --warmup 400 --patience 10
Evaluation (using actual quantization):
python mbt2018.py -V --checkpoint_dir checkpoints/gan/dataset=basenji-data_dim=$d --lambda $lamb eval --dataset basenji --data_dim $d --no_cast_xhat --batchsize 100 --results_dir results/gan/
-
Channel-wise autoregressive model (Minnen et al., 2020):
Identical to the above, except the script name
mbt2018.py
is replaced withms2020.py
.
-
Train:
python rdlb.py -V --dataset basenji --data_dim $d --model cnn --units 4,8,16,1024 --kernel_dims 9,5,3 --lamb 4000 --checkpoint_dir checkpoints/gan/dataset=basenji-data_dim=$d --chunksize 16 train --batchsize 2048 --num_Ck_samples 2 --last_step 4000 --checkpoint_interval 1000 --y_init quick --y_quick_topn 3 --lr 1e-4 --anneal_lamb --target_lambs 50,100,200,300,500,1000,2000,3000,4000
Note that rdlb.py
here trains the lower bound model with a gradually increasing lambda, and saves model checkpoints
along the way when lambda reaches a value in --target_lambs
. Thus, one training run suffices to sweep out an
entire R-D lower bound curve, although the performance can likely still be improved by further training/fine-tuning the intermediate
checkpoints saved along the way.
-
Eval:
parallel -j1 python rdlb.py -V --seed {1} --dataset basenji --data_dim $d --model cnn --units 4,8,16,1024 --kernel_dims 9,5,3 --chunksize 16 eval --batchsize 2048 --num_Ck_samples 5 --y_init exhaustive --ckpt {2} ::: 0 1 2 3 4 5 6 7 8 9 ::: checkpoints/gan/basenji/rdlb-data_dim=$d-model=cnn-units=4_8_16_1024-lamb=4000.0-k=2048/step=*-kerasckpt.index
Again, the results can be collected using the convenience method utils.aggregate_lb_results
.
You can train the proposed hierarchical ResNet VAE with:
python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --ar_prior_levels 4 --ar_slices 8 --num_filters 256 --lambda $lamb --checkpoint_dir checkpoints/img_compression train --dataset cocotrain --patchsize 256 --batchsize 8 --epochs 600 --steps_per_epoch 10000 --warmup 400 --patience 20
and evaluate on Kodak and Tecnick with
parallel -j1 python resnet_vae.py -V --latent_channels 4,8,16,32,64,128 --ar_prior_levels 4 --ar_slices 8 --num_filters 256 --lambda $lamb --checkpoint_dir checkpoints/img_compression eval --dataset {1} --results_dir results/img_compression --no_sub_dir ::: 0.005 0.01 0.02 0.04 0.08 0.16 ::: kodak tecnick
Above $lamb takes value in 0.08, 0.04, 0.02, 0.01, 0.005, 0.0025, 0.001.