# Evaluate generative multimodal audio models

Use this notebook to evaluate the generative multimodal audio models. This notebook uses following metrics:

- Frechet Audio Distance (FAD)
- Kulback-Leibler Divergence (KLD)
- ImageBind score (IB)
- Synchronisation Error (SE)

**Note** Sometimes GPU resources are not freed until the Jupyter Kernel is restarted. If you encounter CUDA out of memory error, please restart the kernel and run the notebook again.

## Setup
User must have

1. Generated audio samples
1. GT audios for given videos that were used when generating the audio

## Configuration

In [1]:
from pathlib import Path
from pprint import pprint

from eval_utils.utils import dataclass_from_dict
from configs.evaluation_cfg import EvaluationCfg
from metrics.evaluation_metrics import EvaluationMetrics

%reload_ext autoreload
%autoreload 2

In [2]:
sample_dir = Path("/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug")
# init evaluation config with default KLD and FAD metrics
evaluation_cfg = dataclass_from_dict(EvaluationCfg, {"sample_directory": sample_dir, "verbose": True, "pipeline": {"fad": {}, "kld": {}, "insync": {}}})
assert type(evaluation_cfg) == EvaluationCfg


Evaluation pipeline:
sample_directory: /home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug
gt_directory: /home/hdd/data/greatesthits/evaluation/GT
result_directory: /home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug
PipelineCfg(fad=FADCfg(model_name='vggish', sample_rate=16000, use_pca=False, use_activation=False, dtype='float32', embeddings_fn='vggish_embeddings.npy'), kld=KLDCfg(pretrained_length=10, batch_size=10, num_workers=10, duration=2.0), insync=InSyncCfg(exp_name='24-01-25T18-57-06', device='cuda:0', vfps=25, afps=24000, input_size=224, ckpt_parent_path='./logs/sync_models'))
verbose: True


# Metrics
Metrics class are initialised with the *EvaluationCfg* -class which defines the evaluation pipeline.

In [3]:
metrics = EvaluationMetrics(evaluation_cfg)
# metrics.run_all()
# print(f"FAD: {metrics.run_fad()}")
# print(f"KLD: {metrics.run_kld()}")
pprint(f"InSync: {metrics.run_insync()}")

Reencoding directory generated_samples_24-01-04T14-12-53_debug to 25 fps, 24000 afps, 224 input size: 100%|██████████| 50/50 [00:03<00:00, 14.80it/s]


No need to reencode




('InSync: (0.96, '
 "{'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/2015-03-30-01-48-12_denoised_186.mp4.mp4': "
 "{'class': 1, 'prob': 0.83}, "
 "'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/2015-09-12-04-15-35-433_denoised_400.mp4.mp4': "
 "{'class': 1, 'prob': 0.854}, "
 "'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/2015-09-24-15-43-24-742_denoised_544.mp4.mp4': "
 "{'class': 1, 'prob': 0.853}, "
 "'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/2015-09-23-16-13-51-1_denoised_741.mp4.mp4': "
 "{'class': 1, 'prob': 0.779}, "
 "'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/2015-03-25-00-58-22_denoised_572.mp4.mp4': "
 "{'class': 1, 'prob': 0.834}, "
 "'/home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_s

In [4]:
metrics.export_results()

Results exported to /home/hdd/data/greatesthits/evaluation/23-12-15T11-35-49/generated_samples_24-01-04T14-12-53_debug/results_24-02-03T09-06-26.yaml
