# Evaluate generative multimodal audio models

Use this notebook to evaluate the generative multimodal audio models. This notebook uses following metrics:

- Frechet Audio Distance (FAD)
    - FAD measures how similar the distributions of these synthetic and real data samples are in the feature space of a pre-trained audio classifier
    - Smaller FAD score indicates that the generated audio is similar to the ground truth audio
- Kulback-Leibler Divergence (KLD)
    - KLD measures how one probability distribution diverges from a second, expected probability distribution
    - DKL​(P∣∣Q)\=∑i​P(i)log(Q(i)P(i)​) where P is the ground truth and Q is the generated audio
    - DKL(P∣∣Q) measures how much information is lost when Q is used to approximate P
    - Smaller KLD score indicates that the generated audio is similar to the ground truth audio
- Audio-Visual Synchronisation Score
    - This is stated as a fraction of the files that are considered to be synchronised (0: all files are unsynchronised, 1: all files are synchronised)
- AVCLIP Score
    - Following the ideas of ImageBind score and CLIP score: Since AVCLIP is trained to learn a joint-embedding across six distinct modalities,the cosine similarity of its embeddings for both video and generated audio can capture semantic relevance between them.
    - Higher AVCLIP score indicates that the generated audio is similar to the ground truth audio (max 1)

❗️**Note**❗️ Sometimes GPU resources are not freed until the Jupyter Kernel is restarted. If you encounter CUDA out of memory error, please restart the kernel and run the notebook again.

## Setup
User must have

1. Videos with generated audio
2. Ground truth videos
3. Initialised the environment according to [README.md](README.md)

## Helpers and Imports

In [None]:
from pathlib import Path
from pprint import pprint

from eval_utils.utils import dataclass_from_dict
from configs.evaluation_cfg import EvaluationCfg, PipelineCfg
from metrics.evaluation_metrics import EvaluationMetrics
from metrics.evaluation_metrics_combiner import EvaluationMetricsCombiner

%reload_ext autoreload
%autoreload 2

In [None]:
def get_evaluation_config(eval_cfg_dict: dict):
    evaluation_cfg = dataclass_from_dict(EvaluationCfg, eval_cfg_dict)
    assert type(evaluation_cfg) == EvaluationCfg
    assert type(evaluation_cfg.pipeline) == PipelineCfg
    return evaluation_cfg


def get_calculated_evaluation_metrics(
    evaluation_cfg: EvaluationCfg, force_recalculate: bool = False
):
    print("Evaluating", evaluation_cfg.sample_directory.as_posix())
    evaluation_metrics = EvaluationMetrics(evaluation_cfg)
    assert type(evaluation_metrics) == EvaluationMetrics
    evaluation_metrics.run_all(force_recalculate)
    evaluation_metrics.export_results()
    print("Evaluation done\n")
    return evaluation_metrics

## Define configurations

<p align="center">❗️<b>NOTE</b>❗️</p>
<p align="center">Only modify the following cell with your arguments and paths. <span style="color:red">Do not touch any other cell if not stated otherwise.</span></p>
<p align="center">❗️<b>NOTE</b>❗️</p>

1. Define the IDs and paths to the videos with model-generated audio (*ids_sample_dirs*)
2. Define the path to ground truth videos (*gt_dir*)
3. Define the evaluation pipeline (*pipeline_cfg_dict*)
    - Define the metrics to be used (fad, kld, insync)
    - Define the parameters for individual metrics (see ./configs/evaluation_cfg.py for more details)
    - Example:
        - Only insync metric with default params: {"insync": {}}
        - All the metrics with default params: {"fad": {}, "kld": {}, "insync": {}}
        - Only FAD calculated using PCA: {"fad": {"use_pca": True}}
4. Define verbosity (*is_verbose*)

In [None]:
ids_sample_dirs = [
    ("gh-syncsonix-flattened-best", "/home/hdd/ilpo/logs/synchronisonix/24-04-17T15-58-35/checkpoints/generated_samples_24-04-18T10-03-37"),
    ("gh-syncsonix-unflattened", "/home/hdd/ilpo/checkpoints/synchronisonix/24-02-27T16-46-55/24-02-27T16-46-55/generated_samples_24-04-17T14-24-06"),
    ("gh-specvqgan", "/home/ilpo/repos/SpecVQGAN/logs/2024-04-16T14-16-42_greatesthit_transformer/samples_2024-04-17T09-54-50/GreatestHits_test/videos/partition1"),
]
gt_dir = "/home/hdd/ilpo/datasets/greatesthit/vis-data-256_h264_video_25fps_256side_24000hz_aac_len_5_splitby_random"
pipeline_cfg_dict = {"fad": {}, "kld": {}, "insync": {}, "avclip_score": {}}
is_verbose = True

In [None]:
gt_dir = Path(gt_dir)
ids_sample_dirs = [(id, Path(p)) for id, p in ids_sample_dirs]

## Initialise configurations

In [None]:
assert pipeline_cfg_dict is not None, "Pipeline is not defined or it is empty."
evaluation_cfgs = [
    get_evaluation_config(
        {
            "id": id,
            "sample_directory": sample_dir,
            "gt_directory": gt_dir,
            "pipeline": pipeline_cfg_dict,
            "verbose": is_verbose,
        }
    )
    for (id, sample_dir) in ids_sample_dirs
]

# Metrics
Metrics class are initialised with the *EvaluationCfg* -class which defines the evaluation pipeline. The class is used to calculate the metrics for a single sample directory (EvaluationCfg entry). *EvaluationMetricsCombiner* -class is used to combine the metrics for all the sample directories for plotting.

In [None]:
metrics = [
    get_calculated_evaluation_metrics(evaluation_cfg)
    for evaluation_cfg in evaluation_cfgs
]


In [None]:
combined_results = EvaluationMetricsCombiner(metrics)
pprint(combined_results.combine())

## Plotting
Plotting the combined results. **Here you can define the plotting directory**.

In [None]:
# Define the output directory for the plots (if desired)
# if not defined, the plots will not be saved but returned as matplotlib figures and
# displayed in the notebook. (You can save them manually from the notebook)
plot_dir = "."
combined_results.plot(plot_dir)

# Clean up

In [None]:
for m in metrics:
    m.remove_resampled_directories()