This is the official repository COCOLA: Coherence-Oriented Contrastive Learning of Musical Audio Representations. The code for the CompoNet baseline can be found at https://github.com/EmilianPostolache/componet.
conda create --name cocola python=3.11
conda activate cocola
pip install -r requirements.txt
If you wish to use MoisesDB for training/validation/test, download it from the official website and unzip it inside ~/moisesdb_contrastive
.
The other datasets (CocoChorales, Slakh2100, Musdb) are automatically downoladed and extracted by the respective PyTorch Datasets.
This project uses LightningCLI. For info about usage:
python main.py --help
For info about subcommands usage:
python main.py fit --help
python main.py validate --help
python main.py test --help
python main.py predict --help
You can pass a YAML config file as command line argument instead of specifying each parameter in the command:
python main.py fit --config path/to/config.yaml
See configs
for examples of cofig files.
python main.py fit --config configs/train_all_submixtures_efficientnet.yaml
Model Checkpoint | Train Dataset | Train Config | Description |
---|---|---|---|
coco_submixtures_efficientnet_bilinear | CocoChorales | configs/train_coco_submixtures_efficientnet.yaml |
COCOLA model trained on CocoChorales dataset using EfficientNet as embedding model and Bilinear Similarity as similarity measure. Submixtures of stems are used during training, with 5 seconds at 16000 kHz audio examples. |
all_submixtures_efficientnet_bilinear | CocoChorales + Slakh2100 + MoisesDB | configs/train_all_submixtures_efficientnet.yaml |
COCOLA model trained on CocoChorales, Slakh2100 and MoisesDB datasets using EfficientNet as embedding model and Bilinear Similarity as similarity measure. Submixtures of stems are used during training, with 5 seconds at 16000 kHz audio examples. |
from contrastive_model.contrastive_model import CoCola
model = CoCola.load_from_checkpoint("/path/to/checkpoint.ckpt")
model.eval()
similarities = model(x)
where x
is like:
x = {
"anchor": torch.randn(batch_size, 1, 16000*5, dtype=torch.float32), # 5 seconds, 16000 kHz
"positive": torch.randn(batch_size, 1, 16000*5, dtype=torch.float32) # 5 seconds, 16000 kHz
}
If batch_size
is 1, model(x)
returns the COCOLA Score between x["anchor"]
and x["positive"]
.
Remove string_track001353
from the train
split as one stem contains less frames than the other ones.