Due to the limited number of image-music dataset in the field, the work investigates feasible and efficient methods of collecting image-music dataset. A combination of the Google MusicCaps text-music dataset and Stable Diffusion text-to-image Generative AI is applied. The work has succeeded in producing a reliable image-music-text dataset - ImMuTe of size 5521 data pairs for training and testing purposes.
- GitHub repository
git clone https://github.com/juliagsy/immute
- Hugging Face
from datasets import load_dataset
dataset = load_dataset("juliagsy/immute")
- Manual script
Example shown here
from immute.dataset import ImMuTe
immute = ImMuTe("images", "caption.json", "audios", start=0, end=100, sampling_rate=32000, pixel=256)