In [None]:
!pip install fiftyone huggingface-hub

In [None]:
import os 
import huggingface_hub

audio_datasets_path = "."

if not os.path.exists(audio_datasets_path):
    os.makedirs(audio_datasets_path, exist_ok=True)

huggingface_hub.snapshot_download(
    repo_id="MahiA/GT-Music-Genre", 
    repo_type="dataset", 
    local_dir=os.path.join(audio_datasets_path, "GT-Music-Genre")
    )

In [None]:
import os
import shutil
import pandas as pd

# Base directory
base_dir = "GT-Music-Genre"

# Create directory
save_dir = os.path.join(base_dir, "test")
os.makedirs(save_dir, exist_ok=True)

# Read the CSV file
df = pd.read_csv(os.path.join(base_dir, "test.csv"))

# Iterate through the DataFrame rows
for _, row in df.iterrows():
    # Get the source file path and genre
    source_file = os.path.join(base_dir, row['path'])
    genre = row['classname']
    filename = os.path.basename(source_file)
    
    # Create genre subdirectory inside train directory if it doesn't exist
    genre_dir = os.path.join(save_dir, genre)
    os.makedirs(genre_dir, exist_ok=True)
    
    # Define destination path
    dest_file = os.path.join(genre_dir, filename)
    
    # Move the file
    if os.path.exists(source_file):
        shutil.move(source_file, dest_file)
    else:
        print(f"Warning: File not found - {source_file}")

print("Files have been organized into genre subdirectories within the directory")

In [1]:
import os

os.environ['FIFTYONE_ALLOW_LEGACY_ORCHESTRATORS'] = 'true'

Now, let's [download a plugin](https://github.com/danielgural/audio_loader/tree/main) that will create spectograms from the audio files.

FiftyOne's plugin framework lets you extend and customize the functionality of FiftyOne to suit your needs. If you’re interested in learning more about plugins, you might be interested in attending one of our monthly workshops. You can [see the full schedule here](https://voxel51.com/computer-vision-events/) and look for the *Advanced Computer Vision Data Curation and Model Evaluation* workshop.

In [None]:
!fiftyone plugins download https://github.com/danielgural/audio_loader

Once the plugin is downloaded there are two ways you can use it.

1. You can launch the FiftyOne app in your local browser by opening the terminal and running: `fiftyone app launch`. Once the app has launched hit the backtick (\`\) button on your keyboard, this will open the Operator browser. Type in "Load Audio" and click on the operator. This will open up the form for the Load Audio plugin which you can fill in (each element of the form will appear once you populate each one). You can choose to kick off a [delegated service](https://docs.voxel51.com/plugins/developing_plugins.html#delegated-execution) if you'd like. 

Below is an example of the form:

<img src="load_audio_form.png" width="50%"/>

The plugin will take some moments to run, depending on the size of your dataset. In this case, it should take no more than 1 minute.

2. Alternatively, instead of launching the app via terminal, you can launch the app in the cell of a Jupyter Notebook. To do that you must first create a dummy dataset and then launch the app in the cell. The pattern for this is as follows:

```python
import fiftyone as fo

dummy_dataset = fo.Dataset()

fo.launch_app(dummy_dataset)
```
Once the app has launched you can open the Operator browser and hit backtick (\`\), then follow the instructions as outlined above.

In both cases, you can then load the dataset once it has been created. I named my dataset `music_genre_spectograms`, so I can load it as follows

In [2]:
import fiftyone as fo

music_dataset = fo.load_dataset("music_genre_spectograms")

In [None]:
fo.launch_app(music_dataset)

We'll need the labels, so we can get them like so:

In [3]:
music_genres = music_dataset.distinct("ground_truth.label")

Talk about music2latent

make mention that you should be on torch<2.6 and torchvision<0.21.0

In [None]:
!pip install music2latent librosa

In [None]:
import librosa
from torch.nn.functional import normalize

from music2latent import EncoderDecoder

music_to_latent_model = EncoderDecoder()

for sample in music_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    sample_rate = sample["frame_rate"]
    loaded_wave, _ = librosa.load(wav_path, sr=sample_rate)
    latents = music_to_latent_model.encode(loaded_wave, extract_features=True)
    embedding = latents.mean(dim=-1).squeeze(0) 
    normalized_embedding = normalize(embedding, p=2, dim=0)
    sample["wav_embedding"] = normalized_embedding.detach().cpu().numpy() #shape (8192,)

Talk real briefly about this model



We'll use this model below for zero-shot-audio classification

In [None]:
import torch
from torch.nn.functional import normalize

import librosa

from transformers import ClapModel, ClapProcessor

device = "cuda" if torch.cuda.is_available() else "cpu"

clap_model = ClapModel.from_pretrained("laion/clap-htsat-unfused").to(device)

clap_processor = ClapProcessor.from_pretrained("laion/clap-htsat-unfused")

for sample in music_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    sample_rate = sample["frame_rate"]
    loaded_wave, _ = librosa.load(wav_path, sr=sample_rate)
    clap_inputs = clap_processor(audios=loaded_wave, return_tensors="pt").to(device)
    audio_embed = clap_model.get_audio_features(**clap_inputs).squeeze(0)  
    normalized_embedding = normalize(audio_embed, p=2, dim=0)
    sample["clap_embeddings"] = normalized_embedding.detach().cpu().numpy() #shape (512,)

I'll also compute embedding using AIMv2, which is a vision encoder. [Read this blog](https://medium.com/voxel51/visual-understanding-with-aimv2-76c58dcd68f9) for a deep dive into the AIMv2 family of models.

This, dare I say, "multimodal" approach to analyzing embeddings provides different ways of exploring and understanding musical content, ultimately leading to an experiment with vision-language models (VLMs). Models like Music2Latent and CLAP operate directly on the raw audio waveforms, capturing temporal patterns, frequency relationships, and acoustic features in their native form. 

In parallel, we can compute embeddings using AIMv2 on the spectrograms - visual representations that encode time-frequency relationships in a 2D format.  This sets up (at least what I think is) a fascinating comparison: while the audio-specific models represent our 'traditional' approach to music understanding, the spectrogram-based analysis might hint at the suitability of a vision-language model to perform music classification. 

By converting audio into spectrograms, we can potentially tap into the sophisticated visual pattern recognition and semantic understanding capabilities of VLMs, even though they weren't specifically trained on musical data.

In [None]:
!fiftyone plugins download https://github.com/harpreetsahota204/aim-embeddings-plugin

In [6]:
import fiftyone.operators as foo

embedding_operator = foo.get_operator("@harpreetsahota/aimv2_embeddings/compute_aimv2_embeddings")

In [None]:
embedding_operator(
    music_dataset,
    model_name="apple/aimv2-large-patch14-224",  # Choose any supported model
    embedding_types="mean",  # Either "cls" or "mean"
    emb_field="aimv2_embeddings",  # Name for the embeddings field
)

Let's visualize our embeddings to better understand how our different models are grouping similar music genres. 

Since our embeddings are high-dimensional, we'll use UMAP to reduce them to 2D for visualization. This will help us see if the models are clustering similar genres together.

In [None]:
import fiftyone.brain as fob

embedding_fields = [ "aimv2_embeddings", "wav_embedding", "clap_embeddings"]

for fields in embedding_fields:
    _fname = fields.split("_embeddings")[0]
    results = fob.compute_visualization(
        music_dataset,
        embeddings=fields,
        method="umap",
        brain_key=f"{_fname}_viz",
        num_dims=2,
        )

In [None]:
fo.launch_app(music_dataset)

Before testing our VLM approach on spectrograms, we'll establish a baseline using a specialized audio model. 

We'll use LAION's CLAP model with a zero-shot audio classification pipeline. `This model was specifically trained on audio-text pairs and can classify audio into arbitrary categories without needing to be fine-tuned on our specific genre labels. `

This will give us a reference point for how well a dedicated audio model performs on our genre classification task, which we can later compare against our VLM-based approach using spectrograms.

In [None]:
from transformers import pipeline

zsc_audio_classifier = pipeline(
    task="zero-shot-audio-classification", 
    model="laion/clap-htsat-unfused"
    )

In [None]:
for sample in music_dataset.iter_samples(autosave=True):
    wav_path = sample["wav_path"]
    zsc_audio_preds = zsc_audio_classifier(wav_path, candidate_labels= music_genres)
    sample["zsc_audio_preds"] = fo.Classification(
        label=zsc_audio_preds[0]["label"], 
        confidence=zsc_audio_preds[0]["score"]
    )


#### Model evaluation in FiftyOne

You can use the [`evaluate_classifications`](https://docs.voxel51.com/tutorials/evaluate_classifications.html?highlight=evaluate%20classification) method to evaluate the predictions of the zero-shot classifiers. This will return a `ClassificationResults` instance that provides various methods for generating aggregate evaluation reports about your model.

By default, the classifications will be treated as a generic multiclass classification task, and for illustration purposes, I am explicitly requesting that simple evaluation be used by setting the method argument to `simple`; but you can specify other evaluation strategies such as `top-k` accuracy or `binary` evaluation via the method parameter.



In [None]:
music_dataset.evaluate_classifications(
    pred_field="zsc_audio_preds",
    gt_field="ground_truth",
    method="simple",
    eval_key=f"clap_simple_eval",
    )

Moondream classification on spectogram

Janus classification on spectogram