Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent results, Mean Square Error is high between runs. #24

Closed
hykilpikonna opened this issue Oct 11, 2023 · 2 comments
Closed

Inconsistent results, Mean Square Error is high between runs. #24

hykilpikonna opened this issue Oct 11, 2023 · 2 comments

Comments

@hykilpikonna
Copy link
Contributor

hykilpikonna commented Oct 11, 2023

This is based on the code in the master branch of this repository. Two runs of get_audio_embeddings on the same audio file does not produce identical results, and the MSE is 0.1367.

image

My testing code:

# Load model (Choose between versions '2022' or '2023')
from CLAPWrapper import CLAPWrapper as CLAP 
import torch

with torch.no_grad():

    clap_model = CLAP("/Users/azalea/Downloads/CLAP_weights_2023.pth", version = '2023', use_cuda=False)

    # Extract text embeddings
    # text_embeddings = clap_model.get_text_embeddings(class_labels: List[str])

    audio_embeddings_1 = clap_model.get_audio_embeddings(["/Users/azalea/Downloads/test.wav"])
    audio_embeddings_2 = clap_model.get_audio_embeddings(["/Users/azalea/Downloads/test.wav"])

    print(audio_embeddings_1)
    print(audio_embeddings_2)

    # Compute mean square error
    mse = torch.mean((audio_embeddings_1 - audio_embeddings_2)**2)
    print(mse)

OS: macOS Sonoma 14.0
Python: Python 3.10.12 | packaged by conda-forge | (main, Jun 23 2023, 22:39:40) [Clang 15.0.7 ]

The audio file metadata:

> ffprobe test.wav -hide_banner
Input #0, wav, from 'test.wav':
  Metadata:
    artist          : ハンバート ハンバート
    date            : 2006
    title           : 日が落ちるまで
    album           : 道はつづく
    track           : 7
    encoder         : Lavf60.3.100
  Duration: 00:04:56.49, bitrate: 1411 kb/s
  Stream #0:0: Audio: pcm_s16le ([1][0][0][0] / 0x0001), 44100 Hz, 2 channels, s16, 1411 kb/s
@soham97
Copy link
Collaborator

soham97 commented Oct 12, 2023

Hi @hykilpikonna, the code supports files up to 7 seconds in length. So if you pass a file greater than 7 seconds (in your case it's ~5mins) then it will randomly sample a 7-second segment and provide embeddings/predictions on that segment. I think this is the cause of high variance in predictions.

I would recommend either chunking your file in 7-second (or lower) files or updating CLAPWrapper.py to chunk and accumulate predictions.

@hykilpikonna
Copy link
Contributor Author

Hi @hykilpikonna, the code supports files up to 7 seconds in length. So if you pass a file greater than 7 seconds (in your case it's ~5mins) then it will randomly sample a 7-second segment and provide embeddings/predictions on that segment. I think this is the cause of high variance in predictions.

I would recommend either chunking your file in 7-second (or lower) files or updating CLAPWrapper.py to chunk and accumulate predictions.

Thank you for the clarification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants