Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RMS-levels = 0.3 #1066

Closed
Oortone opened this issue Nov 7, 2021 · 7 comments
Closed

RMS-levels = 0.3 #1066

Oortone opened this issue Nov 7, 2021 · 7 comments
Labels

Comments

@Oortone
Copy link

Oortone commented Nov 7, 2021

While trying to match offline analysis and realtime analysis using WebAudio and Meyda I came across a strange thing:

If I analyze a batch of files using something like:
# meyda --bs=1024 --hs=512 --mfcc=13 --o=meydaoutput.csv samples_training/DczN6842.wav mfcc rms

If I look in the csv files all frames for all files analyzed has an RMS that is somewhere between 0.29 and 0.31. (apart from the last frame 2 frames where the last is always zero)
This indicates normalization per frame which makes the RMS value of no use as far as I can tell. Or is there something I miss here?

In the realtime case, playing back the same files through an audiocard (using a calibration procedure to get levels the same as for the files) I get an RMS at about 0.01-0.02 which is consistent with what I get from measuring WebAudio directly using input = event.inputBuffer.getChannelData(0);.

This is nowhere near 0.3, actually 0.3 RMS seems to be close to maximum level possible as far as I can tell. So the realtime version does not normalize I guess?

How does this affect other parameters like MFCC, I ask since I have trouble getting consistency between realtime and offline feature extractions.

@hughrawlinson
Copy link
Member

Thanks for the report! Just briefly perusing the code, I can't see where we would normalize in the CLI, but I'll look into it asap. A few things that would really help me identify the issue are:

  1. An audio file where this happens
  2. The exact way you invoke the CLI
  3. Minimum viable reproduction for the web, just a tiny page that loads the file, does the extraction the way you're doing it, and writing the values to the DOM or console.

I likely won't have time to look at this today or tomorrow, but hopefully this week!

@Oortone
Copy link
Author

Oortone commented Nov 7, 2021

See atached files in zip.

(1 & 2)
Here are three simplified wav examples using pink noise. The script used and the csv:s produced, rms in last column.
Low and high level files produce the same rms so it's not absolute rms per file. That could make sense although it does not translate to realtime web version which will produce absolute rms.

Ramped file seems to be correct in beginning with 0 rms, since the start is 0 and then peaks close to the beginning. However the highest rms in the csv is found in the end (row 18) and there's no sign of decaying rms equivalent to the file signal. It makes no sense and looks like normalizing per frame to me, wich also makes no sense.

I also tried removing hop overlap but the results are equivalent.

I don't know about the last row being 0, I get that in all files with overlap but that's no big deal for me since it can be removed.

(3)
I don't load any files on the web, I use the WebAudio context to analyze realtime input. I'm not able to produce an online version of that as for now but the realtime version seems to be correct to me so there's no issue I think.
examples.zip

The problem for me is producing csv files offline for model building so that these csv files corresponds with what I get when using Meyda on realtime WebAudio context.

@Oortone
Copy link
Author

Oortone commented Nov 7, 2021

If I was better at Javascript/Typescript I would love to help out but unfortunately that's a bit out of my reach right now.

@hughrawlinson
Copy link
Member

It's definitely best to compare apples to apples - if you're comparing wavs on disk to audio input via an interface, it might be that your audio interface doesn't have whatever compression was applied to the wav. Is the wav just a plain recording of the audio input? I would expect pink noise to be of a consistent high RMS, and audio input from a microphone to be much quieter.

@Oortone
Copy link
Author

Oortone commented Nov 10, 2021

The wavs are from various sources, professional recordings, but now that I swiched to integer wavs things seems to be getting consistent. I will have to do more tests though.

My idea is to find a correspondence in level between training and realtime. That's why rms is useful. Alternatively, normalization is an option but I'm not sure how to do that on the actual audio stream (time domain) so I might do it on the Meyda output (frequency domain e.t.c. - mfcc for now) however I have to read in on how to normalize mfcc correctly so I don't destroy information.

@hughrawlinson
Copy link
Member

A couple of things:

  1. You could augment your training data (I’m assuming supervised learning?) by copying signals and adding some room noise for your mic. Perhaps even by taking an impulse response of your room and applying that to your input signals. The more representative of the real world that your training data is, the more resilient your model will end up.
  2. If your model is operating on more than one buffer as a datum, then normalization of each datum could be a good approach, as long as you don’t care too much about your model being able to distinguish between quiet and loud sounds. Though, if you do, you might still normalize and add another feature of average rms (or even perceptual loudness) over the entire datum, and that might handle it.

Do you still think an issue with Meyda is causing your different values between training and real-time?

@Oortone
Copy link
Author

Oortone commented Nov 11, 2021

Now I've done a more thorough preliminary test and I get consistent results. Very consistent actually. And even the very stupid and simple model I trained seems to be quite robust to level differences.

Anyway, this confirms I can train a model in Python with parameters extracted with Meyda CLI. Export the model using Tensorflow.js and then use Javascript, WebAudio and Meyda in realtime in a browser to make predictions. It's very nice.

So I don't think there's an issue with Meyda.
The problem was that my preprocessing produced floating point wavs which Meyda can't read.

About levels realtime vs offline and this might be depending on browser – but in my system with Chrome on macOS 10.14 I needed to double the gain from the WebAudio input Stream to get corresponding RMS-readings from Meyda compared to offline analysis. Could be a mono/stereo thing or something else, not a big deal and as you say I will need to take various measures to augument training data (including levels) in a more realistic model scenario.

@Oortone Oortone closed this as completed Nov 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

No branches or pull requests

2 participants