RMS-levels = 0.3 #1066

Oortone · 2021-11-07T00:14:25Z

While trying to match offline analysis and realtime analysis using WebAudio and Meyda I came across a strange thing:

If I analyze a batch of files using something like:
# meyda --bs=1024 --hs=512 --mfcc=13 --o=meydaoutput.csv samples_training/DczN6842.wav mfcc rms

If I look in the csv files all frames for all files analyzed has an RMS that is somewhere between 0.29 and 0.31. (apart from the last frame 2 frames where the last is always zero)
This indicates normalization per frame which makes the RMS value of no use as far as I can tell. Or is there something I miss here?

In the realtime case, playing back the same files through an audiocard (using a calibration procedure to get levels the same as for the files) I get an RMS at about 0.01-0.02 which is consistent with what I get from measuring WebAudio directly using input = event.inputBuffer.getChannelData(0);.

This is nowhere near 0.3, actually 0.3 RMS seems to be close to maximum level possible as far as I can tell. So the realtime version does not normalize I guess?

How does this affect other parameters like MFCC, I ask since I have trouble getting consistency between realtime and offline feature extractions.

The text was updated successfully, but these errors were encountered:

hughrawlinson · 2021-11-07T09:00:07Z

Thanks for the report! Just briefly perusing the code, I can't see where we would normalize in the CLI, but I'll look into it asap. A few things that would really help me identify the issue are:

An audio file where this happens
The exact way you invoke the CLI
Minimum viable reproduction for the web, just a tiny page that loads the file, does the extraction the way you're doing it, and writing the values to the DOM or console.

I likely won't have time to look at this today or tomorrow, but hopefully this week!

Oortone · 2021-11-07T10:21:42Z

See atached files in zip.

(1 & 2)
Here are three simplified wav examples using pink noise. The script used and the csv:s produced, rms in last column.
Low and high level files produce the same rms so it's not absolute rms per file. That could make sense although it does not translate to realtime web version which will produce absolute rms.

Ramped file seems to be correct in beginning with 0 rms, since the start is 0 and then peaks close to the beginning. However the highest rms in the csv is found in the end (row 18) and there's no sign of decaying rms equivalent to the file signal. It makes no sense and looks like normalizing per frame to me, wich also makes no sense.

I also tried removing hop overlap but the results are equivalent.

I don't know about the last row being 0, I get that in all files with overlap but that's no big deal for me since it can be removed.

(3)
I don't load any files on the web, I use the WebAudio context to analyze realtime input. I'm not able to produce an online version of that as for now but the realtime version seems to be correct to me so there's no issue I think.
examples.zip

The problem for me is producing csv files offline for model building so that these csv files corresponds with what I get when using Meyda on realtime WebAudio context.

Oortone · 2021-11-07T12:43:49Z

If I was better at Javascript/Typescript I would love to help out but unfortunately that's a bit out of my reach right now.

hughrawlinson · 2021-11-08T09:48:57Z

It's definitely best to compare apples to apples - if you're comparing wavs on disk to audio input via an interface, it might be that your audio interface doesn't have whatever compression was applied to the wav. Is the wav just a plain recording of the audio input? I would expect pink noise to be of a consistent high RMS, and audio input from a microphone to be much quieter.

Oortone · 2021-11-10T20:45:42Z

The wavs are from various sources, professional recordings, but now that I swiched to integer wavs things seems to be getting consistent. I will have to do more tests though.

My idea is to find a correspondence in level between training and realtime. That's why rms is useful. Alternatively, normalization is an option but I'm not sure how to do that on the actual audio stream (time domain) so I might do it on the Meyda output (frequency domain e.t.c. - mfcc for now) however I have to read in on how to normalize mfcc correctly so I don't destroy information.

hughrawlinson · 2021-11-10T21:37:16Z

A couple of things:

You could augment your training data (I’m assuming supervised learning?) by copying signals and adding some room noise for your mic. Perhaps even by taking an impulse response of your room and applying that to your input signals. The more representative of the real world that your training data is, the more resilient your model will end up.
If your model is operating on more than one buffer as a datum, then normalization of each datum could be a good approach, as long as you don’t care too much about your model being able to distinguish between quiet and loud sounds. Though, if you do, you might still normalize and add another feature of average rms (or even perceptual loudness) over the entire datum, and that might handle it.

Do you still think an issue with Meyda is causing your different values between training and real-time?

Oortone · 2021-11-11T09:45:50Z

Now I've done a more thorough preliminary test and I get consistent results. Very consistent actually. And even the very stupid and simple model I trained seems to be quite robust to level differences.

Anyway, this confirms I can train a model in Python with parameters extracted with Meyda CLI. Export the model using Tensorflow.js and then use Javascript, WebAudio and Meyda in realtime in a browser to make predictions. It's very nice.

So I don't think there's an issue with Meyda.
The problem was that my preprocessing produced floating point wavs which Meyda can't read.

About levels realtime vs offline and this might be depending on browser – but in my system with Chrome on macOS 10.14 I needed to double the gain from the WebAudio input Stream to get corresponding RMS-readings from Meyda compared to offline analysis. Could be a mono/stereo thing or something else, not a big deal and as you say I will need to take various measures to augument training data (including levels) in a more realistic model scenario.

Oortone added the question label Nov 7, 2021

Oortone mentioned this issue Nov 7, 2021

Some file sizes will crash CLI version #1067

Closed

Oortone closed this as completed Nov 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RMS-levels = 0.3 #1066

RMS-levels = 0.3 #1066

Oortone commented Nov 7, 2021 •

edited

Loading

hughrawlinson commented Nov 7, 2021

Oortone commented Nov 7, 2021 •

edited

Loading

Oortone commented Nov 7, 2021

hughrawlinson commented Nov 8, 2021

Oortone commented Nov 10, 2021 •

edited

Loading

hughrawlinson commented Nov 10, 2021

Oortone commented Nov 11, 2021 •

edited

Loading

RMS-levels = 0.3 #1066

RMS-levels = 0.3 #1066

Comments

Oortone commented Nov 7, 2021 • edited Loading

hughrawlinson commented Nov 7, 2021

Oortone commented Nov 7, 2021 • edited Loading

Oortone commented Nov 7, 2021

hughrawlinson commented Nov 8, 2021

Oortone commented Nov 10, 2021 • edited Loading

hughrawlinson commented Nov 10, 2021

Oortone commented Nov 11, 2021 • edited Loading

Oortone commented Nov 7, 2021 •

edited

Loading

Oortone commented Nov 7, 2021 •

edited

Loading

Oortone commented Nov 10, 2021 •

edited

Loading

Oortone commented Nov 11, 2021 •

edited

Loading