Architecture description of BirdNET v.2.4 #177

johnmartinsson · 2023-10-18T14:15:19Z

I was just in contact with Holger Klinck at the "AI for Conservation" slack-channel regarding a detailed description of the current model v2.4 architecture. He suggested that I make an issue for this here.

I think it would be great to have this information available somewhere on the GitHub-page, or if it is already somewhere maybe a pointer to where this information could be found in the version history README or something like that.

My current understanding is that it is based on EfficientNet V2 blocks, but some more details would be great to properly understand what model I am using.

Thanks for the great work on this model!

johnmartinsson · 2023-10-19T11:37:06Z

Small update. I just now found that part of what I am asking for is in fact in the technical description in the README:

"V2.4 uses an EfficienNetB0-like backbone with a final embedding size of 1024"

I had missed this prior to creating this issue, and this is very helpful. But, it would still be interesting to know some of the specific choices and if there are any important differences to the original EfficientNetB0 backbone.

Personally, I am interested in knowing whether or not global pooling is used in BirdNET v2.4? In most EfficientNetB0 implementations this is an option.

EnisBerk · 2023-10-19T13:34:49Z

I am also interested in the details required to reproduce the BirdNet model from scratch, such as a list of samples, a train-test split, and the code for the model. Are these available somewhere that I might be missing?

On an adjacent note, I noticed an attempt to convert the weights into a PyTorch model, but it seems they gave up on it. You can find more information at this repo issue: dsgt-birdclef/birdclef-2023#29 Having the model code could be helpful in achieving that goal.

kahst · 2023-10-19T13:47:43Z

We can't easily publish the training code as it is very complex with many nuances and most of all: messy :)

We'll do a better job in the future, we're planning on releasing a repo with the full implementation developed from scratch so that people can actually use it.

In the meantime, would it help if we published the Keras model with its custom layers? It would be much easier to inspect and might also be easier to convert to ONNX.

EnisBerk · 2023-10-19T14:11:55Z

I understand that this has been a long research project, and I can imagine the complexity of the codebase. It's been a while since I used Keras, but I am happy to give it a try.

As for the dataset details, are you planning to release them?

johnmartinsson · 2023-10-20T08:15:19Z

Yes, the Keras model with the custom layers should be sufficient to resolve this issue. Personally I would like to know if global pooling is used in the model or not, and if that can be made out from the suggested solution I am happy! :)

Maybe another issue should be created regarding the training procedure of the model @EnisBerk ? That was not really meant to be a part of this issue, even though I can see how they are closely related.

This issue is only requires a description of the model architecture to be resolved.

Thanks for the quick reply @kahst !

kahst · 2023-10-20T11:05:59Z

Ok, here are a few details on that:

our layer sequence is: Input --> Custom Spectrogram Layer --> 2DConv --> Pooling --> EfficientNetB0-like backbone --> Global Avg pooling --> Linear classifier
We use inverted ResBlocks with Squeeze excitation, just like EfficientNet does
Our ResBlock sequence is a bit different from EfficientNetB0 out of convenience, not performance
we use a smaller embeddings size (1024 instead of 1280) than EfficientNet to make the model smaller

We primarily train on focal-follow recordings from Xeno-canto and Macaulay Library (~90% of our training data), a number of soundscape datasets from BirdCLEF competitions (see this comment), and a few proprietary sources like BirdNET app data (but those are only a fraction of the training data).

In case you're interested, this is our spectrogram layer implementation:

class MelSpecLayerSimple(l.Layer):

    def __init__(self, sample_rate=48000, spec_shape=(96, 511), frame_step=280, frame_length=1024, fmin=500, fmax=15000, data_format='channels_last', **kwargs):
        super(MelSpecLayerSimple, self).__init__(**kwargs)
        self.sample_rate = sample_rate
        self.spec_shape = spec_shape
        self.data_format = data_format
        self.frame_step = frame_step
        self.frame_length = frame_length
        self.fmin=fmin
        self.fmax=fmax

        self.mel_filterbank = tf.signal.linear_to_mel_weight_matrix(
                                num_mel_bins=self.spec_shape[0],
                                num_spectrogram_bins=self.frame_length // 2 + 1,
                                sample_rate=self.sample_rate,
                                lower_edge_hertz=self.fmin,
                                upper_edge_hertz=self.fmax,
                                dtype=tf.float32)  

    def build(self, input_shape):           
        self.mag_scale = self.add_weight(name='magnitude_scaling', 
                                         initializer=k.initializers.Constant(value=1.23),
                                         trainable=True)                                          
        super(MelSpecLayerSimple, self).build(input_shape)        

    def compute_output_shape(self, input_shape):
        if self.data_format == 'channels_last':
            return tf.TensorShape((None, self.spec_shape[0], self.spec_shape[1], 1))
        else:
            return tf.TensorShape((None, 1, self.spec_shape[0], self.spec_shape[1]))

    def call(self, inputs, training=None):

        # Normalize values between -1 and 1
        inputs = tf.math.subtract(inputs, tf.math.reduce_min(inputs, axis=1, keepdims=True))
        inputs = tf.math.divide(inputs, tf.math.reduce_max(inputs, axis=1, keepdims=True) + 0.000001)
        inputs = tf.math.subtract(inputs, 0.5)
        inputs = tf.math.multiply(inputs, 2.0)        

        # Perform STFT    
        spec = tf.signal.stft(inputs,
                              self.frame_length,
                              self.frame_step,
                              fft_length=self.frame_length,
                              window_fn=tf.signal.hann_window,
                              pad_end=False,
                              name='stft')    

        # Cast from complex to float
        spec = tf.dtypes.cast(spec, tf.float32)

        # Apply mel scale              
        spec = tf.tensordot(spec, self.mel_filterbank, 1)
        
        # Convert to power spectrogram
        spec = tf.math.pow(spec, 2.0)

        # Convert magnitudes using nonlinearity
        spec = tf.math.pow(spec, 1.0 / (1.0 + tf.math.exp(self.mag_scale)))

        # Flip spec horizontally
        spec = tf.reverse(spec, axis=[2])

        # Swap axes to fit input shape
        spec = tf.transpose(spec, [0, 2, 1])

        # Add channel axis        
        if self.data_format == 'channels_last':
            spec = tf.expand_dims(spec, -1)
        else:
            spec = tf.expand_dims(spec, 1)      

        return spec

    def get_config(self):
        config = {'data_format': self.data_format,
                  'sample_rate': self.sample_rate,
                  'spec_shape': self.spec_shape,
                  'frame_step': self.frame_step,
                  'fmin': self.fmin,
                  'fmax': self.fmax,
                  'frame_length': self.frame_length}
        base_config = super(MelSpecLayerSimple, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

johnmartinsson · 2023-10-20T14:11:11Z

Thank you very much for this information! That is very helpful.

EnisBerk · 2023-10-20T16:51:51Z

This is really helpful. Thank you very much!

MJWeldy · 2023-12-05T19:31:34Z

Thanks for providing that info! It is interesting to see!
Could you also provide an estimate of how much training data you used to train different model versions?

kahst · 2023-12-06T07:22:20Z

We train on ~10 mio 3-second snippets of audio with max. 5.000 per species. The train set since V2.0 has been the same.

JacobGlennAyers · 2023-12-10T18:58:13Z

I understand that this has been a long research project, and I can imagine the complexity of the codebase. It's been a while since I used Keras, but I am happy to give it a try.

As for the dataset details, are you planning to release them?

In their main BirdNET paper they reference this process for segmenting bird audio from background noise -
Comes from this paper: https://ceur-ws.org/Vol-1609/16090547.pdf

Visual of this process -

Curious to know if this segmentation process has evolved since the Ecological Informatics Paper

tphakala · 2024-04-05T06:50:43Z

In the meantime, would it help if we published the Keras model with its custom layers? It would be much easier to inspect and might also be easier to convert to ONNX.

Keras model availability would be a really good, I would like to switch from tflite to ONNX runtime as that would simplify my BirdNET-Go significantly as I could get rid of C library compatibility layer.

Josef-Haupt assigned kahst Oct 19, 2023

Josef-Haupt closed this as completed Oct 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture description of BirdNET v.2.4 #177

Architecture description of BirdNET v.2.4 #177

johnmartinsson commented Oct 18, 2023

johnmartinsson commented Oct 19, 2023

EnisBerk commented Oct 19, 2023

kahst commented Oct 19, 2023

EnisBerk commented Oct 19, 2023

johnmartinsson commented Oct 20, 2023

kahst commented Oct 20, 2023

johnmartinsson commented Oct 20, 2023

EnisBerk commented Oct 20, 2023

MJWeldy commented Dec 5, 2023

kahst commented Dec 6, 2023

JacobGlennAyers commented Dec 10, 2023

tphakala commented Apr 5, 2024

Architecture description of BirdNET v.2.4 #177

Architecture description of BirdNET v.2.4 #177

Comments

johnmartinsson commented Oct 18, 2023

johnmartinsson commented Oct 19, 2023

EnisBerk commented Oct 19, 2023

kahst commented Oct 19, 2023

EnisBerk commented Oct 19, 2023

johnmartinsson commented Oct 20, 2023

kahst commented Oct 20, 2023

johnmartinsson commented Oct 20, 2023

EnisBerk commented Oct 20, 2023

MJWeldy commented Dec 5, 2023

kahst commented Dec 6, 2023

JacobGlennAyers commented Dec 10, 2023

tphakala commented Apr 5, 2024