Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Architecture description of BirdNET v.2.4 #177

Closed
johnmartinsson opened this issue Oct 18, 2023 · 12 comments
Closed

Architecture description of BirdNET v.2.4 #177

johnmartinsson opened this issue Oct 18, 2023 · 12 comments
Assignees

Comments

@johnmartinsson
Copy link

I was just in contact with Holger Klinck at the "AI for Conservation" slack-channel regarding a detailed description of the current model v2.4 architecture. He suggested that I make an issue for this here.

I think it would be great to have this information available somewhere on the GitHub-page, or if it is already somewhere maybe a pointer to where this information could be found in the version history README or something like that.

My current understanding is that it is based on EfficientNet V2 blocks, but some more details would be great to properly understand what model I am using.

Thanks for the great work on this model!

@johnmartinsson
Copy link
Author

Small update. I just now found that part of what I am asking for is in fact in the technical description in the README:

  • "V2.4 uses an EfficienNetB0-like backbone with a final embedding size of 1024"

I had missed this prior to creating this issue, and this is very helpful. But, it would still be interesting to know some of the specific choices and if there are any important differences to the original EfficientNetB0 backbone.

Personally, I am interested in knowing whether or not global pooling is used in BirdNET v2.4? In most EfficientNetB0 implementations this is an option.

@EnisBerk
Copy link

I am also interested in the details required to reproduce the BirdNet model from scratch, such as a list of samples, a train-test split, and the code for the model. Are these available somewhere that I might be missing?

On an adjacent note, I noticed an attempt to convert the weights into a PyTorch model, but it seems they gave up on it. You can find more information at this repo issue: dsgt-birdclef/birdclef-2023#29 Having the model code could be helpful in achieving that goal.

@kahst
Copy link
Owner

kahst commented Oct 19, 2023

We can't easily publish the training code as it is very complex with many nuances and most of all: messy :)

We'll do a better job in the future, we're planning on releasing a repo with the full implementation developed from scratch so that people can actually use it.

In the meantime, would it help if we published the Keras model with its custom layers? It would be much easier to inspect and might also be easier to convert to ONNX.

@EnisBerk
Copy link

I understand that this has been a long research project, and I can imagine the complexity of the codebase. It's been a while since I used Keras, but I am happy to give it a try.

As for the dataset details, are you planning to release them?

@johnmartinsson
Copy link
Author

Yes, the Keras model with the custom layers should be sufficient to resolve this issue. Personally I would like to know if global pooling is used in the model or not, and if that can be made out from the suggested solution I am happy! :)

Maybe another issue should be created regarding the training procedure of the model @EnisBerk ? That was not really meant to be a part of this issue, even though I can see how they are closely related.

This issue is only requires a description of the model architecture to be resolved.

Thanks for the quick reply @kahst !

@kahst
Copy link
Owner

kahst commented Oct 20, 2023

Ok, here are a few details on that:

  • our layer sequence is: Input --> Custom Spectrogram Layer --> 2DConv --> Pooling --> EfficientNetB0-like backbone --> Global Avg pooling --> Linear classifier
  • We use inverted ResBlocks with Squeeze excitation, just like EfficientNet does
  • Our ResBlock sequence is a bit different from EfficientNetB0 out of convenience, not performance
  • we use a smaller embeddings size (1024 instead of 1280) than EfficientNet to make the model smaller

We primarily train on focal-follow recordings from Xeno-canto and Macaulay Library (~90% of our training data), a number of soundscape datasets from BirdCLEF competitions (see this comment), and a few proprietary sources like BirdNET app data (but those are only a fraction of the training data).

In case you're interested, this is our spectrogram layer implementation:

class MelSpecLayerSimple(l.Layer):

    def __init__(self, sample_rate=48000, spec_shape=(96, 511), frame_step=280, frame_length=1024, fmin=500, fmax=15000, data_format='channels_last', **kwargs):
        super(MelSpecLayerSimple, self).__init__(**kwargs)
        self.sample_rate = sample_rate
        self.spec_shape = spec_shape
        self.data_format = data_format
        self.frame_step = frame_step
        self.frame_length = frame_length
        self.fmin=fmin
        self.fmax=fmax

        self.mel_filterbank = tf.signal.linear_to_mel_weight_matrix(
                                num_mel_bins=self.spec_shape[0],
                                num_spectrogram_bins=self.frame_length // 2 + 1,
                                sample_rate=self.sample_rate,
                                lower_edge_hertz=self.fmin,
                                upper_edge_hertz=self.fmax,
                                dtype=tf.float32)  

    def build(self, input_shape):           
        self.mag_scale = self.add_weight(name='magnitude_scaling', 
                                         initializer=k.initializers.Constant(value=1.23),
                                         trainable=True)                                          
        super(MelSpecLayerSimple, self).build(input_shape)        

    def compute_output_shape(self, input_shape):
        if self.data_format == 'channels_last':
            return tf.TensorShape((None, self.spec_shape[0], self.spec_shape[1], 1))
        else:
            return tf.TensorShape((None, 1, self.spec_shape[0], self.spec_shape[1]))

    def call(self, inputs, training=None):

        # Normalize values between -1 and 1
        inputs = tf.math.subtract(inputs, tf.math.reduce_min(inputs, axis=1, keepdims=True))
        inputs = tf.math.divide(inputs, tf.math.reduce_max(inputs, axis=1, keepdims=True) + 0.000001)
        inputs = tf.math.subtract(inputs, 0.5)
        inputs = tf.math.multiply(inputs, 2.0)        

        # Perform STFT    
        spec = tf.signal.stft(inputs,
                              self.frame_length,
                              self.frame_step,
                              fft_length=self.frame_length,
                              window_fn=tf.signal.hann_window,
                              pad_end=False,
                              name='stft')    

        # Cast from complex to float
        spec = tf.dtypes.cast(spec, tf.float32)

        # Apply mel scale              
        spec = tf.tensordot(spec, self.mel_filterbank, 1)
        
        # Convert to power spectrogram
        spec = tf.math.pow(spec, 2.0)

        # Convert magnitudes using nonlinearity
        spec = tf.math.pow(spec, 1.0 / (1.0 + tf.math.exp(self.mag_scale)))

        # Flip spec horizontally
        spec = tf.reverse(spec, axis=[2])

        # Swap axes to fit input shape
        spec = tf.transpose(spec, [0, 2, 1])

        # Add channel axis        
        if self.data_format == 'channels_last':
            spec = tf.expand_dims(spec, -1)
        else:
            spec = tf.expand_dims(spec, 1)      

        return spec

    def get_config(self):
        config = {'data_format': self.data_format,
                  'sample_rate': self.sample_rate,
                  'spec_shape': self.spec_shape,
                  'frame_step': self.frame_step,
                  'fmin': self.fmin,
                  'fmax': self.fmax,
                  'frame_length': self.frame_length}
        base_config = super(MelSpecLayerSimple, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

@johnmartinsson
Copy link
Author

Thank you very much for this information! That is very helpful.

@EnisBerk
Copy link

This is really helpful. Thank you very much!

@MJWeldy
Copy link

MJWeldy commented Dec 5, 2023

Thanks for providing that info! It is interesting to see!
Could you also provide an estimate of how much training data you used to train different model versions?

@kahst
Copy link
Owner

kahst commented Dec 6, 2023

We train on ~10 mio 3-second snippets of audio with max. 5.000 per species. The train set since V2.0 has been the same.

@JacobGlennAyers
Copy link

I understand that this has been a long research project, and I can imagine the complexity of the codebase. It's been a while since I used Keras, but I am happy to give it a try.

As for the dataset details, are you planning to release them?

In their main BirdNET paper they reference this process for segmenting bird audio from background noise -
Comes from this paper: https://ceur-ws.org/Vol-1609/16090547.pdf
image
Visual of this process -
image

Curious to know if this segmentation process has evolved since the Ecological Informatics Paper

@tphakala
Copy link

tphakala commented Apr 5, 2024

In the meantime, would it help if we published the Keras model with its custom layers? It would be much easier to inspect and might also be easier to convert to ONNX.

Keras model availability would be a really good, I would like to switch from tflite to ONNX runtime as that would simplify my BirdNET-Go significantly as I could get rid of C library compatibility layer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants