Music auto-tagging models and trained weights in keras/theano

Music Auto-Tagger

Music auto-tagger using keras


  • use keras == 1.0.6 for MusicTaggerCNN. (will fix soon)
  • use keras > 1.0.6 for MusicTaggerCRNN.

The prerequisite

  • You need keras to run
    • To use your own audio file, you need librosa.
  • The input data shape is (None, channel, height, width), i.e. following theano convention. If you're using tensorflow as your backend, you should check out ~/.keras/keras.json if image_dim_ordering is set to th, i.e.
"image_dim_ordering": "th",







alt text


  • 5-layer 2D Convolutions
  • num_parameter: 865,950
  • AUC score of 0.8654
  • WARNING with keras >1.0.6, this model does not work properly. Please use MusicTaggerCRNN until it is updated! (FYI: with 3M parameter, a deeper ConvNet showed 0.8595 AUC.)


  • 4-layer 2D Convolutions + 2 GRU
  • num_parameter: 396,786
  • AUC score: 0.8662

How was it trained?

['rock', 'pop', 'alternative', 'indie', 'electronic', 'female vocalists', 
'dance', '00s', 'alternative rock', 'jazz', 'beautiful', 'metal', 
'chillout', 'male vocalists', 'classic rock', 'soul', 'indie rock',
'Mellow', 'electronica', '80s', 'folk', '90s', 'chill', 'instrumental',
'punk', 'oldies', 'blues', 'hard rock', 'ambient', 'acoustic', 'experimental',
'female vocalist', 'guitar', 'Hip-Hop', '70s', 'party', 'country', 'easy listening',
'sexy', 'catchy', 'funk', 'electro' ,'heavy metal', 'Progressive rock',
'60s', 'rnb', 'indie pop', 'sad', 'House', 'happy']

Which is the better predictor?

  • Training: MusicTaggerCNN is faster than MusicTaggerCRNN (wall-clock time)
  • Prediction: They are more or less the same.
  • Memory Usage: MusicTaggerCRNN have smaller number of trainable parameters. Actually you can even decreases the number of feature maps. The MusicTaggerCRNN still works quite well in the case - i.e., the current setting is a little bit rich (or redundant). With MusicTaggerCNN, you will see the performance decrease if you reduce down the parameters.

Therefore, if you just wanna use the pre-trained weights, use MusicTaggerCNN. If you wanna train by yourself, it's up to you. I would use MusicTaggerCRNN after downsizing it to, like, 0.2M parameters (then the training time would be similar to MusicTaggerCNN) in general. To reduce the size, change number of feature maps of convolution layers.

Which is the better feature extractor?

By setting include_top=False, you can get 256-dim (MusicTaggerCNN) or 32-dim (MusicTaggerCRNN) feature representation.

In general, I would recommend to use MusicTaggerCRNN and 32-dim feature as for predicting 50 tags, 256 features actually sound bit too large. I haven't looked into 256-dim feature but only 32-dim features. I thought of using PCA to reduce the dimension more, but ended up not applying it because mean(abs(recovered - original) / original) are .12 (dim: 32->16), .05 (dim: 32->24) - which don't seem good enough.

Probably the 256-dim features are redundant (which then you can reduce them down effectively with PCA), or they just include more information than 32-dim ones (e.g., features in different hierarchical levels). If the dimension size would not matter, it's worth choosing 256-dim ones.


$ python
$ python


theano, MusicTaggerCRNN

[('jazz', '0.444'), ('instrumental', '0.151'), ('folk', '0.103'), ('Hip-Hop', '0.103'), ('ambient', '0.077')]
[('guitar', '0.068'), ('rock', '0.058'), ('acoustic', '0.054'), ('experimental', '0.051'), ('electronic', '0.042')]

[('jazz', '0.416'), ('instrumental', '0.181'), ('Hip-Hop', '0.085'), ('folk', '0.085'), ('rock', '0.081')]
[('ambient', '0.068'), ('guitar', '0.062'), ('Progressive rock', '0.048'), ('experimental', '0.046'), ('acoustic', '0.046')]

[('Hip-Hop', '0.245'), ('rock', '0.183'), ('alternative', '0.081'), ('electronic', '0.076'), ('alternative rock', '0.053')]
[('metal', '0.051'), ('indie', '0.028'), ('instrumental', '0.027'), ('electronica', '0.024'), ('hard rock', '0.023')]

[('jazz', '0.299'), ('instrumental', '0.174'), ('electronic', '0.089'), ('ambient', '0.061'), ('chillout', '0.052')]
[('rock', '0.044'), ('guitar', '0.044'), ('funk', '0.033'), ('chill', '0.032'), ('Progressive rock', '0.029')]


Reproduce the experiment

  • A repo for split setting for an identical setting of experiments in two papers.
  • Audio file: find someone around you who happened to have the preview clips. or you have to crawl the files. I would recommend you to crawl your colleagues...