Suggestion needed #26

lvaleriu · 2017-03-30T06:40:40Z

I'm trying to use the compact CNN features for a music singing language classification. Do you have any suggestion about what layer should i choose?

Thanks!

keunwoochoi · 2017-03-30T15:08:50Z

How are they different? How's the size? To rush, I'd try https://github.com/keunwoochoi/transfer_learning_music with ALL layer features concatenated + PCA or some fancy feature selection + SVM/random forest/etc (as suggested in the paper ).

I'd assume they are somehow a bit different in their sound, more than the language itself, since usually languages of songs are correlated to some other cultural aspect, which is somehow correlated to the sound. If they are almost identical but the language, sounds like quite a challenging task ;)

lvaleriu · 2017-03-30T15:55:55Z

5 languages : it, es, fr, de, en
500 mp3 files (44100hz, duration 30 sec) per language - total 2500 files
5+ different genres (but mainly voice songs - off course, all contain some instrumental part that i should filter maybe)

lvaleriu · 2017-03-30T17:40:52Z

Did you test extraction of features corresponding to small durations when training on audio datasets? Like 1, 2, 5 seconds length mfcc(s) features.

keunwoochoi · 2017-03-30T18:03:13Z

Yeah, (also described in the paper) I repeat them to make them 29s-signals so that the feature doesn't get distorted by many zero's.

lvaleriu · 2017-03-30T18:16:36Z

Oh, i understand better now.

lvaleriu · 2017-03-30T21:07:08Z

One more question: I get to train the Gtzan music vs speech dataset with a less deeper network (more like a lenet for audio with mfcc/s features) and i get a pretty good score. But when I try to use this network in order to do predictions on different songs i sometimes get good result and sometimes .... very bad results. For ex, i have many examples with people talking (0 music sound) where the nn prediction is bad (a score of 0.4 - doing a mean over the song duration, since i'm training with 1, 2, 5 sec features - when 0 should be close to speech and 1 close to music).

Do you encounter such situation when you use learning transfer and get good accuracy on a dataset and after you try to use the classification in a "real-world" situation? Do the features (low or hig level) of layers still perform well outside the dataset?

keunwoochoi · 2017-03-30T21:46:08Z

Hm, I tested it to 6 different datasets in https://github.com/keunwoochoi/transfer_learning_music and most of them (if not all) are rather real songs. So yes, I think they do.

lvaleriu · 2017-03-31T00:45:51Z

I'm trying to train using the method described in the transfer_learning_music on the Gtzan SpeechVsMusic dataset.

Can I use a simple network like the following and hope to have results? I'm asking this since I've reached a validation accuracy of more than 0.95, but when I do prediction on other songs I have always the value predicted_y = [0, 1] for any song. How much the validation loss should decrease? Now, the value is smaller then 0.1.

 model = Sequential()
 model.add(InputLayer(input_shape=(1, 160)))
 model.add(Flatten())
 model.add(Dense(256, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(2, activation='softmax'))
 model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=Adam(lr=1e-4))

x.shape =(103, 1, 160)
y.shape =(103, 2)

vx.shape =(26, 1, 160)
vy.shape =(26, 2)

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1, 160)        0                                            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 160)           0           input_1[0][0]                    
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           41216       flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 256)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             514         dropout_1[0][0]                  

Total params: 41,730
Trainable params: 41,730
Non-trainable params: 0

Thanks again for your precious advices.

lvaleriu · 2017-03-31T00:51:38Z

Here is a screenshot of the training process:

keunwoochoi · 2017-03-31T00:53:24Z

I have always the value predicted_y = [0, 1] for any song

What does it mean?

keunwoochoi · 2017-03-31T01:03:21Z

The graphs seem alright overall. But Gtzan speech/music is the least interesting and most trivial problem among 6 tasks in the paper. I mean, no one's using deep learning for 129 data samples.

But I also think it should be fine with random songs. If you're really into this problem, probably t-sne of all the training/validation/test features + out-of-dataset song might tell something.

lvaleriu · 2017-03-31T01:17:38Z

"I have always the value predicted_y = [0, 1] for any song
What does it mean?" -> A stupid thing. I have rounded the results when showing them. Sorry for that.

But anyway, the problem that i still have is that when trying to predict on 2 audio files (one instrumental, the other one contains only speech - someone reading a story for children, no music background) i still obtain something like bellow. The 2 pictures shouldnt look the same.
(In order to obtain this graph i'm extracting a sliding windows of 29 seconds with a step of 3000 samples and for each window i generate a feature. In final i'll have a batch of features in input for a file so i'll obtain a batch of predictions for each sliding window. The prediction is of categorical type ([0, 1] for music, [1, 0] for speech - or the opposite).

lvaleriu · 2017-03-31T01:28:09Z

I have 2 tasks:
1°. I need to separate music from the speech part from audio files. So i thought it would be a good idea to use this dataset. It is already there. Otherwise, I have my own manually selected dataset with much more data.

2°. I need to classify language of songs where there is voice, as I mentioned at the beginning of our discussion. And i have a somehow big dataset that was choosen manually. And i still can augment it if needed. But before doing training on it i thought that i might filter the non singing part from the song (like instrumental part). So training on a dataset like jamendo could help (at least in my mind).

I hope i made it clearer.

keunwoochoi · 2017-03-31T13:26:52Z

Okay.

Could you try some other classifier than 1-hidden layer neural networks? e.g. SVM.
That's a well-known task and you can check out papers that cite Jamendo dataset. As you see, my network's input is much longer than the decision resolution you'd like to have so it's not suitable enough.

keunwoochoi closed this as completed Jun 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestion needed #26

Suggestion needed #26

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017 •

edited

Loading

lvaleriu commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017

lvaleriu commented Mar 31, 2017 •

edited by keunwoochoi

Loading

lvaleriu commented Mar 31, 2017

keunwoochoi commented Mar 31, 2017

keunwoochoi commented Mar 31, 2017

lvaleriu commented Mar 31, 2017

lvaleriu commented Mar 31, 2017 •

edited

Loading

keunwoochoi commented Mar 31, 2017

Suggestion needed #26

Suggestion needed #26

Comments

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017 • edited Loading

lvaleriu commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

lvaleriu commented Mar 30, 2017

keunwoochoi commented Mar 30, 2017

lvaleriu commented Mar 31, 2017 • edited by keunwoochoi Loading

lvaleriu commented Mar 31, 2017

keunwoochoi commented Mar 31, 2017

keunwoochoi commented Mar 31, 2017

lvaleriu commented Mar 31, 2017

lvaleriu commented Mar 31, 2017 • edited Loading

keunwoochoi commented Mar 31, 2017

keunwoochoi commented Mar 30, 2017 •

edited

Loading

lvaleriu commented Mar 31, 2017 •

edited by keunwoochoi

Loading

lvaleriu commented Mar 31, 2017 •

edited

Loading