Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion needed #26

Closed
lvaleriu opened this issue Mar 30, 2017 · 14 comments
Closed

Suggestion needed #26

lvaleriu opened this issue Mar 30, 2017 · 14 comments

Comments

@lvaleriu
Copy link

I'm trying to use the compact CNN features for a music singing language classification. Do you have any suggestion about what layer should i choose?

Thanks!

@keunwoochoi
Copy link
Owner

keunwoochoi commented Mar 30, 2017

How are they different? How's the size? To rush, I'd try https://github.com/keunwoochoi/transfer_learning_music with ALL layer features concatenated + PCA or some fancy feature selection + SVM/random forest/etc (as suggested in the paper ).

I'd assume they are somehow a bit different in their sound, more than the language itself, since usually languages of songs are correlated to some other cultural aspect, which is somehow correlated to the sound. If they are almost identical but the language, sounds like quite a challenging task ;)

@lvaleriu
Copy link
Author

5 languages : it, es, fr, de, en
500 mp3 files (44100hz, duration 30 sec) per language - total 2500 files
5+ different genres (but mainly voice songs - off course, all contain some instrumental part that i should filter maybe)

@lvaleriu
Copy link
Author

Did you test extraction of features corresponding to small durations when training on audio datasets? Like 1, 2, 5 seconds length mfcc(s) features.

@keunwoochoi
Copy link
Owner

Yeah, (also described in the paper) I repeat them to make them 29s-signals so that the feature doesn't get distorted by many zero's.

@lvaleriu
Copy link
Author

Oh, i understand better now.

@lvaleriu
Copy link
Author

One more question: I get to train the Gtzan music vs speech dataset with a less deeper network (more like a lenet for audio with mfcc/s features) and i get a pretty good score. But when I try to use this network in order to do predictions on different songs i sometimes get good result and sometimes .... very bad results. For ex, i have many examples with people talking (0 music sound) where the nn prediction is bad (a score of 0.4 - doing a mean over the song duration, since i'm training with 1, 2, 5 sec features - when 0 should be close to speech and 1 close to music).

Do you encounter such situation when you use learning transfer and get good accuracy on a dataset and after you try to use the classification in a "real-world" situation? Do the features (low or hig level) of layers still perform well outside the dataset?

@keunwoochoi
Copy link
Owner

Hm, I tested it to 6 different datasets in https://github.com/keunwoochoi/transfer_learning_music and most of them (if not all) are rather real songs. So yes, I think they do.

@lvaleriu
Copy link
Author

lvaleriu commented Mar 31, 2017

I'm trying to train using the method described in the transfer_learning_music on the Gtzan SpeechVsMusic dataset.

Can I use a simple network like the following and hope to have results? I'm asking this since I've reached a validation accuracy of more than 0.95, but when I do prediction on other songs I have always the value predicted_y = [0, 1] for any song. How much the validation loss should decrease? Now, the value is smaller then 0.1.

 model = Sequential()
 model.add(InputLayer(input_shape=(1, 160)))
 model.add(Flatten())
 model.add(Dense(256, activation='relu'))
 model.add(Dropout(0.5))
 model.add(Dense(2, activation='softmax'))
 model.compile(loss='categorical_crossentropy', metrics=['accuracy'], optimizer=Adam(lr=1e-4))

x.shape =(103, 1, 160)
y.shape =(103, 2)

vx.shape =(26, 1, 160)
vy.shape =(26, 2)

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 1, 160)        0                                            
____________________________________________________________________________________________________
flatten_2 (Flatten)              (None, 160)           0           input_1[0][0]                    
____________________________________________________________________________________________________
dense_1 (Dense)                  (None, 256)           41216       flatten_2[0][0]                  
____________________________________________________________________________________________________
dropout_1 (Dropout)              (None, 256)           0           dense_1[0][0]                    
____________________________________________________________________________________________________
dense_2 (Dense)                  (None, 2)             514         dropout_1[0][0]                  

Total params: 41,730
Trainable params: 41,730
Non-trainable params: 0

Thanks again for your precious advices.


@lvaleriu
Copy link
Author

Here is a screenshot of the training process:

image

@keunwoochoi
Copy link
Owner

I have always the value predicted_y = [0, 1] for any song

What does it mean?

@keunwoochoi
Copy link
Owner

The graphs seem alright overall. But Gtzan speech/music is the least interesting and most trivial problem among 6 tasks in the paper. I mean, no one's using deep learning for 129 data samples.

But I also think it should be fine with random songs. If you're really into this problem, probably t-sne of all the training/validation/test features + out-of-dataset song might tell something.

@lvaleriu
Copy link
Author

"I have always the value predicted_y = [0, 1] for any song
What does it mean?" -> A stupid thing. I have rounded the results when showing them. Sorry for that.

But anyway, the problem that i still have is that when trying to predict on 2 audio files (one instrumental, the other one contains only speech - someone reading a story for children, no music background) i still obtain something like bellow. The 2 pictures shouldnt look the same.
(In order to obtain this graph i'm extracting a sliding windows of 29 seconds with a step of 3000 samples and for each window i generate a feature. In final i'll have a batch of features in input for a file so i'll obtain a batch of predictions for each sliding window. The prediction is of categorical type ([0, 1] for music, [1, 0] for speech - or the opposite).

image

image

@lvaleriu
Copy link
Author

lvaleriu commented Mar 31, 2017

I have 2 tasks:
1°. I need to separate music from the speech part from audio files. So i thought it would be a good idea to use this dataset. It is already there. Otherwise, I have my own manually selected dataset with much more data.

2°. I need to classify language of songs where there is voice, as I mentioned at the beginning of our discussion. And i have a somehow big dataset that was choosen manually. And i still can augment it if needed. But before doing training on it i thought that i might filter the non singing part from the song (like instrumental part). So training on a dataset like jamendo could help (at least in my mind).

I hope i made it clearer.

@keunwoochoi
Copy link
Owner

Okay.

  1. Could you try some other classifier than 1-hidden layer neural networks? e.g. SVM.
  2. That's a well-known task and you can check out papers that cite Jamendo dataset. As you see, my network's input is much longer than the decision resolution you'd like to have so it's not suitable enough.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants