Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Silence / Background Noise similarity #62

Closed
Tomas1337 opened this issue Aug 6, 2020 · 18 comments
Closed

Silence / Background Noise similarity #62

Tomas1337 opened this issue Aug 6, 2020 · 18 comments

Comments

@Tomas1337
Copy link

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

@philipperemy
Copy link
Owner

philipperemy commented Aug 6, 2020

Happy to hear that!

So from what I can say, the model was trained on clean speech without silence nor background noise. So technically, the model has only heard clear voices so far. If I can draw a parallel with a simple cat/dog classifier, it would be like showing a car to the model. It would either predict a cat or a dog.

if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

Yes it's true. I'm sure the model can be smart enough to learn this too.

@w1nk
Copy link

w1nk commented Nov 12, 2020

Hello!

I've taken the repo/dataset and combined it with the Voxceleb2 dataset (6112 speakers). I also added a 'speaker' that was composed of a bunch of noise/silence samples. After I processed the voxceleb data into the same format (flac, 16khz, 24bit samples) as the librispeech data, I made another pass over both datasets, and for every utterance, I created 2 new training examples that were combined with random noise selected from https://github.com/microsoft/MS-SNSD . That resulted in around 730gb of training data. I've added 1k speakers to the initial classifier/softmax training and am currently running that training. Once it's complete I'll complete the triplet loss training and share the code/weights. I'm running it on a 2080ti, with 64gb of RAM, and I needed a bit over 200gb of swap space to keep the OOM killer at bay. An epoch is currently taking slightly over 1 hour.

Talk to you in a week or two :)

@philipperemy
Copy link
Owner

@w1nk AWESOME! Please let us know how it goes :)

@w1nk
Copy link

w1nk commented Nov 18, 2020

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

@philipperemy
Copy link
Owner

@w1nk very cool!

@ntdat017
Copy link

Just an update:

I ended up needing to switch versions of tensorflow (switched to 2.3), 2.2 has a nasty memory leak that was getting triggered. Once I got things running stably, the softmax network converged / I early stopped it at epoch 38 and started training the triplet loss. That network is currently still training, but is getting close:

2000/2000 [==============================] - 815s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0230 - val_loss: 0.0221
Epoch 336/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0229 - val_loss: 0.0228
Epoch 337/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0227
Epoch 338/1000
2000/2000 [==============================] - 813s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0226 - val_loss: 0.0219
Epoch 339/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0227 - val_loss: 0.0218
Epoch 340/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0221 - val_loss: 0.0219
Epoch 341/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0223 - val_loss: 0.0216
Epoch 342/1000
2000/2000 [==============================] - 814s 407ms/step - batch: 999.5000 - size: 192.0000 - loss: 0.0222 - val_loss: 0.0215

Looks like it's fitting nicely, testing some of the later epochs look pretty good as well. I'll find somewhere to put the checkpoints and a couple of the preparation scripts.

how are you split train/val/test dataset? I found in code that train/val/test is come from same speaker, have you try to split dataset with difference speaker. And i also curious with your results.

@w1nk
Copy link

w1nk commented Nov 19, 2020

Hey @ntdat017, I haven't modified the training harness at all so the validation split is being calculated how it's written. For test, I've got a holdout set of data from the voxceleb dataset that I'll use to perform the evaluation.

@w1nk
Copy link

w1nk commented Nov 27, 2020

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

@philipperemy
Copy link
Owner

@w1nk that's really awesome!!!! I'm going to have a look this weekend.

@demonstan
Copy link

Sorry for the delay, it's been a busy week. The triplet training finally converged after a bit over 600 epochs. I haven't had a chance to fully evaluate the output yet, but I've gone ahead and uploaded the checkpoints and some helper scripts I used in case anyone reading along is interested.

https://drive.google.com/drive/folders/1EExljgrj3kP-ciUzrsdoWYE5OT14_7Aa

sha256 hashes:
b71ca16f8364605a8234c9458f8b2b5ae8c2e0f7ca1d551de4d332acdb40ab90 ResCNN_softmax_checkpoint_38.h5
d86a3ac61a427bbc6f425e3b561dd9ed28f57b789f0eb4bf04d3434113f86dab ResCNN_triplet_checkpoint_613.h5

There are 3 files there, the 2 checkpoints (softmax + triplet) and a tar file with some helper scripts. The helper python scripts probably don't run out of the box, but they're pretty simple and should be easy to fix up.

process_vox.py - this will generate a file that can be split/executed as bash commands that will convert the vox speech files into the correct naming scheme and proper encoding (will require ffmpeg + flac support).

create_noise.py - this will use random noise samples from https://github.com/microsoft/MS-SNSD to generate 'noisy' versions of each input audio clip.

I got an error when loading this model.

model = keras.models.load_model('ResCNN_triplet_checkpoint_613.h5', compile=False)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\save.py", line 182, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\hdf5_format.py", line 178, in load_model_from_hdf5
    custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\saving\model_config.py", line 55, in model_from_config
    return deserialize(config, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 617, in from_config
    config, custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1204, in reconstruct_from_config
    process_layer(layer_data)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\engine\functional.py", line 1186, in process_layer
    layer = deserialize_layer(layer_data, custom_objects=custom_objects)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\serialization.py", line 175, in deserialize
    printable_module_name='layer')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 358, in deserialize_keras_object
    list(custom_objects.items())))
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1006, in from_config
    config, custom_objects, 'function', 'module', 'function_type')
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\layers\core.py", line 1058, in _parse_function_from_config
    config[func_attr_name], globs=globs)
  File "D:\Python\Python36\lib\site-packages\tensorflow\python\keras\utils\generic_utils.py", line 457, in func_load
    code = marshal.loads(raw_code)
ValueError: bad marshal data (unknown type code)

What is your version of Keras, Tensorflow, and Python?

@philipperemy
Copy link
Owner

@demonstan the ones specified in the requirements.txt of the repo.

@demonstan
Copy link

@w1nk Did you perform evaluation on any dataset?

@w1nk
Copy link

w1nk commented Nov 30, 2020

@demonstan I've not had a chance to perform the evaluation fully yet. Since I trained on all of librispeech and all the voxceleb2 training data, I need to take the voxceleb2 test data set and convert/rename it to the correct format and evaluate on that. I've not had a chance to do that yet.

As for loading, it should load with TF 2.1/2/3 (I tried all of them) along with 1.15 as well. I was loading the model across those versions trying to get the tflite/coral compilation to work (hint: I didn't yet due to a coral compiler issue).

@demonstan
Copy link

May I ask why not using EarlyStopping and ReduceLROnPlateau call back here?

deep-speaker/train.py

Lines 40 to 42 in 7742796

dsm.m.fit(x=train_generator(), y=None, steps_per_epoch=2000, shuffle=False,
epochs=1000, validation_data=test_generator(), validation_steps=len(test_batches),
callbacks=[checkpoint])

@philipperemy
Copy link
Owner

@demonstan they could be used indeed. It's just that I always saw the loss decreasing steadily and I didn't think it was a necessity. Overfitting on this dataset would have been a pretty big challenge. The loss looked like an exponentially decreasing function on both the training and testing sets.

@1shershah
Copy link

I've been having fun playing with your pre-trained model and implementation!

I've noticed a phenomena that could be a point of improvement. When you record silence or background noise, and extract the features from that, say silent_features. It has a strong cosine_similarity to anything. I was wondering whether if you train the model and include various background noises / silence on the train_set and label them all silent_features, it would learn to predict various silent_features and distinguish it from voices.

It may also be helful to use SOX to remove silence and background noise. That's what i usually do. Denoise and split by silence and then compute embeddings.

@philipperemy
Copy link
Owner

Good point.

@philipperemy
Copy link
Owner

Linked to the README for reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants