Speech Command Recognition

Authors:

This project was conducted as part of a deep learning course, focusing on speech command recognition using recurrent neural network (RNN) architectures. The objective was to develop models capable of accurately classifying spoken commands from audio samples. The dataset used for training and evaluation consisted of speech command recordings, categorized into various classes representing different spoken commands.

Training

Set up the training environment and install required packaged defined in the requirements.txt file.
Navigate to the experiments directory.
Each model type has its own directory containing configuration files for training. The configuration files define the model architecture, training parameters.

For LSTM, GRU, and RNN models, navigate to the LSTM directory. And training is run using:

python lstm.py --SEED <seed>

For Whisper and AST models, navigate to the whisper and AST directories, respectively. And training is run using:

python main_trainer.py

Model checkpoints

Model checkpoints are available in the LSTM/results and whisper/checkpoints directories. The checkpoints for the AST are not available due to their large size, however we include all configuration files used for training, which can be used to reproduce the results.

Results for RNN, LSTM, and GRU based models

All experiments were run 3 times with different random seeds. The results are presented as the average accuracy and standard deviation.

Model Name	Num. Layers	Hidden Size	Avg. Accuracy
GRUModel	4	256	0.837 ± 0.004
LSTMModel	4	256	0.846 ± 0.004
RNNModel	4	16	0.255 ± 0.038

Results for Whisper and AST Models

Model Name	Pretrained	Avg. Accuracy
AST	✓	0.865 ± 0.000
AST	X	0.680 ± 0.004
Whisper	✓	0.845 ± 0.002
Whisper	X	0.673 ± 0.008

From these results, we observe that LSTM and GRU models with 4 recurrent layers and a hidden size of 256 achieve the highest accuracy in classifying speech commands. Furthermore, pretrained AST demonstrates promising performance, emphasizing the potential of transfer learning in speech recognition tasks.

Results with models combined with silence detection model

All models struggled with detecting silence (this can be seen in the confusion matrices available in LSTM/results, whisper/checkpoints, AST/checkpoints), which was a separate class in the dataset. To improve the results, we combined the models with a silence detection model. The results are presented only for best performing models with seed 1.

Model Name	Pretrained	Accuracy	Accuracy with Silence Detection Model
LSTMModel	X	0.846	0.887
AST	✓	0.865	0.971
Whisper	✓	0.845	0.950

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
experiments		experiments
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Speech Command Recognition

Training

Model checkpoints

Results for RNN, LSTM, and GRU based models

Results for Whisper and AST Models

Results with models combined with silence detection model

About

Releases

Packages

Contributors 2

Languages

License

hbujakow/RNN_SpeechCommands

Folders and files

Latest commit

History

Repository files navigation

Speech Command Recognition

Training

Model checkpoints

Results for RNN, LSTM, and GRU based models

Results for Whisper and AST Models

Results with models combined with silence detection model

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages