## Results and Discussion

### Model Performance Overview
The **RNN baseline** was evaluated using bidirectional GRU layers to model temporal dependencies in the AudioMNIST dataset.  
Training, validation, and test evaluations showed exceptional model stability and consistent convergence.  
After normalization and masking of padded frames, the model achieved **99.9 % training accuracy**, **98.7 % validation accuracy**, and **99.2 % test accuracy**.  
The accuracy and loss curves (Figures 1 and 2) confirm smooth convergence and minimal overfitting, while the confusion matrix (Figure 3) demonstrates precise classification across all ten digits.

The **training and validation accuracy** (Figure 1) exhibit rapid convergence by the second epoch, stabilizing above 0.98 for the remainder of training.  
**Loss curves** (Figure 2) show an equally consistent decline, with validation loss stabilizing near 0.07 — clear evidence of efficient learning and generalization.  
These outcomes indicate that the RNN captured robust temporal features without excessive parameter tuning.

### Baseline RNN
The RNN architecture comprised two bidirectional GRU layers (128 and 64 units) followed by dropout-regularized dense layers.  
A Masking layer ensured that only meaningful audio frames contributed to training, while padded segments were ignored during backpropagation.  
This architecture achieved high accuracy with low variance between training and validation performance, confirming that the RNN effectively learned temporal patterns inherent in spoken digits.

The model surpassed the **fine-tuned CNN**, which achieved **97.6 % test accuracy** and **0.071 test loss**.  
While the CNN effectively captured spatial features from spectrograms, the RNN leveraged temporal dynamics, allowing it to recognize contextual relationships across time.  
This capability enabled the RNN to reach **99.2 % test accuracy**, outperforming all CNN variants with fewer epochs and stable validation behavior.

### Visual Analysis
Figures 1 and 2 illustrate the RNN’s learning dynamics.  
Validation accuracy quickly approached training accuracy within the first few epochs, indicating consistent feature learning across splits.  
The minimal gap between training and validation curves reflects effective regularization and stable gradient flow.  
Figure 3, the confusion matrix, confirms the model’s high discriminative power — most predictions lie along the diagonal, with only minor confusion between phonetically similar digits (“eight” and “nine”).

### Interpretation
The RNN’s strong performance is attributed to three key factors:
- **Temporal modeling** via bidirectional GRUs captured contextual relationships that CNNs could not.  
- **Normalization and zero-padding** with proper masking prevented bias from uneven sequence lengths.  
- **Dropout regularization** maintained generalization without extensive data augmentation.

The model’s near-perfect accuracy and low loss suggest that the RNN learned generalized representations of digit pronunciations rather than memorizing specific examples.  
This result aligns with established findings that recurrent architectures outperform purely convolutional models in sequential tasks where time-dependent variation matters (Goodfellow, Bengio, & Courville, 2016).

### Why the RNN Baseline is the Best Model
Although the **fine-tuned CNN** achieved excellent results (97.6 % test accuracy), the **RNN baseline** represents the most effective configuration for this dataset due to:
1. **Superior temporal awareness:** Captures sequential dependencies beyond static spectrogram features.  
2. **Higher accuracy:** Reached **99.2 % test accuracy** and **0.036 test loss**, surpassing the CNN series.  
3. **Stable convergence:** Maintained smooth learning curves with minimal divergence.  
4. **Simpler optimization:** Required no learning-rate scheduling or complex callbacks.  
5. **Efficiency:** Achieved state-of-the-art performance using fewer trainable parameters.

### Conclusion
The **RNN baseline model** provides the strongest performance in this study, delivering high accuracy, low loss, and robust generalization across speakers.  
It demonstrates that temporal modeling is crucial for spoken-digit recognition, and even a relatively lightweight recurrent network can outperform deeper CNNs when sequence dynamics are preserved.

---

### Comparative Summary

| Model Type | Key Enhancements | Validation Accuracy | Test Accuracy | Test Loss | Notes |
|-------------|------------------|---------------------|----------------|------------|--------|
| **Baseline CNN** | Basic 3-layer CNN with BatchNorm and Dropout(0.3) | ~0.85 (unstable) | 0.90 | 0.28 | Strong training accuracy but overfitting. |
| **Improved CNN** | Dropout + BatchNorm + Data Augmentation | 0.93 | 0.938 | 0.188 | Improved generalization and stability. |
| **Fine-Tuned CNN** | EarlyStopping + LR Scheduler | **0.974** | **0.976** | **0.071** | Excellent convergence and minimal overfitting. |
| **Baseline RNN (GRU)** | Bidirectional GRUs + Masking + Dropout | **0.987** | **0.992** | **0.036** | Best overall model, stable and highly accurate. |

---

### Figure Captions (for paper)
- **Figure 1:** RNN training and validation accuracy curves across 15 epochs.  
- **Figure 2:** RNN training and validation loss curves showing smooth convergence.  
- **Figure 3:** Confusion matrix displaying near-perfect diagonal dominance and minimal misclassifications.