## Methodology

This project utilized supervised deep learning techniques to classify spoken digits using the **AudioMNIST dataset** (Srinivasan, 2021), originally developed by Becker, Vielhaben, Ackermann, Müller, Lapuschkin, and Samek (2023). The Kaggle-hosted version provided standardized .wav files and metadata for all 60 speakers and 30,000 utterances, each labeled with a corresponding digit from 0 to 9.

Each audio recording was converted into a two-dimensional spectrogram representation to capture both temporal and frequency domain features. **Log-Mel spectrograms** were used as the primary input representation because the Mel frequency scale compresses the linear frequency domain into a logarithmic space that more closely reflects human auditory perception of pitch (Logan, 2000; Purwins et al., 2019). This transformation emphasizes perceptually relevant frequency bands and reduces spectral detail that is less meaningful for human hearing, improving both interpretability and model performance.

### Notes on Feature Selection

During early model development, I initially used standard STFT spectrograms as CNN input features.
Training accuracy was high, but validation accuracy fluctuated and varied between speakers.
After researching common approaches in audio classification, I learned that most modern systems use
**Log-Mel spectrograms** rather than linear-frequency spectrograms because they emphasize frequencies
that correspond more closely to human hearing. 

I confirmed this approach through Logan (2000), who examined Mel-scale modeling, and Purwins et al. (2019),
who demonstrated that Log-Mel features are standard in deep learning–based audio analysis. After switching
to Log-Mel spectrograms, model stability and validation accuracy both improved.

### Data Preprocessing
All audio files were standardized to a fixed sampling rate and duration, followed by transformation into Mel-spectrograms.
Amplitude normalization ensured consistent scaling across samples, and speaker-balanced splits were created
for training, validation, and test sets. The resulting tensors had the shape *(128 × 128 × 1)* to serve as
CNN-compatible image inputs.

### Model Architecture
The initial baseline model comprised three convolutional blocks with ReLU activation, batch normalization, and max pooling.
A global average pooling layer and dense softmax output completed the architecture.
This design follows common CNN structures for small-scale image and audio classification tasks (Chollet, 2018).

The improved model introduced several key enhancements:
- **Progressive Dropout (0.2–0.5)** to prevent overfitting by randomly deactivating neurons during training (Srivastava, Hinton, Krizhevsky, Sutskever, & Salakhutdinov, 2014).  
- **Batch Normalization** after each convolutional layer to stabilize gradients and accelerate convergence (Ioffe & Szegedy, 2015).  
- **Built-in Spectrogram Augmentation** using translation, zoom, and Gaussian noise layers to simulate acoustic variability.  
- **Adam Optimizer** for adaptive learning rate adjustment (Kingma & Ba, 2015).  

### Fine-Tuning and Optimization
To improve generalization and stability, two callbacks were integrated during fine-tuning:
1. **EarlyStopping**, which halts training when validation loss no longer improves.  
2. **ReduceLROnPlateau**, which decreases the learning rate by half when progress stagnates.  

Training proceeded for a maximum of 30 epochs, but typically converged within 8–10 epochs.
Model performance was evaluated using validation and test accuracy, cross-entropy loss, and confusion matrix analysis.

All experiments were implemented in Python using TensorFlow 2.15 and Keras APIs in Jupyter Notebook.
Training and visualization code were adapted from official TensorFlow tutorials and Keras documentation,
with modifications for dataset handling, augmentation, and model refinement.

---

## Future Work

Future iterations of this project could explore several directions to further improve model accuracy and robustness:

1. **Transfer Learning:**  
   Applying pretrained audio or vision models such as VGGish or EfficientNet could extract more generalized
   spectral features for smaller datasets.

2. **Feature Fusion:**  
   Combining Mel-spectrograms with MFCC (Mel-Frequency Cepstral Coefficients) or chroma features could
   enhance discrimination between acoustically similar digits.

3. **Noise Robustness Testing:**  
   Introducing controlled background noise or reverberation could evaluate how well the model performs
   in real-world environments.

4. **Explainable AI (XAI):**  
   Techniques such as Grad-CAM or saliency mapping could visualize which time-frequency regions
   influence the model’s predictions, increasing interpretability.

5. **Deployment as an Edge Model:**  
   Converting the final CNN to TensorFlow Lite could enable low-latency inference on mobile or embedded devices.

By extending the current pipeline with these methods, future work can focus on improving interpretability,
deployment feasibility, and real-world applicability in speech-driven AI systems.

---

## Model Evolution: From Baseline to Fine-Tuned CNN

The baseline CNN achieved strong training accuracy but exhibited clear signs of overfitting.
Its validation accuracy fluctuated significantly between epochs, suggesting that the model was memorizing
training samples rather than generalizing to unseen data. This instability is common in small datasets
when regularization and augmentation are insufficient (Goodfellow et al., 2016).

To address these limitations, several iterative improvements were introduced.

### Improved CNN (Stage 2)
The improved version incorporated internal data augmentation, progressive dropout, and batch normalization
after every convolutional layer. These changes provided multiple benefits:
- **Batch Normalization** stabilized gradient propagation and reduced sensitivity to parameter initialization
  (Ioffe & Szegedy, 2015).  
- **Dropout** (ranging from 0.2 to 0.5) prevented co-adaptation among neurons, improving generalization
  (Srivastava et al., 2014).  
- **Built-in augmentation** (random translation, zoom, and Gaussian noise) exposed the model to a wider range of spectrogram variations, reducing dependency on any single speaker’s acoustic profile.

These changes improved validation accuracy from roughly **0.85 → 0.93**, and test accuracy from **0.90 → 0.94**, while significantly lowering validation loss oscillations. The improved CNN achieved a stable loss curve and clear separation between correctly and incorrectly classified samples.

### Fine-Tuned CNN (Stage 3)
The final stage introduced **EarlyStopping** and **ReduceLROnPlateau** callbacks to dynamically manage the learning process. This adaptive control allowed the optimizer to take larger steps early on and smaller refinements once the model approached convergence. The fine-tuned CNN reached **0.976 test accuracy** and **0.071 test loss**, while the confusion matrix confirmed near-perfect diagonal dominance.

### Comparative Summary
| Stage | Key Enhancements | Validation Accuracy | Test Accuracy | Test Loss |
|:------|:-----------------|:--------------------:|:--------------:|:-----------:|
| **Baseline CNN** | Standard 3-layer CNN | ~0.85 (unstable) | 0.90 | 0.28 |
| **Improved CNN** | Dropout + BatchNorm + Augmentation | 0.93 | 0.938 | 0.188 |
| **Fine-Tuned CNN** | EarlyStopping + LR Scheduler | **0.974** | **0.976** | **0.071** |

The progression illustrates that **regularization, augmentation, and adaptive learning rate control** collectively contributed to improved generalization and reduced overfitting. Each architectural and procedural adjustment targeted a specific weakness identified in the baseline training behavior, leading to measurable gains in accuracy and model stability.

---

## Results and Discussion

### Model Performance Overview
The three CNN variants were evaluated based on accuracy, loss, and confusion matrix performance across the
training, validation, and test datasets. The results demonstrate a consistent improvement in generalization
and model stability at each stage of development (see Figures 1–4).

The **baseline CNN** achieved rapid gains in training accuracy but unstable validation performance.
Validation accuracy fluctuated widely (0.2–0.9) and validation loss increased over time, a clear indicator
of overfitting. These results suggest that the model memorized training examples rather than learning
generalized patterns (Goodfellow, Bengio, & Courville, 2016).

### Improved CNN
The **improved CNN** introduced data augmentation, batch normalization, and progressive dropout
to counteract overfitting. The resulting training curves showed smoother convergence and reduced
variance between training and validation accuracy. The model achieved **93.2 % validation accuracy** and **93.8 % test accuracy**, while validation loss stabilized near **0.18**. The confusion matrix indicated that most misclassifications involved digits with similar phonetic or temporal characteristics (e.g., “two” vs. “zero” and “five” vs. “four”), which are common challenges in speech-based models (Chollet, 2018).

### Fine-Tuned CNN
The **fine-tuned CNN** integrated **EarlyStopping** and **ReduceLROnPlateau** callbacks to dynamically manage
the learning rate during training. This adaptive strategy prevented unnecessary epochs once convergence was achieved and enabled the optimizer to fine-tune weights more effectively at lower learning rates.  
As a result, the model attained a **test accuracy of 97.6 %** and **test loss of 0.071** — the highest-performing configuration.

Training stopped automatically after nine epochs, restoring the best model weights from epoch 5.  
Learning-rate reductions at epochs 3, 7, and 9 aligned closely with decreases in validation loss, confirming
that adaptive scheduling improved convergence efficiency. Validation accuracy remained above 0.95 throughout
fine-tuning, demonstrating strong generalization.

### Visual Analysis
Figures 1 and 2 compare training and validation accuracy/loss curves for the baseline and improved models.
The fine-tuned curves (Figure 3) show minimal divergence between training and validation metrics,
indicating effective regularization. Figure 4, the confusion matrix for the fine-tuned CNN, exhibits near-perfect diagonal dominance, confirming that the model correctly classified most test samples. Occasional off-diagonal entries correspond to visually similar spectrograms or brief utterances with reduced temporal resolution.

### Interpretation
The steady improvement from baseline → improved → fine-tuned CNNs demonstrates the cumulative effect of
regularization and optimization techniques.

Each enhancement addressed a specific weakness:
- **Dropout** mitigated overfitting by reducing neuron co-adaptation (Srivastava et al., 2014).  
- **Batch Normalization** improved convergence stability (Ioffe & Szegedy, 2015).  
- **Data Augmentation** diversified training inputs and increased robustness to speaker variability.  
- **Adaptive Learning-Rate Scheduling** refined weight updates and minimized validation loss oscillations.

Together, these techniques enabled the final model to generalize effectively across unseen speakers and digit pronunciations, delivering near–state-of-the-art performance for a lightweight CNN trained on AudioMNIST (Becker et al., 2018).