<a href="https://colab.research.google.com/github/prabhmeharbedi/102165002-SESS_LE1/blob/main/102165002_PRABHMEHAR.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Cell 1: Introduction and Paper Summary

# Speech Commands Classification - Lab Evaluation

This notebook follows the evaluation tasks based on the paper
[Speech Commands: A Dataset for Limited-Vocabulary Speech Recognition]
(https://arxiv.org/abs/1804.03209).

### Tasks:
1. Summarize the paper in about 50 words.
2. Download, analyze, and statistically describe the dataset.
3. Train a classifier to distinguish commands.
4. Report performance results using standard benchmarks.
5. Record 30 samples of each command in your voice and create a new dataset.
6. Fine-tune the classifier on your voice.
7. Report the results.

## 1. Paper Summary

The paper introduces a labeled dataset consisting of one-second audio clips of simple spoken words.
It is designed to facilitate research in limited-vocabulary speech recognition tasks.
The dataset contains a variety of speakers and environments, providing a robust benchmark for training machine learning models.

In [3]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [5]:

# Function to calculate MD5 checksum of a file
import hashlib

def calculate_md5(file_path):
    hash_md5 = hashlib.md5()
    with open(file_path, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

# Example usage: Calculate and verify the checksum of the dataset
dataset_file = '/content/drive/MyDrive/Keyword Spotting/Recordings/yes_1.wav'
dataset_checksum = calculate_md5(dataset_file)
print(f'MD5 checksum for dataset: {dataset_checksum}')

# Store this checksum and use it for later verification
# Verifying the file later
new_checksum = calculate_md5(dataset_file)
if new_checksum == dataset_checksum:
    print("File is verified and hasn't been tampered with.")
else:
    print("File verification failed. The file has been modified.")


MD5 checksum for dataset: b18b2c3f131b2577e71b36076a2a6927
File is verified and hasn't been tampered with.


## 2. Analyze Dataset

In [6]:
# Cell 2: Download and Analyze the Dataset
import os
import torchaudio
from collections import Counter
from torch.utils.data import Subset

In [7]:
# Create the data directory if it doesn't exist
data_dir = './data'
if not os.path.exists(data_dir):
    os.makedirs(data_dir)

In [8]:
# Download the Speech Commands dataset
dataset = torchaudio.datasets.SPEECHCOMMANDS(root=data_dir, download=True)

100%|██████████| 2.26G/2.26G [01:47<00:00, 22.7MB/s]


In [9]:
# Select 10 commands to work with
selected_commands = ['yes', 'no', 'up', 'down', 'left', 'right', 'go', 'stop', 'on', 'off']

# Limit the number of samples per command (e.g., 100 samples per command)
samples_per_command = 100

In [10]:
# Create a subset of the dataset by filtering for the selected commands
subset_indices = []
command_counter = Counter()

for idx, sample in enumerate(dataset):
    label = sample[2]
    if label in selected_commands and command_counter[label] < samples_per_command:
        subset_indices.append(idx)
        command_counter.update([label])

    # Stop when we have enough samples for each command
    if all(command_counter[cmd] >= samples_per_command for cmd in selected_commands):
        break

# Create a subset of the dataset
subset_dataset = Subset(dataset, subset_indices)

# Check the sample count for each command in the subset
print(f"Sample counts in subset: {command_counter}")
print(f"Total subset size: {len(subset_dataset)}")

Sample counts in subset: Counter({'down': 100, 'go': 100, 'left': 100, 'no': 100, 'off': 100, 'on': 100, 'right': 100, 'stop': 100, 'up': 100, 'yes': 100})
Total subset size: 1000


## 3. Data Preprocessing

In [11]:
# Cell 3: Data Preprocessing (Padding and Truncating)
import torch
from torch.utils.data import DataLoader
from torch.nn.utils.rnn import pad_sequence

In [12]:
# Define a fixed length for all audio samples (1 second = 16000 samples at 16kHz)
fixed_length = 16000

# Custom collate function to pad and truncate audio data
def collate_fn(batch):
    waveforms = []
    labels = []

    for item in batch:
        waveform = item[0]
        label = item[2]

        if waveform.shape[1] > fixed_length:
            waveform = waveform[:, :fixed_length]
        elif waveform.shape[1] < fixed_length:
            pad_amount = fixed_length - waveform.shape[1]
            waveform = torch.nn.functional.pad(waveform, (0, pad_amount))

        waveforms.append(waveform)
        labels.append(label)

    waveforms = torch.stack(waveforms)
    return waveforms, labels

In [13]:
# DataLoader for the subset dataset
loader = DataLoader(subset_dataset, batch_size=32, shuffle=True, collate_fn=collate_fn)

## 4. CNN Classifier

In [16]:
# Cell 4: Define and Train a CNN Classifier (with correct fully connected layer input size)
import torch.nn as nn
import torch.optim as optim
import torchaudio.transforms as transforms

# Define the MelSpectrogram transform to convert audio waveforms into spectrograms
mel_spectrogram = transforms.MelSpectrogram(
    sample_rate=16000, n_mels=128, n_fft=400, hop_length=160
)

# Define a simple CNN model for speech command classification
class SimpleCNN(nn.Module):
    def __init__(self, num_classes):
        super(SimpleCNN, self).__init__()
        self.conv1 = nn.Conv2d(1, 16, kernel_size=3, stride=1, padding=1)
        self.conv2 = nn.Conv2d(16, 32, kernel_size=3, stride=1, padding=1)
        self.pool = nn.MaxPool2d(kernel_size=2, stride=2, padding=0)
        self.fc1 = nn.Linear(32 * 25 * 32, 128)  # Correct input size based on shape (32*25=800)
        self.fc2 = nn.Linear(128, num_classes)
        self.relu = nn.ReLU()

    def forward(self, x):
        x = self.pool(self.relu(self.conv1(x)))
        x = self.pool(self.relu(self.conv2(x)))
        x = x.view(x.size(0), -1)  # Flatten the output for the fully connected layers
        x = self.relu(self.fc1(x))
        x = self.fc2(x)
        return x

# Get the number of classes (commands)
num_classes = len(selected_commands)

# Create a dictionary to map commands (labels) to numerical values
label_to_index = {label: idx for idx, label in enumerate(selected_commands)}

# Function to convert string labels to numerical indices
def label_to_tensor(label):
    return torch.tensor(label_to_index[label])

# Instantiate the model
model = SimpleCNN(num_classes=num_classes).to('cuda')

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()

# Training loop for 10 epochs
num_epochs = 50

for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for waveforms, labels in loader:
        # Convert waveforms to spectrograms
        waveforms = mel_spectrogram(waveforms)
        waveforms = waveforms.squeeze(1).unsqueeze(1).to('cuda')  # Remove extra dimension, add channel dimension
        labels = torch.tensor([label_to_tensor(label) for label in labels]).to('cuda')

        optimizer.zero_grad()
        outputs = model(waveforms)
        loss = criterion(outputs, labels)
        loss.backward()
        optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch + 1}/{num_epochs}, Loss: {running_loss / len(loader)}')

print('Training completed!')

Epoch 1/50, Loss: 2.601890716701746
Epoch 2/50, Loss: 1.684557981789112
Epoch 3/50, Loss: 1.3523731864988804
Epoch 4/50, Loss: 1.1243946868926287
Epoch 5/50, Loss: 0.996374910697341
Epoch 6/50, Loss: 0.845909109339118
Epoch 7/50, Loss: 0.6968596735969186
Epoch 8/50, Loss: 0.5915448756422848
Epoch 9/50, Loss: 0.5313666826114058
Epoch 10/50, Loss: 0.6627821503207088
Epoch 11/50, Loss: 0.539000429213047
Epoch 12/50, Loss: 0.4645805268082768
Epoch 13/50, Loss: 0.37597756111063063
Epoch 14/50, Loss: 0.3524410950485617
Epoch 15/50, Loss: 0.3254911662079394
Epoch 16/50, Loss: 0.3025505947880447
Epoch 17/50, Loss: 0.23821532796137035
Epoch 18/50, Loss: 0.21791425510309637
Epoch 19/50, Loss: 0.2132582104532048
Epoch 20/50, Loss: 0.20353090862045065
Epoch 21/50, Loss: 0.29704794858116657
Epoch 22/50, Loss: 0.32159089646302164
Epoch 23/50, Loss: 0.2791507130023092
Epoch 24/50, Loss: 0.31834160501603037
Epoch 25/50, Loss: 0.48384921508841217
Epoch 26/50, Loss: 0.2793220744933933
Epoch 27/50, Loss:

## 5. Model Evaluation

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).
Files in the recordings folder: ['no_22.wav', 'yes_1.wav', 'yes_19.wav', 'no_1.wav', 'yes_27.wav', 'yes_4.wav', 'yes_14.wav', 'no_15.wav', 'yes_24.wav', 'yes_22.wav', 'yes_20.wav', 'yes_18.wav', 'yes_6.wav', 'no_16.wav', 'no_27.wav', 'no_19.wav', 'no_5.wav', 'yes_10.wav', 'yes_25.wav', 'no_14.wav', 'yes_8.wav', 'no_3.wav', 'yes_28.wav', 'no_8.wav', 'yes_2.wav', 'no_12.wav', 'yes_17.wav', 'yes_12.wav', 'no_4.wav', 'yes_23.wav', 'yes_21.wav', 'yes_9.wav', 'no_28.wav', 'no_2.wav', 'no_21.wav', 'yes_7.wav', 'yes_13.wav', 'yes_29.wav', 'yes_30.wav', 'no_26.wav', 'yes_3.wav', 'no_11.wav', 'no_24.wav', 'no_23.wav', 'no_20.wav', 'no_7.wav', 'yes_26.wav', 'yes_5.wav', 'yes_16.wav', 'no_29.wav', 'no_30.wav', 'no_6.wav', 'yes_11.wav', 'no_25.wav', 'no_10.wav', 'no_13.wav', 'yes_15.wav', 'no_17.wav', 'no_9.wav', 'no_18.wav', 'down_6.wav', 'down_1.wav', 'down_30.wav', 'do

## 6. Recording

In [27]:
# Cell 6: Install sounddevice and Record 30 Samples of Your Voice
# Step 1: Mount Google Drive
from google.colab import drive
drive.mount('/content/drive')


Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [28]:

# Step 2: Path to your recordings folder in Drive
recordings_path = '/content/drive/MyDrive/Keyword Spotting/Recordings'

# Verify the recordings are accessible
import os
recording_files = os.listdir(recordings_path)
print("Files in the recordings folder:", recording_files)

Files in the recordings folder: ['no_22.wav', 'yes_1.wav', 'yes_19.wav', 'no_1.wav', 'yes_27.wav', 'yes_4.wav', 'yes_14.wav', 'no_15.wav', 'yes_24.wav', 'yes_22.wav', 'yes_20.wav', 'yes_18.wav', 'yes_6.wav', 'no_16.wav', 'no_27.wav', 'no_19.wav', 'no_5.wav', 'yes_10.wav', 'yes_25.wav', 'no_14.wav', 'yes_8.wav', 'no_3.wav', 'yes_28.wav', 'no_8.wav', 'yes_2.wav', 'no_12.wav', 'yes_17.wav', 'yes_12.wav', 'no_4.wav', 'yes_23.wav', 'yes_21.wav', 'yes_9.wav', 'no_28.wav', 'no_2.wav', 'no_21.wav', 'yes_7.wav', 'yes_13.wav', 'yes_29.wav', 'yes_30.wav', 'no_26.wav', 'yes_3.wav', 'no_11.wav', 'no_24.wav', 'no_23.wav', 'no_20.wav', 'no_7.wav', 'yes_26.wav', 'yes_5.wav', 'yes_16.wav', 'no_29.wav', 'no_30.wav', 'no_6.wav', 'yes_11.wav', 'no_25.wav', 'no_10.wav', 'no_13.wav', 'yes_15.wav', 'no_17.wav', 'no_9.wav', 'no_18.wav', 'down_6.wav', 'down_1.wav', 'down_30.wav', 'down_29.wav', 'down_28.wav', 'down_27.wav', 'down_26.wav', 'down_25.wav', 'down_23.wav', 'down_24.wav', 'down_22.wav', 'down_21.wav

In [29]:
import os

# Check if all files exist before loading
def verify_file_paths(file_list, label):
    missing_files = [file for file in file_list if not os.path.isfile(file)]
    if missing_files:
        print(f"Missing {label} files:", missing_files)
    else:
        print(f"All {label} files are present.")

# Verify files for each label
verify_file_paths(yes_files, "yes")
verify_file_paths(no_files, "no")
verify_file_paths(up_files, "up")
verify_file_paths(down_files, "down")
verify_file_paths(left_files, "left")

All yes files are present.
All no files are present.
All up files are present.
All down files are present.
All left files are present.


## 7. Fine Tuning

In [30]:
# Task 6: Fine-Tune the Classifier on Your Voice
import torch.optim as optim
import torch.nn as nn
import torchaudio

# Load the fine-tune data from "yes", "no", "up", "down", and "left" recordings
yes_files = [os.path.join(recordings_path, f'yes_{i}.wav') for i in range(1, 31)]
no_files = [os.path.join(recordings_path, f'no_{i}.wav') for i in range(1, 31)]
up_files = [os.path.join(recordings_path, f'up_{i}.wav') for i in range(1, 31)]
down_files = [os.path.join(recordings_path, f'down_{i}.wav') for i in range(1, 31)]
left_files = [os.path.join(recordings_path, f'left_{i}.wav') for i in range(1, 31)]

# Function to pad or truncate audio waveforms
fixed_length = 16000

def pad_or_truncate(waveform, max_length=fixed_length):
    if waveform.size(1) > max_length:
        return waveform[:, :max_length]
    elif waveform.size(1) < max_length:
        pad_amount = max_length - waveform.size(1)
        return torch.nn.functional.pad(waveform, (0, pad_amount))
    return waveform

# Define the MelSpectrogram transform to convert audio waveforms into spectrograms
mel_spectrogram = torchaudio.transforms.MelSpectrogram(
    sample_rate=16000, n_mels=128, n_fft=400, hop_length=160
)

# Load the audio data and labels for "yes", "no", "up", "down", and "left"
fine_tune_data = []

# Process the files for each command and append to fine_tune_data
for file, label in zip(yes_files, ['yes'] * len(yes_files)):
    waveform, sample_rate = torchaudio.load(file)
    waveform = pad_or_truncate(waveform)
    spectrogram = mel_spectrogram(waveform)
    fine_tune_data.append((spectrogram, label))

for file, label in zip(no_files, ['no'] * len(no_files)):
    waveform, sample_rate = torchaudio.load(file)
    waveform = pad_or_truncate(waveform)
    spectrogram = mel_spectrogram(waveform)
    fine_tune_data.append((spectrogram, label))

for file, label in zip(up_files, ['up'] * len(up_files)):
    waveform, sample_rate = torchaudio.load(file)
    waveform = pad_or_truncate(waveform)
    spectrogram = mel_spectrogram(waveform)
    fine_tune_data.append((spectrogram, label))

for file, label in zip(down_files, ['down'] * len(down_files)):
    waveform, sample_rate = torchaudio.load(file)
    waveform = pad_or_truncate(waveform)
    spectrogram = mel_spectrogram(waveform)
    fine_tune_data.append((spectrogram, label))

for file, label in zip(left_files, ['left'] * len(left_files)):
    waveform, sample_rate = torchaudio.load(file)
    waveform = pad_or_truncate(waveform)
    spectrogram = mel_spectrogram(waveform)
    fine_tune_data.append((spectrogram, label))

# Define the label mapping for all commands
label_to_index = {'yes': 0, 'no': 1, 'up': 2, 'down': 3, 'left': 4}

# Define a function to convert labels to indices
def label_to_tensor(label):
    return torch.tensor(label_to_index[label])

# Fine-tune the model
fine_tune_optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

# Convert the data into DataLoader
batch_size = 16
loader = torch.utils.data.DataLoader(
    fine_tune_data, batch_size=batch_size, shuffle=True,
    collate_fn=lambda batch: (
        torch.stack([item[0] for item in batch]).squeeze(1),
        torch.tensor([label_to_tensor(item[1]) for item in batch])
    )
)

# Fine-tuning loop
num_epochs = 20
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for waveforms, labels in loader:
        waveforms = waveforms.unsqueeze(1).to('cuda')
        labels = labels.to('cuda')

        fine_tune_optimizer.zero_grad()
        outputs = model(waveforms)
        loss = criterion(outputs, labels)
        loss.backward()
        fine_tune_optimizer.step()

        running_loss += loss.item()

    print(f'Epoch {epoch + 1}/{num_epochs}, Fine-tuning Loss: {running_loss / len(loader)}')

print("Fine-tuning completed!")

Epoch 1/20, Fine-tuning Loss: 5.493590688705444
Epoch 2/20, Fine-tuning Loss: 3.3784844517707824
Epoch 3/20, Fine-tuning Loss: 2.101422643661499
Epoch 4/20, Fine-tuning Loss: 1.4913936376571655
Epoch 5/20, Fine-tuning Loss: 1.4523927688598632
Epoch 6/20, Fine-tuning Loss: 1.4425764203071594
Epoch 7/20, Fine-tuning Loss: 1.4275154709815978
Epoch 8/20, Fine-tuning Loss: 1.388307511806488
Epoch 9/20, Fine-tuning Loss: 1.3768107533454894
Epoch 10/20, Fine-tuning Loss: 1.3467698574066163
Epoch 11/20, Fine-tuning Loss: 1.3718288898468018
Epoch 12/20, Fine-tuning Loss: 1.3270316123962402
Epoch 13/20, Fine-tuning Loss: 1.3145939707756042
Epoch 14/20, Fine-tuning Loss: 1.2988775014877318
Epoch 15/20, Fine-tuning Loss: 1.2989997267723083
Epoch 16/20, Fine-tuning Loss: 1.2666441679000855
Epoch 17/20, Fine-tuning Loss: 1.315366953611374
Epoch 18/20, Fine-tuning Loss: 1.3276232600212097
Epoch 19/20, Fine-tuning Loss: 1.2827466249465942
Epoch 20/20, Fine-tuning Loss: 1.2284953832626342
Fine-tuning c

In [31]:
# Check the number of samples for each label
yes_count = sum([1 for item in fine_tune_data if item[1] == 'yes'])
no_count = sum([1 for item in fine_tune_data if item[1] == 'no'])
up_count = sum([1 for item in fine_tune_data if item[1] == 'up'])
down_count = sum([1 for item in fine_tune_data if item[1] == 'down'])
left_count = sum([1 for item in fine_tune_data if item[1] == 'left'])
print(f"Yes count: {yes_count}, No count: {no_count}, Up count: {up_count}, Down count: {down_count}, Left count: {left_count}")

Yes count: 30, No count: 30, Up count: 30, Down count: 30, Left count: 30


In [32]:
# Evaluate the Model and Report the Results
correct = 0
total = 0

model.eval()
with torch.no_grad():
    for waveforms, labels in loader:
        waveforms = waveforms.unsqueeze(1).to('cuda')
        labels = labels.to('cuda')

        outputs = model(waveforms)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy on "yes", "no", "up", "down", and "left" recordings: {accuracy:.2f}%')

Accuracy on "yes", "no", "up", "down", and "left" recordings: 42.67%


In [36]:
import torch
import torch.optim as optim
import torch.nn as nn
import torchaudio

# Function to add noise to waveforms for data augmentation
def add_noise(waveform, noise_factor=0.005):
    noise = torch.randn(waveform.size(), device=waveform.device) * noise_factor
    return waveform + noise

# Fine-tune the model with a lower learning rate, noise augmentation, and more epochs
fine_tune_optimizer = optim.Adam(model.parameters(), lr=1e-5)
scheduler = torch.optim.lr_scheduler.StepLR(fine_tune_optimizer, step_size=10, gamma=0.1)

num_epochs = 100
criterion = nn.CrossEntropyLoss()

# Fine-tuning loop with updated learning rate and noise augmentation
for epoch in range(num_epochs):
    model.train()
    running_loss = 0.0

    for waveforms, labels in loader:
        waveforms = waveforms.unsqueeze(1).to('cuda')
        labels = labels.to('cuda')

        # Apply noise augmentation to the waveforms
        waveforms = add_noise(waveforms)

        fine_tune_optimizer.zero_grad()
        outputs = model(waveforms)
        loss = criterion(outputs, labels)
        loss.backward()
        fine_tune_optimizer.step()

        running_loss += loss.item()

    scheduler.step()

    print(f'Epoch {epoch + 1}/{num_epochs}, Fine-tuning Loss: {running_loss / len(loader)}')

print("Fine-tuning completed!")

Epoch 1/100, Fine-tuning Loss: 1.1450494527816772
Epoch 2/100, Fine-tuning Loss: 1.145140939950943
Epoch 3/100, Fine-tuning Loss: 1.121493148803711
Epoch 4/100, Fine-tuning Loss: 1.1284986197948457
Epoch 5/100, Fine-tuning Loss: 1.1323743879795074
Epoch 6/100, Fine-tuning Loss: 1.1190260708332063
Epoch 7/100, Fine-tuning Loss: 1.1425463140010834
Epoch 8/100, Fine-tuning Loss: 1.1181742429733277
Epoch 9/100, Fine-tuning Loss: 1.1301652193069458
Epoch 10/100, Fine-tuning Loss: 1.1142170011997223
Epoch 11/100, Fine-tuning Loss: 1.13713396191597
Epoch 12/100, Fine-tuning Loss: 1.118468564748764
Epoch 13/100, Fine-tuning Loss: 1.1164010107517242
Epoch 14/100, Fine-tuning Loss: 1.1228023529052735
Epoch 15/100, Fine-tuning Loss: 1.0960364758968353
Epoch 16/100, Fine-tuning Loss: 1.1286650061607362
Epoch 17/100, Fine-tuning Loss: 1.1319554150104523
Epoch 18/100, Fine-tuning Loss: 1.1021289765834807
Epoch 19/100, Fine-tuning Loss: 1.1220424473285675
Epoch 20/100, Fine-tuning Loss: 1.09198720455

In [39]:
# Evaluate the Model and Report the Results
correct = 0
total = 0

model.eval()
with torch.no_grad():
    for waveforms, labels in loader:
        waveforms = waveforms.unsqueeze(1).to('cuda')
        labels = labels.to('cuda')

        outputs = model(waveforms)
        _, predicted = torch.max(outputs, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

accuracy = 100 * correct / total
print(f'Accuracy on "yes", "no", "up", "down", and "left" recordings: {accuracy:.2f}%')

Accuracy on "yes", "no", "up", "down", and "left" recordings: 92.00% 


In [42]:
# Save the fine-tuned model weights
model_save_path = '/content/drive/MyDrive/Keyword Spotting/fine_tuned_model.pth'
torch.save(model.state_dict(), model_save_path)