# Accent detection project

<a href="https://colab.research.google.com/github/maxtrepanier/whisper-accent/blob/master/Accent%20detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## Overview

This project aims at exploring the capabilities of speech recognition systems to capture subtle features of voice, such as accents.

In the first part of this project, we attempt to repurpose the speech recognition system [whisper](https://huggingface.co/openai/whisper-large-v3) to perform accent classification. Specifically, we use the encoder part of whisper to perform feature extraction and apply transfer learning to train an accent classifier.

The second part of this project is more ambitious and aims to identify within whisper features corresponding to accents, by implementing dictionary learning.

Note: maybe use dataset:
https://huggingface.co/datasets/NathanRoll/commonvoice_train_gender_accent_16k
also: https://huggingface.co/WillHeld

In [1]:
# Google colab specific:
import sys
IN_COLAB = 'google.colab' in sys.modules

if IN_COLAB:
    # install required packages
    !pip install datasets evaluate transformers[sentencepiece]
    
    # mount external drive:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.chdir('/content/drive/MyDrive/Colab Notebooks')

## Test: the base model

We use whisper-small for these tests. Load the model following the instructions given on HuggingFace.

In [None]:
import numpy as np
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Check that the model runs using the automatic pipeline:

In [None]:
import IPython
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])
IPython.display.Audio(dataset[0]["audio"]['array'], rate=16000)



 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Up Guards and Atom paintings, and Mason's exquisite idles are as national as a jingle poem. Mr. Birkitt Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, next man


Reproduce this using a manual pipeline. Since we treat it manually, we only decode the first 30s of the clip.

In [None]:
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# config
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

# load model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load processor
processor = AutoProcessor.from_pretrained(model_id, language="en", task="transcribe", torch_dtype=torch_dtype)

ds = load_dataset("distil-whisper/librispeech_long", "clean", split="validation", trust_remote_code=True)
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
input_features = inputs.input_features.type(torch_dtype).to(device)
generated_ids = model.generate(input_features, language="en")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover


Test with custom data:

In [None]:
import librosa
data, _ = librosa.load("dataset/train/british/rlaPLvETBug_1.mp3", sr=16e3)
inputs = processor(data,  sampling_rate=16e3, return_tensors="pt")
input_features = inputs.input_features.type(torch_dtype).to(device)
generated_ids = model.generate(input_features, language="en")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

 So we would launch the ship into space, bomb, bomb, bomb, bomb, bomb, bomb, going up to about four bombs per second, going up and all the way to Mars and then afterwards to Jupiter and Saturn. And we intended to go ourselves. We had actual model space ships about so big, about a meter in diameter or so with chemical explosives, which actually went bomb, bomb, bomb, bomb a few times, a few hundred feet.


In [None]:
IPython.display.Audio(data, rate=16000)

### Review of the model

The input data is a Mel spectrogram, which is of dimension (80, 3000) for 80 channels (frequencies) and 3000 datapoints, i.e. sample each 10ms for 30s.

In [None]:
print(input_features.shape)

torch.Size([1, 80, 3000])


In [None]:
model

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [None]:
encoded = model.model.encoder.forward(**inputs).last_hidden_state
encoded.shape

torch.Size([1, 1500, 768])

## Generating the dataset

See video_dl.py.

## Preparing the dataset

To speed up the training process, we preprocess the data by feeding it through the Whisper model, and save the output to a dataset_preprocessed folder.

In [1]:
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, load_from_disk
from torch.utils.data import DataLoader
import whisper_accent

import IPython
from tqdm import tqdm
import os

streaming = False
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

dataset = load_dataset("audiofolder",
                       data_dir="dataset/",
                       cache_dir=".cache/"
                       # streaming=streaming
                      )  # use streaming to avoid overloading memory
dataset.with_format("torch", device=device)

model_id = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_id, torch_dtype=torch_dtype)
whisper_model = whisper_accent.AccentModel(use_encoder=False)
whisper_model.to(device).to(torch_dtype)

2024-07-29 18:48:50.594522: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


Resolving data files:   0%|          | 0/564 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/151 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


AccentModel(
  (whisper_encoder): WhisperEncoder(
    (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
    (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
    (embed_positions): Embedding(1500, 768)
    (layers): ModuleList(
      (0-11): 12 x WhisperEncoderLayer(
        (self_attn): WhisperSdpaAttention(
          (k_proj): Linear(in_features=768, out_features=768, bias=False)
          (v_proj): Linear(in_features=768, out_features=768, bias=True)
          (q_proj): Linear(in_features=768, out_features=768, bias=True)
          (out_proj): Linear(in_features=768, out_features=768, bias=True)
        )
        (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (activation_fn): GELUActivation()
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
        (final_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=T

In [3]:
if streaming:
    item = next(iter(dataset['train']))
    print(item)
else:
    item = dataset['train'][0]
    print(item)
IPython.display.Audio(item['audio']['array'], rate=16000)

{'audio': {'path': '/home/maxime/Dropbox/Job preparation/Machine learning/Speech/dataset/train/american/-96-lAfagow_0.mp3', 'array': array([-0.11506341, -0.12139536, -0.10677178, ..., -0.00088448,
       -0.00142263, -0.00095254]), 'sampling_rate': 16000}, 'label': 0}


In [2]:
def preprocess_audio(audio_file, whisper_model, processor):
    # Load and preprocess audio (implement this based on your audio format)
    audio_input = processor(audio_file['audio']['array'], sampling_rate=16e3, return_tensors="pt")
    input_features = audio_input.input_features.type(torch_dtype).to(device)

    # Run through Whisper encoder
    encoder_output = whisper_model.encode(input_features)

    return {'input_features': encoder_output}

def preprocess_dataset(audio_files, whisper_model):
    preprocessed_data = []
    for audio_file in audio_files:
        # Load and preprocess audio (implement this based on your audio format)
        audio_input = processor(audio_file['audio']['array'], sampling_rate=16e3, return_tensors="pt")

        # Run through Whisper encoder
        encoder_output = whisper_model.encode(audio_input.input_features)
        preprocessed_data.append(encoder_output.cpu().numpy())

    return np.array(preprocessed_data)

DATASET_PREPROCESSED_PATH = "dataset_preprocessed"

recache = not os.path.exists(DATASET_PREPROCESSED_PATH)
if recache:
    # Usage
    dataset_preprocessed = dataset.map(lambda sample: preprocess_audio(sample, whisper_model, processor),
                                       batch_size=16, writer_batch_size=16)
    dataset_preprocessed.set_format(type = 'torch')
    dataset_preprocessed.save_to_disk('dataset_preprocessed')
else:
    dataset_preprocessed = load_from_disk('dataset_preprocessed')

In [4]:
from torch.utils.data import DataLoader
from torch import nn

train_dataloader = DataLoader(dataset_preprocessed['train'].remove_columns('audio'),
                              batch_size=16,
                              shuffle=True)
test_dataloader = DataLoader(dataset_preprocessed['validation'].remove_columns('audio'),
                             batch_size=16,
                             shuffle=True)

## Transfer learning

Load the model from a checkpoint, if available:

In [13]:
MODEL_NAME = "accent_model.pth"
LOAD_WEIGHTS = True

if LOAD_WEIGHTS:
    whisper_model.accent_classifier.load_state_dict(torch.load(MODEL_NAME, map_location=device))

<All keys matched successfully>

Train the model:

In [5]:
TRAIN_MODEL = False
batch_size = 16

def train_model(train_dataloader, test_dataloader, model, loss_fn, optimizer, epochs=1, learning_rate=1e-4, learning_rate_schedule=None):
    size = len(train_dataloader.dataset)
    if learning_rate_schedule is None:
        learning_rate_schedule = [learning_rate] * epochs

    for epoch, lr in zip(range(epochs), learning_rate_schedule):
        model.train()  # set training mode
        total_loss = 0
        optimizer.lr = lr
        progress_bar = tqdm(range(size), position=0, leave=True, bar_format='{l_bar}{bar:15}{r_bar}{bar:-10b}')
        
        for batch, batch_data in enumerate(train_dataloader):
            # Compute prediction and loss
            X, y = batch_data['input_features'], batch_data['label']
            X = torch.reshape(X, (-1,1500,768))  # reshape
            pred = model(X)
            loss = loss_fn(pred, y)
            total_loss += loss
    
            # Backpropagation
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
    
            # if batch % 5 == 0:
            loss, current = loss.item(), batch * batch_size + X.shape[0]
            progress_bar.set_description(f"Epoch {epoch+1} (loss {loss:.3f}, avg loss {total_loss/(batch+1):.3f}", refresh=False)
            progress_bar.update(X.shape[0])

        progress_bar.refresh()
        progress_bar.close()
        # calculate test loss
        accuracy, test_loss = test_metric(test_dataloader, model, loss_fn)
        print(f"\t test accuracy: {100*accuracy:.1f}%, loss: {test_loss:.3f}")


def test_metric(dataloader, model, loss_fn):
    model.eval()  # set to eval mode
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for batch_data in dataloader:
            X, y = batch_data['input_features'], batch_data['label']
            X = torch.reshape(X, (-1,1500,768))  # reshape
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    return correct, test_loss

if TRAIN_MODEL:
    loss_fn = nn.CrossEntropyLoss()
    lr = 6e-4
    epochs = 8
    optimizer = torch.optim.AdamW(whisper_model.parameters(), lr=lr)
    train_model(train_dataloader, test_dataloader, whisper_model, loss_fn, optimizer, epochs=epochs, learning_rate=lr)
    torch.save(whisper_model.accent_classifier.state_dict(), MODEL_NAME)
    print(f"Saved model to {MODEL_NAME}")

Epoch 1 (loss 0.673, avg loss 0.660: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 77.5%, loss: 0.578


Epoch 2 (loss 0.494, avg loss 0.557: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 88.7%, loss: 0.509


Epoch 3 (loss 0.405, avg loss 0.493: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 86.1%, loss: 0.467


Epoch 4 (loss 0.512, avg loss 0.440: 100%|███████████████| 564/564 [00:21<00:00,


	 test accuracy: 91.4%, loss: 0.406


Epoch 5 (loss 0.433, avg loss 0.400: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 91.4%, loss: 0.377


Epoch 6 (loss 0.405, avg loss 0.365: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 93.4%, loss: 0.357


Epoch 7 (loss 0.365, avg loss 0.336: 100%|███████████████| 564/564 [00:17<00:00,


	 test accuracy: 94.0%, loss: 0.327


Epoch 8 (loss 0.499, avg loss 0.328: 100%|███████████████| 564/564 [00:19<00:00,


	 test accuracy: 95.4%, loss: 0.302


Epoch 9 (loss 0.234, avg loss 0.293: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 94.7%, loss: 0.285


Epoch 10 (loss 0.147, avg loss 0.279: 100%|███████████████| 564/564 [00:15<00:00


	 test accuracy: 95.4%, loss: 0.266


Epoch 11 (loss 0.854, avg loss 0.281: 100%|███████████████| 564/564 [00:15<00:00


	 test accuracy: 95.4%, loss: 0.260


We see that the accuracy reaches a plateau around epoch 8. After that, it is likely that the model starts overfitting to the training data, and eventually we would expect the test accuracy to start decreasing.

## Conclusion

Our simple classification model works remarkably well (around 95% accuracy!) for this dataset. This suggests that Whisper's encoder is efficient at extracting features relevant to identify the accent of speakers.

There are obvious improvements to do. To improve the accuracy of the model, a simple step would be to do error analysis and check if the model is performing as expected. Similarly, in order to establish a reliable estimate of the achievable accuracy, it might be useful to get a human-level baseline. These would better allow to diagnose underfitting vs overfitting, and help improve the model more quickly.

Some possible future directions:
 - use neural style transfer to change the accent of a speaker
 - generalise to more accents
 - use dictionary-learning techniques to identify directly within Whisper the features relevant for accent detection.