<a href="https://colab.research.google.com/github/maxtrepanier/whisper-accent/blob/master/Accent%20detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Accent detection project

## Overview

In the classic 'My Fair Lady', Professor Higgins, an expert in phonetics, boasts that his knowledge of accents is so refined he could 'place any man within six miles of his home'. While this claim might well remain unchallenged by any other human, it begs the question: Can modern AI systems perform similar feats of accent recognition?

Understanding the variations of different accents and dialects is crucial in disambiguating speech. This is particularly relevant for tasks such as audio transcription, on which AI speech recognition models are trained. It therefore seems plausible that such models implicitly learn the relevant features to recognise accents.

To test this claim, we explore in this project the capabilities of one such model at extracting relevant features for classifying two clear and widely recognized accents: UK Received Pronunciation (RP) and US General American. Although this is a simpler task than 'placing any man within six miles of his home', it is sufficient to establish that such systems do indeed extract the relevant features.

More specifically, we repurpose a state-of-the-art speech recognition model developed by OpenAI called [Whisper](https://huggingface.co/openai/whisper-large-v3). We use the encoder part of the model as a sophisticated feature extractor, tapping into the rich, multi-layered representations of speech it has learned from diverse training data. We then apply transfer learning techniques to train an accent classifier. This method capitalises on Whisper's pre-existing knowledge while allowing for efficient learning with a relatively small dataset of accent-labeled speech samples.

A primary challenge is establishing the feasibility of this approach, as we're not aware of similar existing models specifically designed for accent classification using repurposed speech recognition systems.
Potential applications of this technology include enhancing speaker diarisation systems and, looking further ahead, developing more accessible speech therapy tools. Additionally, this project can be viewed as a stepping stone towards contributing to the interpretability of complex speech recognition models by identifying features that correspond to specific accent characteristics.

## Initialisation

Start with imports and additional setup for Google Colab

In [1]:
# Google colab specific:
import sys
IN_COLAB = 'google.colab' in sys.modules
MY_DRIVE_PATH = "MyDrive/Colab Notebooks"  # replace here with your path!

if IN_COLAB:
    # install required packages
    !pip install datasets evaluate transformers[sentencepiece]
    
    # mount external drive:
    import os
    from google.colab import drive
    drive.mount('/content/drive')
    os.chdir('/content/drive/' + MY_DRIVE_PATH)

In [2]:
# I/O
import IPython
from tqdm import tqdm
import os

# Processing
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset, load_from_disk
from torch.utils.data import DataLoader
import whisper_accent  # local file from GitHub

2024-07-30 09:45:20.576685: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.


### Test case: transcription with Whisper

To check that the model is loaded properly, we first run it on a transcription task. We use whisper-small for these tests, and load the model following the instructions given on HuggingFace.

In [3]:
from transformers import AutoModelForSpeechSeq2Seq, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"  # specify whisper-small

# loading checkpoint
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load audio preprocessor
processor = AutoProcessor.from_pretrained(model_id)

# automatic pipeline
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Check that the model runs using the automatic pipeline:

In [None]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])
IPython.display.Audio(dataset[0]["audio"]['array'], rate=16000)



 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Up Guards and Atom paintings, and Mason's exquisite idles are as national as a jingle poem. Mr. Birkitt Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, next man


The next task is to reproduce this without using a pipeline. Whisper supports only 30s clips at once (longer clips are processed by stitching together shorter clips), so we aim to process only the first half of the previous transcript.

In [4]:
# config device
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

# load model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load processor
processor = AutoProcessor.from_pretrained(model_id, language="en", task="transcribe", torch_dtype=torch_dtype)

ds = load_dataset("distil-whisper/librispeech_long", "clean", split="validation", trust_remote_code=True)
# trust_remote_code allows load_dataset to run loading scripts
inputs = processor(ds[0]["audio"]["array"], 
                   sampling_rate=ds[0]["audio"]["sampling_rate"],
                   return_tensors="pt")  # outputs torch tensor
input_features = inputs.input_features.type(torch_dtype).to(device)  # use torch_dtype
generated_ids = model.generate(input_features, language="en")  # transcription
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]  # convert to text
print(transcription)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Downloading readme:   0%|          | 0.00/480 [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

You have passed language=en, but also have set `forced_decoder_ids` to [[1, None], [2, 50359]] which creates a conflict. `forced_decoder_ids` will be ignored in favor of language=en.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover


We can also test this with a custom audio clip, for instance:

In [6]:
import librosa
TEST_AUDIO_PATH = "dataset/train/british/rlaPLvETBug_0.mp3"
TEST_AUDIO_SAMPLING = 16e3

data, _ = librosa.load(TEST_AUDIO_PATH, sr=TEST_AUDIO_SAMPLING)
inputs = processor(data,  sampling_rate=TEST_AUDIO_SAMPLING, return_tensors="pt")
input_features = inputs.input_features.type(torch_dtype).to(device)
generated_ids = model.generate(input_features, language="en")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

 I'm a professor retired at the Institute for Advanced Study in Princeton. And when I was a young kid, I used to scribble big numbers and draw pictures of the solar system. So obviously I was interested right from the start. I think the decisive moment in my career path was reading the book Men of Mathematics by Eric Tempelbell. He showed the mathematicians as being mostly


In [7]:
IPython.display.Audio(data, rate=16000)

### A peek into the model

The documentation specifies that the preprocessor outputs the Mel spectrogram of the audio clip (on 80 channels), sampled every 10ms. So for an input clip of 30s (with padding if necessary), the output of the preprocessor is (batch, 80, 3000). Check that explicitly:

In [8]:
print(input_features.shape)

torch.Size([1, 80, 3000])


Next we can look at the model. It has an encoder-decoder architecture, consisting of multiple convolution, attention and fully connected layers. The decoder also receives the partial text transcript and attempts transcription. So we expect that the encoder extracts relevant audio features, while the decoder infers their meaning and performs transcription.

In [9]:
model

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

For our purposes we need only the encoder part of the model. Its output has 768 channels, and for 30s clips it has also 1500 samples (down from 3000 because of the convolution layer with a stride of 2). Compared to the preprocessor the data is also transposed to follow the convention for transformer architecture, so that the output of the encoder is (batch, 1500, 768).

In [10]:
encoded = model.model.encoder.forward(**inputs).last_hidden_state
encoded.shape

torch.Size([1, 1500, 768])

## Transfer learning

In this section we build an accent classifier by adding an AccentClassifier class on top of Whisper's encoder. This class is implemented in the file `whisper_accent.py`. It consists of:
 - an average pool layer averaging over samples ($1500 \to 1$)
 - a fully connected layer ($768 \to 2$) with linear activation, outputing logits.
Accents typically inflect words throughout speech, making an average pooling appropriate. (But for detecting subtler accents, a more refined version of pooling might be needed.) Note that because there is no activation, the role of the fully connected layer is to perform a change of basis to the features of whisper that best capture these two accents. In particular, the classifier does not create new features!

To train this model, we use supervised learning. In the following we describe how to build a small dataset, prepare the data loader, train and then test the model.

### The dataset

There are existing datasets which include accent labels. Here however there are good reasons to build our own dataset:
 - we want to specifically train on two accents and favour specificity over generality
 - simplifies processing to get the right input format for whisper
 - control over content

These choices ensure that we can train the model with limited computing ressources, and is a good starting point before building a more encompassing model. The control over content is important because models like Whisper are designed to extract meaning and we want to make sure that our model is learning to classify based on features and not content. One way to reduce this is to make sure that the distribution in content is uniform across both accents.

These steps are implemented in the external script `video_dl.py`. This proceeds to extract audio tracks from ~100 videos publicly available on Youtube, with native speakers and clear accents. From this it extracts about 700 audio clips of 30s, split into a training and dev set.
See `video_dl.py` for more details.

Existing datasets:
 - https://huggingface.co/datasets/NathanRoll/commonvoice_train_gender_accent_16k
 - https://huggingface.co/WillHeld

### Loading the dataset

To speed up the training process, we preprocess the data by feeding it through the Whisper model, and save the output to a dataset_preprocessed folder.

In [11]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

dataset = load_dataset("audiofolder",  # "audiofolder" script reads from data_dir
                       data_dir="dataset/",  # with structure train/accent, dev/accent
                       cache_dir=".cache/"
                      )  # use streaming to avoid overloading memory
dataset.with_format("torch", device=device)

# load model and preprocessor
model_id = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_id, torch_dtype=torch_dtype)
# we will use the encoder directly in preprocessing,
# so the data fed through the model does not go through the encoder
whisper_model = whisper_accent.AccentModel(use_encoder=False)
_ = whisper_model.to(device).to(torch_dtype)

Resolving data files:   0%|          | 0/564 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/151 [00:00<?, ?it/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Sanity check: play an element of the dataset

In [12]:
streaming = False
if streaming:
    item = next(iter(dataset['train']))
    print(item)
else:
    item = dataset['train'][0]
    print(item)
IPython.display.Audio(item['audio']['array'], rate=16000)

{'audio': {'path': '/home/maxime/Dropbox/Job preparation/Machine learning/Speech/dataset/train/american/-96-lAfagow_0.mp3', 'array': array([-0.11506341, -0.12139536, -0.10677178, ..., -0.00088448,
       -0.00142263, -0.00095254]), 'sampling_rate': 16000}, 'label': 0}


Next we define the preprocessing method, and use map to process the entire dataset. Note, this step takes a few minutes to run on Colab.

In [2]:
def preprocess_audio(audio_file, whisper_model, processor):
    """ Preprocesses an audio clip by:
     - using Whisper's preprocessor (Mel spectrogram)
     - feedforward through Whisper's encoder
     Returns a dictionary containing 'input_features'.
     
    Args:
     - audio_file: dictionary with structure {'audio': {'array': data}}
     - whisper_model: AccentModel instance, implements encode
     - processor: AutoProcessor instance for Whisper preprocessing
    Returns:
     - dict containing {'input_features': tensor of shape (1, 1500, 768)}.
    """
    # Load and preprocess audio (implement this based on your audio format)
    audio_input = processor(audio_file['audio']['array'], sampling_rate=16e3, return_tensors="pt")
    input_features = audio_input.input_features.type(torch_dtype).to(device)

    # Run through Whisper encoder
    encoder_output = whisper_model.encode(input_features)

    return {'input_features': encoder_output}

DATASET_PREPROCESSED_PATH = "dataset_preprocessed"  # output dir
recache = not os.path.exists(DATASET_PREPROCESSED_PATH)
if recache:
    # apply preprocessing
    # note, preprocess_audio does not support batching at the moment
    dataset_preprocessed = dataset.map(lambda sample: preprocess_audio(sample, whisper_model, processor),
                                       # batch_size=16,
                                       writer_batch_size=16  # how many samples processed before caching
                                      )
    dataset_preprocessed.set_format(type = 'torch')
    dataset_preprocessed.save_to_disk('dataset_preprocessed')  # save
else:
    dataset_preprocessed = load_from_disk('dataset_preprocessed')  # load from disk

In [4]:
from torch.utils.data import DataLoader
from torch import nn

train_dataloader = DataLoader(dataset_preprocessed['train'].remove_columns('audio'),
                              batch_size=16,
                              shuffle=True)
test_dataloader = DataLoader(dataset_preprocessed['validation'].remove_columns('audio'),
                             batch_size=16,
                             shuffle=True)

### Training the model

Load the model from a checkpoint, if available:

In [15]:
MODEL_NAME = "accent_model.pth"
LOAD_WEIGHTS = os.path.isfile(MODEL_NAME)

if LOAD_WEIGHTS:
    whisper_model.accent_classifier.load_state_dict(torch.load(MODEL_NAME, map_location=device))
    print(f"Loaded weights from {MODEL_NAME}")

Loaded weights from accent_model.pth


Train the model:

In [5]:
TRAIN_MODEL = False
batch_size = 16

def train_model(train_dataloader : DataLoader, test_dataloader : DataLoader,
                model : whisper_accent.AccentModel,
                loss_fn,
                optimizer,
                epochs : int = 1) -> None:
    """ Basic training loop with simple progress bar.
    """
    
    size = len(train_dataloader.dataset)  # nb of samples

    for epoch in range(epochs):
        model.train()  # set training mode
        total_loss = 0  # to compute avg_train_loss
        progress_bar = tqdm(range(size), position=0, leave=True,
                            bar_format='{l_bar}{bar:15}{r_bar}{bar:-10b}'
                           )
        
        for batch, batch_data in enumerate(train_dataloader):
            # Compute prediction and loss
            X, y = batch_data['input_features'], batch_data['label']
            X = torch.reshape(X, (-1,1500,768))  # reshape
            pred = model(X)
            loss = loss_fn(pred, y)
    
            # Backpropagation
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
    
            # update progress bar
            loss = loss.item()
            total_loss += loss
            progress_bar.set_description(f"Epoch {epoch+1} (loss {loss:.3f}, avg loss {total_loss/(batch+1):.3f}", refresh=False)
            progress_bar.update(X.shape[0])  # advance progress bar by num of samples processed

        # between epochs: print test metrics
        progress_bar.refresh()
        progress_bar.close()
        # calculate test loss
        accuracy, test_loss = test_metric(test_dataloader, model, loss_fn)
        print(f"\t test accuracy: {100*accuracy:.1f}%, loss: {test_loss:.3f}")


def test_metric(dataloader, model, loss_fn):
    model.eval()  # set to eval mode
    size = len(dataloader.dataset)
    num_batches = len(dataloader)
    test_loss, correct = 0, 0

    # Evaluating the model with torch.no_grad() ensures that no gradients are computed during test mode
    # also serves to reduce unnecessary gradient computations and memory usage for tensors with requires_grad=True
    with torch.no_grad():
        for batch_data in dataloader:
            X, y = batch_data['input_features'], batch_data['label']
            X = torch.reshape(X, (-1,1500,768))  # reshape
            pred = model(X)
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= num_batches
    correct /= size
    return correct, test_loss

if TRAIN_MODEL:
    loss_fn = nn.CrossEntropyLoss()
    lr = 6e-4
    epochs = 11
    optimizer = torch.optim.AdamW(whisper_model.parameters(), lr=lr)
    train_model(train_dataloader, test_dataloader, whisper_model, loss_fn, optimizer, epochs=epochs)
    torch.save(whisper_model.accent_classifier.state_dict(), MODEL_NAME)
    print(f"Saved model to {MODEL_NAME}")

Epoch 1 (loss 0.673, avg loss 0.660: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 77.5%, loss: 0.578


Epoch 2 (loss 0.494, avg loss 0.557: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 88.7%, loss: 0.509


Epoch 3 (loss 0.405, avg loss 0.493: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 86.1%, loss: 0.467


Epoch 4 (loss 0.512, avg loss 0.440: 100%|███████████████| 564/564 [00:21<00:00,


	 test accuracy: 91.4%, loss: 0.406


Epoch 5 (loss 0.433, avg loss 0.400: 100%|███████████████| 564/564 [00:15<00:00,


	 test accuracy: 91.4%, loss: 0.377


Epoch 6 (loss 0.405, avg loss 0.365: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 93.4%, loss: 0.357


Epoch 7 (loss 0.365, avg loss 0.336: 100%|███████████████| 564/564 [00:17<00:00,


	 test accuracy: 94.0%, loss: 0.327


Epoch 8 (loss 0.499, avg loss 0.328: 100%|███████████████| 564/564 [00:19<00:00,


	 test accuracy: 95.4%, loss: 0.302


Epoch 9 (loss 0.234, avg loss 0.293: 100%|███████████████| 564/564 [00:16<00:00,


	 test accuracy: 94.7%, loss: 0.285


Epoch 10 (loss 0.147, avg loss 0.279: 100%|███████████████| 564/564 [00:15<00:00


	 test accuracy: 95.4%, loss: 0.266


Epoch 11 (loss 0.854, avg loss 0.281: 100%|███████████████| 564/564 [00:15<00:00


	 test accuracy: 95.4%, loss: 0.260


Notice that the train and test loss are similar and decreasing, suggesting that the model is still underfitting. The accurracy reaches a plateau around epoch 8.

## Conclusion

Our simple classification model has demonstrated remarkable effectiveness, achieving ~95% accuracy on the dataset! This high performance strongly suggests that Whisper's encoder efficiently extracts features relevant to accent identification. Moreover, our preliminary results provide the activations corresponding to the two studied accents: UK Received Pronunciation and US General American.

While these results are promising, there are several avenues for improvement and further exploration:

1. Model Refinement:

 - Include a test set to assess the performance of the model in a wider setting.
 - Conduct thorough error analysis to check that the model's performance aligns with expectations.
 - Establish a human-level baseline for accent classification to better contextualise our model's performance.
 - Use these insights to diagnose potential underfitting or overfitting, guiding further model improvements.
 - Try variations on the architecture, e.g. max pooling or using intermediate layers of the encoder.

2. Expansion of Scope:

 - Extend the classification task to include a broader range of accents, testing the model's generalisation capabilities.
 - Explore the potential of neural style transfer techniques to artificially modify speaker accents, which could have applications in speech synthesis and language learning.

3. Deeper Analysis of Whisper:

 - Apply dictionary-learning techniques to identify specific features within Whisper that are most relevant for accent detection. This analysis could provide valuable insights into how deep learning models process and represent accent information.



These future directions not only promise to enhance the model's performance but also offer opportunities to gain deeper insights into the nature of accent representation in advanced speech recognition systems. By pursuing these avenues, we can further our understanding of how AI systems process and utilise the subtle nuances of human speech.