# Accent detection project

## Overview

This project aims at exploring the capabilities of speech recognition systems to capture subtle features of voice, such as accents.

In the first part of this project, we attempt to repurpose the speech recognition system [whisper](https://huggingface.co/openai/whisper-large-v3) to perform accent classification. Specifically, we use the encoder part of whisper to perform feature extraction and apply transfer learning to train an accent classifier.

The second part of this project is more ambitious and aims to identify within whisper features corresponding to accents, by implementing dictionary learning.

### Todo

1. Data processing:
 - increase the number of samples
 - use data augmentation to generate more samples
 - separate audio tracks between train/dev sets to avoid data leakage

2. Transfer learning
 - implement classification layers on top of encoder
 - integrate into pipeline
 - preprocess data through encoder
 - make it work on colab, use GPU

3. Model review
 - test usage of the model
 - review its architecture, in particular transformer architecture (attention heads, positional encoding ...)


More fun:
 - can we build a model that changes the accent of the speaker?
 - one approach to this problem is using a supervised learning approach, but there is no dataset. We want instead to use neural style transfer. The idea will be to choose two audio clips. One content, the other style. We calculate the activations in a neural network of both of them in a given layer, then minimise a loss function to find dataclip.
 - Maybe there is a better approach where we predict denoising?

Note: maybe use dataset:
https://huggingface.co/datasets/NathanRoll/commonvoice_train_gender_accent_16k
also: https://huggingface.co/WillHeld

## Test: the base model

We use whisper-small for these tests. Load the model following the instructions given on HuggingFace.

In [1]:
import numpy as np
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

2024-07-24 13:03:06.837441: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Check that the model runs using the automatic pipeline:

In [6]:
import IPython
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])
IPython.display.Audio(sample['array'], rate=16000)

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Up Guards and Atom paintings, and Mason's exquisite idles are as national as a jingle poem. Mr. Birkitt Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says like a shampoo and a Turkish bath, next man


Reproduce this using a manual pipeline. Since we treat it manually, we only decode the first 30s of the clip.

In [36]:
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# config
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

# load model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load processor
processor = AutoProcessor.from_pretrained(model_id)

ds = load_dataset("distil-whisper/librispeech_long", "clean", split="validation", trust_remote_code=True)
inputs = processor(ds[0]["audio"]["array"], sampling_rate=ds[0]["audio"]["sampling_rate"], return_tensors="pt")
generated_ids = model.generate(**inputs, language="en")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover


Test with custom data:

In [40]:
import librosa
data, _ = librosa.load("dataset/british/rlaPLvETBug_1.mp3", sr=16e3)
inputs = processor(data,  sampling_rate=16e3, return_tensors="pt")
generated_ids = model.generate(**inputs, language="en")
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

 I'm quite unscrupulous and not very clever. And still they managed to do great mathematics. So it told the kid that if they can do it, why can't you? And that was certainly what turned me on. I came from England to the United States to study physics. I applied to Cornell University to work with Hans Bethe, who is a famous physicist. But the amazing thing was in the very first week I was there, I met Dick Feynman, who is an absolute


### Review of the model

The input data is a Mel spectrogram, which is of dimension (80, 3000) for 80 channels (frequencies) and 3000 datapoints, i.e. sample each 10ms for 30s.

In [45]:
print(input_features.shape)

torch.Size([1, 80, 3000])


In [79]:
model

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

In [45]:
encoded = model.model.encoder.forward(**inputs).last_hidden_state
encoded.shape

torch.Size([1, 1500, 768])

## Generating the dataset

See external python script

## Preparing the dataset

We've constructed a small dataset already, see [...].
To load it, we use the load_dataset method with the "audiofolder" setting:

In [14]:
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset
from torch.utils.data import DataLoader
import IPython
import whisper_accent
from tqdm import tqdm

streaming = False
dataset = load_dataset("audiofolder",
                       data_dir="dataset/",
                       cache_dir=".cache/"
                       # streaming=streaming
                      )  # use streaming to avoid overloading memory
dataset.with_format("torch")
# dataset = dataset['train'].train_test_split(test_size=0.2)

model_id = "openai/whisper-small"
processor = AutoProcessor.from_pretrained(model_id)
whisper_model = whisper_accent.AccentModel(use_encoder=False)

if streaming:
    item = next(iter(dataset['train']))
    print(item)
else:
    item = dataset['train'][0]
    print(item)
IPython.display.Audio(item['audio']['array'], rate=16000)

Resolving data files:   0%|          | 0/696 [00:00<?, ?it/s]

Resolving data files:   0%|          | 0/19 [00:00<?, ?it/s]

Downloading data:   0%|          | 0/696 [00:00<?, ?files/s]

Downloading data:   0%|          | 0/19 [00:00<?, ?files/s]

Generating train split: 0 examples [00:00, ? examples/s]

Generating validation split: 0 examples [00:00, ? examples/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


{'audio': {'path': '/home/maxime/Dropbox/Job preparation/Machine learning/Speech/dataset/train/american/-96-lAfagow_0.mp3', 'array': array([-0.11506341, -0.12139536, -0.10677178, ..., -0.00088448,
       -0.00142263, -0.00095254]), 'sampling_rate': 16000}, 'label': 0}


In [15]:
dataset['train'].features

{'audio': Audio(sampling_rate=None, mono=True, decode=True, id=None),
 'label': ClassLabel(names=['american', 'british'], id=None)}

In [17]:
def preprocess_audio(audio_file, whisper_model, processor):
    # Load and preprocess audio (implement this based on your audio format)
    audio_input = processor(audio_file['audio']['array'], sampling_rate=16e3, return_tensors="pt")
    
    # Run through Whisper encoder
    encoder_output = whisper_model.encode(audio_input.input_features)
    
    return {'input_features': encoder_output}

def preprocess_dataset(audio_files, whisper_model):
    preprocessed_data = []
    for audio_file in audio_files:
        # Load and preprocess audio (implement this based on your audio format)
        audio_input = processor(audio_file['audio']['array'], sampling_rate=16e3, return_tensors="pt")
        
        # Run through Whisper encoder
        encoder_output = whisper_model.encode(audio_input.input_features)
        preprocessed_data.append(encoder_output.cpu().numpy())
    
    return np.array(preprocessed_data)

    return batch

# Usage
# whisper_model = WhisperModel.from_pretrained("openai/whisper-base")
# preprocessed_data = preprocess_dataset(your_audio_files, whisper_model)
# np.save('preprocessed_whisper_features.npy', preprocessed_data)
        
# Usage
dataset_preprocessed = dataset['validation'].map(lambda sample: preprocess_audio(sample, whisper_model, processor))
# loader = DataLoader(dataset_preprocessed)

Map:   0%|          | 0/19 [00:00<?, ? examples/s]

In [20]:
to_process = dataset_preprocessed
for item in tqdm(to_process):
    print(item)

  5%|█████████▏                                                                                                                                                                     | 1/19 [00:01<00:23,  1.29s/it]IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
ServerApp.rate_limit_window=3.0 (secs)

 11%|██████████████████▍                                                                                                                                                            | 2/19 [00:02<00:22,  1.30s/it]IOPub data rate exceeded.
The Jupyter server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--ServerApp.iopub_data_rate_limit`.

Current values:
ServerApp.iopub_data_rate_limi

In [30]:
for item in dataset_preprocessed:
    print(whisper_model.forward(item['input_features']))
    break

tensor([[-0.7702,  0.4509]], grad_fn=<AddmmBackward0>)


In [17]:
from torch.utils.data import DataLoader

loader = DataLoader(dataset_preprocessed, shuffle=True)

In [None]:
def preprocess_dataset(audio_files, whisper_model):
    preprocessed_data = []
    for audio_file in audio_files:
        # Load and preprocess audio (implement this based on your audio format)
        audio_input = load_and_preprocess_audio(audio_file)
        
        # Run through Whisper encoder
        with torch.no_grad():
            encoder_output = whisper_model.encoder(audio_input).last_hidden_state
        
        # Store the output
        preprocessed_data.append(encoder_output.cpu().numpy())
    
    return np.array(preprocessed_data)

# Usage
whisper_model = WhisperModel.from_pretrained("openai/whisper-base")
preprocessed_data = preprocess_dataset(your_audio_files, whisper_model)
np.save('preprocessed_whisper_features.npy', preprocessed_data)

## Transfer learning

We start by implementing the model:

Instantiation and test:

In [59]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"

model = AccentModel(use_encoder=False)
model.to(device)

x = torch.rand((1,1500,768))
model(x)

tensor([[-0.1335, -1.4753]], grad_fn=<AddmmBackward0>)