# Accent detection project

## Overview

This project aims at exploring the capabilities of speech recognition systems to capture subtle features of voice, such as accents.

In the first part of this project, we attempt to repurpose the speech recognition system [whisper](https://huggingface.co/openai/whisper-large-v3) to perform accent classification. Specifically, we use the encoder part of whisper to perform feature extraction and apply transfer learning to train an accent classifier.

The second part of this project is more ambitious and aims to identify within whisper features corresponding to accents, by implementing dictionary learning.

### Todo

 - select small speech recognition model
 - test usage of the model
 - review its architecture, in particular transformer architecture (attention heads, positional encoding ...)
 - Q: does the model know about different accents? identifying features
Toy problem:
 - can we build a model that changes the accent of the speaker?
 - one approach to this problem is using a supervised learning approach, but there is no dataset. We want instead to use neural style transfer. The idea will be to choose two audio clips. One content, the other style. We calculate the activations in a neural network of both of them in a given layer, then minimise a loss function to find dataclip.
 - Maybe there is a better approach where we predict denoising?

Note: maybe use dataset:
https://huggingface.co/datasets/NathanRoll/commonvoice_train_gender_accent_16k
also: https://huggingface.co/WillHeld

## The model

We use whisper-small for these tests. Load the model following the instructions given on HuggingFace.

In [1]:
import numpy as np
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset


device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    torch_dtype=torch_dtype,
    device=device,
)

2024-07-23 17:27:07.932332: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


Check that the model runs using the automatic pipeline:

In [112]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]

result = pipe(sample)
print(result["text"])

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, simile is drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of Up Guards and Atom paintings, and Mason's exquisite idles are as national as a Jingo poem. Mr. Birkut Foster's landscapes smile at one much in the same way that Mr. Karker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap on the back before he says, like a shampoo or a Turkish bath, next man,


Reproduce this using a manual pipeline:

In [2]:
import numpy as np
import torch
from transformers import AutoProcessor, WhisperForConditionalGeneration
from datasets import load_dataset

# config
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"

# load model
model = WhisperForConditionalGeneration.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

# load processor
processor = AutoProcessor.from_pretrained(model_id)

ds = load_dataset("distil-whisper/librispeech_long", "clean", split="validation", trust_remote_code=True)
inputs = processor(ds[0]["audio"]["array"], return_tensors="pt")
input_features = inputs.input_features
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
It is strongly recommended to pass the `sampling_rate` argument to this function. Failing to do so can result in silent errors that might be hard to debug.
Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token.As a consequence, you ma

 Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, symbolies drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover


Test with custom data:

In [43]:
import librosa
data, _ = librosa.load("dataset/british/rlaPLvETBug_1.mp3", sr=16e3)
inputs = processor(data,  sampling_rate=16e3, return_tensors="pt")
input_features = inputs.input_features
generated_ids = model.generate(inputs=input_features)
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(transcription)

 I'm quite unscrupulous and not very clever. And still they managed to do great mathematics. So it told the kid that if they can do it, why can't you? And that was certainly what turned me on. I came from England to the United States to study physics. I applied to Cornell University to work with Hans Bethe, who is a famous physicist. But the amazing thing was in the very first week I was there, I met Dick Feynman, who is an absolute


### The structure of the model

The input data is a Mel spectrogram, which is of dimension (80, 3000) for 80 channels (frequencies) and 3000 datapoints, i.e. sample each 10ms for 30s.

In [45]:
print(input_features.shape)

torch.Size([1, 80, 3000])


In [79]:
model

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 768, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(768, 768, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 768)
      (layers): ModuleList(
        (0-11): 12 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=768, out_features=768, bias=False)
            (v_proj): Linear(in_features=768, out_features=768, bias=True)
            (q_proj): Linear(in_features=768, out_features=768, bias=True)
            (out_proj): Linear(in_features=768, out_features=768, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=768, out_features=3072, bias=True)
          (fc2): Linear(in_features=3072, out_features=768, bias=True)
        

## Preparing the dataset

We've constructed a small dataset already, see [...].
To load it, we use the load_dataset method with the "audiofolder" setting:

In [3]:
dataset = load_dataset("audiofolder", data_dir="dataset/")
dataset = dataset['train'].train_test_split(test_size=0.2)

Resolving data files:   0%|          | 0/227 [00:00<?, ?it/s]

In [41]:
import IPython
print(dataset['train'][0])
IPython.display.Audio(dataset['train'][0]['audio']['array'], rate=16000)

{'audio': {'path': '/home/maxime/Dropbox/Job preparation/Machine learning/Speech/dataset/british/57AzwH0Q6lA_0.mp3', 'array': array([ 0.00025885, -0.00057162, -0.00099729, ...,  0.0010395 ,
        0.02615245,  0.04395052]), 'sampling_rate': 16000}, 'label': 1}


In [4]:
def preprocess_audio(item):
    return processor(item['audio']['array'], sampling_rate=16e3, return_tensors="pt")
dataset['train'].map(preprocess_audio)

Map:   0%|          | 0/181 [00:00<?, ? examples/s]

KeyboardInterrupt: 

In [7]:
import librosa
data, _ = librosa.load("dataset/british/rlaPLvETBug_1.mp3", sr=16e3)
inputs = processor(data,  sampling_rate=16e3, return_tensors="pt")
result = model.model.encoder.forward(**inputs)
print(result)

BaseModelOutput(last_hidden_state=tensor([[[-0.9986, -0.6993,  1.8703,  ...,  0.5181, -1.4452, -0.3796],
         [ 0.8717,  1.7176,  2.2849,  ...,  0.0507, -0.7438, -2.2120],
         [ 1.1933,  1.0807,  3.0355,  ...,  1.1174, -0.7023, -1.4283],
         ...,
         [-0.1414,  0.6472,  3.1696,  ..., -0.5337, -1.7626,  1.2193],
         [-1.6642,  0.5604,  2.8126,  ..., -0.4803, -1.4233,  1.6012],
         [-1.9856,  0.5954,  2.1323,  ..., -0.0580, -1.0761,  1.3244]]],
       grad_fn=<NativeLayerNormBackward0>), hidden_states=None, attentions=None)


In [4]:
result.last_hidden_state.shape

torch.Size([1, 1500, 768])