# Code-switching Pipeline POC

This is heavily based on <a href="https://github.com/openai/whisper/blob/main/notebooks/Multilingual_ASR.ipynb">this notebook</a>, and aims to combine the different tasks whisper is trained on to gather multilingual transcriptions.



Key idea:
For each frame,

In [34]:
from IPython.display import display, Audio, HTML

In [1]:
import torch
import transformers
import datasets

  from .autonotebook import tqdm as notebook_tqdm


In [None]:
device = 'CUDA' if torch.cuda.is_available() else 'CPU'

In [None]:
# Load code switching dataset (e.g. ASCEND)
dataset = datasets.load_dataset('CAiRE/ASCEND')

Downloading readme: 100%|██████████| 4.56k/4.56k [00:00<?, ?B/s]
Downloading data: 100%|██████████| 317M/317M [01:01<00:00, 5.17MB/s] 
Downloading data: 100%|██████████| 367M/367M [01:12<00:00, 5.09MB/s] 
Downloading data: 100%|██████████| 328M/328M [00:56<00:00, 5.76MB/s] 
Downloading data: 100%|██████████| 106M/106M [00:18<00:00, 5.79MB/s] 
Downloading data: 100%|██████████| 107M/107M [00:17<00:00, 6.04MB/s] 
Generating train split: 100%|██████████| 9869/9869 [00:03<00:00, 3246.54 examples/s]
Generating test split: 100%|██████████| 1315/1315 [00:00<00:00, 4135.58 examples/s]
Generating validation split: 100%|██████████| 1130/1130 [00:00<00:00, 3561.93 examples/s]


In [54]:
# Give example
ex = dataset['train'][2]
SAMPLING_RATE = ex['audio']['sampling_rate']

display(HTML('<h1> Example Audio Segment</h1><hr>'))
display(Audio(ex['audio']['array'], rate=SAMPLING_RATE))
display(HTML(f"Transcription: {ex['transcription']}"))

In this example we have an example of true code switching, where the utterance goes from chinese language to an english phrase and back to chinese particles.

In a zero-shot setting, we achieve the following results:

In [None]:
from transformers import WhisperForConditionalGeneration, WhisperProcessor

In [81]:
processor = WhisperProcessor.from_pretrained('openai/whisper-medium')
model = WhisperForConditionalGeneration.from_pretrained('openai/whisper-medium')

In [87]:
input_features = processor(ex['audio']['array'], sampling_rate=SAMPLING_RATE, return_tensors='pt').input_features

In [88]:
pred = model.generate(input_features)

Passing a tuple of `past_key_values` is deprecated and will be removed in Transformers v4.43.0. You should pass an instance of `EncoderDecoderCache` instead, e.g. `past_key_values=EncoderDecoderCache.from_legacy_cache(past_key_values)`.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [89]:
pred

tensor([[50258, 50260, 50359, 50363, 28727,  9487, 23813,  8833,    11,    77,
           573,   281,  1677,   291]])

In [91]:
transcription = processor.batch_decode(pred)

In [92]:
transcription

['<|startoftranscript|><|zh|><|transcribe|><|notimestamps|>初次见面,nice to meet you']

# NOTES

To do:
Test how WhisperModel works shape wise
Get frame/timestamp level language identification
Get transcription for each language
Collapse function