### Goal
The goal of this model is to generate a text answer given on a spoken (audio) question.  

#### Steps
1. Model 1: Speech recognition model that takes an audio recording and generates the according text from the audio
    
    *   Whisper: [Robust Speech Recognition via Large-Scale Weak Supervision](https://arxiv.org/abs/2212.04356) by Alec Radford et al. from OpenAI (2022)
    *   Wav2Vec2: [wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations](https://arxiv.org/abs/2006.11477) by Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, Michael Auli (2020)


2. Model 2: Text generation/question answering/chat model that generates an answer based on a given question

    * Llama2: [Llama 2: Open Foundation and Fine-Tuned Chat Models](https://arxiv.org/abs/2307.09288) (2023) - Llama3 (2024)
    * GPT-3/4, T5, BERT, etc.  

3. [LoRA](https://arxiv.org/abs/2106.09685): accelerates the fine-tuning of large models while consuming less memory

In [2]:
!pip install -Uq datasets

In [3]:
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

#### Speech recoginition: Whisper

In [4]:
# settings
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

asr_model_id = "openai/whisper-tiny.en" # tiny: openai/whisper-tiny , larger: openai/whisper-large-v3


In [5]:
asr_model = AutoModelForSpeechSeq2Seq.from_pretrained(
    asr_model_id, torch_dtype=torch_dtype, use_safetensors=True
) # load checkpoints

asr_model.to(device)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


config.json:   0%|          | 0.00/1.94k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/151M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/1.62k [00:00<?, ?B/s]

WhisperForConditionalGeneration(
  (model): WhisperModel(
    (encoder): WhisperEncoder(
      (conv1): Conv1d(80, 384, kernel_size=(3,), stride=(1,), padding=(1,))
      (conv2): Conv1d(384, 384, kernel_size=(3,), stride=(2,), padding=(1,))
      (embed_positions): Embedding(1500, 384)
      (layers): ModuleList(
        (0-3): 4 x WhisperEncoderLayer(
          (self_attn): WhisperSdpaAttention(
            (k_proj): Linear(in_features=384, out_features=384, bias=False)
            (v_proj): Linear(in_features=384, out_features=384, bias=True)
            (q_proj): Linear(in_features=384, out_features=384, bias=True)
            (out_proj): Linear(in_features=384, out_features=384, bias=True)
          )
          (self_attn_layer_norm): LayerNorm((384,), eps=1e-05, elementwise_affine=True)
          (activation_fn): GELUActivation()
          (fc1): Linear(in_features=384, out_features=1536, bias=True)
          (fc2): Linear(in_features=1536, out_features=384, bias=True)
          

In [6]:
processor = AutoProcessor.from_pretrained(asr_model_id) # used to extracts spoken text from audio into tokens to prepare for the model and decode the predicted IDs back into text
# "processor_class": "WhisperProcessor"
# wraps a Whisper feature extractor and a Whisper tokenizer into a single processor

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/805 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/798k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.41M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/1.83k [00:00<?, ?B/s]

In [8]:
# load the dataset
audio_dataset = load_dataset("hf-internal-testing/librispeech_asr_dummy", "clean", split="validation", trust_remote_code=True)

In [9]:
audio_dataset[0]

{'file': '/Users/sanchitgandhi/.cache/huggingface/datasets/downloads/extracted/aad76e6f21870761d7a8b9b34436f6f8db846546c68cb2d9388598d7a164fa4b/dev_clean/1272/128104/1272-128104-0000.flac',
 'audio': {'path': '1272-128104-0000.flac',
  'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00042725, 0.00057983,
         0.0010376 ]),
  'sampling_rate': 16000},
 'text': 'MISTER QUILTER IS THE APOSTLE OF THE MIDDLE CLASSES AND WE ARE GLAD TO WELCOME HIS GOSPEL',
 'speaker_id': 1272,
 'chapter_id': 128104,
 'id': '1272-128104-0000'}

In [10]:
# prepare inputs to the model - send to processor to get the processed data
inputs = processor(audio_dataset[0]["audio"]["array"], return_tensors='pt')

# return_tensors (str or TensorType, optional) — If set, will return tensors instead of list of python integers. Acceptable values are:
# 'pt': Return PyTorch torch.Tensor objects.

# returns dict {input_features: tensor([[[...]]])}


It is strongly recommended to pass the `sampling_rate` argument to `WhisperFeatureExtractor()`. Failing to do so can result in silent errors that might be hard to debug.


In [11]:
input_features = inputs.input_features.half() if torch_dtype == torch.float16 else inputs.input_features # because we are using torch_dtype = torch.float16 instead of torch.float32
input_features = input_features.to(device)
input_features.dtype

torch.float16

In [12]:
# now we want to transcribe audio to text so we need to pass the input features to the Whisper model
# the model will generate token ids from the audio
# Transcribes or translates log-mel input features to a sequence of auto-regressively generated token ids.

generated_ids = asr_model.generate(inputs=input_features)

The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


In [13]:
generated_ids

tensor([[ 1770,    13,  2264,   346,   353,   318,   262, 46329,   286,   262,
          3504,  6097,    11,   290,   356,   389,  9675,   284,  7062,   465,
         21443,    13]], device='cuda:0')

In [14]:
# get the tokens (words) from the generated token ids
transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)
transcription[0]

' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel.'

#### Using HuggingFace Pipeline

In [15]:
# we can either use the processor directly for inference or use a pipeline
pipe = pipeline(
    "automatic-speech-recognition", # task - returns AutomaticSpeechRecognitionPipeline
    model=asr_model,
    feature_extractor=processor.feature_extractor, # audio encodes audio waveform (raw) to suitable feature for the model
    tokenizer=processor.tokenizer, # encodes suitable feature to tokens
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)

Device set to use cuda:0


In [16]:
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")

README.md:   0%|          | 0.00/480 [00:00<?, ?B/s]

(…)-00000-of-00001-913508124a40cb97.parquet:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

In [17]:
dataset[0]

{'audio': {'path': '0d38672e0bbdbdc460af55b8bb84a15b2730db2819f2af64f9c777d4d586f2de',
  'array': array([0.00238037, 0.0020752 , 0.00198364, ..., 0.00024414, 0.00048828,
         0.0005188 ]),
  'sampling_rate': 16000}}

In [18]:
sample = dataset[0]["audio"]

In [19]:
result = pipe(sample)

In [20]:
result["text"]

" Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. Nor is Mr. Quilter's manner less interesting than his matter. He tells us that at this festive season of the year, with Christmas and roast beef looming before us, similes drawn from eating and its results occur most readily to the mind. He has grave doubts whether Sir Frederick Layton's work is really Greek after all, and can discover in it but little of rocky Ithaca. Linnell's pictures are a sort of up-gards and atom paintings, and Mason's exquisite idles are as national as a jingo poem. Mr. Birkett Foster's landscapes smile at one much in the same way that Mr. Carker used to flash his teeth. And Mr. John Collier gives his sitter a cheerful slap in the back, before he says, like a shampooer and a Turkish bath, next man!"

#### Text generation/Question Answering: Llama2

In [21]:
from transformers import AutoTokenizer, AutoModelForCausalLM

In [22]:
# "NousResearch/Llama-2-7b-chat-hf"
tg_model_id = "Maykeye/TinyLLama-v0"

# get model and tokenizer
tg_model = AutoModelForCausalLM.from_pretrained(tg_model_id)
tokenizer = AutoTokenizer.from_pretrained(tg_model_id)

config.json:   0%|          | 0.00/498 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/9.25M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/132 [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/649 [00:00<?, ?B/s]

tokenizer.model:   0%|          | 0.00/534k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.98M [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/411 [00:00<?, ?B/s]

You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the `legacy` (previous) behavior will be used so nothing changes for you. If you want to use the new behaviour, set `legacy=False`. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.


In [23]:
tokenizer

LlamaTokenizerFast(name_or_path='Maykeye/TinyLLama-v0', vocab_size=32000, model_max_length=2048, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>'}, clean_up_tokenization_spaces=False, added_tokens_decoder={
	0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
	2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=True, special=True),
}
)

In [24]:
tg_model

LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 64, padding_idx=0)
    (layers): ModuleList(
      (0-7): 8 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=64, out_features=64, bias=False)
          (k_proj): Linear(in_features=64, out_features=64, bias=False)
          (v_proj): Linear(in_features=64, out_features=64, bias=False)
          (o_proj): Linear(in_features=64, out_features=64, bias=False)
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=64, out_features=256, bias=False)
          (up_proj): Linear(in_features=64, out_features=256, bias=False)
          (down_proj): Linear(in_features=256, out_features=64, bias=False)
          (act_fn): SiLU()
        )
        (input_layernorm): LlamaRMSNorm((64,), eps=1e-06)
        (post_attention_layernorm): LlamaRMSNorm((64,), eps=1e-06)
      )
    )
    (norm): LlamaRMSNorm((64,), eps=1e-06)
    (rotary_emb): LlamaRotaryEmbeddi

In [25]:
# encode input words to token ids and pass them to the model
tg_inputs = tokenizer.encode(transcription[0])

In [26]:
tg_inputs

[1,
 31822,
 1864,
 31843,
 2053,
 302,
 359,
 322,
 266,
 19041,
 296,
 287,
 266,
 4122,
 5573,
 31844,
 291,
 382,
 397,
 8559,
 289,
 5928,
 492,
 17275,
 31843]

In [27]:
# cast list to tensor and reshape
tg_inputs = torch.tensor(tg_inputs).reshape(1, len(tg_inputs))
tg_inputs

tensor([[    1, 31822,  1864, 31843,  2053,   302,   359,   322,   266, 19041,
           296,   287,   266,  4122,  5573, 31844,   291,   382,   397,  8559,
           289,  5928,   492, 17275, 31843]])

In [28]:
# inference on the model
generated_tg_ids = tg_model.generate(inputs=torch.tensor(tg_inputs), max_length=30)

  generated_tg_ids = tg_model.generate(inputs=torch.tensor(tg_inputs), max_length=30)


In [29]:
generated_tg_ids

tensor([[    1, 31822,  1864, 31843,  2053,   302,   359,   322,   266, 19041,
           296,   287,   266,  4122,  5573, 31844,   291,   382,   397,  8559,
           289,  5928,   492, 17275, 31843,   636,  1329,   289,   492,  2017]])

In [30]:
# get the predicted sentence from the generated token ids
pred = tokenizer.batch_decode(generated_tg_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
pred

# we see it has predicted 5 more words since max_length=30 and input length is already 25

[' Mr. Quilter is the apostle of the middle classes, and we are glad to welcome his gospel. He says to his mom']