# Voice Interactive Systems - Introduction

**Enhancing Question-Answering System with TTS and STT**

Text-to-Speech (TTS) and Speech-to-Text (STT) technologies significantly enhance the usability of applications like chatbots and question-answering systems. TTS allows to provide spoken responses, making interactions more natural and accessible, especially for users who prefer or require auditory communication. STT enables users to ask questions verbally, making the interaction hands-free and more convenient. Together, these technologies create a more engaging and efficient user experience, broadening the accessibility and practicality of question-answering applications in various contexts.

**Our Work**

 In the following code blocks, we present an implementation of a voice-interactive question-answering system. The main components are:



*   **Whisper model as SST Technology**: Utilized for automatic speech recognition

*   **Question-answering System**: Based on a pre-trained Large Language Model fine-tuned specifically for medical question-answering taks. Given a medical question obtained through the STT technology, the system is capable of providing a sufficiently accurate and relevant answer.

*   **Tacotron2 and WaveGlow models as TTS Technologies**: Employed to synthetize natural-sounding speech from raw trnascripts , which is going to correspond to the answer of the question-answering systems, without additional prosody information

This combination of advanced STT and TTS technologies, along with a specialized LLM, creates a robust and efficient voice-interactive system capable of handling complex medical inquiries.



## Install and Import

Let's install and then impot some useful libraries

In [1]:
! pip install --upgrade pip
! pip install --upgrade git+https://github.com//huggingface/transformers.git accelerate datasets[audio]

Collecting pip
  Downloading pip-24.0-py3-none-any.whl (2.1 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.1/2.1 MB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pip
  Attempting uninstall: pip
    Found existing installation: pip 23.1.2
    Uninstalling pip-23.1.2:
      Successfully uninstalled pip-23.1.2
Successfully installed pip-24.0
Collecting git+https://github.com//huggingface/transformers.git
  Cloning https://github.com//huggingface/transformers.git to /tmp/pip-req-build-ko3uaeaj
  Running command git clone --filter=blob:none --quiet https://github.com//huggingface/transformers.git /tmp/pip-req-build-ko3uaeaj
  Resolved https://github.com//huggingface/transformers.git to commit bdb9106f247fca48a71eb384be25dbbd29b065a8
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml)

In [2]:
!pip install --upgrade transformers optimum accelerate

Collecting optimum
  Downloading optimum-1.19.2-py3-none-any.whl.metadata (19 kB)
Collecting coloredlogs (from optimum)
  Downloading coloredlogs-15.0.1-py2.py3-none-any.whl.metadata (12 kB)
Collecting transformers
  Downloading transformers-4.40.2-py3-none-any.whl.metadata (137 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m138.0/138.0 kB[0m [31m4.8 MB/s[0m eta [36m0:00:00[0m
Collecting humanfriendly>=9.1 (from coloredlogs->optimum)
  Downloading humanfriendly-10.0-py2.py3-none-any.whl.metadata (9.2 kB)
Downloading optimum-1.19.2-py3-none-any.whl (417 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m417.0/417.0 kB[0m [31m9.9 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading transformers-4.40.2-py3-none-any.whl (9.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m9.0/9.0 MB[0m [31m30.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading coloredlogs-15.0.1-py2.py3-none-any.whl (46 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

In [3]:
!pip install ffmpeg-python

Collecting ffmpeg-python
  Downloading ffmpeg_python-0.2.0-py3-none-any.whl.metadata (1.7 kB)
Downloading ffmpeg_python-0.2.0-py3-none-any.whl (25 kB)
Installing collected packages: ffmpeg-python
Successfully installed ffmpeg-python-0.2.0
[0m

In [4]:
!pip install numpy scipy librosa unidecode inflect

Collecting unidecode
  Downloading Unidecode-1.3.8-py3-none-any.whl.metadata (13 kB)
Downloading Unidecode-1.3.8-py3-none-any.whl (235 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m235.5/235.5 kB[0m [31m7.3 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: unidecode
Successfully installed unidecode-1.3.8
[0m

In [5]:
%%capture
# Installs Unsloth, Xformers (Flash Attention) and all other packages!
!pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
!pip install --no-deps xformers trl peft accelerate bitsandbytes

In [6]:
import torch

from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.utils import is_flash_attn_2_available

from datasets import load_dataset

from google.colab.output import eval_js
from google.colab import drive

from IPython.display import HTML, Audio

from base64 import b64decode

import numpy as np

import ffmpeg

import io
from scipy.io.wavfile import read as wav_read
from scipy.io.wavfile import write

import os

from unsloth import FastLanguageModel

import time

🦥 Unsloth: Will patch your computer to enable 2x faster free finetuning.


## Setup

In [7]:
MODEL_PATH = '/content/drive/MyDrive/Colab Notebooks/NLP/project/Models/Mistral'
drive.mount('/content/drive')
os.chdir(f'{MODEL_PATH}')
os.getcwd()

Mounted at /content/drive


'/content/drive/MyDrive/Colab Notebooks/NLP/project/Models/Mistral'

Set to true **FAST_WHISPER** to use the fast Whisper model. Note that a GPU is required for this mode.

In [8]:
FAST_WHISPER= True

# Speech-To-Text Technologies: Whisper Model

Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. The use of such a large and diverse dataset leads to improved robustness to accents, background noise, and technical language. Moreover, it enables transcription in multiple languages, as well as translation from those languages into English. Whisper's performance varies widely depending on the language.

There are five model sizes, four with English-only versions, offering speed and accuracy tradeoffs.

In order to guarantee a pleasant user experience, tiny, base, and small sizes of Whisper should be used. Even though they have 39M, 74M, and 244M parameters, respectively, they are capable of providing correct transcriptions in a short amount of time without strict constraints on resources and hardware.

Later, we'll explore a faster implementation that will allow us to use larger sizes without having to wait too long for an answer.


In [9]:
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

model_id = "openai/whisper-small"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
      model_id, torch_dtype = torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(model_id)

pipe = pipeline(
      "automatic-speech-recognition",
      model = model,
      tokenizer= processor.tokenizer,
      feature_extractor= processor.feature_extractor,
      max_new_tokens=128,
      chunk_length_s = 30,
      batch_size= 16,
      return_timestamps=True,
      torch_dtype=torch_dtype,
      device=device
  )

config.json:   0%|          | 0.00/1.97k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/967M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.87k [00:00<?, ?B/s]

preprocessor_config.json:   0%|          | 0.00/185k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/836k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.19k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


A faster implmentation of Whisper has recently been released, leveraging flash attention 2. To utilize this implementation, a GPU must be available.  Flash Attention provides a more memory-efficient approach and increases efficiency due to optimized GPU memory utilization.
Flash Attention 2 is an evolution of the original Flash Attention. It exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4× compared to optimized baselines) with no approximation

The latter model enables us to utilize larger sizes, such as large-v3, which boasts 1550M parameters.

In [10]:
pipe_fast = pipeline(
      "automatic-speech-recognition",
      model="openai/whisper-large-v3", # select checkpoint from https://huggingface.co/openai/whisper-large-v3#model-details
      torch_dtype=torch.float16,
      device="cuda:0", # or mps for Mac devices
      model_kwargs={"attn_implementation": "flash_attention_2"} if is_flash_attn_2_available() else {"attn_implementation": "sdpa"},
  )

config.json:   0%|          | 0.00/1.27k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/3.09G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/3.90k [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/283k [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/2.48M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/494k [00:00<?, ?B/s]

normalizer.json:   0%|          | 0.00/52.7k [00:00<?, ?B/s]

added_tokens.json:   0%|          | 0.00/34.6k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/2.07k [00:00<?, ?B/s]

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


preprocessor_config.json:   0%|          | 0.00/340 [00:00<?, ?B/s]

# Text-To-Speech Technologies: Tacotron2 and WaveGlow

Tacotron 2 and WaveGlow form a text-to-speech system, synthesizing natural-sounding speech from raw transcripts without additional prosody information. Tacotron 2 utilizes an encoder-decoder architecture to produce mel spectrograms from input text, while WaveGlow is a flow-based model that generates speech from these spectrograms. Hence, A Vocoder which convert the spectrogram into an audio signal

Moreover, models like Bark offer an alternative approach to synthesizing natural speech. Bark skips the spectrogram stage by using transformers directly from text to audio waveform.

Tacotron remains the preferred choice in a multitude of cases. Its nuanced approach to predicting spectrograms allows for precise control over various aspects of synthesized speech, including intonation, rhythm, and overall naturalness.


Load the Tacotron model from Torchhub:

In [11]:
tacotron2 = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tacotron2', model_math='fp16')
tacotron2 = tacotron2.to('cuda')
tacotron2.eval()

Downloading: "https://github.com/NVIDIA/DeepLearningExamples/zipball/torchhub" to /root/.cache/torch/hub/torchhub.zip
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/tacotron2_pyt_ckpt_amp/versions/19.09.0/files/nvidia_tacotron2pyt_fp16_20190427


Tacotron2(
  (embedding): Embedding(148, 512)
  (encoder): Encoder(
    (convolutions): ModuleList(
      (0-2): 3 x Sequential(
        (0): ConvNorm(
          (conv): Conv1d(512, 512, kernel_size=(5,), stride=(1,), padding=(2,))
        )
        (1): BatchNorm1d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      )
    )
    (lstm): LSTM(512, 256, batch_first=True, bidirectional=True)
  )
  (decoder): Decoder(
    (prenet): Prenet(
      (layers): ModuleList(
        (0): LinearNorm(
          (linear_layer): Linear(in_features=80, out_features=256, bias=False)
        )
        (1): LinearNorm(
          (linear_layer): Linear(in_features=256, out_features=256, bias=False)
        )
      )
    )
    (attention_rnn): LSTMCell(768, 1024)
    (attention_layer): Attention(
      (query_layer): LinearNorm(
        (linear_layer): Linear(in_features=1024, out_features=128, bias=False)
      )
      (memory_layer): LinearNorm(
        (linear_layer): Linear(in_fea

Load the waveglow model as follows:

In [12]:
waveglow = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_waveglow', model_math='fp16')
waveglow = waveglow.remove_weightnorm(waveglow)
waveglow = waveglow.to('cuda')
waveglow.eval()

Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
Downloading checkpoint from https://api.ngc.nvidia.com/v2/models/nvidia/waveglow_ckpt_amp/versions/19.09.0/files/nvidia_waveglowpyt_fp16_20190427


WaveGlow(
  (upsample): ConvTranspose1d(80, 80, kernel_size=(1024,), stride=(256,))
  (WN): ModuleList(
    (0-3): 4 x WN(
      (in_layers): ModuleList(
        (0): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(1,))
        (1): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(2,), dilation=(2,))
        (2): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(4,), dilation=(4,))
        (3): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(8,), dilation=(8,))
        (4): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(16,), dilation=(16,))
        (5): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(32,), dilation=(32,))
        (6): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(64,), dilation=(64,))
        (7): Conv1d(512, 1024, kernel_size=(3,), stride=(1,), padding=(128,), dilation=(128,))
      )
      (res_skip_layers): ModuleList(
        (0-6): 7 x Conv1d(512, 1024, kernel_size=(1,), stride=(1,))
        (7

# Question-Answering model

The upcoming lines of code loads a pre-trained LLM and tokenizer with specified parameters and enables native 2x faster inference

In [13]:
max_seq_length = 4096   # TinyLlama's internal maximum sequence length is 2048. We use RoPE Scaling to extend it to 4096 with Unsloth!
dtype = None            # None for auto detection. Float16 for Tesla T4, V100, Bfloat16 for Ampere+
load_in_4bit = True     # Use 4bit quantization to reduce memory usage. Can be False.

In [14]:
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "lora_model", # YOUR MODEL YOU USED FOR TRAINING
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
)
FastLanguageModel.for_inference(model) # Enable native 2x faster inference

config.json:   0%|          | 0.00/1.15k [00:00<?, ?B/s]

==((====))==  Unsloth: Fast Mistral patching release 2024.5
   \\   /|    GPU: Tesla T4. Max memory: 14.748 GB. Platform = Linux.
O^O/ \_/ \    Pytorch: 2.3.0+cu121. CUDA = 7.5. CUDA Toolkit = 12.1.
\        /    Bfloat16 = FALSE. Xformers = 0.0.26.post1. FA = False.
 "-____-"     Free Apache license: http://github.com/unslothai/unsloth


model.safetensors:   0%|          | 0.00/4.14G [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/111 [00:00<?, ?B/s]

You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Unsloth 2024.5 patched 32 layers with 32 QKV layers, 32 O layers and 32 MLP layers.


In [15]:
alpaca_prompt = """Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{}

### Input:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
def formatting_prompts_func(examples):
    instructions = examples["instruction"]
    inputs       = examples["input"]
    outputs      = examples["output"]
    texts = []
    for instruction, input, output in zip(instructions, inputs, outputs):
        # Must add EOS_TOKEN, otherwise your generation will go on forever!
        text = alpaca_prompt.format(instruction, input, output) + EOS_TOKEN
        texts.append(text)
    return { "text" : texts, }
pass


The following block of code uses the pre-trained fine-tuned model to generate an answer for a given question

In [16]:
def generate_answer(question):
  # alpaca_prompt = Copied from above
  FastLanguageModel.for_inference(model) # Enable native 2x faster inference
  inputs = tokenizer(
  [
      alpaca_prompt.format(
          "Answer this question truthfully", # instruction
          question, # input
          "", # output - leave this blank for generation!
      )
  ], return_tensors = "pt").to("cuda")

  outputs = model.generate(**inputs, max_new_tokens = 128, use_cache = True, pad_token_id=tokenizer.eos_token_id)
  tokenizer.batch_decode(outputs)
  predicted_answer = tokenizer.batch_decode(outputs[:, inputs['input_ids'].shape[1]:])[0]
  predicted_answer = predicted_answer[:(len(predicted_answer)-5)]
  return predicted_answer

# Audio Recording

Let's define a function to record our own voice

In [17]:
AUDIO_HTML = """
<script>
var my_div = document.createElement("DIV");
var my_p = document.createElement("P");
var my_btn = document.createElement("BUTTON");
var t = document.createTextNode("Press to start recording");

my_btn.appendChild(t);
//my_p.appendChild(my_btn);
my_div.appendChild(my_btn);
document.body.appendChild(my_div);

var base64data = 0;
var reader;
var recorder, gumStream;
var recordButton = my_btn;

var handleSuccess = function(stream) {
  gumStream = stream;
  var options = {
    //bitsPerSecond: 8000, //chrome seems to ignore, always 48k
    mimeType : 'audio/webm;codecs=opus'
    //mimeType : 'audio/webm;codecs=pcm'
  };
  //recorder = new MediaRecorder(stream, options);
  recorder = new MediaRecorder(stream);
  recorder.ondataavailable = function(e) {
    var url = URL.createObjectURL(e.data);
    var preview = document.createElement('audio');
    preview.controls = true;
    preview.src = url;
    document.body.appendChild(preview);

    reader = new FileReader();
    reader.readAsDataURL(e.data);
    reader.onloadend = function() {
      base64data = reader.result;
      //console.log("Inside FileReader:" + base64data);
    }
  };
  recorder.start();
  };

recordButton.innerText = "Recording... press to stop";

navigator.mediaDevices.getUserMedia({audio: true}).then(handleSuccess);


function toggleRecording() {
  if (recorder && recorder.state == "recording") {
      recorder.stop();
      gumStream.getAudioTracks()[0].stop();
      recordButton.innerText = "Saving the recording... pls wait!"
  }
}

// https://stackoverflow.com/a/951057
function sleep(ms) {
  return new Promise(resolve => setTimeout(resolve, ms));
}

var data = new Promise(resolve=>{
//recordButton.addEventListener("click", toggleRecording);
recordButton.onclick = ()=>{
toggleRecording()

sleep(2000).then(() => {
  // wait 2000ms for the data to be available...
  // ideally this should use something like await...
  //console.log("Inside data:" + base64data)
  resolve(base64data.toString())

});

}
});

</script>
"""

def get_audio():
  display(HTML(AUDIO_HTML))
  data = eval_js("data")
  binary = b64decode(data.split(',')[1])

  process = (ffmpeg
    .input('pipe:0')
    .output('pipe:1', format='wav')
    .run_async(pipe_stdin=True, pipe_stdout=True, pipe_stderr=True, quiet=True, overwrite_output=True)
  )
  output, err = process.communicate(input=binary)

  with open('audio.wav', 'wb') as f:
    f.write(output)

  return 'audio.wav'

# Interactive Question-Answering Function

This function processes user question via audio input, genereates textual responses and convert them into synthesized speech.

In [18]:
def voice_interactive_QnA(pipe, isFast):
  audio = get_audio()
  if isFast:
    question = pipe(audio, chunk_length_s=30, batch_size=24, return_timestamps=True)
  else:
    question = pipe(audio, generate_kwargs={"language": "english"})

  text = generate_answer(question["text"])
  # Split text into sentences
  sentences = text.split('. ')

  # Generate audio for each sentence and concatenate
  audio_clips = []
  utils = torch.hub.load('NVIDIA/DeepLearningExamples:torchhub', 'nvidia_tts_utils')
  for sentence in sentences:
      # Ensure each sentence ends with a period for proper pronunciation
      if not sentence.endswith('.'):
          sentence += '.'

      # Process the sentence text
      sequences, lengths = utils.prepare_input_sequence([sentence])
      with torch.no_grad():
        mel, _, _ = tacotron2.infer(sequences, lengths)

      # Get the sampling rate and the generated speech
      with torch.no_grad():
        audio = waveglow.infer(mel)
      audio_numpy = audio[0].data.cpu().numpy()
      rate = 22050

      audio_clips.append(audio_numpy)

  # Concatenate all audio arrays
  final_audio_array = np.concatenate(audio_clips)

  # Save the concatenated audio to a file


  # If you have librosa and soundfile installed, you can save the audio like this:
  #import soundfile as sf
  #sf.write("final_output.wav", final_audio_array, rate)

  return final_audio_array, question["text"]
  # Or, if you want to play it directly in Python
  #from IPython.display import Audio
  #Audio(data=final_audio_array, rate=rate)

# Welcome to MedCortana

This is a brief guide on how to use it:


1.  **Start Recording**: Once the code cell is executed, a button will appear, signaling that the recording has commenced. Speak your question clearly into the microphone

2.  **Stop Recording**: Once you've finished your question, press the button again to stop the recording

3. **Processing**: The system will take 30 to 60 seconds to generate a response, depending on the length of the question

4. **Playback**: Your recording will appear, and you can listen to the generated response

5. **Continue or Quit**: After the playback, the system will ask if you want to continue with further questions or quit. Press Ener to continue or type 'exit' to quit

That being said, enjoy MedCortana!



In [19]:
while True:
  answer, question = voice_interactive_QnA(pipe_fast, FAST_WHISPER)
  print(f'User: {question}')
  print("MedCortana:")
  display(Audio(data=answer, rate=22050))
  time.sleep(3)
  user_input = input("Press Enter to continue or type 'exit' to quit: ")
  print("--------------------------------------------------------------------------------------------------------------------------------------------")
  # Check if the user wants to continue or exit
  if user_input.lower() == 'exit':
      break

Due to a bug fix in https://github.com/huggingface/transformers/pull/28687 transcription using a multilingual Whisper will default to language detection followed by transcription instead of translation to English.This might be a breaking change for your use case. If you want to instead always translate your audio to English, make sure to pass `language='en'`.
Using cache found in /root/.cache/torch/hub/NVIDIA_DeepLearningExamples_torchhub
  return s in _symbol_to_id and s is not '_' and s is not '~'
  return s in _symbol_to_id and s is not '_' and s is not '~'


User:  What is the PAP test?
MedCortana:


Press Enter to continue or type 'exit' to quit: exit
--------------------------------------------------------------------------------------------------------------------------------------------
