# Traditional Voice Interaction Pipeline Implementation

## Short Description
This code demonstrates a sequential voice assistant pipeline with three decoupled modules: Automatic Speech Recognition (ASR), Large Language Model (LLM) processing, and Text-to-Speech (TTS) synthesis. It illustrates the modular architecture and sequential execution flow of traditional voice assistants, highlighting both benefits and limitations discussed in lecture.

## Key Libraries Used
- `speech_recognition`: For capturing audio and performing speech-to-text conversion
- `transformers`: Provides access to the GPT-2 language model for text generation
- `gtts`: Google Text-to-Speech API interface for speech synthesis
- `playsound`: For audio playback of generated responses

## Code Logic and Flow

### High-Level Overview
The script implements a strictly sequential pipeline where:
1. Audio input is captured through the microphone and converted to text
2. The transcribed text is processed by a language model to generate a response
3. The text response is converted to speech and played through speakers
The execution flows unidirectionally with no feedback mechanisms between stages.

### Visual Flowchart
```mermaid
flowchart TD
    A[Start] --> B[Capture Audio]
    B --> C{ASR Success?}
    C -->|Yes| D[Process Text with LLM]
    C -->|No| E[Generate Error Message]
    D --> F[Convert Text to Speech]
    E --> F
    F --> G[Play Audio]
    G --> H[Cleanup Resources]
    H --> I[End]
```
## Step-by-Step Code Breakdown

#### Step 1: Automatic Speech Recognition (ASR)

    Captures live audio input through the microphone

    Performs ambient noise reduction for better accuracy

    Sends audio to Google's speech recognition API

    Handles two primary error cases:

        Unrecognizable audio (returns None)

        Service connection errors (returns None)

    Returns transcribed text or error indicator

#### Step 2: Large Language Model (LLM) Processing

    Validates ASR output before processing

    Constructs conversational prompt incorporating user input

    Uses GPT-2 model to generate contextual response

    Extracts first relevant line from model output

    Provides fallback message for invalid inputs

    Demonstrates text-only processing (no audio context)

#### Step 3: Text-to-Speech (TTS) Synthesis

    Converts text response to spoken audio

    Uses Google's TTS service for speech generation

    Saves generated audio to temporary file

    Plays audio through system speakers

    Cleans up temporary files post-playback

    Demonstrates output-only conversion

#### Pipeline Execution

    Coordinates strict sequential execution of stages

    ASR must complete before LLM processing starts

    LLM must finish before TTS begins

    No error recovery between stages

    Shows linear waterfall execution pattern

### Connecting to the Lecture

This implementation concretely demonstrates the traditional pipeline architecture discussed in lecture:

    Modular Separation: Clear boundaries between ASR, LLM, and TTS components

    Sequential Processing: Strict stage-by-stage execution creates latency accumulation

    Error Propagation: ASR failures directly impact downstream components

    Context Loss: Prosody and emotional cues are stripped during ASR conversion

    Temporary Artifacts: File-based handling between components increases latency

The code intentionally shows architectural constraints that modern end-to-end SpeechLMs address through integrated audio-to-audio processing, preserving paralinguistic features and enabling real-time feedback.    

In [5]:
# Install required packages - Use Anaconda shell to install them
!pip install speechrecognition pyttsx3 openai gTTS --quiet
# !pip install playsound==1.2.2
!pip install pygame
!pip install openai python-dotenv




In [4]:
!pip install SpeechRecognition



In [6]:
!pip install transformers



In [7]:
!pip install torch

Collecting torch
  Downloading torch-2.10.0-cp312-cp312-manylinux_2_28_x86_64.whl.metadata (31 kB)
Collecting sympy>=1.13.3 (from torch)
  Using cached sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting networkx>=2.5.1 (from torch)
  Using cached networkx-3.6.1-py3-none-any.whl.metadata (6.8 kB)
Collecting jinja2 (from torch)
  Using cached jinja2-3.1.6-py3-none-any.whl.metadata (2.9 kB)
Collecting cuda-bindings==12.9.4 (from torch)
  Downloading cuda_bindings-12.9.4-cp312-cp312-manylinux_2_24_x86_64.manylinux_2_28_x86_64.whl.metadata (2.6 kB)
Collecting nvidia-cuda-nvrtc-cu12==12.8.93 (from torch)
  Using cached nvidia_cuda_nvrtc_cu12-12.8.93-py3-none-manylinux2010_x86_64.manylinux_2_12_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-runtime-cu12==12.8.90 (from torch)
  Using cached nvidia_cuda_runtime_cu12-12.8.90-py3-none-manylinux2014_x86_64.manylinux_2_17_x86_64.whl.metadata (1.7 kB)
Collecting nvidia-cuda-cupti-cu12==12.8.90 (from torch)
  Using cached nvidia_cuda_cupt

In [9]:
!pip install pyaudio

Collecting pyaudio
  Using cached PyAudio-0.2.14.tar.gz (47 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: pyaudio
  Building wheel for pyaudio (pyproject.toml) ... [?25ldone
[?25h  Created wheel for pyaudio: filename=pyaudio-0.2.14-cp312-cp312-linux_x86_64.whl size=27687 sha256=3d7710efdf8accdc8cac77f4bd3a5f04c5ebff71f1e4f1dc72c8b4fb4e5a7ccb
  Stored in directory: /home/hubenschmidt/.cache/pip/wheels/68/c7/33/c6a6b210cb5819ec5c219928c794a447742a7d86d21c0b92e6
Successfully built pyaudio
Installing collected packages: pyaudio
Successfully installed pyaudio-0.2.14


In [16]:
# Importing Libraries
import speech_recognition as sr
from transformers import pipeline
from gtts import gTTS
import os
import pyaudio
# from playsound import playsound
import pygame
import torch
print(torch.cuda.is_available())
print(torch.version.hip)
# Initialize components
recognizer = sr.Recognizer()
device = 0 if torch.cuda.is_available() else -1  # Use GPU if available
llm = pipeline("text-generation", model="gpt2")  # Simple LLM for demo
output_file = "response.mp3"


#### Step 1: Automatic Speech Recognition (ASR)
#Defining function for ASR:

def asr_process():
    """Step 1: Automatic Speech Recognition (ASR)"""
    with sr.Microphone() as source:
        print("Listening for your query...")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)
    try:
        text = recognizer.recognize_google(audio)
        print(f"Transcribed: {text}")
        return text
    except sr.UnknownValueError:
        print("Could not understand audio.")
        return None
    except sr.RequestError:
        print("ASR service error.")
        return None

#### Step 2: Large Language Model (LLM)
#Defining function for LLM:
def llm_process(text):
    """Step 2: Large Language Model (LLM)"""
    if text:
        prompt = f"User said: {text}. Respond appropriately."
        response = llm(prompt, max_length=50, num_return_sequences=1)[0]["generated_text"]
        response_text = response.split("\n")[0]  # Extract clean response
        print(f"LLM Response: {response_text}")
        return response_text
    return "Sorry, I didn't understand."

#### Step 3: Text-to-Speech (TTS)
#Defining function for TTS:
def tts_process(text):
    """Step 3: Text-to-Speech (TTS)"""
    tts = gTTS(text=text, lang="en")
    tts.save(output_file)
    print("Playing response...")
    pygame.mixer.init()
    pygame.mixer.music.load(output_file)
    pygame.mixer.music.play()
    while pygame.mixer.music.get_busy():
        continue
    os.remove(output_file)  # Clean up

#### Step 4: Run the full ASR + LLM + TTS pipeline
#Defining function for full pipeline:
def main_pipeline():
    """Run the full ASR + LLM + TTS pipeline"""
    # Step 1: ASR
    transcribed_text = asr_process()
    # Step 2: LLM
    response_text = llm_process(transcribed_text)
    # Step 3: TTS
    tts_process(response_text)

if __name__ == "__main__":
    if torch.cuda.is_available():
        print("Using ROCm for LLM processing.")
    else:
        print("ROCm not available. Using CPU for LLM processing.")
    main_pipeline()

True
6.2.41133-dd7f95766


Loading weights: 100%|██████████| 148/148 [00:00<00:00, 2404.59it/s, Materializing param=transformer.wte.weight]             
[1mGPT2LMHeadModel LOAD REPORT[0m from: gpt2
Key                  | Status     |  | 
---------------------+------------+--+-
h.{0...11}.attn.bias | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


Using ROCm for LLM processing.


ALSA lib pcm_dsnoop.c:567:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
ALSA lib pcm

Listening for your query...


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=50) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


Transcribed: can you hear me
LLM Response: User said: can you hear me. Respond appropriately.
Playing response...


KeyboardInterrupt: 

### Extension of Previous Example - Using superior gpt-4o-mini model instead of GPT2

In [13]:
!pip install openai python-dotenv



In [14]:
import speech_recognition as sr
import openai
from gtts import gTTS
import os
import pyaudio
import pygame
from dotenv import load_dotenv
import torch
# Load environment variables from .env file
load_dotenv()
openai_api_key = os.getenv("OPENAI_API_KEY")

# Initialize components
recognizer = sr.Recognizer()
output_file = "response.mp3"

def asr_process():
    """Step 1: Automatic Speech Recognition (ASR)"""
    with sr.Microphone() as source:
        print("Listening for your query...")
        recognizer.adjust_for_ambient_noise(source)
        audio = recognizer.listen(source)
    try:
        text = recognizer.recognize_google(audio)
        print(f"Transcribed: {text}")
        return text
    except sr.UnknownValueError:
        print("Could not understand audio.")
        return None
    except sr.RequestError:
        print("ASR service error.")
        return None

def llm_process(text):
    """Step 2: Large Language Model (LLM) using OpenAI GPT-4o-mini"""
    if text:
        prompt = f"User said: {text}. Respond appropriately."
        try:
            client = openai.OpenAI(api_key=openai_api_key)
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": "You are a helpful assistant."},
                    {"role": "user", "content": prompt}
                ],
                max_tokens=50,
                n=1,
                stop=None,
                temperature=0.7,
            )
            response_text = response.choices[0].message.content.strip()
            print(f"LLM Response: {response_text}")
            return response_text
        except Exception as e:
            print(f"LLM API error: {e}")
            return "Sorry, I couldn't process your request."
    return "Sorry, I didn't understand."

def tts_process(text):
    """Step 3: Text-to-Speech (TTS)"""
    tts = gTTS(text=text, lang="en")
    tts.save(output_file)
    print("Playing response...")
    pygame.mixer.init()
    pygame.mixer.music.load(output_file)
    pygame.mixer.music.play()
    os.remove(output_file)  # Clean up

def main_pipeline():
    """Run the full ASR + LLM + TTS pipeline"""
    # Step 1: ASR
    transcribed_text = asr_process()
    # Step 2: LLM
    response_text = llm_process(transcribed_text)
    # Step 3: TTS
    tts_process(response_text)

if __name__ == "__main__": 
        main_pipeline()

ALSA lib pcm_dsnoop.c:567:(snd_pcm_dsnoop_open) unable to open slave
ALSA lib pcm_dmix.c:1000:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2721:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
Cannot connect to server socket err = No such file or directory
Cannot connect to server request channel
jack server is not running or cannot be started
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
JackShmReadWritePtr::~JackShmReadWritePtr - Init not done for -1, skipping unlock
ALSA lib pcm

Listening for your query...
Transcribed: how are you there
LLM Response: I'm doing well, thank you! How about you?
Playing response...
