# üéôÔ∏è Cloud-Based Voice Assistant: STT-LLM-TTS Pipeline

**Author:** Juli√°n Machuca Ram√≠rez
**Date:** December 2025

## Project Overview
This notebook implements an **End-to-End Voice Interaction System** running entirely on the cloud using Google Colab. The pipeline integrates three distinct AI technologies to simulate a natural conversation loop:

1.  **ASR (Automatic Speech Recognition):** `OpenAI Whisper` for high-fidelity audio transcription.
2.  **LLM (Large Language Model):** `Llama 3.3 (70B)` via **Groq API** for low-latency inference and natural language understanding.
3.  **TTS (Text-to-Speech):** `gTTS` for audio synthesis and response delivery.

## Architecture
`[User Audio Input] -> [Whisper Base Model] -> [Text Prompt] -> [Llama 3] -> [AI Response] -> [Audio Output]`

## 1. Environment Setup
Installing dependencies for Audio I/O, ASR, and LLM inference.

In [None]:
# --- ENVIRONMENT SETUP ---
# Installing dependencies for Audio I/O, ASR, and LLM inference.
# Using 'capture' to suppress verbose installation logs for a cleaner output.

from IPython.utils import io
import os
import sys

print("‚öôÔ∏è Setting up environment...")

with io.capture_output() as captured:
    # System & Audio processing
    !pip install ffmpeg-python pydub gTTS -q

    # ASR: OpenAI Whisper
    !pip install git+https://github.com/openai/whisper.git -q

    # LLM: Groq API Client
    !pip install groq -q

# Imports
import whisper
import ffmpeg
import numpy as np
import getpass
from google.colab import output
from base64 import b64decode
from IPython.display import HTML, Audio, display, Javascript
from pydub import AudioSegment
from scipy.io.wavfile import read as wav_read
from groq import Groq
from gtts import gTTS

print("‚úÖ Environment ready.")

## 2. Audio Input Interface
Executes a JavaScript bridge to access the browser's microphone, automatically capturing a 5-second audio sample for processing.

In [None]:
# --- AUDIO INPUT INTERFACE (Automatic Capture) ---

def record_audio_simple(duration=5, filename='input_audio.wav'):
    """
    Records audio from the browser microphone for a fixed duration
    and saves it to the Colab runtime.
    """
    print(f"üéôÔ∏è Recording for {duration} seconds... Please speak now.")

    js_code = """
    const sleep  = time => new Promise(resolve => setTimeout(resolve, time))
    const b2text = blob => new Promise(resolve => {
      const reader = new FileReader()
      reader.onloadend = e => resolve(e.srcElement.result)
      reader.readAsDataURL(blob)
    })
    var record = time => new Promise(async resolve => {
      stream = await navigator.mediaDevices.getUserMedia({ audio: true })
      recorder = new MediaRecorder(stream)
      chunks = []
      recorder.ondataavailable = e => chunks.push(e.data)
      recorder.start()
      await sleep(time)
      recorder.onstop = async ()=>{
        blob = new Blob(chunks)
        text = await b2text(blob)
        resolve(text)
      }
      recorder.stop()
    })
    """
    display(Javascript(js_code))
    s = output.eval_js('record(%d)' % (duration*1000))
    binary = b64decode(s.split(',')[1])

    with open(filename, 'wb') as f:
        f.write(binary)

    return filename

# --- EXECUTION ---
try:
    audio_path = record_audio_simple(duration=5)
    print(f"‚úÖ Audio captured successfully: {audio_path}")

    # Playback for verification
    print("‚ñ∂Ô∏è Playback:")
    display(Audio(audio_path))
except Exception as e:
    print(f"‚ùå Error recording audio: {str(e)}")

## 3. AI Pipeline Execution
Orchestrating the flow: Audio Transcription (ASR) $\to$ Intelligence (LLM) $\to$ Synthesis (TTS).

In [None]:
# --- AI PROCESSING PIPELINE ---

def run_pipeline(audio_file):
    # 1. ASR: Speech-to-Text
    print("\n1Ô∏è‚É£ Transcribing audio (Whisper)...")
    model = whisper.load_model("base")
    transcription = model.transcribe(audio_file)["text"]
    print(f"   ‚îî‚îÄ‚îÄ User Query: \"{transcription.strip()}\"")

    # 2. LLM: Inference
    print("\n2Ô∏è‚É£ Generating response (Llama 3.3)...")
    try:
        if 'client' not in locals(): # API Key check
            print("   üîë Enter Groq API Key:")
            api_key = getpass.getpass()
            client = Groq(api_key=api_key)

        chat_completion = client.chat.completions.create(
            model="llama-3.3-70b-versatile",
            messages=[
                {"role": "system", "content": "Eres un asistente de IA conciso y profesional. Responde en espa√±ol."},
                {"role": "user", "content": transcription}
            ],
            temperature=0.5,
            max_tokens=200
        )
        ai_response = chat_completion.choices[0].message.content
        print(f"   ‚îî‚îÄ‚îÄ AI Response: \"{ai_response[:100]}...\"") # Preview output

    except Exception as e:
        return f"Error in LLM inference: {str(e)}", None

    # 3. TTS: Audio Synthesis
    print("\n3Ô∏è‚É£ Synthesizing audio (gTTS)...")
    tts = gTTS(text=ai_response, lang='es')
    output_path = "ai_response.wav"
    tts.save(output_path)

    return ai_response, output_path

# --- RUN PIPELINE ---
# Execute the full flow using the captured audio
if os.path.exists(audio_path):
    response_text, response_audio = run_pipeline(audio_path)

    print("\n‚úÖ Interaction Complete.")
    print("="*50)
    print(f"ü§ñ Full Response:\n{response_text}")
    print("="*50)
    display(Audio(response_audio, autoplay=True))
else:
    print("‚ùå No audio input found. Please run the recording cell first.")

## 4. Technical Notes & Future Improvements

### Performance Analysis
* **Latency:** The Whisper `base` model provides a good trade-off between speed and accuracy for this demo. For production, `distil-whisper` could reduce ASR latency by 50%.
* **Inference:** Using Groq's LPU (Language Processing Unit) ensures token generation speeds significantly faster than standard GPU inference.

### Stack References
* **Whisper:** [OpenAI GitHub](https://github.com/openai/whisper)
* **Llama 3:** [Meta AI](https://llama.meta.com/)
* **Groq Cloud:** [Groq API Docs](https://console.groq.com/docs/quickstart)