# Week 3 - Assignment: Voice Agent Development

From now on, we start to hands on buiding Research Voice Agent, truly useful AI Research Assistants must listen, understand, and respond with voice. **we will give you some simple introduction code as a starter, feel free to write your own code or do optimization.**

## 📚 Learning Objectives this week
to build a simple Voice Agent, we need these following knowledge.

* **1. Speech Recognition (ASR):** Convert audio to text using models like Whisper or Google Speech-to-Text.
* **2. Dialogue Generation with LLMs:** Feed transcribed user input into LLM (e.g. LLaMA 3) and generate natural language responses.
* **3. Text-to-Speech (TTS):** Use a TTS engine (CozyVoice) to convert generated responses into spoken audio.
* **4. FastAPI for API Serving:** Create a web server with FastAPI to handle audio file uploads and return voice responses.
* **5. Conversation State Management:** Track conversation history to enable multi-turn interaction.
* **6. Low-Latency Real-Time Processing:** Use asynchronous functions to reduce inference time and improve response experience.

---


> ✅ You do NOT need Docker. Just ensure your local Python environment works.

---

## 🧪 Project: Build an Local Voice Assistant

### 🎯 Goal:

Develop a real-time voice chatbot that can:

1. Take audio input via HTTP,
2. Transcribe audio to text (ASR),
3. Generate a response using LLM,
4. Convert the response back to speech (TTS),
5. Support 5-turn conversational memory.

---

### Step 1: FastAPI Skeleton

Create a simple FastAPI server that accepts an audio file via POST and returns an audio file in response:


here is the official guidance of FastAPI [fastapi](https://fastapi.tiangolo.com/)

In [None]:
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse

app = FastAPI()

@app.post("/chat/")
async def chat_endpoint(file: UploadFile = File(...)):
    audio_bytes = await file.read()
    # TODO: ASR → LLM → TTS
    return FileResponse("response.wav", media_type="audio/wav")


Run your server:

```bash
uvicorn main:app --reload
```

Test it with `curl`, Postman, or a custom frontend.

### Step 2: ASR (Speech Recognition)

Use OpenAI Whisper to transcribe the uploaded audio to text:

In [None]:
import whisper

asr_model = whisper.load_model("small")

def transcribe_audio(audio_bytes):
    with open("temp.wav", "wb") as f:
        f.write(audio_bytes)
    result = asr_model.transcribe("temp.wav")
    return result["text"]

Add it to the `/chat/` route:

In [None]:
user_text = transcribe_audio(audio_bytes)

Print `user_text` for debugging.


### Step 3: Response Generation (LLM)

Generate context-aware responses using Llama 3. Use HuggingFace `pipeline` to call LLaMA 3 or similar models:


In [None]:
from transformers import pipeline

llm = pipeline("text-generation", model="meta-llama/Llama-3-8B")

conversation_history = []

def generate_response(user_text):
    conversation_history.append({"role": "user", "text": user_text})
    # Construct prompt from history
    prompt = ""
    for turn in conversation_history[-5:]:
        prompt += f"{turn['role']}: {turn['text']}\n"
    outputs = llm(prompt, max_new_tokens=100)
    bot_response = outputs[0]["generated_text"]
    conversation_history.append({"role": "assistant", "text": bot_response})
    return bot_response


Call in route:

In [None]:
bot_text = generate_response(user_text)


---



### Step 4: TTS (Text to Speech)


Convert LLM text responses to natural-sounding speech. \
try to use cozyvoice to complete Text to Speech, here is the original project.
[Cozyvoice](https://github.com/FunAudioLLM/CosyVoice)

In [None]:
from cozyvoice import CozyVoice

tts_engine = CozyVoice()

def synthesize_speech(text, filename="response.wav"):
    tts_engine.generate(text, output_file=filename)
    return filename

### Use it in the route:

In [None]:
pythonaudio_path = synthesize_speech(bot_text)



---


### Step 5: Full Integration

Your final `/chat/` endpoint might look like this:

In [None]:

@app.post("/chat/")
async def chat_endpoint(file: UploadFile = File(...)):
    audio_bytes = await file.read()
    user_text = transcribe_audio(audio_bytes)
    bot_text = generate_response(user_text)
    audio_path = synthesize_speech(bot_text)
    return FileResponse(audio_path, media_type="audio/wav")



---

## ✅ Deliverables

* [ ] A runnable FastAPI project with `/chat/` endpoint
* [ ] A working voice assistant that handles **5-turn** multi-round conversations
* [ ] Code with clear structure and modular components (ASR, LLM, TTS)
* [ ] A **2-minute screen recording** demo: record 5 turns of real-time interaction
* [ ] Optional: Add conversation memory display, prompt formatting logic, async optimization

---

## 🌟 Extension Ideas (Optional)

* Use `async` processing for parallel ASR/LLM/TTS.
* Integrate a microphone frontend UI for live recording.
* Add speaker identification or personalized voice response.

---
