# Week 3 - Assignment: Voice Agent Development

From now on, we start to hands on buiding Research Voice Agent, truly useful AI Research Assistants must listen, understand, and respond with voice. **we will give you some simple introduction code as a starter, feel free to write your own code or do optimization.**

## üìö Learning Objectives this week
to build a simple Voice Agent, we need these following knowledge.

* **1. Speech Recognition (ASR):** Convert audio to text using models like Whisper or Google Speech-to-Text.
* **2. Dialogue Generation with LLMs:** Feed transcribed user input into LLM (e.g. LLaMA 3) and generate natural language responses.
* **3. Text-to-Speech (TTS):** Use a TTS engine (CozyVoice) to convert generated responses into spoken audio.
* **4. FastAPI for API Serving:** Create a web server with FastAPI to handle audio file uploads and return voice responses.
* **5. Conversation State Management:** Track conversation history to enable multi-turn interaction.
* **6. Low-Latency Real-Time Processing:** Use asynchronous functions to reduce inference time and improve response experience.

---


> ‚úÖ You do NOT need Docker. Just ensure your local Python environment works.

---

## üß™ Project: Build an Local Voice Assistant

### üéØ Goal:

Develop a real-time voice chatbot that can:

1. Take audio input via HTTP,
2. Transcribe audio to text (ASR),
3. Generate a response using LLM,
4. Convert the response back to speech (TTS),
5. Support 5-turn conversational memory.

---

### Step 1: FastAPI Skeleton

Create a simple FastAPI server that accepts an audio file via POST and returns an audio file in response:


here is the official guidance of FastAPI [fastapi](https://fastapi.tiangolo.com/)

In [None]:
from fastapi import FastAPI, UploadFile, File
from fastapi.responses import FileResponse

app = FastAPI()

@app.post("/chat/")
async def chat_endpoint(file: UploadFile = File(...)):
    audio_bytes = await file.read()
    # TODO: ASR ‚Üí LLM ‚Üí TTS
    return FileResponse("response.wav", media_type="audio/wav")


Run your server:

```bash
uvicorn main:app --reload
```

Test it with `curl`, Postman, or a custom frontend.

### Step 2: ASR (Speech Recognition)

Use OpenAI Whisper to transcribe the uploaded audio to text:

In [None]:
import whisper

asr_model = whisper.load_model("small")

def transcribe_audio(audio_bytes):
    with open("temp.wav", "wb") as f:
        f.write(audio_bytes)
    result = asr_model.transcribe("temp.wav")
    return result["text"]

Add it to the `/chat/` route:

In [None]:
user_text = transcribe_audio(audio_bytes)

Print `user_text` for debugging.


### Step 3: Response Generation (LLM)

Generate context-aware responses using Llama 3. Use HuggingFace `pipeline` to call LLaMA 3 or similar models:

*Note* The llama-3.1-8b might not run in the 16G Tensorflow GPU due to out of the memory usage. try use some smaller model for that such as `llama-3.2-1b` or `llama-3.2-3b`

In [None]:
from transformers import pipeline

llm = pipeline("text-generation", model="meta-llama/Llama-3.1-8B")

conversation_history = []

def generate_response(user_text):
    conversation_history.append({"role": "user", "text": user_text})
    # Construct prompt from history
    prompt = ""
    for turn in conversation_history[-5:]:
        prompt += f"{turn['role']}: {turn['text']}\n"
    outputs = llm(prompt, max_new_tokens=100)
    bot_response = outputs[0]["generated_text"]
    conversation_history.append({"role": "assistant", "text": bot_response})
    return bot_response


Call in route:

In [None]:
bot_text = generate_response(user_text)


---



### Step 4: TTS (Text to Speech)


Convert LLM text responses to natural-sounding speech. \
try to use cozyvoice to complete Text to Speech, here is the original project.
[Cosyvoice](https://github.com/FunAudioLLM/CosyVoice)

other suggestion:
You can also try to use [BentoTTX](https://github.com/bentoml/BentoXTTS?tab=readme-ov-file) or [Pyttsx3](https://pypi.org/project/pyttsx3/) if you have trouble to set up CosyVoice (kinda out dated here)

In [None]:
import sys
sys.path.append('third_party/Matcha-TTS')
from cosyvoice.cli.cosyvoice import CosyVoice, CosyVoice2
from cosyvoice.utils.file_utils import load_wav
import torchaudio

cosyvoice = CosyVoice2('pretrained_models/CosyVoice2-0.5B', load_jit=False, load_trt=False, load_vllm=False, fp16=False)

# NOTE if you want to reproduce the results on https://funaudiollm.github.io/cosyvoice2, please add text_frontend=False during inference
# zero_shot usage
prompt_speech_16k = load_wav('./asset/zero_shot_prompt.wav', 16000)
for i, j in enumerate(cosyvoice.inference_zero_shot('Êî∂Âà∞Â•ΩÂèã‰ªéËøúÊñπÂØÑÊù•ÁöÑÁîüÊó•Á§ºÁâ©ÔºåÈÇ£‰ªΩÊÑèÂ§ñÁöÑÊÉäÂñú‰∏éÊ∑±Ê∑±ÁöÑÁ•ùÁ¶èËÆ©ÊàëÂøÉ‰∏≠ÂÖÖÊª°‰∫ÜÁîúËúúÁöÑÂø´‰πêÔºåÁ¨ëÂÆπÂ¶ÇËä±ÂÑøËà¨ÁªΩÊîæ„ÄÇ', 'Â∏åÊúõ‰Ω†‰ª•ÂêéËÉΩÂ§üÂÅöÁöÑÊØîÊàëËøòÂ•ΩÂë¶„ÄÇ', prompt_speech_16k, stream=False)):
    torchaudio.save('zero_shot_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)


In [None]:
# After you set up the BentoTTX, you can use it to do the TTS.
# Make sure you run the server before you use it.
# if the python didn't give you voice response, try to set up using cURL to test the server.
import bentoml

with bentoml.SyncHTTPClient("http://localhost:3000") as client:
        result = client.synthesize(
            text="It took me quite a long time to develop a voice and now that I have it I am not going to be silent.",
            lang="en"
        )

### Use it in the route:

In [None]:
pythonaudio_path = synthesize_speech(bot_text)



---


### Step 5: Full Integration

Your final `/chat/` endpoint might look like this:

In [None]:

@app.post("/chat/")
async def chat_endpoint(file: UploadFile = File(...)):
    audio_bytes = await file.read()
    user_text = transcribe_audio(audio_bytes)
    bot_text = generate_response(user_text)
    audio_path = synthesize_speech(bot_text)
    return FileResponse(audio_path, media_type="audio/wav")



---

## ‚úÖ Deliverables

* [ ] A runnable FastAPI project with `/chat/` endpoint
* [ ] A working voice assistant that handles **5-turn** multi-round conversations
* [ ] Code with clear structure and modular components (ASR, LLM, TTS)
* [ ] A **2-minute screen recording** demo: record 5 turns of real-time interaction
* [ ] Optional: Add conversation memory display, prompt formatting logic, async optimization

---

## üåü Extension Ideas (Optional)

* Use `async` processing for parallel ASR/LLM/TTS.
* Integrate a microphone frontend UI for live recording.
* Add speaker identification or personalized voice response.

---
