In this notebook, we'll walk you through the new Speech-to-Text (STT) and Text-to-Speech (TTS) models added to our project. To keep the architecture clean and maintainable, we've introduced two dedicated interfaces: STTModel for handling speech recognition and TTSModel for generating speech output. 

**Let's take a look at how each one works in practice!**

<p align="center">
  <img src="./img/streaming_handlers.png" width="600">
</p>


In [None]:
from dotenv import load_dotenv

load_dotenv()

## STT Models

We added a helper function to fetch the STT model you need.

In [None]:
import numpy as np
import soundfile as sf
from IPython.display import Audio

from realtime_phone_agents.stt import get_stt_model

We are going to use these two example audios, to test the different models.

In [None]:
Audio("sounds/example.mp3")

In [None]:
audio_example, samplerate = sf.read("sounds/example.mp3", dtype=np.float32)   # decode MP3 â†’ numpy array

### Moonshine

The one we've been using throughout the previous lessons.

In [None]:
moonshine = get_stt_model("moonshine")

In [None]:
moonshine.stt((samplerate, audio_example))

### Groq

If you don't want to host the STT models yourself, you can use Groq for that purpose.

In [None]:
whisper_groq = get_stt_model("whisper-groq")

In [None]:
whisper_groq.stt((samplerate, audio_example))

### Faster-Whisper

Finally, let's host our own STT model using Runpod. In particular, we'll go with `faster-whisper`. **Make sure you have created the Runpod Pod, copy the URL and paste it in your `.env` file, under `FASTER_WHISPER__API_URL`.

In [None]:
faster_whisper = get_stt_model("faster-whisper")

In [None]:
faster_whisper.stt((samplerate, audio_example))

## TTS Models

Let's do the same for our TTS models.

In [None]:
import numpy as np
from IPython.display import Audio

from realtime_phone_agents.tts import get_tts_model

In [None]:
text = "Hello, and welcome to The Neural Maze call center. Are you looking for a great apartment in Madrid?"

### Kokoro

The one we've been using throughout the previous lessons.

In [None]:
kokoro = get_tts_model("kokoro")

In [None]:
samplerate, audio = kokoro.tts(text)

In [None]:
Audio(audio, rate=samplerate)

### Orpheus 3B (Together.ai)

As we did with Groq, you can rely on providers for the TTS models.

In [None]:
orpheus_togetherai = get_tts_model("together")

In [None]:
samplerate, audio = orpheus_togetherai.tts(text)

In [None]:
Audio(audio, rate=samplerate)

### Orpheus 3B (Runpod)


In [None]:
orpheus_runpod = get_tts_model("orpheus-runpod")

In [None]:
samplerate, audio = await orpheus_runpod.tts(text)

In [None]:
Audio(audio, rate=samplerate)

The great thing about Orpheus is that it not only lets us switch between different voices, but also allows us to add emotional expressions to them.

You can enhance the speech with the following emotive tags:
`<laugh>`, `<chuckle>`, `<sigh>`, `<cough>`, `<sniffle>`, `<groan>`, `<yawn>`, `<gasp>`

Available voices include: `tara`, `leah`, `jess`, `leo`, `dan`, `mia`, `zac`, `zoe`.

Let's check some final examples!

In [None]:
orpheus_runpod.set_voice("leo")
leo_message = "Miguel told me I'm an AI Agent! <giggle> This guy is a liar"
samplerate, audio = await orpheus_runpod.tts(leo_message)
Audio(audio, rate=samplerate)

In [None]:
orpheus_runpod.set_voice("tara")
zoe_message = "I'm so tired of working at this call center <yawn> Just need some sleep"
samplerate, audio = await orpheus_runpod.tts(zoe_message)
Audio(audio, rate=samplerate)

In [None]:
samplerate, audio = await orpheus_runpod.tts(long_message)
Audio(audio, rate=samplerate)