# STT -> LLM -> TTS Test Space

pipeline and stack:

* STT: coqui tts, vosk (which one?)
* LLM: ollama, langchain
* TTS: coqui tts
* AUDIO I/O: pyaudio, sounddevice

### Audio I/O Testing

In [2]:
# audio IO - pyaudio test (playback - sample)
import wave
import sys
import pyaudio

chunksize = 1024
f = 'wav_training/to_output.wav'

with wave.open(f, 'rb') as wf:
    # Instantiate PyAudio and initialize PortAudio system resources (1)
    p = pyaudio.PyAudio()

    # Open stream (2)
    stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                    channels=wf.getnchannels(),
                    rate=wf.getframerate(),
                    output=True)

    # Play samples from the wave file (3)
    while len(data := wf.readframes(chunksize)):
        stream.write(data)

    # Close stream (4)
    stream.close()

    # Release PortAudio system resources (5)
    p.terminate()

In [3]:
# audio IO - pyaudio test (record)
import wave
import sys
import pyaudio
import math

chunksize = 1024
f = 'record.wav'
seconds = 5
rate = 44100
channels = 1
form = pyaudio.paInt16

# Instantiate PyAudio and initialize PortAudio system resources (1)
p = pyaudio.PyAudio()

# Open steam (2)
stream = p.open(format=form,
                channels=channels,
                rate=rate,
                input=True,
                frames_per_buffer=chunksize)

# instantiate frames container
print ("recording started")
recordframes = []

# record w/ logic for seconds
for i in range(0, math.ceil(rate / chunksize * seconds)):
    data = stream.read(chunksize)
    recordframes.append(data)
print ("recording stopped")
stream.stop_stream()

# Close stream (4)
stream.close()

# Release PortAudio system resources (5)
p.terminate()

# wav file
wf = wave.open(f, 'wb')
wf.setnchannels(channels)
wf.setsampwidth(p.get_sample_size(form))
wf.setframerate(rate)
wf.writeframes(b''.join(recordframes))
wf.close()

recording started
recording stopped


In [4]:
# audio IO - pyaudio test (playback - sample)
import wave
import sys
import pyaudio

chunksize = 1024
f = 'record.wav'

with wave.open(f, 'rb') as wf:
    # Instantiate PyAudio and initialize PortAudio system resources (1)
    p = pyaudio.PyAudio()

    # Open steam (2)
    stream = p.open(format=p.get_format_from_width(wf.getsampwidth()),
                    channels=wf.getnchannels(),
                    rate=wf.getframerate(),
                    output=True)

    # Play samples from the wave file (3)
    while len(data := wf.readframes(chunksize)):
        stream.write(data)

    # Close stream (4)
    stream.close()

    # Release PortAudio system resources (5)
    p.terminate()

### Speech to Text Testing

In [1]:
from vosk import Model, KaldiRecognizer
import pyaudio
import json

In [2]:
# Load the Vosk model
model_path = "models/stt_model/vosk-model-small-en-us-0.15"
# model_path = "models/stt_model/vosk-model-en-us-0.22"
model = Model(model_path)

In [3]:
# Initialize the recognizer with the model and sample rate
recognizer = KaldiRecognizer(model, 16000) # 16000 is the sample rate of the model

In [4]:
# Setup PyAudio for audio input
p = pyaudio.PyAudio()
stream = p.open(format=pyaudio.paInt16,
                channels=1,
                rate=16000,
                input=True,
                frames_per_buffer=8192)
stream.start_stream()

print("Listening...")

# Speech recognition loop
while True:
    data = stream.read(4096, exception_on_overflow=False)
    if recognizer.AcceptWaveform(data):
        result = json.loads(recognizer.Result())
        print("You:", result["text"])
    else:
        # Optional: print partial results during recognition
        # partial_result = json.loads(recognizer.PartialResult())
        # print("Partial:", partial_result["partial"])
        pass

Listening...
You: hey i'm trying to small also want to make it for with local before on little smartphones your sleep but like it's pretty good it should be you're my voice now i'm you know have options exit minute
You: oh it's python and for ask is the package amusing to be with speech that it knows when i finish it sense middletown homicide at that moment
You: it's not very good it's like it's it's it's
You: first woman like it's far from me
You: cooper
You: it's for the bike is far from me right now and like i'm like i said i'm using the that small version of their mother so it it it's a little limited it's capabilities but at least the all this was about twenty minutes appoint if it's it's a really simple baggage
You: 
You: fruit sugar i mean abusing lama for one thought that was my models local you
You: and it just got a
You: interface and on all and
You: 
You: 
You: oh really
You: 
You: 
You: interesting
You: that's really interesting of i'm trying to develop a small enough think

KeyboardInterrupt: 

### Voice Synthesis Testing

In [None]:
import torch
from TTS.api import TTS
from datetime import date 

script = 'Hey fryman, pass me the peanut butter'

# Get device
device = "cuda" if torch.cuda.is_available() else "cpu"

# Init TTS
tts = TTS("tts_models/multilingual/multi-dataset/xtts_v2").to(device)

# Text to speech to a file
tts.tts_to_file(text=script, speaker_wav="wav_training/p1.wav", language="en", file_path=f"wav_sample/test_p1_{date.today().strftime('%Y%m%d%H%M%S')}.wav")

 > tts_models/multilingual/multi-dataset/xtts_v2 is already downloaded.


  from .autonotebook import tqdm as notebook_tqdm

KeyboardInterrupt


KeyboardInterrupt



### Voice Synthesis Testing - Fine Tuned XTTSv2 Model

In [1]:
# model
import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from IPython.display import Audio, display

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# input fs
CONFIG_PATH = "models/tts_model/XTTS/run/training/GPT_XTTS_Carl_20251016_2-October-17-2025_11+28AM-96e7a6a/config.json"
TOKENIZER_PATH = "models/tts_model/XTTS/run/training/XTTS_v2_original_model_files/vocab.json"
XTTS_CHECKPOINT = "models/tts_model/XTTS/run/training/GPT_XTTS_Carl_20251016_2-October-17-2025_11+28AM-96e7a6a/best_model.pth"
SPEAKER_REFERENCE = "C:/Users/monol/wkdir/code/llm_tts/wav/input/source/p1.wav"

In [3]:
# instantiate model - only needs to run once!

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")
print("Loading fine-tuned XTTS model manually...")

try:
    print("Loading model...")
    config = XttsConfig()
    config.load_json(CONFIG_PATH)
    model = Xtts.init_from_config(config)
    model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False)
    model.cuda()
    
    print("Computing speaker latents...")
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE])
    print("\n✅ Model and Speaker Latents loaded successfully!")

except Exception as e:
    print(f"\n❌ Model Loading Failed: {e}")

Using device: cuda
Loading fine-tuned XTTS model manually...
Loading model...
Computing speaker latents...

✅ Model and Speaker Latents loaded successfully!


In [5]:
# audio generation

NEW_CARL_TEXT = input()
OUTPUT_WAV_PATH = f"wav/output/test_XTTSv2_20251002/output_{NEW_CARL_TEXT[0:10]}.wav" 

# --- Inference ---
print(f"Generating: '{NEW_CARL_TEXT[:50]}...'")

try:
    # Use the loaded model and pre-calculated latents for fast inference
    out = model.inference(
        NEW_CARL_TEXT,
        "en",
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.9, # Add custom parameters here
    )
    
    torchaudio.save(OUTPUT_WAV_PATH, torch.tensor(out["wav"]).unsqueeze(0), 24000)
    print(f"✅ Audio saved as {OUTPUT_WAV_PATH}")
    display(Audio(out['wav'], rate=24000))

except Exception as e:
    print(f"❌ Inference failed: {e}")

 Listen up Meat man


Generating: 'Listen up Meat man...'
✅ Audio saved as wav/output/test_XTTSv2_20251002/output_Listen up .wav


### LLM Instantiation Testing

In [1]:
# instantiate ollama - is this necessary when running win app?
import os
os.system('ollama run carl_20250927')

0

In [2]:
from ollama import chat
from ollama import ChatResponse

In [3]:
# demo example - https://github.com/ollama/ollama-python

msg = input('Speak to the Carl: ')

response: ChatResponse = chat(model='carl_20250927', messages=[
  {
    'role': 'user',
    'content': msg,
  },
])
print(response['message']['content'])
# or access fields directly from the response object
print(response.message.content)

Speak to the Carl:  Hey Carl, what's up man


Ugh, what's it to you, Fryman? Can't you see I'm tryin' to watch some heavy metal documentaries here? "Master of Puppets" or somethin'. Now you're botherin' me with your stupid questions. What do you want, anyway? Don't tell me you're comin' over again to borrow some cash or something. Just get the *frick* outta my house!
Ugh, what's it to you, Fryman? Can't you see I'm tryin' to watch some heavy metal documentaries here? "Master of Puppets" or somethin'. Now you're botherin' me with your stupid questions. What do you want, anyway? Don't tell me you're comin' over again to borrow some cash or something. Just get the *frick* outta my house!


In [4]:
type(response)

ollama._types.ChatResponse

In [5]:
response

ChatResponse(model='carl_20250927', created_at='2025-10-16T21:23:00.0031668Z', done=True, done_reason='stop', total_duration=3322200000, load_duration=137289100, prompt_eval_count=359, prompt_eval_duration=395771800, eval_count=91, eval_duration=2676663800, message=Message(role='assistant', content='Ugh, what\'s it to you, Fryman? Can\'t you see I\'m tryin\' to watch some heavy metal documentaries here? "Master of Puppets" or somethin\'. Now you\'re botherin\' me with your stupid questions. What do you want, anyway? Don\'t tell me you\'re comin\' over again to borrow some cash or something. Just get the *frick* outta my house!', thinking=None, images=None, tool_name=None, tool_calls=None))

In [6]:
response['message']

Message(role='assistant', content='Ugh, what\'s it to you, Fryman? Can\'t you see I\'m tryin\' to watch some heavy metal documentaries here? "Master of Puppets" or somethin\'. Now you\'re botherin\' me with your stupid questions. What do you want, anyway? Don\'t tell me you\'re comin\' over again to borrow some cash or something. Just get the *frick* outta my house!', thinking=None, images=None, tool_name=None, tool_calls=None)

In [7]:
# stop the carl
os.system('ollama stop carl_20250927')

0

### Integrate Steps - Talk 2 Carl

In [1]:
# STT model
from vosk import Model, KaldiRecognizer
import pyaudio
import json

# LLM model
from ollama import chat
from ollama import ChatResponse

# TTS model
import os
import torch
import torchaudio
from TTS.tts.configs.xtts_config import XttsConfig
from TTS.tts.models.xtts import Xtts
from IPython.display import Audio, display

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
# Load the Vosk model
model_path = "models/stt_model/vosk-model-small-en-us-0.15"
# model_path = "models/stt_model/vosk-model-en-us-0.22"
model = Model(model_path)
# Initialize the recognizer with the model and sample rate
recognizer = KaldiRecognizer(model, 16000) # 16000 is the sample rate of the model

In [3]:
# input fs - XTTSv2
CONFIG_PATH = "models/tts_model/XTTS/run/training/GPT_XTTS_Carl_20251002-October-16-2025_06+52PM-96e7a6a/config.json"
TOKENIZER_PATH = "models/tts_model/XTTS/run/training/XTTS_v2_original_model_files/vocab.json"
XTTS_CHECKPOINT = "models/tts_model/XTTS/run/training/GPT_XTTS_Carl_20251002-October-16-2025_06+52PM-96e7a6a/best_model.pth"
SPEAKER_REFERENCE = "C:/Users/monol/wkdir/code/llm_tts/wav/input/source/p1.wav"

In [4]:
# instantiate XTTSv2 model

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
print(f"Using device: {DEVICE}")
print("Loading fine-tuned XTTS model manually...")

try:
    print("Loading model...")
    config = XttsConfig()
    config.load_json(CONFIG_PATH)
    model = Xtts.init_from_config(config)
    model.load_checkpoint(config, checkpoint_path=XTTS_CHECKPOINT, vocab_path=TOKENIZER_PATH, use_deepspeed=False)
    model.cuda()
    
    print("Computing speaker latents...")
    gpt_cond_latent, speaker_embedding = model.get_conditioning_latents(audio_path=[SPEAKER_REFERENCE])
    print("\n✅ Model and Speaker Latents loaded successfully!")

except Exception as e:
    print(f"\n❌ Model Loading Failed: {e}")

Using device: cuda
Loading fine-tuned XTTS model manually...
Loading model...
Computing speaker latents...

✅ Model and Speaker Latents loaded successfully!


In [5]:
# instantiate ollama
import os
os.system('ollama run carl_20251002')

0

In [6]:
# audio generation and playback function
def speak_response(text):
    global gpt_cond_latent, speaker_embedding
    
    # Generate audio (inference)
    out = model.inference(
        text,
        "en",
        gpt_cond_latent,
        speaker_embedding,
        temperature=0.98, # Add custom parameters here
    )
    
    # Play the audio back (This requires the notebook/Jupyter environment)
    display(Audio(out['wav'], rate=24000, autoplay=True))

In [None]:
# Open the PyAudio stream for input (listening)
p = pyaudio.PyAudio()
input_stream = p.open(
    format=pyaudio.paInt16,
    channels=1,
    rate=16000,
    input=True,
    frames_per_buffer=4096
)

print("-" * 50)
print("STARTING ROBOCARL. Speak after the 'Listening...' prompt.")
print("Say 'OK Bye Carl' or interrupt the cell to stop.")
print("-" * 50)

try:
    # Clear any previous buffer and start listening
    input_stream.start_stream() 
    
    while True:
        print("\n\nListening...")
        
        # --- STAGE 1: LISTEN (Speech-to-Text) ---
        user_text = ""
        while True:
            # Read audio data from the microphone
            data = input_stream.read(4096, exception_on_overflow=False)
            
            # Process the waveform using Vosk
            if recognizer.AcceptWaveform(data):
                result = json.loads(recognizer.Result())
                if result.get("text"):
                    user_text = result["text"]
                    print(f"You said: {user_text}")
                    break
        
        # Check for exit command
        if "ok bye carl" in user_text.lower(): # Using one of your recognized exit phrases [cite: 125]
            print("Exiting conversation.")
            break
            
        if not user_text:
            continue
            
        # --- STAGE 2: THINK (LLM Processing) ---
        print("Carl is thinking...")
        
        # Call the Ollama chat service [cite: 274]
        response_obj = chat(
            model='carl_20251002', # Your custom LLM model [cite: 274]
            messages=[{'role': 'user', 'content': user_text}]
        )
        carl_response = response_obj['message']['content'] # Extract the string content [cite: 280]
        
        print(f"\nCarl says: {carl_response}")

        # --- STAGE 3: SPEAK (Text-to-Speech) ---
        speak_response(carl_response)


except KeyboardInterrupt:
    print("\n\nUser manually interrupted the loop.")
    
finally:
    # Clean up the audio streams and PyAudio resources
    input_stream.stop_stream()
    input_stream.close()
    p.terminate()
    print("Cleanup complete.")

--------------------------------------------------
STARTING ROBOCARL. Speak after the 'Listening...' prompt.
Say 'OK Bye Carl' or interrupt the cell to stop.
--------------------------------------------------


Listening...
You said: hey carl can you hear me
Carl is thinking...

Carl says: What the hell is this? Can't you see I'm tryin' to watch the game here? You're probably one of those butt-nuts from next door, aren't ya? What do you want? This don't matter. None of this matters. Just get off my ass and let me enjoy my beer in peace...




Listening...
You said: what the hell is this
Carl is thinking...

Carl says: Ugh, what's it to you, Fryman? Can't you see I'm busy trying to watch some wrestling on TV here? What's with all these questions comin' outta nowhere like a swarm of biting insects? Spit it out, man. You ain't gettin' me off the couch without a good reason.




Listening...
You said: did you sell tripod was it to fryman
Carl is thinking...

Carl says: *sigh* What's it to you? Yeah, I sold Tripod to that numb-nut, Frylock. Made a decent buck off the thing too. Don't even get me started on how he's been using it for some of his weird, space-age schemes with Master Shake and Meatwad... Those three are always causin' trouble around here. You'd think I'd get some peace after sellin' that eyesore, but noooo... Now I gotta deal with Frylock and his hippie nonsense all the time. Get off my ass!




Listening...
You said: just say your what's it to you
Carl is thinking...
