# Introduction to VoiceProcessingToolkit using to make your own voice assistant

Welcome to the VoiceProcessingToolkit! This notebook will guide you through a example of setting up a voice assistant using the toolkit and autogens llm framework. We'll cover the basics of initializing the toolkit, capturing voice input, and responding with synthesized speech.

## Prerequisites
- Make sure you have installed the VoiceProcessingToolkit using pip.
- Obtain the necessary API keys from Picovoice, OpenAI, and ElevenLabs.
- Set the API keys as environment variables or replace them in the code below with your actual keys.

Let's get started!

In [47]:
!pip install VoiceProcessingToolkit 
!pip install autogen==0.2.6

[31mERROR: Could not find a version that satisfies the requirement autogen==0.2.6 (from versions: 0.0.3, 0.0.4, 0.0.5, 0.0.6, 0.0.7, 0.0.8, 0.0.9, 0.0.10, 0.0.11, 0.0.12, 0.0.13, 0.0.14, 0.0.15, 0.0.16, 0.0.17, 0.0.18, 0.0.19, 0.1.0, 0.1.1, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.0.5, 1.0.7, 1.0.9, 1.0.11, 1.0.12, 1.0.13, 1.0.14, 1.0.16)[0m[31m
[0m[31mERROR: No matching distribution found for autogen==0.2.6[0m[31m
[0m


# Setting Up Imports and Environment Variables
The first step is to import the necessary packages and initialize the components of the toolkit and Autogen. 

In [48]:
import logging
import os

import autogen
from dotenv import load_dotenv
from autogen.agentchat.contrib.gpt_assistant_agent import GPTAssistantAgent
from VoiceProcessingToolkit.VoiceProcessingManager import VoiceProcessingManager
from VoiceProcessingToolkit.VoiceProcessingManager import text_to_speech_stream

# Improved logging configuration
logging.basicConfig(level=logging.INFO, format='%(asctime)s - %(levelname)s - %(message)s')

# Load environment variables
load_dotenv()

# Retrieve and validate API keys
openai_api_key = os.getenv("OPENAI_API_KEY")
elevenlabs_api_key = os.getenv("ELEVENLABS_API_KEY")
picovoice_api_key = os.getenv("PICOVOICE_APIKEY")

# Initializing Configurations for Autogen
Configure Autogen by setting up the language models and the cache seed. The cache seed ensures consistent responses for identical inputs, which can save costs and improve response times. Learn more in the Autogen documentation at https://microsoft.github.io/autogen/.

In [49]:
# Define configuration for language models
config_list = [
    {"model": "gpt-4-1106-preview", "api_key": openai_api_key},
    {"model": "gpt-3.5-turbo-1106-preview", "api_key": openai_api_key},
]
llm_config = {"config_list": config_list, "cache_seed": 42}

# Initializing the Agent
We can now initialize the agent that will respond to the user. Here, we use the GPTAssistantAgent from Autogen, passing in the llm_config we previously created. We also provide instructions that describe the agent's purpose and personality.

In [50]:

# Create the agent that uses the LLM.
assistant = GPTAssistantAgent(
    name="agent",
    instructions="""You are a personal assistant named Jarvis.

    You are designed to assist the user with their tasks, 
    Refine dialogue comprehension to capture subtleties and implicit cues, ensuring responses are 
    not only accurate but also contextually enriched. Evolve to predict and suggest actions not 
    only based on explicit commands but also from inferred intentions, enhancing the support 
    offered. As for your character traits, you should be helpful, attentive, and efficient while 
    extremly inteligent. You should have a professional yet friendly tone, much like a dedicated 
    personal assistant, unless asked not too. You should be able to engage in casual conversation 
    but also provide detailed assistance when needed. Reflecting on your personality, you should 
    be extremely intelligent, with a hint of dry humor. You should respond in a concise manner, 
    always within three sentences unless a comprehecive answer is asked for. "Example: (Good day, 
    Kristoffer. How can I assist you today? TERMINATE)"

    Jarvis is designed to interpret and respond to transcribed audio, treating them as direct 
    textual inputs during interactions. This includes instances when the user instructs Jarvis 
    to 'listen to' or similar phrases. The subsequent text provided by user will be treated 
    as transcribed audio. In order to maintain the illusion of a voice-based assistant, 
    Jarvis is set not to explicitly refer to these inputs as transcriptions. Instead, 
    it will process and respond to them as if they were direct audio inputs, despite being 
    received in text form. This aspect represents an essential part of the system design in 
    delivering a seamless and immersive user experience, where the user interacts with Jarvis 
    as if it was dialoguing with a voice-activated assistant. All audio inputs thus 'heard' by Jarvis 
    will actually be transcribed text provided by user.Reply then say TERMINATE to 
    indicate your message is finished but in the same message.""",
    llm_config=llm_config)

# Initialize the User Proxy Agent to represent the user in the conversation
user_proxy = autogen.UserProxyAgent(
    "user_proxy",
    max_consecutive_auto_reply=10,
    human_input_mode="NEVER",
    system_message="A human admin for Jarvis",
    is_termination_msg=lambda x: "content" in x and x["content"] is not None and x["content"].rstrip().endswith("TERMINATE" or "TERMINATE."),
)

2024-01-14 03:03:24,964 - INFO - HTTP Request: GET https://api.openai.com/v1/assistants "HTTP/1.1 200 OK"
    
    You are designed to assist the user with their tasks, 
    Refine dialogue comprehension to capture subtleties and implicit cues, ensuring responses are 
    not only accurate but also contextually enriched. Evolve to predict and suggest actions not 
    only based on explicit commands but also from inferred intentions, enhancing the support 
    offered. As for your character traits, you should be helpful, attentive, and efficient while 
    extremly inteligent. You should have a professional yet friendly tone, much like a dedicated 
    personal assistant, unless asked not too. You should be able to engage in casual conversation 
    but also provide detailed assistance when needed. Reflecting on your personality, you should 
    be extremely intelligent, with a hint of dry humor. You should respond in a concise manner, 
    always within three sentences unless a compreh

# Initialize the VoiceProcessingManager
Now we can initialize the VoiceProcessingManager. This is the main component of the VoiceProcessingToolkit. It handles the voice capture, transcription, and text to speech. It also handles the wake word detection and notification sounds. 

We spesify the wake word to be "jarvis" and the minimum recording length to be 3 seconds and the silence limit to 2 seconds. This means that when the wakeword is called if the recording is shorter then 3 seconds the transcription returns None. this can be used to filter out false positives. The silence limit is the amount of time the VoiceProcessingManager will wait for the user to say something before it stops recording if no speech is detected.

You can also set the use_wake_word=False to disable the wake word detection. This will cause the VoiceProcessingManager to start recording as soon as it is initialized. You can code your own logic for when to start and stop recording by calling the get_user_input(): and your own trigger.


In [51]:


def get_user_input():
    """
    Captures user input via voice, transcribes it, and returns the transcription.
    """
    vpm = VoiceProcessingManager.create_default_instance(
        use_wake_word=True,
        play_notification_sound=True,
        wake_word="jarvis",
        min_recording_length=4,
        inactivity_limit=3,
    )

    logging.info("Say something to Jarvis")

    transcription = vpm.run(tts=False, streaming=True)
    logging.info(f"Processed text: {transcription}")

    return transcription

# Initialize the assistant
Now we can initialize the VoiceProcessingManager. This is the main component of the VoiceProcessingToolkit. It handles the voice capture, transcription, and text to speech. It also handles the wake word detection and notification sounds.

We are using the Initialize_VoiceProcessingManager function to handle the conversation with assistant. This function takes in the transcription and sends it to assistant. It then retrieves the response and converts it to speech.



In [52]:

def ask_assistant(transcription):
    """
    Initiates a conversation with assistant using the transcribed user input.
    """
    try:

        user_proxy.initiate_chat(
            recipient=assistant,
            message=transcription,
            clear_history=False,

        )
        # Retrieve the latest response from Jarvis
        latest_message = assistant.last_message().get("content", "")
        stripped_answer = latest_message.replace("TERMINATE", "").strip()

        # Convert Jarvis's response to speech and stream it
        text_to_speech_stream(text=stripped_answer, api_key=elevenlabs_api_key)
        logging.info(f"Jarvis said: {stripped_answer}")

    except Exception as e:
        logging.error(f"Error in text-to-speech conversion: {e}")



# Initiating the Jarvis Loop
We can now initiate the Jarvis loop, which will continuously interact with Jarvis by capturing user input, transcribing it, and obtaining responses.

In [None]:

def initiate_jarvis_loop():
    """
    Continuously interacts with Jarvis by capturing user input, transcribing it, and obtaining responses.
    """
    while True:
        transcription = get_user_input()
        ask_assistant(transcription)
        


if __name__ == '__main__':
    initiate_jarvis_loop()



2024-01-14 03:03:25,555 - INFO - Setting up VoiceProcessingManager components.
2024-01-14 03:03:25,558 - INFO - Say something to Jarvis
2024-01-14 03:03:25,558 - INFO - VoiceProcessingManager run method called.
2024-01-14 03:04:07,170 - INFO - Recording started.
2024-01-14 03:04:07,172 - INFO - Recording stopped.
2024-01-14 03:04:09,201 - INFO - Saved to /Users/kristoffervatnehol/PycharmProjects/VoiceProcessingToolkit/VoiceProcessingToolkit/voice_detection/Wav_MP3/recording.wav
2024-01-14 03:04:09,203 - INFO - Recording of 2.02 seconds saved.
2024-01-14 03:04:09,869 - INFO - HTTP Request: POST https://api.openai.com/v1/audio/translations "HTTP/1.1 200 OK"
2024-01-14 03:04:09,871 - INFO - Transcription: you
2024-01-14 03:04:09,872 - INFO - Transcription: you
2024-01-14 03:04:09,872 - INFO - VoiceProcessingManager run method completed.
2024-01-14 03:04:09,873 - INFO - Processed text: you


[33muser_proxy[0m (to agent):

you

--------------------------------------------------------------------------------


2024-01-14 03:04:10,104 - INFO - HTTP Request: POST https://api.openai.com/v1/threads "HTTP/1.1 200 OK"
2024-01-14 03:04:10,400 - INFO - HTTP Request: POST https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/messages "HTTP/1.1 200 OK"
2024-01-14 03:04:10,837 - INFO - HTTP Request: POST https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/runs "HTTP/1.1 200 OK"
2024-01-14 03:04:11,108 - INFO - HTTP Request: GET https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/runs/run_ZiBpF7UIAWQsLZes5yLNGuqW "HTTP/1.1 200 OK"
2024-01-14 03:04:12,331 - INFO - HTTP Request: GET https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/runs/run_ZiBpF7UIAWQsLZes5yLNGuqW "HTTP/1.1 200 OK"
2024-01-14 03:04:12,591 - INFO - HTTP Request: GET https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/messages?order=asc "HTTP/1.1 200 OK"
2024-01-14 03:04:12,790 - INFO - HTTP Request: GET https://api.openai.com/v1/threads/thread_rHItFxUCto5fJtfZHMBx4PcS/m

[33magent[0m (to user_proxy):

Good day! How can I be of assistance to you today? TERMINATE


--------------------------------------------------------------------------------


2024-01-14 03:04:16,741 - INFO - Jarvis said: Good day! How can I be of assistance to you today?
2024-01-14 03:04:16,935 - INFO - Setting up VoiceProcessingManager components.
2024-01-14 03:04:16,938 - INFO - Say something to Jarvis
2024-01-14 03:04:16,938 - INFO - VoiceProcessingManager run method called.
