# 1. Initializing the OpenAI Client

This section of the code is responsible for setting up the environment and initializing the OpenAI client, which we will use to interact with various OpenAI APIs including ChatGPT.

In [1]:
import os
from openai import OpenAI

from dotenv import load_dotenv, find_dotenv

# Load environment variables from a .env file
_ = load_dotenv(find_dotenv()) 

# Specify the GPT model to be used
gpt_model_name = "gpt-3.5-turbo-1106"

# Load env variables for Naver Cloud
ncloud_client_id = os.environ['NCLOUD_CLIENT_ID']
ncloud_client_secret = os.environ['NCLOUD_CLIENT_SECRET']

# Initialize the OpenAI client with the API key
client = OpenAI(
    api_key=os.environ['OPENAI_API_KEY'],  # Retrieves API key from environment variables
)

## Key Components:

- **Environment Variables**: We use `dotenv` to load environment variables. This is a secure way to manage sensitive information like API keys. The `.env` file should contain your `OPENAI_API_KEY`.
- **OpenAI Client Initialization**: We create an instance of the `OpenAI` class from the `openai` package, passing the API key from the environment variables. This client will be used to make requests to OpenAI services.

> 💡 **Tip:** Always keep your API keys secure. Never hardcode them into your scripts. Using environment variables as shown here is a best practice.


# 2. Function to Transcribe Audio to Text

This function, `get_transcript`, takes the path of an audio file and uses OpenAI's Whisper model to transcribe the audio to text.

In [2]:
def get_transcript(file_path):
    # Open the audio file in binary read mode
    audio_file = open(file_path, "rb")

    # Use the OpenAI Whisper model to transcribe the audio
    transcript = client.audio.transcriptions.create(
        model="whisper-1",           # Specifies the Whisper model to use
        file=audio_file,             # Passes the audio file to the API
        response_format="text"       # Requests the transcription in text format
    )

    # Return the transcription
    return transcript

## Key Points:

- **Opening the File**: The audio file is opened in binary read mode (`"rb"`), which is required for audio data processing.
- **Transcription Request**: The `client.audio.transcriptions.create` method is used to send the audio file to OpenAI's Whisper API for transcription.
- **Model Specification**: Here, `"whisper-1"` is specified as the model. Depending on your needs and OpenAI's offerings, you might use a different model version.
- **Returning the Transcript**: The function returns the transcription result, which can then be used for further processing or displayed in the notebook.

> 💡 **Note:**  Ensure that the audio file format and content are compatible with the Whisper API's requirements for accurate transcription.

# 3. Function to Generate Tutor's Response Using GPT

This function, `get_gpt_response`, generates a response from the AI English tutor based on the student's transcribed speech and the conversation history.

In [3]:
# Predefined prompt that sets the context for the AI's role
system_prompt = """
You are an experienced English tutor who graduated from Harvard University in Boston.
You are talking to a student who wants to practice speaking English. 
Help them practice speaking English by talking to your student and 
try to teach your student how to say what they would like to say.
The answer must be formatted as a JSON string
"""

def get_gpt_response(transcript, history):
    # Format the system message for context setting
    system_message = {
        "role": "system", 
        "content": system_prompt.replace("\n", " ")  # Removes newline characters for formatting
    }
    
    # Prepare the message list combining the system message and conversation history
    message_list = [system_message]
    message_list.extend(history)
    message_list.append({"role": "user", "content": transcript})  # Add the latest user input

    # Get the AI response using the OpenAI Chat Completion API
    response = client.chat.completions.create(
        model=gpt_model_name,  # Specifies the GPT model to use
        response_format={ "type": "json_object" },  # Requests response in JSON format
        messages=message_list  # Provides the context and conversation history
    )
    
    # Return the AI's message content
    return response.choices[0].message.content

## Key Components:

- **System Prompt**: This sets the context for the AI, defining its role as an English tutor. The prompt is crucial as it guides the AI's responses.
- **Function Parameters**: `transcript` is the latest user input (student's speech), and `history` contains previous messages in the conversation.
- **Message Formatting**: The conversation history and new user input are formatted as a list of messages, each with a role (`system` or `user`) and content.
- **AI Response Generation**: The `client.chat.completions.create` method is used to generate a response from the AI based on the provided context and conversation history.
- **Response Handling**: The function extracts and returns the content of the AI's response, formatted as requested in JSON.

> **💡 Tip:** This function plays a key role in maintaining the flow of conversation, ensuring that the AI's responses are contextually relevant and pedagogically sound.

# 4. Function to Play AI Tutor's Response Using Text-to-Speech

This function, `play_gpt_response_with_tts`, converts the AI tutor's textual response into speech using Text-to-Speech (TTS) and plays it aloud for the user.

In [4]:
import os
from playsound import playsound

# Path to temporarily store the generated speech file
speech_file_path = "./speech.wav"

def play_gpt_response_with_tts(gpt_response):
    # Generate speech from the GPT response using TTS
    response = client.audio.speech.create(
        model="tts-1",          # Specifies the TTS model to use
        voice="alloy",          # Chooses a specific voice for the TTS
        input=gpt_response      # The text input to be converted to speech
    )

    # Stream the audio to a file
    response.stream_to_file(speech_file_path)

    # Play the generated speech audio
    playsound(speech_file_path)

    # Remove the temporary speech file to clean up
    os.remove(speech_file_path)

## Key Points:

- **TTS Conversion**: The `client.audio.speech.create` method from the OpenAI API is used to convert the AI's textual response into speech. The `tts-1` model and `alloy` voice are specified here, but these can be adjusted based on your preferences.
- **Temporary Audio File Handling**: The generated speech is streamed to a file named `speech.wav` stored at the given file path. This approach is used to handle the audio output efficiently.
- **Audio Playback**: The `playsound` library plays the audio file, allowing the user to hear the AI's response.
- **Cleanup**: After playing the audio, the temporary file is removed to avoid clutter and manage storage efficiently.

> **💡 Note:** This function bridges the gap between textual AI responses and auditory output, making the interaction more engaging and accessible, especially for auditory learners.

# 5. Main Function to Facilitate Conversation with the AI Tutor

The function `talk_to_gpt` orchestrates the process of converting user speech to text, obtaining a response from the AI tutor, and then converting this response back to speech.

In [5]:
import json

# History list to keep track of the conversation
history = []

def talk_to_gpt(file_path):
    # Transcribe user speech to text
    user_transcript = get_transcript(file_path)

    # Get the GPT tutor's response to the user's transcript
    # Uses only the last 10 messages in history for context
    gpt_response = get_gpt_response(user_transcript, history[-10:])
    
    # Parse the JSON-formatted response from the GPT tutor
    gpt_response = json.loads(gpt_response)
    gpt_response = gpt_response['response']
    
    # Update the conversation history with user and assistant messages
    history.extend([
        {"role": "user", "content": user_transcript}, 
        {"role": "assistant", "content": gpt_response}
    ])

    # Print the GPT tutor's response for logging
    print(gpt_response)

    # Play the GPT tutor's response using TTS
    play_gpt_response_with_tts(gpt_response=gpt_response)

## Key Components:

- **Speech-to-Text Conversion**: The `get_transcript` function is used to convert the user's speech (from the audio file at `file_path`) into text.
- **AI Response Generation**: The `get_gpt_response` function generates a response from the AI tutor based on the user's transcript and recent conversation history.
- **JSON Parsing**: The response from the AI tutor, which is in JSON format, is parsed to extract the textual response.
- **Conversation History Management**: The conversation history is updated with the latest user and assistant (AI tutor) messages. This history is used for context in subsequent interactions.
- **Printing and TTS Playback**: The AI tutor's response is printed to the console (which can be useful for logging or debugging) and then played aloud using the `play_gpt_response_with_tts` function.

> **💡 Note:** This function is central to the user interaction, seamlessly integrating speech-to-text, AI response generation, and text-to-speech to simulate a natural conversation flow.

# 6. Audio Recording Class for User Input

The `AudioRecorder` class encapsulates the functionality needed to record audio from the user, which can then be processed for speech-to-text conversion.

In [6]:
import threading
import sounddevice as sd
import numpy as np
import wavio

class AudioRecorder:
    def __init__(self):
        self.is_recording = False      # Flag to control recording state
        self.audio_data = []           # List to store audio frames
        self.fs = 44100                # Sample rate (in Hz)
        self.channels = 1              # Number of audio channels

    def start_recording(self):
        self.is_recording = True
        self.audio_data = []
        # Start recording in a separate thread
        threading.Thread(target=self.record).start()

    def stop_recording(self):
        self.is_recording = False      # Stop the recording

    def record(self):
        # Set up the audio input stream
        with sd.InputStream(samplerate=self.fs, channels=self.channels) as stream:
            while self.is_recording:
                data, _ = stream.read(1024)  # Read audio data from the input stream
                self.audio_data.append(data)  # Append data to the audio_data list

    def save(self, filename='output.wav'):
        # Save the recorded audio to a file
        if self.audio_data:
            wav_data = np.concatenate(self.audio_data, axis=0)  # Concatenate all audio frames
            wavio.write(filename, wav_data, self.fs, sampwidth=2)  # Write to WAV file
            print("Recording saved to", filename)
            return filename
        else:
            print("No recording data to save.")


## Key Features:

- **Initialization**: Sets up initial variables like sample rate, channels, and recording state.
- **Start and Stop Recording**: Methods to control the start and stop of audio recording.
- **Multithreading for Recording**: Uses a separate thread to handle audio input, ensuring the main program remains responsive.
- **Audio Data Collection**: Continuously reads audio data from the microphone and stores it in a list.
- **Saving the Recording**: Concatenates the recorded audio frames and saves them as a WAV file. This file can then be used for further processing like speech-to-text.

> **💡 Note:** This class provides a foundational audio input mechanism, crucial for capturing the user's speech in real-time.

# 7. Interactive Interface for Audio Recording and Processing

This section of the code creates an interactive interface using IPython widgets to control the audio recording and initiate conversation with the AI tutor.

In [7]:
import ipywidgets as widgets
from IPython.display import display

# Initialize the audio recorder
recorder = AudioRecorder()

# Create buttons for starting and stopping the recording
start_button = widgets.Button(description="Start Recording")
stop_button = widgets.Button(description="Stop Recording")

def on_start_clicked(b):
    # Function to handle start button click
    recorder.start_recording()  # Start recording audio
    print("Recording started...")

def on_stop_clicked(b):
    # Function to handle stop button click
    print("Recording stopped and saved.")
    recorder.stop_recording()  # Stop recording audio
    file_name = recorder.save()  # Save the recorded audio to a file
    talk_to_gpt(file_name)  # Process the audio file through the AI tutor
    os.remove(file_name)  # Remove the temporary audio file

# Assign the click event handlers to the buttons
start_button.on_click(on_start_clicked)
stop_button.on_click(on_stop_clicked)

# Display the buttons in the Jupyter Notebook interface
display(start_button, stop_button)

Button(description='Start Recording', style=ButtonStyle())

Button(description='Stop Recording', style=ButtonStyle())

Recording started...
Recording stopped and saved.
Recording saved to output.wav
Hello! I'm doing well, thank you. How about you?


## Key Components:

- **Button Widgets**: Two buttons are created using `ipywidgets` for starting and stopping the audio recording.
- **Event Handlers**: Functions `on_start_clicked` and `on_stop_clicked` are defined to handle the respective button clicks.
    - `on_start_clicked` starts the audio recording.
    - `on_stop_clicked` stops the recording, saves the audio, processes it through the AI tutor (`talk_to_gpt`), and then cleans up the temporary file.
- **Display Widgets**: The `display` function from `IPython.display` is used to render the buttons in the Jupyter Notebook.

> **💡 Note:** This interactive setup allows users to easily control the recording process and seamlessly initiate interaction with the AI tutor, enhancing the user experience in the Jupyter Notebook.