# Lesson 5 Project: Building a Multimodal AI App

## Introduction

Welcome to Lesson 5, where you'll embark on an exciting journey to create a sophisticated multimodal AI application. In this lesson, you'll build a language tutor app that integrates text, image, and audio processing to provide an immersive and interactive learning experience.

By the end of this lesson, you will be able to:
* Integrate text, image, and audio processing in a single application
* Implement a user interface for multimodal interactions
* Evaluate the effectiveness of multimodal integration in enhancing user experience

Let's dive in and start building a language tutor app!

## Setting Up OpenAI Development Environment

Refer to the Python Crash Course lesson to learn how to set up your OpenAI development environment.

In [None]:
# Install the libraries, including Gradio
!pip install openai requests python-dotenv matplotlib librosa ipyaudioworklet gradio Pillow

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

## Using Gradio

In [None]:
import gradio as gr

def greet(name, intensity):
    return "Hello, " + name + "!" * int(intensity)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "slider"],
    outputs=["text"],
)

demo.launch()

In [None]:
def greet(name, is_morning, temperature):
    salutation = "Good morning" if is_morning else "Good evening"
    greeting = f"{salutation} {name}. It is {temperature} degrees today"
    celsius = (temperature - 32) * 5 / 9
    return greeting, round(celsius, 2)

demo = gr.Interface(
    fn=greet,
    inputs=["text", "checkbox", gr.Slider(0, 100)],
    outputs=["text", "number"],
)
demo.launch()

In [None]:
import numpy as np
import gradio as gr

def sepia(input_img):
    sepia_filter = np.array([
        [0.393, 0.769, 0.189],
        [0.349, 0.686, 0.168],
        [0.272, 0.534, 0.131]
    ])
    sepia_img = input_img.dot(sepia_filter.T)
    sepia_img /= sepia_img.max()
    return sepia_img

demo = gr.Interface(sepia, gr.Image(), "image")
demo.launch()

In [None]:
def calculator(num1, operation, num2):
    if operation == "add":
        return num1 + num2
    elif operation == "subtract":
        return num1 - num2
    elif operation == "multiply":
        return num1 * num2
    elif operation == "divide":
        if num2 == 0:
            raise gr.Error("Cannot divide by zero!")
        return num1 / num2

demo = gr.Interface(
    calculator,
    [
        "number",
        gr.Radio(["add", "subtract", "multiply", "divide"]),
        "number"
    ],
    "number",
    examples=[
        [45, "add", 3],
        [3.14, "divide", 2],
        [144, "multiply", 2.5],
        [0, "subtract", 1.2],
    ],
    title="Toy Calculator",
    description="Here's a sample toy calculator.",
)

demo.launch()

## Generating Situational Prompts and Images

Let's create a function to generate a situational prompt and a corresponding image using OpenAI's GPT-4 and DALL-E models. For example, the results could be:
- The "A person is ordering a cafe latte in a coffee shop" situational prompt (generated by OpenAI's GPT-4)
- An image of a person ordering a cafe latte in a coffee shop (generated by DALL-E)

In [None]:
def generate_situational_prompt(seed_prompt=""):
    additional_prompt = """
    Then create an initial response to the person. If the situation is "ordering coffee in a cafe.", then
        the initial response will be, "Hello, what would you like to order?".
        Seperate the initial situation and the initial response with a line containing "====". Something like:
        "You're ordering coffee in a cafe.
        ====
        'Hello, there. What would you like to order?'"
        Limit the output to 1 sentence.
    """
    if seed_prompt:
        seed_phrase = f"""Generate a second-person POV situation for practicing English with this seed prompt: {seed_prompt}.
        {additional_prompt}"""
    else:
        seed_phrase = f"""Generate a second-person POV situation for practicing English, like meeting your parents-in-law, etc.
        {additional_prompt}"""
    # Use GPT to generate a situation for practicing English
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a creative writer. Very very creative."},
        {"role": "user", "content": seed_phrase}
      ]
    )
    # Extract and return the situation and the initial response
    message = response.choices[0].message.content
    return message

In [None]:
generate_situational_prompt()

In [None]:
import requests
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt

def generate_situation_image(dalle_prompt):
    response = client.images.generate(
      model="dall-e-3",
      prompt=dalle_prompt,
      size="1024x1024",
      n=1,
    )
    image_url = response.data[0].url
    response = requests.get(image_url)

    img = Image.open(BytesIO(response.content))

    return img, image_url

In [None]:
def display_image(img):
    plt.imshow(img)
    plt.axis('off')
    plt.show()

In [None]:
full_response = generate_situational_prompt("cafe")
initial_situation_prompt = full_response.split('====')[0].strip()
print(initial_situation_prompt)
img = generate_situation_image(initial_situation_prompt)
display_image(img)

## Implementing Speech Recognition and Speech Synthesis

In this section, you'll generate a text prompt using OpenAI's GPT-4 API, such as "Welcome to Cute Cafe. What do you want to order?" The text then must be sent to OpenAI's TTS API so you'll have the synthesis voice that you must play.

Then you need to wait a user's voice or audio input, such as, "I want to order a cafe latte." This audio file must be sent to OpenAI's Whisper API so for transcription so in the end you'll have the text from the user.

In [None]:
import librosa
from IPython.display import Audio, display

def play_speech(file_path):
    # Load the audio file using librosa
    y, sr = librosa.load(file_path)

    # Create an Audio object for playback
    audio = Audio(data=y, rate=sr, autoplay=True)

    # Display the audio player
    display(audio)

In [None]:
def speak_prompt(speech_prompt, autoplay=True, speech_file_path="speech.mp3"):
    # Generate speech from the grammar feedback using TTS
    response = client.audio.speech.create(
      model="tts-1",
      voice="alloy",
      input=speech_prompt
    )

    # Save the synthesized speech to the specified path
    response.stream_to_file(speech_file_path)

    if autoplay:
        # Play the synthesized speech
        play_speech(speech_file_path)

In [None]:
initial_response = full_response.split('====')[1].strip()
speak_prompt(initial_response)

In [None]:
import ipyaudioworklet as ipyaudio
import wave

def receive_audio_input(speech_filename="my_speech.wav"):
    # Create an audio recorder object
    recorder = ipyaudio.AudioRecorder(filename=speech_filename)
    return recorder

def save_recorded_audio_input(recorder):
    # Save the recorded audio to a file
    _x = (recorder.audiodata * 32767.5).astype(dtype=np.int16)
    with wave.open(recorder.filename, mode='wb') as wb:
         wb.setnchannels(1)
         wb.setsampwidth(_x.itemsize)
         wb.setframerate(recorder.sampleRate)
         wb.writeframes(_x.tobytes())

In [None]:
def transcript_speech(speech_filename="my_speech.wav"):
    with open(speech_filename, "rb") as audio_file:
        # Open the audio file and transcribe using the Whisper model
        transcription = client.audio.transcriptions.create(
          model="whisper-1", 
          file=audio_file,
          response_format="json",
          language="en"
        )
    # Return the transcribed text
    return transcription.text

In [None]:
recorder = receive_audio_input()
recorder

In [None]:
save_recorded_audio_input(recorder)
transcripted_text = transcript_speech()
print(transcripted_text)

In [None]:
def creating_conversation_history(history, added_response):
    history = f"""{history}
====
'{added_response}'
"""
    return history

In [None]:
history = creating_conversation_history(full_response, transcripted_text)
print(history)

In [None]:
def generate_conversation_from_history(history):
    prompt = """Continue conversation from a person based on this conversation history and end it with '\n====\n'.
    Limit it to max 3 sentences.
    This is the history:"""
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a creative writer. Very very creative."},
        {"role": "user", "content": f"{prompt}\n{history}"}
      ]
    )
    # Extract and return the situation and the initial response
    message = response.choices[0].message.content
    return message

In [None]:
conversation = generate_conversation_from_history(history)
print(conversation)

In [None]:
combined_history = history + "\n====\n" + conversation
print(combined_history)

In [None]:
dalle_prompt = "Generate a scenery based on this conversation: " + combined_history
img = generate_situation_image(dalle_prompt)
display_image(img)

In [None]:
speak_prompt(conversation)

In [None]:
# Create a function to generate a text prompt based on the situational prompt with OpenAI's GPT-4 API.

# Create a function to send this text prompt to OpenAI's TTS API then play the audio file to the user.

# Create a function to receive audio input from a user.
# You can use PyAudio to record it in the Jupyter notebook.
# Or you can record it in a separate occassion then provide the path to the audio file.

# Create a function to send the audio file to OpenAI's Whisper for transcription and get the text

## Building the User Interface with Gradio

Now, let's create your multimodal language tutor app using Gradio. At first, when the app is launched, there will be an image showed to the user,
such as an image of a cafe. Then there will be an audio file being played, such as, "Welcome to Cute Cafe. What do you like to order?"

There will be an interface to record speech from the user, "I would like to have a cup of cafe latte.". As an alternative, you can also have an interface to upload the audio file.

Then the image will be changed to another image, such as, an image of of a cafe latte. Then there will be speech generated and played to the user, "Would you like another thing, such as croissant?"

The user can give an answer, "No, but what is the wifi password?" Then the image will be changed to another image, such as an image of a wifi router or a note displaying the wifi password. And so on. You get the idea.

A user can use this app until they get bored. There is a button to quit the app.

In [None]:
initial_situation = generate_situational_prompt("cafe near beach")
img = generate_situation_image(initial_situation)

first_time = True
combined_history = ""

def extract_first_last(text):
    elements = [elem.strip() for elem in text.split('====') if elem.strip()]

    if len(elements) >= 2:
        return elements[0] + elements[-1]
    elif len(elements) == 1:
        return elements[0]
    else:
        return ""

def conversation_generation(audio_path):
    global combined_history
    global first_time
    transcripted_text = transcript_speech(audio_path)
    if first_time:
        history = creating_conversation_history(initial_situation, transcripted_text)
        first_time = False
    else:
        history = creating_conversation_history(combined_history, transcripted_text)
    print(history)
    conversation = generate_conversation_from_history(history)
    combined_history = history + "\n====\n" + conversation
    dalle_prompt = extract_first_last(combined_history)
    img = generate_situation_image(combined_history)
    output_audio_file = "speak_speech.mp3"
    speak_prompt(conversation, False, output_audio_file)

    return img, conversation, output_audio_file

tutor_app = gr.Interface(
    conversation_generation,
    gr.Audio(sources=["microphone"], type="filepath"),
    outputs=[gr.Image(value=img), gr.Text(), gr.Audio(type="filepath")],
    title="Speaking Language Tutor App",
    description=initial_situation
)

tutor_app.launch()

In [None]:
# Create a window
# Display an image in the window
# Play the audio file in the window
# Create an interface to record speech or to upload the audio file
# Create a way to regenerate the image and replay a different speech
# Create a button to quit the app
# Integrate the user interface with the core OpenAI API functions that you created previously

## Evaluation and Reflection

After building and testing your multimodal AI app, consider the following questions:

1. How does the integration of text, image, and audio enhance the language learning experience?
2. What challenges did you face in designing the user interface for multi-modal interactions?
3. How might you improve the app to make it more effective or user-friendly?

Take some time to reflect on these questions and discuss your thoughts with your peers or instructor.

Also, you can try to build the multimodal AI app outside Jupyter notebook. Put the app inside a Python script so you can run it in command line interface or terminal.

## Conclusion

Congratulations! You've successfully built a multimodal AI application that integrates text, image, and audio processing. This language tutor app demonstrates the powerful potential of combining multiple AI technologies to create an immersive and interactive learning experience.

Remember, this is just the beginning. There are many ways to expand and improve this app, such as implementing more sophisticated speech recognition, adding more scenarios, or incorporating user feedback to improve the AI's responses.

Keep exploring and experimenting with multimodal AI applications!