# Lesson 5 Project: Building a Multimodal AI App

## Introduction

Welcome to Lesson 5, where you'll embark on an exciting journey to create a sophisticated multimodal AI application. In this lesson, you'll build a language tutor app that integrates text, image, and audio processing to provide an immersive and interactive learning experience.

By the end of this lesson, you will be able to:
* Integrate text, image, and audio processing in a single application
* Implement a user interface for multimodal interactions
* Evaluate the effectiveness of multimodal integration in enhancing user experience

Let's dive in and start building a language tutor app!

## Setting Up OpenAI Development Environment

Refer to the Python Crash Course lesson to learn how to set up your OpenAI development environment.

In this lesson, you also need to install the gradio library.

In [1]:
# Install the libraries, including Gradio
!pip install openai requests python-dotenv matplotlib librosa gradio Pillow

# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
# Make sure OPENAI_API_KEY=... exists in .env
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()



Could not find platform independent libraries <prefix>

[notice] A new release of pip is available: 24.0 -> 24.2
[notice] To update, run: python.exe -m pip install --upgrade pip


## Using Gradio

Gradio is an open-source Python package that allows you to quickly build a demo or web application for your machine learning model, API, or any arbitrary Python function. You can then share your demo with a public link in seconds using Gradio's built-in sharing features. No JavaScript, CSS, or web hosting experience needed!

In this section, you will learn how to create Gradio applications in Jupyter Lab. Every time you execute Gradio code in a cell, it will launch a Gradio app in a new port. You should start with a simple Gradio app.

In [None]:
# Import the Gradio library
import gradio as gr

# Define a simple function that takes a name and a time of day as inputs
def greet(name, greeting_time):
    return "Good " +  greeting_time + ", " + name + "!"

# Create a Gradio interface for the function
demo = gr.Interface(
    fn=greet,  # The function to wrap a UI around
    inputs=[gr.Text(), gr.Dropdown(["morning", "evening", "night"])],  # Define input components
    outputs=[gr.Text()], # Define output components
)

# Launch the Gradio app
demo.launch()

In this example, you created a simple Gradio interface that takes a name and a time of day (morning, evening, night) as inputs and returns a greeting message. You can see that the number of inputs in the function matches the number of input components, and the number of return values matches the number of output components.

But you don't have to limit the output components to only one component. It can be more.

In [None]:
# Define a function that returns a greeting message and a hard-coded image URL
def greet(name, greeting_time):
    greeting = "Good " +  greeting_time + ", " + name + "!"
    image_url = "https://upload.wikimedia.org/wikipedia/commons/d/d6/An_Oberoi_Hotel_employee_doing_Namaste%2C_New_Delhi.jpg"
    return (greeting, image_url)

# Create a Gradio interface for the function
demo = gr.Interface(
    fn=greet,
    inputs=[gr.Text(), gr.Dropdown(["morning", "evening", "night"])],
    outputs=[gr.Text(), gr.Image()],
)

# Launch the Gradio app
demo.launch()

In this example, you have enhanced the function to return not only a greeting message but also an image URL. Gradio handles displaying the image using the `gr.Image()` component. The number of return values from the function matches the number of output components

You can also use an audio input field and an audio output field. For the audio input, you can use the microphone as the source of audio.

In [None]:
# Define a function that returns a greeting message, an image URL, and an audio file path
def greet(name, greeting_time, audio_path):
    greeting = "Good " +  greeting_time + ", " + name + "!"
    image_url = "https://upload.wikimedia.org/wikipedia/commons/d/d6/An_Oberoi_Hotel_employee_doing_Namaste%2C_New_Delhi.jpg"
    return (greeting, image_url, audio_path)

# Create a Gradio interface for the function
demo = gr.Interface(
    fn=greet,
    inputs=[gr.Text(), gr.Dropdown(["morning", "evening", "night"]), gr.Audio(sources=["microphone"], type="filepath")],
    outputs=[gr.Text(), gr.Image(), gr.Audio(type="filepath")],
)

# Launch the Gradio app
demo.launch()

In this example, you added an audio input component using `gr.Audio()` using the `sources` and `type` arguments that allows users to provide audio input through a microphone and passes the audio input to the function as the audio file path. The function now returns a greeting message, an image URL, and the audio file path.

You can also make your app more informative using the `title` and `description` arguments in the `gr.Interface` method.

In [None]:
# Define a function that returns a greeting message, an image URL, and an audio file path
def greet(name, greeting_time, audio_path):
    greeting = "Good " +  greeting_time + ", " + name + "!"
    image_url = "https://upload.wikimedia.org/wikipedia/commons/d/d6/An_Oberoi_Hotel_employee_doing_Namaste%2C_New_Delhi.jpg"
    return (greeting, image_url, audio_path)

# Create a Gradio interface for the function with a title and description
demo = gr.Interface(
    fn=greet,
    inputs=[gr.Text(), gr.Dropdown(["morning", "evening", "night"]), gr.Audio(sources=["microphone"], type="filepath")],
    outputs=[gr.Text(), gr.Image(), gr.Audio(type="filepath")],
    title="Greeting App",
    description="This is a billion dollar greeting app."
)

# Launch the Gradio app
demo.launch()

In this final example, you added a title and description to the Gradio interface. These elements help users understand the purpose of the app and provide context for the inputs and outputs.

## Generating Situational Prompts and Images

Before you create a multimodal AI app with UI, you need to create functions to generate a situation where users can practice the English language, generate a scenery image of the situation, and the initial response that triggers a conversation between users and AI.

You start with the function to generate the initial situation.

In [2]:
def generate_situational_prompt(seed_prompt=""):
    # Define additional prompt instructions
    additional_prompt = """
    Then create an initial response to the person. If the situation is "ordering coffee in a cafe.", then
        the initial response will be, "Hello, what would you like to order?".
        Seperate the initial situation and the initial response with a line containing "====". Something like:
        "You're ordering coffee in a cafe.
        ====
        'Hello, there. What would you like to order?'"
        Limit the output to 1 sentence.
    """

    # Check if a seed prompt is provided and create the seed phrase accordingly
    if seed_prompt:
        seed_phrase = f"""Generate a second-person POV situation for practicing English with this seed prompt: {seed_prompt}.
        {additional_prompt}"""
    else:
        seed_phrase = f"""Generate a second-person POV situation for practicing English, like meeting your parents-in-law, etc.
        {additional_prompt}"""

    # Use GPT to generate a situation for practicing English
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a creative writer. Very very creative."},
        {"role": "user", "content": seed_phrase}
      ]
    )

    # Extract and return the situation and the initial response from the response
    message = response.choices[0].message.content

    # Return the generated message
    return message

Test the function.

In [3]:
generate_situational_prompt()

'You\'re meeting your parents-in-law for the first time at a family dinner.\n====\n"Hello, it\'s wonderful to finally meet you both; I\'ve heard so much about you."'

Test the function with the seed prompt.

In [4]:
generate_situational_prompt("comics exhibition")

'You are attending a comics exhibition.\n====\n"Welcome to the exhibition! Which artist would you like to learn more about today?"'

To enhance the immersive experience of practicing English, you can generate a scenery image that matches the situation. For example, if the situation is "ordering coffee in a cafe," you can generate an image of a cafe. This helps in visualizing the context, making the practice more engaging and realistic.

In [5]:
# Import necessary libraries for image processing and display
import requests
from PIL import Image
from io import BytesIO

def generate_situation_image(dalle_prompt):
    # Generate an image using the DALL-E 3 model with the provided prompt
    response = client.images.generate(
      model="dall-e-3", # Specify the model to use
      prompt=dalle_prompt, # The prompt describing the image to generate
      size="1024x1024", # Specify the size of the generated image
      n=1, # Number of images to generate
    )

    # Retrieve the URL of the generated image
    image_url = response.data[0].url

    # Download the image from the URL
    response = requests.get(image_url)

     # Open the image using PIL
    img = Image.open(BytesIO(response.content))

    # Return the image object
    return img

In [6]:
import matplotlib.pyplot as plt

# Display the image in the cell
def display_image(img):
    plt.imshow(img)
    plt.axis('off')
    plt.show()

Now that we have both functions ready, you can execute them together. The image generation function requires the output from the text generation function first. In this example, you'll create a situation related to a "cafe" and then generate an image based on that situation.

In [None]:
full_response = generate_situational_prompt("cafe")
initial_situation_prompt = full_response.split('====')[0].strip()
print(initial_situation_prompt)
img = generate_situation_image(initial_situation_prompt)
display_image(img)

## Implementing Speech Recognition and Speech Synthesis

After implementing functions for text and image generation, you will explore how to implement functions to handle speech recognition and speech synthesis. In the multimodal app, you want to get the audio input from the user and give back the audio response. Remember this is the app for practicing conversation in English.

In [8]:
# Import necessary libraries for audio processing and display
import librosa
from IPython.display import Audio, display

# Function to play a speech file
def play_speech(file_path):
    # Load the audio file using librosa
    y, sr = librosa.load(file_path)

    # Create an Audio object for playback
    audio = Audio(data=y, rate=sr, autoplay=True)

    # Display the audio player
    display(audio)

Next, create a function to generate speech from a text prompt using a Text-to-Speech (TTS) model.

In [9]:
# Function to generate speech from a text prompt
def speak_prompt(speech_prompt, autoplay=True, speech_file_path="speech.mp3"):
    # Generate speech from the grammar feedback using TTS
    response = client.audio.speech.create(
      model="tts-1",
      voice="alloy",
      input=speech_prompt
    )

    # Save the synthesized speech to the specified path
    response.stream_to_file(speech_file_path)

    # Sometimes you want to play the speech automatically, sometimes you do not
    if autoplay:
        # Play the synthesized speech
        play_speech(speech_file_path)

Now, extract the initial situation from the generated situational prompt and use the `speak_prompt` function to generate and play the speech for this initial response.

In [None]:
initial_situation = full_response.split('====')[1].strip()
speak_prompt(initial_situation)

Next, create a function to transcribe the speech into text using a speech-to-text model.

In [11]:
# Function to transcribe speech from an audio file
def transcript_speech(speech_filename="my_speech.wav"):
    with open(speech_filename, "rb") as audio_file:
        # Transcribe the audio file using the Whisper model
        transcription = client.audio.transcriptions.create(
          model="whisper-1", 
          file=audio_file,
          response_format="json",
          language="en"
        )
    # Return the transcribed text
    return transcription.text

Transcribe the speech. Then, print the transcribed text.

In [12]:
# Transcribe the audio
transcripted_text = transcript_speech("audio/cappuccino.m4a")

# Print the transcribed text
print(transcripted_text)

I would like to order a cup of cappuccino.


Create a conversation history by combining the initial response and the transcribed text. Here is the function to do that.

In [13]:
# Function to create a conversation history
def creating_conversation_history(history, added_response):
    history = f"""{history}
====
'{added_response}'
"""
    return history

Now use the function to create and print the conversation history.

In [14]:
history = creating_conversation_history(full_response, transcripted_text)
print(history)

You're asking a stranger in a cafe if you can share their table.
====
"Hi, do you mind if I sit here?"
====
'I would like to order a cup of cappuccino.'



Next, generate a continuation of the conversation based on the conversation history. Here is the function to generate the conversation.

In [15]:
# Function to generate a conversation based on the conversation history
def generate_conversation_from_history(history):
    prompt = """Continue conversation from a person based on this conversation history and end it with '\n====\n'.
    Limit it to max 3 sentences.
    This is the history:"""
    response = client.chat.completions.create(
      model="gpt-4o",
      messages=[
        {"role": "system", "content": "You are a creative writer. Very very creative."},
        {"role": "user", "content": f"{prompt}\n{history}"}
      ]
    )
    # Extract and return the generated conversation
    message = response.choices[0].message.content
    return message

Generate and print the conversation based on the history.

In [16]:
# Generate and print the conversation based on the history
conversation = generate_conversation_from_history(history)
print(conversation)

"Of course, go ahead. I'm about to order one myself—cappuccino's my favorite."


Combine the conversation history with the new conversation and print the combined history.

In [17]:
# Combine the conversation history with the new conversation
combined_history = history + "\n====\n" + conversation

# Print the combined history
print(combined_history)

You're asking a stranger in a cafe if you can share their table.
====
"Hi, do you mind if I sit here?"
====
'I would like to order a cup of cappuccino.'

====
"Of course, go ahead. I'm about to order one myself—cappuccino's my favorite."


Next, generate a scenery image based on the combined history using the `generate_situation_image` function and display the image.

In [None]:
# Generate a scenery image based on the combined history
dalle_prompt = "Generate a scenery based on this conversation: " + combined_history
img = generate_situation_image(dalle_prompt)

# Display the generated image
display_image(img)

Finally, generate and play the prompt based on the new conversation using the `speak_prompt` function.

In [None]:
speak_prompt(conversation)

## Building the User Interface with Gradio

Now, let's create your multimodal language tutor app using Gradio. When the app is launched, it will display an image to the user, such as a picture of a cafe. An audio file will then play, for example, "Welcome to Cute Cafe. What would you like to order?"

There will be an interface for the user to record their speech, such as, "I would like to have a cup of cafe latte."

The image will then change to another picture, for instance, an image of a cafe latte. Speech will be generated and played to the user, saying, "Would you like anything else, such as a croissant?"

The user can respond, "No, but what is the wifi password?" The image will change again, perhaps to a picture of a wifi router or a note displaying the wifi password. And so on. You get the idea.

The user can use this app until they decide to quit.

In [None]:
# Initial seed prompt for generating the initial situational context
seed_prompt = "cafe near beach" # or "comics exhibition", "meeting parents-in-law for the first time", etc

# Generate an initial situational description based on the seed prompt
initial_situation = generate_situational_prompt(seed_prompt)

# Generate an initial image based on the initial situational description
img = generate_situation_image(initial_situation)

# Flags to manage the state of the app
first_time = True
combined_history = ""

# Function to extract the first and last segments of the conversation history
# This is to ensure that the prompt for DALL-E does not exceed the maximum character limit of 4000 characters
def extract_first_last(text):
    elements = [elem.strip() for elem in text.split('====') if elem.strip()]

    if len(elements) >= 2:
        return elements[0] + elements[-1]
    elif len(elements) == 1:
        return elements[0]
    else:
        return ""

# Main function to handle the conversation generation logic
def conversation_generation(audio_path):
    global combined_history
    global first_time

    # Transcribe the user's speech from the provided audio file path
    transcripted_text = transcript_speech(audio_path)

    # Create conversation history based on whether it is the first interaction or not
    if first_time:
        history = creating_conversation_history(initial_situation, transcripted_text)
        first_time = False
    else:
        history = creating_conversation_history(combined_history, transcripted_text)

    # Generate a new conversation based on the updated history
    conversation = generate_conversation_from_history(history)

    # Update the combined history with the new conversation
    combined_history = history + "\n====\n" + conversation
    
    # Extract a suitable prompt for DALL-E by combining the first and last parts of the conversation history
    dalle_prompt = extract_first_last(combined_history)

    # Generate a new image based on the updated combined history
    img = generate_situation_image(combined_history)

    # Generate speech for the new conversation and save it to an audio file
    output_audio_file = "speak_speech.mp3"
    speak_prompt(conversation, False, output_audio_file)

    # Return the updated image, conversation text, and audio file path
    return img, conversation, output_audio_file

# Create the Gradio interface for the language tutor app
tutor_app = gr.Interface(
    conversation_generation,
    gr.Audio(sources=["microphone"], type="filepath"),
    outputs=[gr.Image(value=img), gr.Text(), gr.Audio(type="filepath")],
    title="Speaking Language Tutor App",
    description=initial_situation
)

# Launch the Gradio app
tutor_app.launch()

This multimodal language tutor app helps users practice language skills through interactive scenarios. When the app starts, it displays an image and displays an initial prompt related to a specific scenario, such as a cafe near a beach. Users can respond by recording their speech. The app transcribes their speech, updates the conversation history, generates new responses, and updates the visual and audio outputs accordingly.

### Inputs and Outputs

- Inputs:
  - Audio file (recorded via microphone)
- Outputs:
  - Image (updated based on conversation context)
  - Text (generated conversation response)
  - Audio file (generated speech response)

### Flow of the Program

1. Initialization:
   - The app starts with a seed prompt (e.g., "cafe near beach").
   - An initial situational description and corresponding image are generated.
2. User interaction:
   - The user records an audio file with their response.
   - The app transcribes the audio to text.
3. Conversation Update:
   - The app updates the conversation history with the new user input.
   - A new conversation response is generated based on the updated history.
   - The history is preserved and updated for future interactions.
4. Visual and Audio Update:
   - A new image is generated based on the updated history.
   - New speech is generated from the conversation response and saved to an audio file.
5. Outputs:
   - The updated image, conversation text, and speech audio are displayed and played to the user.
  
### State Preservation

To preserve the state in the Gradio app, global variables (first_time and combined_history) are used. These variables keep track of whether it is the first interaction and the combined history of the conversation, respectively. This allows the app to maintain the context of the conversation across multiple interactions, ensuring a coherent and continuous dialogue with the user.

## Evaluation and Reflection

The app isn't perfect. It doesn't maintain consistency with characters in the generated images. For example, if you order a cup of coffee, one time you might be greeted by a man, and the next time by a woman. Additionally, it doesn't provide grammar feedback to what you say in the conversation.

The effectiveness of multimodal integration in enhancing user experience involves using different types of media—like text, audio, images, and interactive elements—to make interactions more engaging and easier to understand. By combining these various forms of communication, apps can meet the needs and preferences of a broader range of users. For example, an app that uses visual instructions along with spoken feedback can help users learn and remember information better, while also making the experience more enjoyable. This approach can also improve accessibility, such as providing text descriptions for images or audio transcriptions for videos, making the app usable for people with disabilities. Overall, the aim is to see if these combined methods lead to happier users, more engagement, and a smoother, more enjoyable experience.

After building and testing your multimodal AI app, consider the following questions:

1. How does the integration of text, image, and audio enhance the language learning experience?
2. What challenges did you face in designing the user interface for multi-modal interactions?
3. How might you improve the app to make it more effective or user-friendly?

Take some time to reflect on these questions and discuss your thoughts with your peers or instructor.

Also, you can try to build the multimodal AI app outside Jupyter notebook. Put the app inside a Python script so you can run it in command line interface or terminal.