# Lesson 1 Project: Introduction to Multimodal AI

## Introduction

Welcome to the first lesson on multimodal AI! While this course primarily focuses on images and speech, it also involves working with text. You should have already learned about text generation in previous courses and how to access the OpenAI text generation API. However, it never hurts to refresh your memory. By the end of this lesson, you will be able to:
- Access the OpenAI text generation API
- Ensure that the API responses are structured outputs

These skills will serve as the foundation for learning the multimodal AI system.

## Setting Up OpenAI Development Environment

In [None]:
# Install dependencies
!pip install openai pydantic python-dotenv matplotlib Pillow requests

In [None]:
# Load the OpenAI library
from openai import OpenAI

# Set up relevant environment variables
from dotenv import load_dotenv

load_dotenv()

# Create the OpenAI connection object
client = OpenAI()

## Making an API Request

To make a request to the OpenAI text generation API, you can use the following code:

In [None]:
# Make an API request
completion = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "You are an English grammar checker."},
        {
            "role": "user",
            "content": "Check the grammar: 'Alice eat an apple every day.'"
        }
    ]
)

# Print the response
print(completion.choices[0].message)

## Structured Outputs

OpenAI introduced structured outputs, allowing you to enforce that the generated response from the API adheres to a JSON schema. This makes it easier to extract information without having to parse a raw string. To create an API request with structured outputs, use the following code:

In [None]:
# Import Pydantic
from pydantic import BaseModel

# Define data structure
class GrammarChecking(BaseModel):
    wrong_sentence: str
    correct_sentence: str
    is_correct: bool

# Make an API request
completion = client.beta.chat.completions.parse(
    model="gpt-4o-2024-08-06",
    messages=[
        {"role": "system", "content": "You are an English grammar checker."},
        {"role": "user", "content": "Check the grammar: 'Alice eat an apple every day.'"},
    ],
    response_format=GrammarChecking,
)

# Print the response
print(completion.choices[0].message)

## Multimodal AI API Request

But text is boring. That's why this multimodal AI are interesting. You can do so much more than text, such as generating images.

In [None]:
# Import lines
from PIL import Image
from io import BytesIO
import matplotlib.pyplot as plt
import requests

# Creating a multimodal AI API request
response = client.images.generate(
    model="dall-e-3",
    prompt="A cat holding a sign 'Welcome to the Multimodal AI Module'"
)

# Downloading the image
image_url = response.data[0].url
image_response = requests.get(image_url)
img = Image.open(BytesIO(image_response.content))

# Displaying the image
plt.imshow(img)
plt.axis('off')
plt.show()

You would see a cute cat holding a sign "Welcome to the Multimodal AI Module"!

Don't worry about the code. You'll learn how to craft the code to interact with the multimodal AI later in other lessons.

The multimodal AI is also more than just text and images. It deals with audio as well. You'll also learn about this next.