# Building Your First Multimodal GenAI App

## Welcome to the Workshop on Building Composable AI Systems 🚀

In this workshop, we're diving into how to create powerful and flexible AI applications using a code-first approach. The world of AI is evolving rapidly, and we now have access to a multitude of models for generating text, images, audio, and more. But to rell leverage the power of these models, we need to think about **composability**: how we can connect and automate AI capabilities in ways that fit our particular needs.

### Why Use Code Interfaces Over Pre-Built Tools?

You might wonder, "Why not just use simple, web-based tools for these tasks?" The answer lies in the power of customization, automation, and scalability. Let's explore some real-world scenarios:

---

### Scenario 1: Rapid Image Generation for a Marketing Team 🎨
Imagine you're part of a marketing team launching a new product. You need to generate dozens of image concepts quickly to test different styles and messages. Using a code-based approach, you can:

- **Prompt** a language model to come up with creative image descriptions.
- **Generate** images from those descriptions using an image model.
- **Filter** and select the best images based on specific criteria.
- **Send** the top images to your team’s Slack or email for feedback.

This automated workflow saves hours of manual work and ensures your team has more creative options to consider.



### Scenario 2: Creating Background Music for Content Creators 🎶
You're a content creator working on a podcast or video series. Each episode needs custom background music that matches the theme and mood. With this approach, you can:

- **Generate** multiple music samples using a generative audio model.
- **Adjust** the style, tempo, or instrumentation on the fly.
- **Save** and organize the best samples for your project.

This allows you to experiment freely and find the perfect sound without having to rely on pre-made music libraries.



### Scenario 3: Interactive Storytelling or Live Events 🎭
Imagine hosting a live event where you want to engage the audience with real-time storytelling. You can build an app that:

- **Listens** to audience suggestions through voice input.
- **Generates** narrative twists or visual art based on those suggestions.
- **Plays** custom soundscapes to match the unfolding story.



### Scenario 4: Team Collaboration & Idea Generation 💡
Consider a product team brainstorming new features for an app. Instead of manually jotting down ideas and creating prototypes, you can build a pipeline that:

- **Generates** feature ideas using an LLM.
- **Creates** visual prototypes from those ideas using an image model.
- **Filters** and prioritizes the best concepts automatically.

This speeds up the ideation process and allows your team to explore more possibilities in less time.

### Workshop Overview 🛠️

In this workshop, you will learn how to:
- **Compose** AI models to build custom applications.
- **Automate** workflows using AI-generated content.
- **Experiment** with different models for text, audio, and images.
- **Deploy** your own composable AI systems for real-world use cases.

We’ll start with a high-level look at composability, inspired by the Unix philosophy: simple, modular components that can be easily combined. Then, we'll get hands-on, building systems that turn your ideas into reality with minimal effort.

Let’s get started! 🚀

First note that these types of multimodal aproaches are already available in apps, such as ChatGPT:

In [17]:
from IPython.display import Video

Video("vid/Dream Advertising Marketing Campaign.mp4")


But this approach ties you to particular models:

![Alt text](img/multimodal_app_1.png)

In this notebook, we'll explore a variety of SOTA GenAI models and get a sense of how to stitch them together!

![Alt text](img/composability.png)

![Alt text](img/karpathy-austen.png)

In [18]:
from IPython.display import Video

Video("vid/karpathy-austen.mp4")


![Alt text](img/unix-phil.png)

![Alt text](img/apple-counterpoint.png)

![Alt text](img/ai-evolution.png)

![Alt text](img/importance-comp.png)

![Alt text](img/comp-nec.png)

## Get our API Keys in our environment

- Create a [Groq](https://groq.com/) account and navigate [here to get your API key](https://console.groq.com/keys). They have a free tier with a bunch of LLMs (see screenshot below)!
- If you'd prefer to use OpenAI, you can do that and get [your API key here](https://platform.openai.com/api-keys).
- To use the models below as is, you'll need a [Replicate account](https://replicate.com/). If you're using this notebook in a workshop, chances are Hugo is able to provision free Replicate credits for you so ask him, if he hasn't mentioned it.
- Many of these models [you can also find on HuggingFace](https://huggingface.co/models), if you'd prefer.

![Alt text](img/multimodal_app_2.png)

In [1]:
import getpass


# Prompt for the Replicate API key
replicate_api_key = getpass.getpass("Please enter your Replicate API key: ")
print("Replicate API key captured successfully!")

# Prompt for the Grok API key
groq_api_key = getpass.getpass("Please enter your Groq API key: ")
print("Groq API key captured successfully!")

# # Prompt for the OpenAI API key
# openai_api_key = getpass.getpass("Please enter your OpenAI API key: ")
# print("Replicate OpenAI key captured successfully!")


Replicate API key captured successfully!
Groq API key captured successfully!


## Suno Bark: text to audio

First up, we'll experiment with the [Suno Bark](https://github.com/suno-ai/bark) text to audio model:

In [2]:
import replicate

# Create a Replicate client instance with the API token
client = replicate.Client(api_token=replicate_api_key)

# Define the input parameters for the model
input_params = {
    "prompt": "Hello, my name is Hugo. And, uh — and I like pizza. [laughs] But I also have other interests such as playing chess. [chuckles]",
    "text_temp": 0.7,
    "output_full": False,
    "waveform_temp": 0.7,
    "history_prompt": "announcer"
}

# Run the model using Replicate API
try:
    output = client.run(
        "suno-ai/bark:b76242b40d67c76ab6742e987628a2a9ac019e11d56ab96c4e91ce03b79b2787",
        input=input_params
    )
    print(output)
except Exception as e:
    print(f"Error: {e}")

{'audio_out': 'https://replicate.delivery/czjl/3KbzO0kbYqZIBhfRBWRTpMQYjoivQeK3nSIEWbsQWQ0DVStTA/audio.wav', 'prompt_npz': None}


### LLM output --> Suno bark

But what if we want to pipe the output of an LLM into Bark?

In [3]:
from groq import Groq

def get_llm_response(user_input):
    client = Groq(
        api_key=groq_api_key)

    response = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": user_input,
            }
        ],
        model="llama3-8b-8192",
    )

    return response.choices[0].message.content

# from openai import OpenAI
# import os


# def get_llm_response(user_input):
#     client = OpenAI(api_key=openai_api_key)
    
#     response = client.chat.completions.create(
#         model="gpt-3.5-turbo-0613",
#         messages=[
#     {"role": "user", "content": user_input}
#   ]
#         )
#     return response.choices[0].message.content

In [4]:
song = get_llm_response("a short pirates sea shanty")
print(song)

Arrrr, gather 'round me hearties, and listen close to me tale,
Of a life on the seven seas, where the winds do never fail.

(Verse 1)
Oh, we set sail on the "Black Swan's" back,
For the islands of gold and the treasures so vast,
Our captain, a swashbuckler, with a heart full of gold,
Led us forth to find riches, or a watery cold.

(Chorus)
Heave ho, me hearties, let the anchor go,
Heave ho, me hearties, the winds do blow,
Heave ho, me hearties, let the battle rage,
Heave ho, me hearties, and the treasure we'll engage!

(Verse 2)
The sharks they did circle, the storms did rage,
But we held fast to the ship, and our spirits did engage,
We fought off the scurvy dogs, and the deadly beast,
And reached the shores of victory, and rested our weary feet.

(Chorus)
Heave ho, me hearties, let the anchor go,
Heave ho, me hearties, the winds do blow,
Heave ho, me hearties, let the battle rage,
Heave ho, me hearties, and the treasure we'll engage!

(Bridge)
Oh, the stories we did tell, of the battl

In [5]:
# Define the input parameters for the model
input_params = {
    "prompt": song,
    "text_temp": 0.7,
    "output_full": False,
    "waveform_temp": 0.7,
    "history_prompt": "announcer",
   # "duration": 30
}

# Run the model using Replicate API
try:
    output = client.run(
        "suno-ai/bark:b76242b40d67c76ab6742e987628a2a9ac019e11d56ab96c4e91ce03b79b2787",
        input=input_params
    )
    print(output)
except Exception as e:
    print(f"Error: {e}")

{'audio_out': 'https://replicate.delivery/czjl/fksIIMEz1egvGkAxK50EvCNt1GPDCWlg1qHoSCZtxG22VStTA/audio.wav', 'prompt_npz': None}


This is totally bent and makes no sense ☝️

## Text to music w/ meta musicgen

What if we wanted to create some music with text? Let's try Musicgen from Meta.

In [6]:
input = {
    "prompt": "Horns and Drums. Edo25 major g melodies that sound triumphant and cinematic. Leading up to a crescendo that resolves in a 9th harmonic",
    "model_version": "stereo-large",
    "output_format": "mp3",
    "normalization_strategy": "peak"
}

output = client.run(
    "meta/musicgen:671ac645ce5e552cc63a54a2bbff63fcf798043055d2dac5fc9e36a837eedcfb",
    input=input
)
print(output)
#=> "https://replicate.delivery/pbxt/OeLYIQiltdzMaCex1shlEFy6...

https://replicate.delivery/yhqm/5Ps5e37V4PygbyfAL0qIF1RKDKZ59VBQhDxbKWS0T1QqWStTA/out.mp3


In [7]:
input = {
    "prompt": "Ancient Trip Hop with Throat Singing",
    "model_version": "stereo-large",
    "output_format": "mp3",
    "normalization_strategy": "peak",
    "duration": 30 
}

output = client.run(
    "meta/musicgen:671ac645ce5e552cc63a54a2bbff63fcf798043055d2dac5fc9e36a837eedcfb",
    input=input
)
print(output)
#=> "https://replicate.delivery/pbxt/OeLYIQiltdzMaCex1shlEFy6...

https://replicate.delivery/yhqm/eE35qkwuCw1wekLyNmhR8lipUSeRz9twL60HNseTFurnhJ1OB/out.mp3


## Text to music with riffusion

There are lots of other models to experiment with, such as riffusion:

In [8]:
output = client.run(
    "riffusion/riffusion:8cf61ea6c56afd61d8f5b9ffd14d7c216c0a93844ce2d82ac1c9ecc9c7f24e05",
    input={
        "alpha": 0.5,
        "prompt_a": "West African Desert Blues",
        "prompt_b": "Throat Singing",
        "denoising": 0.75,
        "seed_image_id": "vibes",
        "num_inference_steps": 50
    }
)
print(output)

{'audio': 'https://replicate.delivery/czjl/v8LskkFqSEJbAB8VMhhDVBFjii9scILxczDDiQhZ3LW2mU7E/gen_sound.wav', 'spectrogram': 'https://replicate.delivery/czjl/fLi1Xp3vhgRJUCBtOkBDB8gF5u0QdQMskfoTeLvVf2fObTqdC/spectrogram.jpg'}


___

## Experiment: One prompt to many models

Now what if we wanted to use a single prompt to create text, audio, images, and video?

In [9]:
message = "The Waffle House is really messing up the pancakes and bacon tonight HOLY MOLEY and there's anarchist jazz also!"

### text to image

In [10]:
input = {
    "prompt": message
}

output = client.run(
    "fofr/epicrealismxl-lightning-hades:0ca10b1fd361c1c5568720736411eaa89d9684415eb61fd36875b4d3c20f605a",
    input=input
)
print(output)
#=> ["https://replicate.delivery/pbxt/ulYZRIyAUDYpOZfl7OjhrKx...

['https://replicate.delivery/pbxt/naSOujsYXDYxDljySnCPHQFb5TIReXmMQzNjCQeP7xwqcStTA/R8__00001_.webp']


### text to audio

In [11]:
# Define the input parameters for the model
input_params = {
    "prompt": message,
    "text_temp": 0.7,
    "output_full": False,
    "waveform_temp": 0.7,
    "history_prompt": "announcer",
   # "duration": 30
}

# Run the model using Replicate API
try:
    output = client.run(
        "suno-ai/bark:b76242b40d67c76ab6742e987628a2a9ac019e11d56ab96c4e91ce03b79b2787",
        input=input_params
    )
    print(output)
except Exception as e:
    print(f"Error: {e}")

{'audio_out': 'https://replicate.delivery/czjl/dzYZHYYkWTLmFJvoeXHvjOcInRTWNW2feDSFcPfLlewx7TqdC/audio.wav', 'prompt_npz': None}


### text to music

In [12]:
input = {
    "prompt": message,
    "model_version": "stereo-large",
    "output_format": "mp3",
    "normalization_strategy": "peak",
    "duration": 30 
}

output = client.run(
    "meta/musicgen:671ac645ce5e552cc63a54a2bbff63fcf798043055d2dac5fc9e36a837eedcfb",
    input=input
)
print(output)
#=> "https://replicate.delivery/pbxt/OeLYIQiltdzMaCex1shlEFy6...

https://replicate.delivery/yhqm/O3fhhP3bpBVtPiWwx664x4ggTLukt7IfUDx2f2Il3EcWBlanA/out.mp3


## Many models at once

Let's write some utility functions that use these models:

In [13]:
def generate_epic_realism(prompt, api_token):
    # Create a Replicate client instance with the API token
    client = replicate.Client(api_token=replicate_api_key)

    # Define the input parameters for the model
    input_data = {
        "prompt": prompt
    }

    # Run the model using Replicate API
    output = client.run(
        "fofr/epicrealismxl-lightning-hades:0ca10b1fd361c1c5568720736411eaa89d9684415eb61fd36875b4d3c20f605a",
        input=input_data
    )
    
    return output



def generate_suno_bark(prompt, api_token, text_temp=0.7, output_full=False, waveform_temp=0.7, history_prompt="announcer"):
    # Create a Replicate client instance with the API token
    client = replicate.Client(api_token=replicate_api_key)

    # Define the input parameters for the model
    input_params = {
        "prompt": prompt,
        "text_temp": text_temp,
        "output_full": output_full,
        "waveform_temp": waveform_temp,
        "history_prompt": "zh_speaker_7",
    }

    # Run the model using Replicate API
    try:
        output = client.run(
            "suno-ai/bark:b76242b40d67c76ab6742e987628a2a9ac019e11d56ab96c4e91ce03b79b2787",
            input=input_params
        )
        return output
    except Exception as e:
        print(f"Error: {e}")
        return None




def generate_music_gen(prompt, api_token, duration=30, model_version="stereo-large", output_format="mp3", normalization_strategy="peak"):
    # Create a Replicate client instance with the API token
    client = replicate.Client(api_token=replicate_api_key)

    # Define the input parameters for the model
    input_data = {
        "prompt": prompt,
        "model_version": model_version,
        "output_format": output_format,
        "normalization_strategy": normalization_strategy,
        "duration": duration 
    }

    # Run the model using Replicate API
    output = client.run(
        "meta/musicgen:671ac645ce5e552cc63a54a2bbff63fcf798043055d2dac5fc9e36a837eedcfb",
        input=input_data
    )
    
    return output


def generate_suno_bark(prompt, api_token, text_temp=0.7, output_full=False, waveform_temp=0.7, history_prompt="announcer"):
    # Create a Replicate client instance with the API token
    client = replicate.Client(api_token=replicate_api_key)

    # Define the input parameters for the model
    input_params = {
        "prompt": prompt,
        "text_temp": text_temp,
        "output_full": output_full,
        "waveform_temp": waveform_temp,
        "history_prompt": "announcer",
    }

    # Run the model using Replicate API
    try:
        output = client.run(
            "suno-ai/bark:b76242b40d67c76ab6742e987628a2a9ac019e11d56ab96c4e91ce03b79b2787",
            input=input_params
        )
        return output
    except Exception as e:
        print(f"Error: {e}")
        return None






Let's test them out:

In [14]:
message = "crazy wild zombie party at the blaring symphony orchestra"
output = generate_epic_realism(message, replicate_api_key)
print(output)

output = generate_suno_bark(message, replicate_api_key)
print(output)

output = generate_music_gen(message, replicate_api_key)
print(output)


['https://replicate.delivery/pbxt/JHwgByESzW6FCpaQWnpfaEYXBE7PJy6IiWpFeUnpsJD6hStTA/R8__00001_.webp']
{'audio_out': 'https://replicate.delivery/czjl/PLHLseg6EhT4YCOtKeF7zj6we9a2GeVeLcJsqPQ9ul55DVqdC/audio.wav', 'prompt_npz': None}
https://replicate.delivery/yhqm/XtxMwbfHk4XbHqG7ftIlpH1fRBKmKggTofMbRtnlLy7VmK1OB/out.mp3


In [15]:
# Define your API token and prompt message
# api_token = 'your_api_token_here'
message = "The Waffle House messing it up for real with the pancakes and bacon and punk abstract jazz, yo!"

# Run the Epic Realism model
epicrealism_output = generate_epic_realism(message, replicate_api_key)
print("Epic Realism Output:")
print(epicrealism_output)

# Run the Meta MusicGen model
musicgen_output = generate_music_gen(message, replicate_api_key)
print("Meta MusicGen Output:")
print(musicgen_output)

# Run the Suno Bark model
bark_output = generate_suno_bark(message, replicate_api_key)
print("Suno Bark Output:")
print(bark_output)



Epic Realism Output:
['https://replicate.delivery/pbxt/GHKFnGfyRsWYACcHgg22HP6hJUDrhFdNRxkKtxLamUnzUp2JA/R8__00001_.webp']
Meta MusicGen Output:
https://replicate.delivery/yhqm/xo81Ixf8hv0bL69f47Q4y2cWtrjHPMGv71lMGRp6kFsarStTA/out.mp3
Suno Bark Output:
{'audio_out': 'https://replicate.delivery/czjl/egt08ZFPeEkj30twSbgpgaElvagjxkA51VoSeY5YY1KqXlanA/audio.wav', 'prompt_npz': None}


### Experiment: text to video

In [16]:
message = "The Waffle House messing it up for real with the pancakes and bacon and punk abstract jazz, yo!"

input = {
    "sampler": "klms",
    "max_frames": 100,
    "animation_prompts": message
}

output = client.run(
    "deforum/deforum_stable_diffusion:e22e77495f2fb83c34d5fae2ad8ab63c0a87b6b573b6208e1535b23b89ea66d6",
    input=input
)
print(output)
#=> "https://replicate.delivery/mgxm/873a1cc7-0427-4e8d-ab3c-...

https://replicate.delivery/yhqm/iBXTe8egVzgBDknYEVHncIhUjp1uqrqm8OfV7d1aorHZeK1OB/out.mp4
