# Pentagram

**SOURCES**:

- [Getting started with Modal](https://modal.com/docs/examples/hello_world)
- [Building an Image Generation Pipeline on Modal](https://www.youtube.com/watch?v=sHSKArbiKmU)
- [Run Stable Diffusion as a CLI, API and webUI](https://modal.com/docs/examples/text_to_image)
- [Midjourney Examples](https://www.midjourney.com/explore?tab=top)
- [NVIDIA GPU comparison](https://www.digitalocean.com/community/tutorials/h100_vs_other_gpus_choosing_the_right_gpu_for_your_machine_learning_workload)
- [Modal Playground](https://modal.com/playground/get_started)
- [Modal cold Start Guide](https://modal.com/docs/guide/cold-start)
- [Image Generation Models](https://huggingface.co/models?pipeline_tag=text-to-image)
- [Modal Web endpoints](https://modal.com/docs/guide/webhooks)

## Objective
For this project, you are tasked with building an Instagram clone, where instead of users uploading pictures themselves, they can generate images with text prompts. Instead of using existing image generation APIs, you will have to host an image generation model yourself on serverless GPUs and ensure low latency for a smooth user experience.

Getting Started:

Learn how Modal works here, along with the other resources provided above
Set up the backend API using Modal that generates images from a text prompt
Clone the GitHub repo here for the web app where users can generate images, and take a look at the TODOs in the codebase
Project Requirements:

Host an image generation model (e.g., Stable Diffusion) on serverless GPUs through Modal, ensuring low-latency performance for smooth user experience.
Create a web app that allows users to generate images from text prompts, manage their creations, and interact socially through likes, comments, and sharing features.
Incorporate intuitive UI/UX design, authentication, and efficient image management with prompt histories.
Challenges:

Ensuring the hosted image generation model operates within low-latency thresholds (<2 seconds) while handling multiple concurrent requests
Managing the dynamic scaling of GPU resources to handle demand spikes without exceeding cost or causing performance bottlenecks.
Add the ability to search for images semantically
Prevent harmful or inappropriate content from being generated
Build a recommendation system that creates personalized feeds for users, balancing new content discovery with user preferences

## Project Tips

* Diffuser packages often require significant storage space and computational resources. For optimal performance, it's recommended to run them on platforms like Google Colab or virtual machines with sufficient storage and GPU capabilities.

# Preparation

## Installation

In [1]:
%pip install --upgrade modal \
 diffusers \
 requests \
 torch \
 fastapi

Collecting modal
  Downloading modal-0.68.42-py3-none-any.whl.metadata (2.3 kB)
Collecting fastapi
  Downloading fastapi-0.115.6-py3-none-any.whl.metadata (27 kB)
Collecting grpclib==0.4.7 (from modal)
  Downloading grpclib-0.4.7.tar.gz (61 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m61.3/61.3 kB[0m [31m3.6 MB/s[0m eta [36m0:00:00[0m
[?25h  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Collecting synchronicity~=0.9.7 (from modal)
  Downloading synchronicity-0.9.7-py3-none-any.whl.metadata (8.3 kB)
Collecting types-certifi (from modal)
  Downloading types_certifi-2021.10.8.3-py3-none-any.whl.metadata (1.4 kB)
Collecting types-toml (from modal)
  Downloading types_toml-0.10.8.20240310-py3-none-any.whl.metadata (1.5 kB)
Collecting watchfiles (from modal)
  Downloading watchfiles-1.0.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64

## Check Versions

In [3]:
!python --version
!nvcc --version   # CUDA version

Python 3.10.12
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2023 NVIDIA Corporation
Built on Tue_Aug_15_22:02:13_PDT_2023
Cuda compilation tools, release 12.2, V12.2.140
Build cuda_12.2.r12.2/compiler.33191640_0


## Check GPU

To use the GPU on Colab, you have to choose T4 for $ 1.44 per hour

In [2]:
import torch
print("torch versions: ", torch.__version__)
print("Cuda availability: ", torch.cuda.is_available())
print("CUDA device count:", torch.cuda.device_count())

torch versions:  2.5.1+cu121
Cuda availability:  False
CUDA device count: 0


## Environment Variables

In [5]:
# Using Colab
from google.colab import userdata
import os

os.environ['HF_TOKEN'] = userdata.get('HF_TOKEN')
os.environ['BFL_API_KEY'] = userdata.get('BFL_API_KEY')

hf_token = os.environ['HF_TOKEN']
bfl_api_key = os.environ['BFL_API_KEY']

## Model List

In [3]:
bf_model = "black-forest-labs/FLUX.1-dev"
sd_model = "stabilityai/stable-diffusion-3.5-large-turbo"
sdx1_model = "stabilityai/sdxl-turbo"

adamo_model = "adamo1139/stable-diffusion-3.5-large-turbo-ungated"
adamo_model_id = "9ad870ac0b0e5e48ced156bb02f85d324b7275d2"

# Hugging Face Model: Flux
**SOURCES:**
* [HF: Black-forest-labs](https://huggingface.co/black-forest-labs/FLUX.1-dev)
* [Diffusers for MacOS](https://huggingface.co/docs/diffusers/optimization/mps)

To use the diffusers, you need to get access token from hugging face:
* Go to Setting > Access Token > {user access token name} > edit permission > "Read access to contents of all public gated repos you can access"
* Go to the terminal and enter:
```terminal
huggingface-cli login`
```

This command will prompt you for a token. Copy-paste yours and press Enter. Then, you’ll be asked if the token should also be saved as a git credential. I chose 'NO'. Finally, it will call the Hub to check that your token is valid and save it locally.

**Problem:** _DiffusionPipeline keeps crashing_
Error says "Your session crashed after using all available RAM"

**Solution:** Use PyTorch2.0 and `pipe.to("mps")`


In [3]:
from diffusers import (DiffusionPipeline, FluxPipeline)
import torch

pipe = DiffusionPipeline.from_pretrained(sdx1_model)

# If you are running this on Apples M1/M2 chips, Use 'to.mps'
# pipe = pipe.to("mps")

# ENABLE_MODEL_CPU_OFFLOAD: save some VRAM by offloading the model to CPU. Remove this if you have enough GPU power
# pipe.enable_model_cpu_offload

# Recommended if your computer has < 64 GB of RAM
pipe.enable_attention_slicing()

prompt = "A cat holding a sign that says hello world"
image = pipe(
    prompt,
    height=1024,
    width=1024,
    guidance_scale=3.5,
    num_inference_steps=50,
    max_sequence_length=512,
).images[0]
image.save("flux-dev.png")


Loading pipeline components...:   0%|          | 0/7 [00:00<?, ?it/s]

  0%|          | 0/50 [00:00<?, ?it/s]

# Modal

## Importing Modal and setting up
**SOURCE**:
* [Modal getting started](https://modal.com/docs/examples/hello_world)

In [4]:
import sys
import modal

In [8]:
%%python -m modal setup

Process is interrupted.


In [11]:
dir(modal)

['App',
 'Client',
 'CloudBucketMount',
 'Cls',
 'Cron',
 'Dict',
 'Error',
 'FilePatternMatcher',
 'Function',
 'Image',
 'Mount',
 'NetworkFileSystem',
 'Period',
 'Proxy',
 'Queue',
 'Retries',
 'Sandbox',
 'SchedulerPlacement',
 'Secret',
 'Stub',
 'Tunnel',
 'Volume',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_ipython',
 '_location',
 '_pty',
 '_resolver',
 '_resources',
 '_runtime',
 '_serialization',
 '_traceback',
 '_tunnel',
 '_utils',
 '_vendor',
 'app',
 'asgi_app',
 'batched',
 'build',
 'call_graph',
 'client',
 'cloud_bucket_mount',
 'cls',
 'config',
 'container_process',
 'current_function_call_id',
 'current_input_id',
 'dict',
 'enable_output',
 'enter',
 'environments',
 'exception',
 'exit',
 'file_io',
 'file_pattern_matcher',
 'forward',
 'functions',
 'gpu',
 'image',
 'interact',
 'io_streams',
 'is_local',
 'method',
 'mount',
 'network_file_system

# Run Stable Diffusion as a CLI, API, and web UI

https://modal.com/docs/examples/text_to_image

This example shows how to run Stable Diffusion 3.5 Large Turbo on Modal to generate images from your local command line, via an API, and as a web UI.

Inference takes about one minute to cold start, at which point images are generated at a rate of one image every 1-2 seconds for batch sizes between one and 16.

## Basic Setup

In [10]:
dir(modal)

['App',
 'Client',
 'CloudBucketMount',
 'Cls',
 'Cron',
 'Dict',
 'Error',
 'FilePatternMatcher',
 'Function',
 'Image',
 'Mount',
 'NetworkFileSystem',
 'Period',
 'Proxy',
 'Queue',
 'Retries',
 'Sandbox',
 'SchedulerPlacement',
 'Secret',
 'Stub',
 'Tunnel',
 'Volume',
 '__all__',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '__version__',
 '_ipython',
 '_location',
 '_pty',
 '_resolver',
 '_resources',
 '_runtime',
 '_serialization',
 '_traceback',
 '_tunnel',
 '_utils',
 '_vendor',
 'app',
 'asgi_app',
 'batched',
 'build',
 'call_graph',
 'client',
 'cloud_bucket_mount',
 'cls',
 'config',
 'container_process',
 'current_function_call_id',
 'current_input_id',
 'dict',
 'enable_output',
 'enter',
 'environments',
 'exception',
 'exit',
 'file_io',
 'file_pattern_matcher',
 'forward',
 'functions',
 'gpu',
 'image',
 'interact',
 'io_streams',
 'is_local',
 'method',
 'mount',
 'network_file_system

## Webpoints
[SOURCE](https://github.com/modal-labs/modal-examples/blob/main/09_job_queues/doc_ocr_webapp.py)

cmd: ["modal", "serve", "07_web_endpoints/basic_web.py"]


### Hello world wide web!

Modal makes it easy to turn your Python functions into serverless web services:
access them via a browser or call them from any client that speaks HTTP, all
without having to worry about setting up servers or managing infrastructure.

This tutorial shows the path with the shortest ["time to 200"](https://shkspr.mobi/blog/2021/05/whats-your-apis-time-to-200/):
[`modal.web_endpoint`](https://modal.com/docs/reference/modal.web_endpoint).

On Modal, web endpoints have all the superpowers of Modal Functions:
they can be [accelerated with GPUs](https://modal.com/docs/guide/gpu),
they can access [Secrets](https://modal.com/docs/guide/secrets) or [Volumes](https://modal.com/docs/guide/volumes),
and they [automatically scale](https://modal.com/docs/guide/cold-start) to handle more traffic.


Under the hood, we use the [FastAPI library](https://fastapi.tiangolo.com/),
which has [high-quality documentation](https://fastapi.tiangolo.com/tutorial/),
linked throughout this tutorial.


---
### Turn a Modal Function into an endpoint with a single decorator

Modal Functions are already accessible remotely -- when you add the `@app.function` decorator to a Python function
and run `modal deploy`, you make it possible for your [other Python functions to call it](https://modal.com/docs/guide/trigger-deployed-functions).

That's great, but it's not much help if you want to share what you've written with someone running code in a different language --
or not running code at all!

And that's where most of the power of the Internet comes from: sharing information and functionality across different computer systems.

So we provide the `web_endpoint` decorator to wrap your Modal Functions in the lingua franca of the web: HTTP.
Here's what that looks like:

```
import modal

image = modal.Image.debian_slim().pip_install("fastapi[standard]")
app = modal.App(name="example-lifecycle-web", image=image)


@app.function()
@modal.web_endpoint(
    docs=True  # adds interactive documentation in the browser
)
def hello():
    return "Hello world!"
```

You can turn this function into a web endpoint by running `modal serve basic_web.py`.
In the output, you should see a URL that ends with `hello-dev.modal.run`.
If you navigate to this URL, you should see the `"Hello world!"` message appear in your browser.

You can also find interactive documentation, powered by OpenAPI and Swagger,
if you add `/docs` to the end of the URL.
From this documentation, you can interact with your endpoint, sending HTTP requests and receiving HTTP responses.
For more details, see the [FastAPI documentation](https://fastapi.tiangolo.com/features/#automatic-docs).

By running the endpoint with `modal serve`, you created a temporary endpoint that will disappear if you interrupt your terminal.
These temporary endpoints are great for debugging -- when you save a change to any of your dependent files, the endpoint will redeploy.
Try changing the message to something else, hitting save, and then hitting refresh in your browser or re-sending
the request from `/docs` or the command line. You should see the new message, along with logs in your terminal showing the redeploy and the request.

When you're ready to deploy this endpoint permanently, run `modal deploy basic_web.py`.
Now, your function will be available even when you've closed your terminal or turned off your computer.

---
### Send data to a web endpoint

The web endpoint above was a bit silly: it always returns the same message.

Most endpoints need an input to be useful. There are two ways to send data to a web endpoint:
- in the URL as a [query parameter](#sending-data-in-query-parameters)
- in the [body of the request](#sending-data-in-the-request-body) as JSON


### Sending data in query parameters

By default, your function's arguments are treated as query parameters:
they are extracted from the end of the URL, where they should be added in the form
`?arg1=foo&arg2=bar`.

From the Python side, there's hardly anything to do:
```
@app.function()
@modal.web_endpoint(docs=True)
def greet(user: str) -> str:
    return f"Hello {user}!"
```

If you are already running `modal serve basic_web.py`, this endpoint will be available at a URL, printed in your terminal, that ends with `greet-dev.modal.run`.

We provide Python type-hints to get type information in the docs and
[automatic validation](https://fastapi.tiangolo.com/tutorial/query-params-str-validations/).
For example, if you navigate directly to the URL for `greet`, you will get a detailed error message
indicating that the `user` parameter is missing. Navigate instead to `/docs` to see how to invoke the endpoint properly.

You can read more about query parameters in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/query-params/).

---
### Sending data in the request body

For larger and more complex data, it is generally preferrable to send data in the body of the HTTP request.
This body is formatted as [JSON](https://developer.mozilla.org/en-US/docs/Learn/JavaScript/Objects/JSON),
the most common data interchange format on the web.

To set up an endpoint that accepts JSON data, add an argument with a `dict` type-hint to your function.
This argument will be populated with the data sent in the request body.

```
@app.function()
@modal.web_endpoint(method="POST", docs=True)
def goodbye(data: dict) -> str:
    name = data.get("name") or "world"
    return f"Goodbye {name}!"
```

Note that we gave a value of `"POST"` for the `method` argument here.
This argument defines the HTTP request method that the endpoint will respond to,
and it defaults to `"GET"`.
If you head to the URL for the `goodbye` endpoint in your browser,
you will get a 405 Method Not Allowed error, because browsers only send GET requests by default.
While this is technically a separate concern from query parameters versus request bodies
and you can define an endpoint that accepts GET requests and uses data from the body,
it is [considered bad form](https://stackoverflow.com/a/983458).

Navigate to `/docs` for more on how to invoke the endpoint properly.
You will need to send a POST request with a JSON body containing a `name` key.
To get the same typing and validation benefits as with query parameters,
use a [Pydantic model](https://fastapi.tiangolo.com/tutorial/body/)
for this argument.

You can read more about request bodies in the [FastAPI documentation](https://fastapi.tiangolo.com/tutorial/body/).

---
### Handle expensive startup with `modal.Cls`

Sometimes your endpoint needs to do something before it can handle its first request,
like get a value from a database or set the value of a variable.
If that step is expensive, like [loading a large ML model](https://modal.com/docs/guide/model-weights),
it'd be a shame to have to do it every time a request comes in!

Web endpoints can be methods on a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions#container-lifecycle-functions-and-parameters).
Note that they don't need the [`modal.method`](https://modal.com/docs/reference/modal.method) decorator.

This example will only set the `start_time` instance variable once, on container startup.

```
@app.cls()
class WebApp:
    @modal.enter()
    def startup(self):
        from datetime import datetime, timezone

        print("🏁 Starting up!")
        self.start_time = datetime.now(timezone.utc)

    @modal.web_endpoint(docs=True)
    def web(self):
        from datetime import datetime, timezone

        current_time = datetime.now(timezone.utc)
        return {"start_time": self.start_time, "current_time": current_time}
```


### What next?

Modal's `web_endpoint` decorator is opinionated and designed for relatively simple web applications --
one or a few independent Python functions that you want to expose to the web.

Three additional decorators allow you to serve more complex web applications with greater control:
- [`asgi_app`](https://modal.com/docs/guide/webhooks#asgi) to serve applications compliant with the ASGI standard,
like [FastAPI](https://fastapi.tiangolo.com/)
- [`wsgi_app`](https://modal.com/docs/guide/webhooks#wsgi) to serve applications compliant with the WSGI standard,
like [Flask](https://flask.palletsprojects.com/)
- [`web_server`](https://modal.com/docs/guide/webhooks#non-asgi-web-servers) to serve any application that listens on a port

## Running Flux fast with torch.compile

[source](https://modal.com/docs/examples/flux)


## Implementing SD3.5 large turbo inference on Modal

We wrap inference in a Modal Cls that ensures models are downloaded when we `build` our container image (just like our dependencies) and that models are loaded and then moved to the GPU when a new container starts.

The run function just wraps a `diffusers` pipeline. It sends the output image back to the client as bytes.

We also include a web wrapper that makes it possible to trigger inference via an API call. See the /docs route of the URL ending in inference-web.modal.run that appears when you deploy the app for details.The Inference class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically.

The `Inference` class will serve multiple users from its own auto-scaling pool of warm GPU containers automatically.

In [8]:
#================================================================================================
#
# BASIC SETUP
#
#================================================================================================

from io import BytesIO
import random
import time
from pathlib import Path

import modal

# Running Flux fast
# We’ll make use of the full CUDA toolkit in this example, so we’ll build our container image off of the nvidia/cuda base.
cuda_version = "12.4.0"  # should be no greater than host CUDA version
flavor = "devel"  # includes full CUDA toolkit
operating_sys = "ubuntu22.04"
tag = f"{cuda_version}-{flavor}-{operating_sys}"
diffusers_commit_sha = "81cf3b2f155f1de322079af28f625349ee21ec6b" # original sha
diffusers_commit_sha = "9c0e20de61a6e0adcec706564cee739520c1d2f4"

cuda_dev_image = modal.Image.from_registry(
    f"nvidia/cuda:{tag}", add_python="3.12"
).entrypoint([])

flux_image = (
    cuda_dev_image.apt_install(
        "git",
        "libglib2.0-0",
        "libsm6",
        "libxrender1",
        "libxext6",
        "ffmpeg",
        "libgl1",
    )
    .pip_install(
        "invisible_watermark==0.2.0",
        "transformers==4.44.0",
        "huggingface_hub[hf_transfer]==0.26.2",
        "accelerate==0.33.0",
        "safetensors==0.4.4",
        "sentencepiece==0.2.0",
        "torch==2.5.1",
        f"git+https://github.com/huggingface/diffusers.git@{diffusers_commit_sha}",
        "numpy<2",
        "accelerate==0.33.0",
        # "diffusers==0.31.0",
        "fastapi[standard]==0.115.4",
        "torchvision==0.20.1",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})
)

# Torch compilation needs to be re-executed when each new container starts,
# So we turn on some extra caching to reduce compile times for later containers.
flux_image = flux_image.env(
    {"TORCHINDUCTOR_CACHE_DIR": "/root/.inductor-cache"}
).env({"TORCHINDUCTOR_FX_GRAPH_CACHE": "1"})


basic_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install(
        "accelerate==0.33.0",
        "diffusers==0.31.0",
        "fastapi[standard]==0.115.4",
        "huggingface-hub[hf_transfer]==0.25.2",
        "sentencepiece==0.2.0",
        "torch==2.5.1",
        "torchvision==0.20.1",
        "transformers~=4.44.0",
    )
    .env({"HF_HUB_ENABLE_HF_TRANSFER": "1"})  # faster downloads
)

# Switch between BASIC or FLUX
image = flux_image

# Creates the app. All Modal programs need an App — an object that acts as a recipe for the application. Let’s give it a friendly name.
app = modal.App("pentagram-app")

# The `image.imports()` lets us conditionally import in the global scope.
# This is needed because we might have the dependencies installed locally,
# but we know they are installed inside the custom image.

with image.imports():
    from diffusers import (FluxPipeline, StableDiffusion3Pipeline, DiffusionPipeline)
    import torch
    from fastapi import Response


#================================================================================================
#
# AVAILABLE DIFFUSION MODELS and DEFAULT PARAMS
#
#================================================================================================

MINUTES = 60 #seconds
VARIANT = "dev"  # "schnell" or "dev", but note [dev] requires you to accept terms and conditions on HF
NUM_INFERENCE_STEPS = 50  # use ~50 for [dev], smaller (~4) for [schnell]

bf_model = f"black-forest-labs/FLUX.1-{VARIANT}"
sd_model = "stabilityai/stable-diffusion-3.5-large-turbo"
sdx1_model = "stabilityai/sdxl-turbo"

adamo_model = "adamo1139/stable-diffusion-3.5-large-turbo-ungated"
adamo_revision_id = "9ad870ac0b0e5e48ced156bb02f85d324b7275d2"

#================================================================================================
#
# MODEL CLASS
#
#================================================================================================
@app.cls(
    image=image,
    gpu="a10g",     # Cheapest GPU
    container_idle_timeout=20 * MINUTES,
    timeout=60 * MINUTES,  # leave plenty of time for compilation
    volumes={  # add Volumes to store serializable compilation artifacts, see section on torch.compile below
    "/root/.nv": modal.Volume.from_name("nv-cache", create_if_missing=True),
    "/root/.triton": modal.Volume.from_name(
        "triton-cache", create_if_missing=True
    ),
    "/root/.inductor-cache": modal.Volume.from_name(
        "inductor-cache", create_if_missing=True
    ),
},
)
class Model:
    compile: int = (  # see section on torch.compile below for details
      modal.parameter(default=0)
    )

    def setup_model(self)
    @modal.build()
    @modal.enter()
    def __init__(self):
        """
        Initialize our diffusion model
        """
        sdx1_model = "stabilityai/sdxl-turbo"
        self.diffuser = StableDiffusion3Pipeline.from_pretrained(
            sdx1_model,
            torch_dtype=torch.bfloat16,
        )

    @modal.enter()
    def move_to_gpu(self):
        self.pipe.to("cuda")

    def generateImage(self, prompt, batch_size:int = 4):
        imageOutput = self.diffuser(
            prompt,
            height=1024,
            width=1024,
            guidance_scale=3.5,
            num_inference_steps=NUM_INFERENCE_STEPS,
            max_sequence_length=512,
            num_images_per_prompt=batch_size,  # outputting multiple images per prompt is much cheaper than separate calls
            num_inference_steps=4,  # turbo is tuned to run in four steps
            guidance_scale=3.5,  # turbo doesn't use CFG
            max_sequence_length=512,  # T5-XXL text encoder supports longer sequences, more complex prompts
        ).images
        return imageOutput

    def run(
        self, prompt: str, batch_size: int = 4, seed: int = None
    ) -> list[bytes]:
        seed = seed if seed is not None else random.randint(0, 2**32 - 1)
        print("seeding RNG with", seed)
        torch.manual_seed(seed)

        images = self.generateImage(prompt, batch_size)

        buffer = []
        for image in images:
            with BytesIO() as buf:
                image.save(buf, format="PNG")
                buffer.append(buf.getvalue())
        torch.cuda.empty_cache()  # reduce fragmentation
        return buffer

    # Sometimes your endpoint needs to do something before it can handle its first request,
    # like get a value from a database or set the value of a variable.
    # If that step is expensive, like [loading a large ML model](https://modal.com/docs/guide/model-weights),
    # it'd be a shame to have to do it every time a request comes in!

    # Web endpoints can be methods on a [`modal.Cls`](https://modal.com/docs/guide/lifecycle-functions#container-lifecycle-functions-and-parameters).
    # Note that they don't need the [`modal.method`](https://modal.com/docs/reference/modal.method) decorator.
    @modal.web_endpoint(docs=True)
    def web(self, prompt: str, seed: int = None):
        return Response(
            content=self.run.local(  # run in the same container
                prompt, batch_size=1, seed=seed
            )[0],
            media_type="image/png",
        )


#================================================================================================
#
# MODAL APP'S ENTRYPOINT
#
# This will trigger the run locally. The first time we run this,
# it will take 1-2 min. When we run this subsequent times, the image is already built,
# and it will run much faster.
#
#================================================================================================


@app.local_entrypoint()
def entrypoint(
    samples: int = 4,
    prompt: str = "A princess riding on a pony",
    batch_size: int = 4,
    seed: int = None,
):
    print(
        f"prompt => {prompt}",
        f"samples => {samples}",
        f"batch_size => {batch_size}",
        f"seed => {seed}",
        sep="\n",
    )

    output_dir = Path("/tmp/stable-diffusion")
    output_dir.mkdir(exist_ok=True, parents=True)

    inference_service = Model()

    for sample_idx in range(samples):
        start = time.time()
        images = inference_service.run.remote(prompt, batch_size, seed)
        duration = time.time() - start
        print(f"Run {sample_idx+1} took {duration:.3f}s")
        if sample_idx:
            print(
                f"\tGenerated {len(images)} image(s) at {(duration)/len(images):.3f}s / image."
            )
        for batch_idx, image_bytes in enumerate(images):
            output_path = (
                output_dir
                / f"output_{slugify(prompt)[:64]}_{str(sample_idx).zfill(2)}_{str(batch_idx).zfill(2)}.png"
            )
            if not batch_idx:
                print("Saving outputs", end="\n\t")
            print(
                output_path,
                end="\n" + ("\t" if batch_idx < len(images) - 1 else ""),
            )
            output_path.write_bytes(image_bytes)


#================================================================================================
#
# FRONT-END
#
#================================================================================================


frontend_path = Path(__file__).parent / "frontend"

web_image = (
    modal.Image.debian_slim(python_version="3.12")
    .pip_install("jinja2==3.1.4", "fastapi[standard]==0.115.4")
    .add_local_dir(frontend_path, remote_path="/assets")
)


@app.function(
    image=web_image,
    allow_concurrent_inputs=1000,
)
@modal.asgi_app()
def ui():
    import fastapi.staticfiles
    from fastapi import FastAPI, Request
    from fastapi.templating import Jinja2Templates

    web_app = FastAPI()
    templates = Jinja2Templates(directory="/assets")

    @web_app.get("/")
    async def read_root(request: Request):
        return templates.TemplateResponse(
            "index.html",
            {
                "request": request,
                "inference_url": Inference.web.web_url,
                "model_name": "Stable Diffusion 3.5 Large Turbo",
                "default_prompt": "A cinematic shot of a baby raccoon wearing an intricate italian priest robe.",
            },
        )

    web_app.mount(
        "/static",
        fastapi.staticfiles.StaticFiles(directory="/assets"),
        name="static",
    )

    return web_app


def slugify(s: str) -> str:
    return "".join(c if c.isalnum() else "-" for c in s).strip("-")



```