#**Implementation of Latent Diffusion model**

In [30]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [31]:
# Importing libraries
import numpy as np  # linear algebra
import pandas as pd  # data processing, CSV file I/O (e.g. pd.read_csv)
import os

# Set the directory path to '/content' or another desired location
base_path = '/content/drive/MyDrive/base_LDM'  # Base path for storing or accessing files

# List files in a specified directory (default: '/content')
for dirname, _, filenames in os.walk(base_path):
    for filename in filenames:
        print(os.path.join(dirname, filename))
# To access files in the 'working' directory, use '/content/your_directory_name'
working_directory = '/content/drive/MyDrive/work_LDM'

os.makedirs(working_directory, exist_ok=True)
with open(os.path.join(working_directory, 'example.txt'), 'w') as f:
    f.write('This is an example file stored in the working directory of Google Colab.')
print(f"File created at {os.path.join(working_directory, 'example.txt')}")


File created at /content/drive/MyDrive/work_LDM/example.txt


In [32]:
!pip install diffusers



#**Latent diffusion model:**

Latent Diffusion Models (LDMs) are a specific type of diffusion model optimized for generating high-quality images in a more memory-efficient manner by operating in a lower-dimensional latent space. Unlike traditional diffusion models that work directly on image pixels and consume considerable memory, LDMs achieve the diffusion process in latent space, significantly reducing the memory requirements. This process involves training the model to denoise random Gaussian noise step by step to eventually produce an image.

Main component:


*   CLIP Text Encoder: Converts input text into text embeddings.
*   Variational Auto Encoder (VAE): Compresses and decompresses images into and from a lower-dimensional latent space.
*   U-Net: Predicts the noise to be removed from the noised latent representations to reconstruct the original image data.

**CLIP text Encoder**:

The CLIP Text Encoder takes text as input and produces text embeddings. These embeddings represent the text in a form that is close in the latent space to the representation of images encoded by a similar process.



*   Tokenization: Break down the input text into sub-words or tokens and convert these into numerical representations using a lookup table.
*   Text to Embedding Conversion: Utilize the CLIPTextModel to convert the numerical tokens into embeddings that encapsulate the semantic meaning of the text.

*Role in Latent Diffusion*:
The text embeddings generated by the CLIP Text Encoder are used as one of the inputs to the U-Net model. This enables the generation of images that are semantically related to the input text.







#**Variational Auto Encoder (VAE)**
The VAE consists of two main parts: an encoder and a decoder. The encoder compresses an image into a lower-dimensional latent representation, and the decoder attempts to reconstruct the image from this latent representation.

* Encoding: Compress an input image into a latent space representation.

* Decoding: Reconstruct the image from the latent representation.
Role in Latent Diffusion

The VAE is essential for reducing the computational load of the diffusion process. By operating in latent space, the diffusion model requires less computational power and memory, facilitating the generation of high-resolution images.

#**U-Net**

The U-Net architecture takes two inputs: noisy latents and text embeddings. It outputs the predicted noise residuals to be subtracted from the noisy latents, effectively denoising them.

* Adding Noise: Apply a series of noise levels to the latent representations according to a predetermined schedule.

* Noise Prediction: For each noise level, predict the noise present in the noisy latents.

* Denoising: Subtract the predicted noise from the noisy latents to move closer to the original image representation.

*Role in Latent Diffusion*:
The U-Net is crucial for the iterative denoising process in latent diffusion. By gradually reducing noise from the latents, it guides the generation process to produce images that correspond to the textual description provided.

**Importing libraries**


In [33]:
import torch, logging

## disable warnings
logging.disable(logging.WARNING)

## Imaging  library
from PIL import Image
from torchvision import transforms as tfms

## Basic libraries
import numpy as np
from tqdm.auto import tqdm
import matplotlib.pyplot as plt
%matplotlib inline
from IPython.display import display
import shutil
import os

## For video display
from IPython.display import HTML
from base64 import b64encode


## Import the CLIP artifacts
from transformers import CLIPTextModel, CLIPTokenizer
from diffusers import AutoencoderKL, UNet2DConditionModel, LMSDiscreteScheduler
from IPython.display import display, clear_output
import os

In [34]:
import os

# Define the directory path
steps_directory = '/content/drive/MyDrive/LDM/steps2'

# Create the directory along with any intermediate directories if they don't exist
os.makedirs(steps_directory, exist_ok=True)

print(f"Directory created or already exists: {steps_directory}")

Directory created or already exists: /content/drive/MyDrive/LDM/steps2


**setting cpu/gpu device**

In [35]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

**Load image**

Loads an image from a specified path, converts it to RGB, and resizes it to a specified dimension.

Parameters:

* p (str): Path to the image file.
* size (tuple, optional): The dimensions to resize the image to. Default is (512, 512).

Returns:

* Image: An image object in RGB format with the specified dimensions.

In [36]:
def load_image(p):
    return Image.open(p).convert('RGB').resize((224,224))

**PIL to latent representation suitable for vae**

Converts a PIL image to a latent representation suitable for input into a VAE model.

Parameters:

* image (PIL.Image): The image to convert.

Returns:

* Tensor: The latent representation of the image.


In [37]:
def pil_to_latents(image):
    init_image = tfms.ToTensor()(image).unsqueeze(0) * 2.0 - 1.0
    init_image = init_image.to(device="cuda", dtype=torch.float16)
    init_latent_dist = vae.encode(init_image).latent_dist.sample() * 0.18215
    return init_latent_dist

**Latent to PIL**

Converts latents back into a PIL image, suitable for visualization and further processing.

Parameters:

* latents (Tensor): The latent representation to convert back to an image.

Returns:

* List[Image]: A list of image objects generated from the latent representations

In [38]:
def latents_to_pil(latents):
    latents = (1 / 0.18215) * latents
    with torch.no_grad():
        image = vae.decode(latents).sample
    image = (image / 2 + 0.5).clamp(0, 1)
    image = image.detach().cpu().permute(0, 2, 3, 1).numpy()
    images = (image * 255).round().astype("uint8")
    pil_images = [Image.fromarray(image) for image in images]
    return pil_images

**Text Encoder**

Encodes textual prompts into embeddings using a CLIP text model.
Parameters:

* prompts (List[str]): A list of textual prompts to encode.
* maxlen (int, optional): Maximum length of the encoded text. Defaults to the model's maximum length.

Returns:

* Tensor: The encoded text embeddings.

In [39]:
def text_enc(prompts, maxlen=None):
    if maxlen is None: maxlen = tokenizer.model_max_length
    inp = tokenizer(prompts, padding="max_length", max_length=maxlen, truncation=True, return_tensors="pt")
    return text_encoder(inp.input_ids.to("cuda"))[0].half()

**Prompt to image**

Converts text prompts into images using a latent diffusion model.

Parameters:

* prompts (List[str]): Text prompts to convert into images.
* g (float): Guidance scale. Higher values enforce stronger adherence to the text prompt.
* seed (int): Random seed for generating images.
* steps (int): Number of diffusion steps.
* dim (int): Dimension of the generated images.
* save_int (bool): Whether to save intermediate images.

Returns:

* List[Image]: A list of generated image objects corresponding to the text prompts.

**Note**: Due to resource constraints here only 10 diffusion steps and 64 dimensions is given.

In [41]:
def prompt_2_img(prompts, g=7.5, seed=100, steps=70, dim=512, save_int=True): #steps=70, dim=512 for optimal result

    # Defining batch size
    bs = len(prompts)

    # Converting textual prompts to embedding
    text = text_enc(prompts)

    # Adding an unconditional prompt , helps in the generation process
    uncond =  text_enc([""] * bs, text.shape[1])
    emb = torch.cat([uncond, text])

    # Setting the seed
    if seed: torch.manual_seed(seed)

    # Initiating random noise
    latents = torch.randn((bs, unet.in_channels, dim//8, dim//8))

    # Setting number of steps in scheduler
    scheduler.set_timesteps(steps)

    # Adding noise to the latents
    latents = latents.to("cuda").half() * scheduler.init_noise_sigma

    print("Processing text prompts:", prompts)
    # Just before the loop starts:
    print("Visualizing initial latents...")
    latents_norm = torch.norm(latents.view(latents.shape[0], -1), dim=1).mean().item()
    print(f"Initial Latents Norm: {latents_norm}")

    # Iterating through defined steps
    for i,ts in enumerate(tqdm(scheduler.timesteps)):
        # We need to scale the i/p latents to match the variance
        inp = scheduler.scale_model_input(torch.cat([latents] * 2), ts)

        # Predicting noise residual using U-Net
        with torch.no_grad(): u,t = unet(inp, ts, encoder_hidden_states=emb).sample.chunk(2)

        # Performing Guidance
        pred = u + g*(t-u)

        # Conditioning  the latents
        latents = scheduler.step(pred, ts, latents).prev_sample

        # Inside your loop, after `latents` have been updated:
        latents_norm = torch.norm(latents.view(latents.shape[0], -1), dim=1).mean().item()
        print(f"Step {i+1}/{steps} Latents Norm: {latents_norm}")

        from IPython.display import display, clear_output
        if save_int and i%10==0:
            !mkdir -p steps2 # Creating the directory if it doesn't exist
            image_path = f'steps2/la_{i:04d}.jpeg'
            latents_to_pil(latents)[0].save(image_path)
            display(latents_to_pil(latents)[0])  # Display the new image

    return latents_to_pil(latents)

**The Diffusion Process**

* The stable diffusion model takes the textual input and a seed.
* The textual input is then passed through the CLIP model to generate textual embedding of size 77x768 and the seed is used to generate Gaussian noise of size 4x64x64 which becomes the first latent image representation.
* Next, the U-Net iteratively denoises the random latent image representations while conditioning on the text embeddings.
* The output of the U-Net is predicted noise residual, which is then used to compute conditioned latents via a scheduler algorithm.
* This process of denoising and text conditioning is repeated N times (We will use 2 as timestep due to gpu constraint) to retrieve a better latent image representation.
* Once this process is complete, the latent image representation (4x64x64) is decoded by the VAE decoder to retrieve the final output image (3x512x512).

In [42]:
## Initiating tokenizer and encoder.
tokenizer = CLIPTokenizer.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16)
text_encoder = CLIPTextModel.from_pretrained("openai/clip-vit-large-patch14", torch_dtype=torch.float16).to("cuda")

## Initiating the VAE
vae = AutoencoderKL.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="vae", torch_dtype=torch.float16).to("cuda")

## Initializing a scheduler and Setting number of sampling steps
scheduler = LMSDiscreteScheduler(beta_start=0.00085, beta_end=0.012, beta_schedule="scaled_linear", num_train_timesteps=5)
scheduler.set_timesteps(100)  # time step should be 50

## Initializing the U-Net model
unet = UNet2DConditionModel.from_pretrained("CompVis/stable-diffusion-v1-4", subfolder="unet", torch_dtype=torch.float16).to("cuda")

**Result**

In [43]:
images = prompt_2_img(["A cat"], save_int=True)
for img in images:display(img)

Output hidden; open in https://colab.research.google.com to view.

In [15]:
images = prompt_2_img(["A cat"], save_int=True)
for img in images:display(img)

Output hidden; open in https://colab.research.google.com to view.

**Result is not impressive as diffusion step is only given 100 and dimension is 512, also denoising step is given as 70 which is less in contrast of optimal denoising step. Noted: LDM is a resource hungry process**