                                                     ⚠️ GRADIO USER INTERFACE FOR DEMONSTRATION ONLY ⚠️

# Prompt-Driven Image Analysis: Integrating Gen-AI for Segmentation, Object Transformation, and Cognitive Interpretation | By Kaleem Ahmad


> This system will be uniquely capable of understanding and executing tasks based on natural language inputs, simplifying complex image processing operations for users.

**Key Features:**

* **Natural Language Processing Interface:** At the heart of the system is an interface that allows users to give instructions in simple, everyday language. This makes it accessible even to those without technical expertise in image processing.

* **Advanced Image Segmentation and Transformation:** Leveraging state-of-the-art AI techniques, the system can segment images into distinct components and transform them in various ways. This includes altering image styles, merging elements from different images, and more.

* **Cognitive Analysis Capabilities:** Beyond just editing, the system is designed to perform cognitive analysis of images. This involves understanding the context, content, and potential implications or meanings behind visual elements.

**Technological Backbone:**

* **Grounding DINO by IDEA-Research:** Utilizes this model for its exceptional capability in visual grounding, enabling the system to understand and locate specific elements within an image based on textual descriptions.
> [![arXiv](https://img.shields.io/badge/arXiv-2303.05499-b31b1b.svg)](https://arxiv.org/abs/2303.05499)

* **Segment Anything by META:** Incorporates this technology for advanced segmentation tasks, allowing for precise differentiation and manipulation of various elements within an image.
> [![arXiv](https://img.shields.io/badge/arXiv-2304.02643-b31b1b.svg)](https://arxiv.org/abs/2304.02643)

* **Stable Diffusion by StabilityAI:** A key component for image transformation tasks, providing the ability to artistically alter and generate images based on the user's prompts.
> [![arXiv](https://img.shields.io/badge/arXiv-2112.10752-b31b1b.svg)](https://arxiv.org/abs/2112.10752)

* **LLaVA (Large Language-and-Vision Assistant):** Potential integration of this model to further enhance the system's understanding of the interplay between language and visual content.
> [![arXiv](https://img.shields.io/badge/arXiv-2112.10752-b31b1b.svg)](https://arxiv.org/abs/2304.08485)

> **End Goal:** To create a seamless, user-friendly platform that integrates these diverse AI models. This unified system will not only perform complex image processing tasks but also provide insights and analysis, making it a comprehensive tool for a wide range of applications, from creative industries to academic research.

## Install and Initialize Libraries & Models

In [1]:
!pip install gradio diffusers transformers scipy segment_anything accelerate replicate
!git clone https://github.com/IDEA-Research/GroundingDINO.git
%cd GroundingDINO
!pip install -e .
!mkdir -p weights
%cd weights
!wget https://dl.fbaipublicfiles.com/segment_anything/sam_vit_h_4b8939.pth
!wget https://github.com/IDEA-Research/GroundingDINO/releases/download/v0.1.0-alpha/groundingdino_swint_ogc.pth
%cd ..

Collecting gradio
  Downloading gradio-4.31.2-py3-none-any.whl (12.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m12.3/12.3 MB[0m [31m46.2 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting diffusers
  Downloading diffusers-0.27.2-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m64.3 MB/s[0m eta [36m0:00:00[0m
Collecting segment_anything
  Downloading segment_anything-1.0-py3-none-any.whl (36 kB)
Collecting accelerate
  Downloading accelerate-0.30.1-py3-none-any.whl (302 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m302.6/302.6 kB[0m [31m39.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting replicate
  Downloading replicate-0.26.0-py3-none-any.whl (40 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m40.0/40.0 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting aiofiles<24.0,>=22.0 (from gradio)
  Downloading aiofiles-23.2.1-py3-none-any.whl (15 kB)
Co

In [2]:
import gradio as gr
from segment_anything import SamPredictor, sam_model_registry
from diffusers import StableDiffusionInpaintPipeline
from groundingdino.util.inference import load_model, load_image, predict, annotate
from groundingdino.util import box_ops  # Corrected import path
from PIL import Image
import torch
import cv2
import matplotlib.pyplot as plt
import numpy as np
import os
import replicate

# Load models and initialize
device = "cuda" if torch.cuda.is_available() else "cpu"
sam_model_path = "weights/sam_vit_h_4b8939.pth"
groundingdino_model_config = "/content/GroundingDINO/groundingdino/config/GroundingDINO_SwinT_OGC.py"
groundingdino_model_weights = "weights/groundingdino_swint_ogc.pth"
predictor = SamPredictor(sam_model_registry["vit_h"](checkpoint=sam_model_path).to(device=device))
pipe = StableDiffusionInpaintPipeline.from_pretrained("stabilityai/stable-diffusion-2-inpainting", torch_dtype=torch.float16).to(device)
groundingdino_model = load_model(groundingdino_model_config, groundingdino_model_weights)

# Save the uploaded image
def upload_image(image):
    image.save("uploaded_image.jpg")
    return "Image uploaded successfully!"

# Load the uploaded image for display and subsequent steps
def load_uploaded_image():
    return Image.open("uploaded_image.jpg")

def prompt_for_object(prompt):
    image = load_uploaded_image()
    src, img = load_image("uploaded_image.jpg")
    boxes, logits, phrases = predict(
        model=groundingdino_model,
        image=img,
        caption=prompt,
        box_threshold=0.5,
        text_threshold=0.35
    )
    img_annotated = annotate(image_source=src, boxes=boxes, logits=logits, phrases=phrases)[..., ::-1]
    annotated_image = Image.fromarray(img_annotated)
    annotated_image.save("annotated_image.jpg")
    np.save("boxes.npy", boxes.cpu().numpy())  # Save the boxes for later use
    return annotated_image

def object_mask():
    image = np.array(load_uploaded_image())
    boxes = torch.tensor(np.load("boxes.npy")).to(device)  # Load the saved boxes
    predictor.set_image(image)
    H, W, _ = image.shape
    boxes_xyxy = box_ops.box_cxcywh_to_xyxy(boxes) * torch.tensor([W, H, W, H]).to(device)
    new_boxes = predictor.transform.apply_boxes_torch(boxes_xyxy, (H, W)).to(device)
    masks, _, _ = predictor.predict_torch(
        point_coords=None,
        point_labels=None,
        boxes=new_boxes,
        multimask_output=False,
    )
    mask_np = masks[0][0].cpu().numpy()
    mask_image = Image.fromarray((mask_np * 255).astype(np.uint8))
    mask_image.save("mask_image.png")

    # Create color mask overlay on the original image
    color_mask = np.zeros((H, W, 3), dtype=np.uint8)
    color_mask[mask_np > 0] = [175, 0, 0]  # Red color for the mask
    original_image = np.array(load_uploaded_image())
    overlay_image = cv2.addWeighted(original_image, 1, color_mask, 0.5, 0)
    overlay_image = Image.fromarray(overlay_image)
    overlay_image.save("color_mask_image.png")

    return overlay_image, mask_image

def prompt_to_replace_object(prompt):
    image = load_uploaded_image()
    mask = Image.open("mask_image.png")
    edited = pipe(prompt=prompt, image=image, mask_image=mask).images[0]
    edited.save("edited_image.jpg")
    return edited

def display_output_from_diffusion():
    original_image = load_uploaded_image()
    edited_image = Image.open("edited_image.jpg")

    fig, axes = plt.subplots(1, 2, figsize=(15, 15))  # Adjust figsize to ensure images are not shrunken

    axes[0].imshow(original_image)
    axes[0].axis('off')
    axes[0].set_title('Before', fontsize=16, fontweight='bold')

    axes[1].imshow(edited_image)
    axes[1].axis('off')
    axes[1].set_title('After', fontsize=16, fontweight='bold')

    plt.tight_layout()
    plt.savefig("output_comparison.png")
    plt.close()

    return "output_comparison.png"

def prompt_for_llava(prompt):
    REPLICATE_API_TOKEN = 'r8_J7sHvdn6763vDKRX8e00ZmQFPKn7b7n0oapW6'
    client = replicate.Client(api_token=REPLICATE_API_TOKEN)
    file_path = 'edited_image.jpg'  # Use the edited image instead
    output = client.run(
        ref="yorickvp/llava-13b:e272157381e2a3bf12df3a8edd1f38d1dbd736bbb7437277c8b34175f8fce358",
        input={"image": open(file_path, "rb"), "prompt": prompt}
    )
    formatted_output = ''.join(output)
    image = Image.open(file_path)  # Load the edited image to display
    return image, f"Prompt: '{prompt}'\n\nDescription:\n{formatted_output}"

with gr.Blocks() as demo:
    gr.Markdown("<h2 style='text-align: center;'>Prompt-Driven Image Analysis: Integrating Generative AI for Segmentation, Object Transformation, and Cognitive Interpretation using SAM, Diffusion and LLaVA<br>By<br>Kaleem Ahmad</h2>")

    with gr.Tab("Upload Image"):
        image_input = gr.Image(type="pil")
        upload_button = gr.Button("Upload")
        upload_button.click(upload_image, inputs=image_input, outputs=gr.Textbox())

    with gr.Tab("Display Image"):
        display_button = gr.Button("Display")
        display_button.click(load_uploaded_image, outputs=gr.Image(type="pil"))

    with gr.Tab("Prompt for Object to Segment"):
        text_prompt = gr.Textbox(label="Object to Segment")
        segment_button = gr.Button("Segment")
        segment_button.click(prompt_for_object, inputs=text_prompt, outputs=gr.Image(type="pil"))

    with gr.Tab("Object Mask from the Image"):
        mask_button = gr.Button("Get Mask")
        mask_button.click(object_mask, outputs=[gr.Image(type="pil"), gr.Image(type="pil")])

    with gr.Tab("Prompt to Replace Object"):
        replace_prompt = gr.Textbox(label="Object to Replace")
        replace_button = gr.Button("Replace Object")
        replace_button.click(prompt_to_replace_object, inputs=replace_prompt, outputs=gr.Image(type="pil"))

    with gr.Tab("Display Output from Diffusion"):
        diffusion_button = gr.Button("Display Output")
        diffusion_button.click(display_output_from_diffusion, outputs=gr.Image(type="filepath"))

    with gr.Tab("Prompt for LLaVA and Output"):
        llava_prompt = gr.Textbox(label="LLaVA Prompt")
        llava_button = gr.Button("Get LLaVA Output")
        llava_button.click(prompt_for_llava, inputs=llava_prompt, outputs=[gr.Image(type="pil"), gr.Textbox()])

demo.launch()

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


model_index.json:   0%|          | 0.00/544 [00:00<?, ?B/s]

Fetching 13 files:   0%|          | 0/13 [00:00<?, ?it/s]

text_encoder/config.json:   0%|          | 0.00/638 [00:00<?, ?B/s]

tokenizer/merges.txt:   0%|          | 0.00/525k [00:00<?, ?B/s]

tokenizer/special_tokens_map.json:   0%|          | 0.00/460 [00:00<?, ?B/s]

tokenizer/tokenizer_config.json:   0%|          | 0.00/829 [00:00<?, ?B/s]

scheduler/scheduler_config.json:   0%|          | 0.00/308 [00:00<?, ?B/s]

tokenizer/vocab.json:   0%|          | 0.00/1.06M [00:00<?, ?B/s]

(…)ature_extractor/preprocessor_config.json:   0%|          | 0.00/342 [00:00<?, ?B/s]

unet/config.json:   0%|          | 0.00/914 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.36G [00:00<?, ?B/s]

vae/config.json:   0%|          | 0.00/616 [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/3.46G [00:00<?, ?B/s]

diffusion_pytorch_model.safetensors:   0%|          | 0.00/335M [00:00<?, ?B/s]

Loading pipeline components...:   0%|          | 0/6 [00:00<?, ?it/s]



final text_encoder_type: bert-base-uncased


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/440M [00:00<?, ?B/s]

Setting queue=True in a Colab notebook requires sharing enabled. Setting `share=True` (you can turn this off by setting `share=False` in `launch()` explicitly).

Colab notebook detected. To show errors in colab notebook, set debug=True in launch()
Running on public URL: https://c4534610a860b2b792.gradio.live

This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)


